Pipes in R Tutorial For Beginners (article) – DataCamp

Pipes in R Tutorial For Beginners

Learn more about the famous pipe operator %>% and other pipes in R, why and how you should use them and what alternatives you can consider!

You might have already seen or used the pipe operator when you’re working with packages such as dplyrmagrittr,… But do you know where pipes and the famous %>% operator come from, what they exactly are, or how, when and why you should use them? Can you also come up with some alternatives?

This tutorial will give you an introduction to pipes in R and will cover the following topics:

Are you interested in learning more about manipulating data in R with dplyr? Take a look at DataCamp’s Data Manipulation in R with dplyr course.

Pipe Operator in R: Introduction

To understand what the pipe operator in R is and what you can do with it, it’s necessary to consider the full picture, to learn the history behind it. Questions such as “where does this weird combination of symbols come from and why was it made like this?” might be on top of your mind. You’ll discover the answers to these and more questions in this section.

Now, you can look at the history from three perspectives: from a mathematical point of view, from a holistic point of view of programming languages, and from the point of view of the R language itself. You’ll cover all three in what follows!

History of the Pipe Operator in R

Mathematical History

If you have two functions, let’s say f:BCf:B→C and g:ABg:A→B, you can chain these functions together by taking the output of one function and inserting it into the next. In short, “chaining” means that you pass an intermediate result onto the next function, but you’ll see more about that later.

For example, you can say, f(g(x))f(g(x))g(x)g(x) serves as an input for f()f(), while xx, of course, serves as input to g()g().

If you would want to note this down, you will use the notation fgf◦g, which reads as “f follows g”. Alternatively, you can visually represent this as:

Image Credit: James Balamuta, “Piping Data”

Pipe Operators in Other Programming Languages

As mentioned in the introduction to this section, this operator is not new in programming: in the Shell or Terminal, you can pass command from one to the next with the pipeline character |. Similarly, F# has a forward pipe operator, which will prove to be important later on! Lastly, it’s also good to know that Haskell contains many piping operations that are derived from the Shell or Terminal.

Pipes in R

Now that you have seen some history of the pipe operator in other programming languages, it’s time to focus on R. The history of this operator in R starts, according to this fantastic blog post written by Adolfo Álvarez, on January 17th, 2012, when an anonymous user asked the following question in this Stack Overflow post:

How can you implement F#’s forward pipe operator in R? The operator makes it possible to easily chain a sequence of calculations. For example, when you have an input data and want to call functions foo and bar in sequence, you can write data |> foo |> bar?

The answer came from Ben Bolker, professor at McMaster University, who replied:

I don’t know how well it would hold up to any real use, but this seems (?) to do what you want, at least for single-argument functions …

"%>%" <- function(x,f) do.call(f,list(x))
pi %>% sin
[1] 1.224606e-16
pi %>% sin %>% cos
[1] 1
cos(sin(pi))
[1] 1

About nine months later, Hadley Wickham started the dplyr package on GitHub. You might now know Hadley, Chief Scientist at RStudio, as the author of many popular R packages (such as this last package!) and as the instructor for DataCamp’s Writing Functions in R course.

Be however it may, it wasn’t until 2013 that the first pipe %.% appears in this package. As Adolfo Álvarez rightfully mentions in his blog post, the function was denominated chain(), which had the purpose to simplify the notation for the application of several functions to a single data frame in R.

The %.% pipe would not be around for long, as Stefan Bache proposed an alternative on the 29th of December 2013, that included the operator as you might now know it:

iris %>%
  subset(Sepal.Length > 5) %>%
  aggregate(. ~ Species, ., mean)

Bache continued to work with this pipe operation and at the end of 2013, the magrittr package came to being. In the meantime, Hadley Wickham continued to work on dplyr and in April 2014, the %.% operator got replaced with the one that you now know, %>%.

Later that year, Kun Ren published the pipeR package on GitHub, which incorporated a different pipe operator, %>>%, which was designed to add more flexibility to the piping process. However, it’s safe to say that the %>% is now established in the R language, especially with the recent popularity of the Tidyverse.

What Is It?

Knowing the history is one thing, but that still doesn’t give you an idea of what F#’s forward pipe operator is nor what it actually does in R.

In F#, the pipe-forward operator |> is syntactic sugar for chained method calls. Or, stated more simply, it lets you pass an intermediate result onto the next function.

Remember that “chaining” means that you invoke multiple method calls. As each method returns an object, you can actually allow the calls to be chained together in a single statement, without needing variables to store the intermediate results.

In R, the pipe operator is, as you have already seen, %>%. If you’re not familiar with F#, you can think of this operator as being similar to the + in a ggplot2statement. Its function is very similar to that one that you have seen of the F# operator: it takes the output of one statement and makes it the input of the next statement. When describing it, you can think of it as a “THEN”.

Take, for example, following code chunk and read it aloud:

iris %>%
  subset(Sepal.Length > 5) %>%
  aggregate(. ~ Species, ., mean)

You’re right, the code chunk above will translate to something like “you take the Iris data, then you subset the data and then you aggregate the data”.

This is one of the most powerful things about the Tidyverse. In fact, having a standardized chain of processing actions is called “a pipeline”. Making pipelines for a data format is great, because you can apply that pipeline to incoming data that has the same formatting and have it output in a ggplot2friendly format, for example.

Why Use It?

R is a functional language, which means that your code often contains a lot of parenthesis, ( and ). When you have complex code, this often will mean that you will have to nest those parentheses together. This makes your R code hard to read and understand. Here’s where %>% comes in to the rescue!

Take a look at the following example, which is a typical example of nested code:

# Initialize `x`
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)

# Compute the logarithm of `x`, return suitably lagged and iterated differences, 
# compute the exponential function and round the result
round(exp(diff(log(x))), 1)
  1. 3.3
  2. 1.8
  3. 1.6
  4. 0.5
  5. 0.3
  6. 0.1
  7. 48.8
  8. 1.1

With the help of %<%, you can rewrite the above code as follows:

# Import `magrittr`
library(magrittr)

# Perform the same computations on `x` as above
x %>% log() %>%
    diff() %>%
    exp() %>%
    round(1)

Does this seem difficult to you? No worries! You’ll learn more on how to go about this later on in this tutorial.

Note that you need to import the magrittr library to get the above code to work. That’s because the pipe operator is, as you read above, part of the magrittr library and is, since 2014, also a part of dplyr. If you forget to import the library, you’ll get an error like Error in eval(expr, envir, enclos): could not find function "%>%".

Also note that it isn’t a formal requirement to add the parentheses after logdiff and exp, but that, within the R community, some will use it to increase the readability of the code.

In short, here are four reasons why you should be using pipes in R:

  • You’ll structure the sequence of your data operations from left to right, as apposed to from inside and out;
  • You’ll avoid nested function calls;
  • You’ll minimize the need for local variables and function definitions; And
  • You’ll make it easy to add steps anywhere in the sequence of operations.

These reasons are taken from the magrittr documentation itself. Implicitly, you see the arguments of readability and flexibility returning.

Additional Pipes

Even though %>% is the (main) pipe operator of the magrittr package, there are a couple of other operators that you should know and that are part of the same package:

  • The compound assignment operator %<>%;
# Initialize `x` 
x <- rnorm(100)

# Update value of `x` and assign it to `x`
x %<>% abs %>% sort
  • The tee operator %T>%;
rnorm(200) %>%
matrix(ncol = 2) %T>%
plot %>% 
colSums

Note that it’s good to know for now that the above code chunk is actually a shortcut for:

rnorm(200) %>%
matrix(ncol = 2) %T>%
{ plot(.); . } %>% 
colSums

But you’ll see more about that later on!

  • The exposition pipe operator %$%.
data.frame(z = rnorm(100)) %$% 
  ts.plot(z)

Of course, these three operators work slightly differently than the main %>%operator. You’ll see more about their functionalities and their usage later on in this tutorial!

Note that, even though you’ll most often see the magrittr pipes, you might also encounter other pipes as you go along! Some examples are wrapr‘s dot arrow pipe %.>% or to dot pipe %>.%, or the Bizarro pipe ->.;.

How to Use Pipes in R

Now that you know how the %>% operator originated, what it actually is and why you should use it, it’s time for you to discover how you can actually use it to your advantage. You will see that there are quite some ways in which you can use it!

Basic Piping

Before you go into the more advanced usages of the operator, it’s good to first take a look at the most basic examples that use the operator. In essence, you’ll see that there are 3 rules that you can follow when you’re first starting out:

  • f(x) can be rewritten as x %>% f

In short, this means that functions that take one argument, function(argument), can be rewritten as follows: argument %>% function(). Take a look at the following, more practical example to understand how these two are equivalent:

# Compute the logarithm of `x` 
log(x)

# Compute the logarithm of `x` 
x %>% log()
  • f(x, y) can be rewritten as x %>% f(y)

Of course, there are a lot of functions that don’t just take one argument, but multiple. This is the case here: you see that the function takes two arguments, x and y. Similar to what you have seen in the first example, you can rewrite the function by following the structure argument1 %>% function(argument2), where argument1 is the magrittr placeholder and argument2 the function call.

This all seems quite theoretical. Let’s take a look at a more practical example:

# Round pi
round(pi, 6)

# Round pi 
pi %>% round(6)
  • x %>% f %>% g %>% h can be rewritten as h(g(f(x)))

This might seem complex, but it isn’t quite like that when you look at a real-life R example:

# Import `babynames` data
library(babynames)
# Import `dplyr` library
library(dplyr)

# Load the data
data(babynames)

# Count how many young boys with the name "Taylor" are born
sum(select(filter(babynames,sex=="M",name=="Taylor"),n))

# Do the same but now with `%>%`
babynames%>%filter(sex=="M",name=="Taylor")%>%
            select(n)%>%
            sum

Note how you work from the inside out when you rewrite the nested code: you first put in the babynames, then you use %>% to first filter() the data. After that, you’ll select n and lastly, you’ll sum() everything.

Remember also that you already saw another example of such a nested code that was converted to more readable code in the beginning of this tutorial, where you used the log()diff()exp() and round() functions to perform calculations on x.

Functions that Use the Current Environment

Unfortunately, there are some exceptions to the more general rules that were outlined in the previous section. Let’s take a look at some of them here.

Consider this example, where you use the assign() function to assign the value 10 to the variable x.

# Assign `10` to `x`
assign("x", 10)

# Assign `100` to `x` 
"x" %>% assign(100)

# Return `x`
x

10

You see that the second call with the assign() function, in combination with the pipe, doesn’t work properly. The value of x is not updated.

Why is this?

That’s because the function assigns the new value 100 to a temporary environment used by %>%. So, if you want to use assign() with the pipe, you must be explicit about the environment:

# Define your environment
env <- environment()

# Add the environment to `assign()`
"x" %>% assign(100, envir = env)

# Return `x`
x

100

Functions with Lazy Evalution

Arguments within functions are only computed when the function uses them in R. This means that no arguments are computed before you call your function! That means also that the pipe computes each element of the function in turn.

One place that this is a problem is tryCatch(), which lets you capture and handle errors, like in this example:

tryCatch(stop("!"), error = function(e) "An error")

stop("!") %>% 
  tryCatch(error = function(e) "An error")

‘An error’

Error in eval(expr, envir, enclos): !
Traceback:


1. stop("!") %>% tryCatch(error = function(e) "An error")

2. eval(lhs, parent, parent)

3. eval(expr, envir, enclos)

4. stop("!")

You’ll see that the nested way of writing down this line of code works perfectly, while the piped alternative returns an error. Other functions with the same behavior are try()suppressMessages(), and suppressWarnings() in base R.

Argument Placeholder

There are also instances where you can use the pipe operator as an argument placeholder. Take a look at the following examples:

  • f(x, y) can be rewritten as y %>% f(x, .)

In some cases, you won’t want the value or the magrittr placeholder to the function call at the first position, which has been the case in every example that you have seen up until now. Reconsider this line of code:

pi %>% round(6)

If you would rewrite this line of code, pi would be the first argument in your round() function. But what if you would want to replace the second, third, … argument and use that one as the magrittr placeholder to your function call?

Take a look at this example, where the value is actually at the third position in the function call:

"Ceci n'est pas une pipe" %>% gsub("une", "un", .)

‘Ceci n\’est pas un pipe’

  • f(y, z = x) can be rewritten as x %>% f(y, z = .)

Likewise, you might want to make the value of a specific argument within your function call the magrittr placeholder. Consider the following line of code:

6 %>% round(pi, digits=.)

Re-using the Placeholder for Attributes

It is straight-forward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code.

Here are some general “rules” that you can take into account when you’re working with argument placeholders in nested function calls:

  • f(x, y = nrow(x), z = ncol(x)) can be rewritten as x %>% f(y = nrow(.), z = ncol(.))
# Initialize a matrix `ma` 
ma <- matrix(1:12, 3, 4)

# Return the maximum of the values inputted
max(ma, nrow(ma), ncol(ma))

# Return the maximum of the values inputted
ma %>% max(nrow(ma), ncol(ma))

12

12

The behavior can be overruled by enclosing the right-hand side in braces:

  • f(y = nrow(x), z = ncol(x)) can be rewritten as x %>% {f(y = nrow(.), z = ncol(.))}
# Only return the maximum of the `nrow(ma)` and `ncol(ma)` input values
ma %>% {max(nrow(ma), ncol(ma))}

4

To conclude, also take a look at the following example, where you could possibly want to adjust the workings of the argument placeholder in the nested function call:

# The function that you want to rewrite
paste(1:5, letters[1:5])

# The nested function call with dot placeholder
1:5 %>%
  paste(., letters[.])
  1. ‘1 a’
  2. ‘2 b’
  3. ‘3 c’
  4. ‘4 d’
  5. ‘5 e’
  1. ‘1 a’
  2. ‘2 b’
  3. ‘3 c’
  4. ‘4 d’
  5. ‘5 e’

You see that if the placeholder is only used in a nested function call, the magrittr placeholder will also be placed as the first argument! If you want to avoid this from happening, you can use the curly brackets { and }:

# The nested function call with dot placeholder and curly brackets
1:5 %>% {
  paste(letters[.])
}

# Rewrite the above function call 
paste(letters[1:5])
  1. ‘a’
  2. ‘b’
  3. ‘c’
  4. ‘d’
  5. ‘e’
  1. ‘a’
  2. ‘b’
  3. ‘c’
  4. ‘d’
  5. ‘e’

Building Unary Functions

Unary functions are functions that take one argument. Any pipeline that you might make that consists of a dot ., followed by functions and that is chained together with %>% can be used later if you want to apply it to values. Take a look at the following example of such a pipeline:

. %>% cos %>% sin

This pipeline would take some input, after which both the cos() and sin()fuctions would be applied to it.

But you’re not there yet! If you want this pipeline to do exactly that which you have just read, you need to assign it first to a variable f, for example. After that, you can re-use it later to do the operations that are contained within the pipeline on other values.

# Unary function
f <- . %>% cos %>% sin 

f
structure(function (value) 
freduce(value, `_function_list`), class = c("fseq", "function"
))

Remember also that you could put parentheses after the cos() and sin()functions in the line of code if you want to improve readability. Consider the same example with parentheses: . %>% cos() %>% sin().

You see, building functions in magrittr very similar to building functions with base R! If you’re not sure how similar they actually are, check out the line above and compare it with the next line of code; Both lines have the same result!

# is equivalent to 
f <- function(.) sin(cos(.)) 

f
function (.) 
sin(cos(.))

Compound Assignment Pipe Operations

There are situations where you want to overwrite the value of the left-hand side, just like in the example right below. Intuitively, you will use the assignment operator <- to do this.

# Load in the Iris data
iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE)

# Add column names to the Iris data
names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")

# Compute the square root of `iris$Sepal.Length` and assign it to the variable
iris$Sepal.Length <- 
  iris$Sepal.Length %>%
  sqrt()

However, there is a compound assignment pipe operator, which allows you to use a shorthand notation to assign the result of your pipeline immediately to the left-hand side:

# Compute the square root of `iris$Sepal.Length` and assign it to the variable
iris$Sepal.Length %<>% sqrt

# Return `Sepal.Length`
iris$Sepal.Length

Note that the compound assignment operator %<>% needs to be the first pipe operator in the chain for this to work. This is completely in line with what you just read about the operator being a shorthand notation for a longer notation with repetition, where you use the regular <- assignment operator.

As a result, this operator will assign a result of a pipeline rather than returning it.

Tee Operations with The Tee Operator

The tee operator works exactly like %>%, but it returns the left-hand side value rather than the potential result of the right-hand side operations.

This means that the tee operator can come in handy in situations where you have included functions that are used for their side effect, such as plotting with plot() or printing to a file.

In other words, functions like plot() typically don’t return anything. That means that, after calling plot(), for example, your pipeline would end. However, in the following example, the tee operator %T>% allows you to continue your pipeline even after you have used plot():

set.seed(123)
rnorm(200) %>%
matrix(ncol = 2) %T>%
plot %>% 
colSums

pipe R

Exposing Data Variables with the Exposition Operator

When you’re working with R, you’ll find that many functions take a dataargument. Consider, for example, the lm() function or the with() function. These functions are useful in a pipeline where your data is first processed and then passed into the function.

For functions that don’t have a data argument, such as the cor() function, it’s still handy if you can expose the variables in the data. That’s where the %$% operator comes in. Consider the following example:

iris %>%
  subset(Sepal.Length > mean(Sepal.Length)) %$%
  cor(Sepal.Length, Sepal.Width)

0.336696922252551

With the help of %$% you make sure that Sepal.Length and Sepal.Width are exposed to cor(). Likewise, you see that the data in the data.frame() function is passed to the ts.plot() to plot several time series on a common plot:

data.frame(z = rnorm(100)) %$%
  ts.plot(z)

pipe operator R

dplyr and magrittr

In the introduction to this tutorial, you already learned that the development of dplyr and magrittr occurred around the same time, namely, around 2013-2014. And, as you have read, the magrittr package is also part of the Tidyverse.

In this section, you will discover how exciting it can be when you combine both packages in your R code.

For those of you who are new to the dplyr package, you should know that this R package was built around five verbs, namely, “select”, “filter”, “arrange”, “mutate” and “summarize”. If you have already manipulated data for some data science project, you will know that these verbs make up the majority of the data manipulation tasks that you generally need to perform on your data.

Take an example of some traditional code that makes use of these dplyrfunctions:

library(hflights)

grouped_flights <- group_by(hflights, Year, Month, DayofMonth)
flights_data <- select(grouped_flights, Year:DayofMonth, ArrDelay, DepDelay)
summarized_flights <- summarise(flights_data, 
                arr = mean(ArrDelay, na.rm = TRUE), 
                dep = mean(DepDelay, na.rm = TRUE))
final_result <- filter(summarized_flights, arr > 30 | dep > 30)

final_result
Year Month DayofMonth arr dep
2011 2 4 44.08088 47.17216
2011 3 3 35.12898 38.20064
2011 3 14 46.63830 36.13657
2011 4 4 38.71651 27.94915
2011 4 25 37.79845 22.25574
2011 5 12 69.52046 64.52039
2011 5 20 37.02857 26.55090
2011 6 22 65.51852 62.30979
2011 7 29 29.55755 31.86944
2011 9 29 39.19649 32.49528
2011 10 9 61.90172 59.52586
2011 11 15 43.68134 39.23333
2011 12 29 26.30096 30.78855
2011 12 31 46.48465 54.17137

When you look at this example, you immediately understand why dplyr and magrittr are able to work so well together:

hflights %>% 
    group_by(Year, Month, DayofMonth) %>% 
    select(Year:DayofMonth, ArrDelay, DepDelay) %>% 
    summarise(arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) %>% 
    filter(arr > 30 | dep > 30)

Both code chunks are fairly long, but you could argue that the second code chunk is more clear if you want to follow along through all of the operations. With the creation of intermediate variables in the first code chunk, you could possibly lose the “flow” of the code. By using %>%, you gain a more clear overview of the operations that are being performed on the data!

In short, dplyr and magrittr are your dreamteam for manipulating data in R!

RStudio Keyboard Shortcuts for Pipes

Adding all these pipes to your R code can be a challenging task! To make your life easier, John Mount, co-founder and Principal Consultant at Win-Vector, LLC and DataCamp instructor, has released a package with some RStudio add-ins that allow you to create keyboard shortcuts for pipes in R. Addins are actually R functions with a bit of special registration metadata. An example of a simple addin can, for example, be a function that inserts a commonly used snippet of text, but can also get very complex!

With these addins, you’ll be able to execute R functions interactively from within the RStudio IDE, either by using keyboard shortcuts or by going through the Addins menu.

Note that this package is actually a fork from RStudio’s original add-in package, which you can find here. Be careful though, the support for addins is available only within the most recent release of RStudio! If you want to know more on how you can install these RStudio addins, check out this page.

You can download the add-ins and keyboard shortcuts here.

When Not To Use the Pipe Operator in R

In the above, you have seen that pipes are definitely something that you should be using when you’re programming with R. More specifically, you have seen this by covering some cases in which pipes prove to be very useful! However, there are some situations, outlined by Hadley Wickham in “R for Data Science”, in which you can best avoid them:

  • Your pipes are longer than (say) ten steps.

In cases like these, it’s better to create intermediate objects with meaningful names. It will not only be easier for you to debug your code, but you’ll also understand your code better and it’ll be easier for others to understand your code.

  • You have multiple inputs or outputs.

If you aren’t transforming one primary object, but two or more objects are combined together, it’s better not to use the pipe.

  • You are starting to think about a directed graph with a complex dependency structure.

Pipes are fundamentally linear and expressing complex relationships with them will only result in complex code that will be hard to read and understand.

  • You’re doing internal package development

Using pipes in internal package development is a no-go, as it makes it harder to debug!

For more reflections on this topic, check out this Stack Overflow discussion. Other situations that appear in that discussion are loops, package dependencies, argument order and readability.

In short, you could summarize it all as follows: keep the two things in mind that make this construct so great, namely, readability and flexibility. As soon as one of these two big advantages is compromised, you might consider some alternatives in favor of the pipes.

Alternatives to Pipes in R

After all that you have read by you might also be interested in some alternatives that exist in the R programming language. Some of the solutions that you have seen in this tutorial were the following:

  • Create intermediate variables with meaningful names;

Instead of chaining all operations together and outputting one single result, break up the chain and make sure you save intermediate results in separate variables. Be careful with the naming of these variables: the goal should always be to make your code as understandable as possible!

  • Nest your code so that you read it from the inside out;

One of the possible objections that you could have against pipes is the fact that it goes against the “flow” that you have been accustomed to with base R. The solution is then to stick with nesting your code! But what to do then if you don’t like pipes but you also think nesting can be quite confusing? The solution here can be to use tabs to highlight the hierarchy.

  • … Do you have more suggestions? Make sure to let me know – Drop me a tweet @willems_karlijn

Conclusion

You have covered a lot of ground in this tutorial: you have seen where %>%comes from, what it exactly is, why you should use it and how you should use it. You’ve seen that the dplyr and magrittr packages work wonderfully together and that there are even more operators out there! Lastly, you have also seen some cases in which you shouldn’t use it when you’re programming in R and what alternatives you can use in such cases.

If you’re interested in learning more about the Tidyverse, consider DataCamp’s Introduction to the Tidyverse course.

 

소스: Pipes in R Tutorial For Beginners (article) – DataCamp

데이터 전처리 -데이터 전처리(클린징)에 대한 모든 것

본 포스팅에서는 탐색적 데이터 분석(EDA)라고 불리우는 단계에서 수행해야 할 Task에 대해 순서대로 정리해 보고자 합니다. EDA는 데이터 셋 확인 – 결측값 처리 – 이상값 처리 – Feature Engineering 의 순서로 진행합니다.

데이터 분석의 단계 중 가장 많은 시간이 소요되는 단계가 바로 Exploratory Data Analysis 단계입니다. Forbes에서 인용한 CrowdFlower의 설문 결과에 따르면 데이터 분석가는 업무 시간 중 80%정도를 데이터 수집 및 전처리 과정에 사용한다고 합니다.
(하지만 동일 설문에서 데이터 분석 업무 중 가장 싫은 단계로 꼽히기도 했다죠.)

time-1200x511

본 포스팅에서는 탐색적 데이터 분석(EDA)라고 불리우기도 하는 데이터 전처리 단계에서 수행해야 할 Task에 대해 순서대로 정리해 보고자 합니다.  데이터 전처리는 데이터 셋 확인 – 결측값 처리 – 이상값 처리 – Feature Engineering 의 순서로 진행합니다.

1 데이터 셋 확인

분석하고자 하는 데이터 셋과 친해지는 단계입니다. 데이터 셋에 대해 아래 두가지 확인 작업을 하게 됩니다.

A. 변수 확인

독립/종속 변수의 정의, 각 변수의 유형(범주형인지 연속형인지), 변수의 데이터 타입(Date인지, Character인지, Numeric 인지 등)을 확인합니다.

다른 툴도 마찬가지겠지만, R을 사용하는 분들은 변수의 데이터 타입에 따라 모델 Fitting 할때 전혀 다른 결과가 나오기 때문에 사전에 변수 타입을 체크하고, 잘못 설정되어 있는 경우 이 단계에서 변경해 주세요.

B. RAW 데이터 확인

B-1. 단변수 분석

변수 하나에 대해 기술 통계 확인을 하는 단계입니다. Histogram이나 Boxplot을 사용해서 평균, 최빈값, 중간값 등과 함께 각 변수들의 분포를 확인하면 됩니다. 범주형 변수의 경우 Boxplot을 사용해서 빈도 수 분포를 체크해 주면 됩니다.

B-2. 이변수 분석

변수 2개 간의 관계를 분석하는 단계 입니다. 아래 그림과 같이 변수의 유형에 따라 적절한 시각화 및 분석 방법을 택하면 됩니다.
%ed%94%84%eb%a0%88%ec%a0%a0%ed%85%8c%ec%9d%b4%ec%85%982

B-2. 셋 이상의 변수

번거롭지만 세개 이상의 변수 간의 관계를 시각화, 분석해야 할 경우도 있을 텐데요. 이때 범주형 변수가 하나이상 포함되어 있는 경우 변수를 범주에 따라 쪼갠 후에 위 분석 방법에 따라 분석하면 됩니다.

예를 들어 남성-여성의 정보가 있고 연소득, 학력, 키의 정보가 있다고 할때 성별로 구분해서 연소득과 학력이 독립적인지 T-test로 확인해 볼 수 있겠죠? 학력으로 구분해서 연소득과 키의 상관관계를 확인해도 될 것이구요.

세개 이상의 연속형 변수의 관계를 확인하기 위해서는, 연속형 변수를 Feature engineering을 통해 범주형 변수로 변환한 후 분석하시거나, 혹은 (추천하지 않습니다만 굳이 필요하다면) 3차원 그래프를 그려 시각적으로 확인해 볼 수 있습니다.

 

2. 결측값 처리 (Missing value treatment)

결측값이 있는 상태로 모델을 만들게 될 경우 변수간의 관계가 왜곡될수 있기 때문에 모델의 정확성이 떨어지게 됩니다.

결측값이 발생하는 유형은 다양한데요, 결측값이 무작위로 발생하느냐, 아니면 결측값의 발생이 다른 변수와 관계가 있는지 여부에 따라 결측값을 처리하는 방법도 조금씩 달라집니다.

결측값 처리 방법의 종류

A. 삭제

결측값이 발생한 모든 관측치를 삭제하거나 (전체 삭제, Listwise Deletion), 데이터 중 모델에 포함시킬 변수들 중 관측값이 발생한 모든 관측치를 삭제하는 방법(부분 삭제)이 있습니다.

전체 삭제는 간편한 반면 관측치가 줄어들어 모델의 유효성이 낮아질 수 있고,  부분 삭제는 모델에 따라 변수가 제각각 다르기 때문에 관리 Cost가 늘어난다는 단점이 있습니다.

삭제는 결측값이 무작위로 발생한 경우에 사용합니다. 결측값이 무작위로 발생한 것이 아닌데 관측치를 삭제한 데이터를 사용할 경우 왜곡된 모델이 생성될 수 있습니다.

B. 다른 값으로 대체 (평균, 최빈값, 중간값)

결측값이 발생한 경우 다른 관측치의 평균, 최빈값, 중간값 등으로 대체할 수 있는데요, 모든 관측치의 평균값 등으로 대체하는 일괄 대체 방법과, 범주형 변수를 활용해 유사한 유형의 평균값 등으로 대체하는 유사 유형 대체 방법이 있습니다.

(예 – 남자 키의 평균 값 173, 여자 키의 평균 값 158인 경우, 남자 관측치의 결측 값은 173으로 대체)

결측 값의 발생이 다른 변수와 관계가 있는 경우 대체 방법이 유용한 측면은 있지만, 유사 유형 대체 방법의 경우 어떤 범주형 변수를 유사한 유형으로 선택할 것인지는 자의적으로 선택하므로 모델이 왜곡될 가능성이 존재합니다.

C. 예측값 삽입

결측값이 없는 관측치를 트레이닝 데이터로 사용해서 결측값을 예측하는 모델을 만들고, 이 모델을 통해 결측값이 있는 관측 데이터의 결측값을 예측하는 방법입니다. Regression이나 Logistic regression을 주로 사용합니다.

대체하는 방법보다 조금 덜 자의적이나, 결측 값이 다양한 변수에서 발생하는 경우 사용 가능 변수 수가 적어 적합한 모델을 만들기 어렵고, 또 이렇게 만들어 진 모델의 예측력이 낮은 경우에는 사용하기 어려운 방법입니다.

3. 이상값 처리 (Outlier treatment)

출처: stats.stackexchange.com
출처: stats.stackexchange.com

이상값이란 데이터/샘플과 동떨어진 관측치로, 모델을 왜곡할 가능성이 있는 관측치를 말합니다.

이상값 찾아내기

이상값을 찾아 내기 위한 쉽고 간단한 방법은 변수의 분포를 시각화하는 것입니다. 일반적으로 하나의 변수에 대해서는 Boxplot이나 Histogram을, 두개의 변수 간 이상값을 찾기 위해서는 Scatter plot을 사용합니다.

숨어있는 아웃라이어를 찾아라. 출처: www.analyticsvidhya.com
숨어있는 아웃라이어를 찾아라.

시각적으로 확인하는 방법은 직관적이지만 자의적이기도 하고 하나하나 확인해야 해서 번거로운 측면이 있습니다.

두 변수 간 이상값을 찾기 위한 또 다른 방법으로는 두 변수 간 회귀 모형에서 Residual, Studentized residual(혹은 standardized residual), leverage, Cook’s D값을 확인하면 됩니다.

이상값 처리하기

A. 단순 삭제

이상값이 Human error에 의해서 발생한 경우에는 해당 관측치를 삭제하면 됩니다. 단순 오타나, 주관식 설문 등의 비현실 적인 응답, 데이터 처리 과정에서의 오류 등의 경우에 사용합니다.

B. 다른 값으로 대체

절대적인 관측치의 숫자가 작은 경우,  삭제의 방법으로 이상치를 제거하면 관측치의 절대량이 작아지는 문제가 발생합니다.

이런 경우 이상 값이 Human error에 의해 발생했더라도 관측치를 삭제하는 대신 다른 값(평균 등)으로 대체하거나, 결측값과 유사하게 다른 변수들을 사용해서 예측 모델을 만들고, 이상값을 예측한 후 해당 값으로 대체하는 방법도 사용할 수 있습니다.

C. 변수화

이상값이 자연 발생한 경우, 단순 삭제나 대체의 방법을 통해 수립된 모델은 설명/예측하고자 하는 현상을 잘 설명하지 못할 수도 있습니다.

예를 들어 아래 그래프에서 다른 관측치들만 보면 경력과 연봉이 비례하는 현상이 존재하는 것 처럼 보이지만, 5년차의 연봉 $35,000인 이상치가 포함됨으로써 모델의 설명력이 크게 낮아 집니다.

11

자연발생적인 이상값의 경우, 바로 삭제하지 말고 좀 더 찬찬히 이상값에 대해 파악하는 것이 중요합니다.

예를 들어 위 이상값의 경우 의사 등 전문직종에 종사하는 사람이라고 가정해 봅시다. 이럴 경우 전문직종 종사 여부를 Yes – No로 변수화 하면 이상값을 삭제하지 않고 모델에 포함시킬 수 있습니다.

D. 리샘플링

자연 발생한 이상값을 처리하는 또 다른 방법으로는 해당 이상값을 분리해서 모델을 만드는 방법이 있습니다.

아래와 같이 15년 이상의 경력을 가진 이상값이 존재한다고 가정해 봅시다. 이 관측치는 경력은 길지만 연봉이 비례해서 늘어나지 않은 사람입니다.

(위 사례와의 차이:
위 사례는 설명 변수, 즉 경력 측면에서는 Outlier가 아니고, 종속 변수인 연봉만 예측치를 벗어나는 반면, 본 케이스는 설명 변수, 종속 변수 모두에서 Outlier라는 점입니다.)

12

이 경우 간단하게는 이상치를 삭제하고 분석 범위는 10년 이내의 경력자를 대상으로 한다는 설명 등을 다는 것으로 이상값을 처리할 수 있습니다.

E. 케이스를 분리하여 분석

위와 동일한 사례에서 실은 경력이 지나치게 길어질 경우 연봉이 낮아지는 현상이 실제로 발생할 수도 있습니다. (건강상의 이유 등으로)

이 경우 이상값을 대상에서 제외시키는 것은 현상에 대한 정확한 설명이 되지 않을 수 있습니다. 보다 좋은 방법은 이상값을 포함한 모델과 제외한 모델을 모두 만들고 각각의 모델에 대한 설명을 다는 것입니다.

자연 발생한 이상값에 별다른 특이점이 발견되지 않는다면, 단순 제외 보다는 케이스를 분리하여 분석하는 것을 추천합니다.

4. Feature Engineering

Feature Engineering이란, 기존의 변수를 사용해서 데이터에 정보를 추가하는 일련의 과정입니다. 새로 관측치나 변수를 추가하지 않고도 기존의 데이터를 보다 유용하게 만드는 방법론입니다.

A. SCALING

변수의 단위를 변경하고 싶거나, 변수의 분포가 편향되어 있을 경우, 변수 간의 관계가 잘 드러나지 않는 경우에는 변수 변환의 방법을 사용합니다.

가장 자주 사용하는 방법으로는 Log 함수가 있고, 유사하지만 좀 덜 자주 사용되는 Square root를 취하는 방법도 있습니다.

relation

B. BINNING

연속형 변수를 범주형 변수로 만드는 방법입니다. 예를 들어 연봉 데이터가 수치로 존재하는 경우, 이를 100만원 미만, 101만원~200만원.. 하는 식으로 범주형 변수로 변환하는 것이죠.

Binning에는 특별한 원칙이 있는 것이 아니기 때문에,  분석가의 Business 이해도에 따라 창의적인 방법으로 Binning  할 수 있습니다.

C. TRANSFORM

기존 존재하는 변수의 성질을 이용해 다른 변수를 만드는 방법입니다.

예를 들어 날짜 별 판매 데이터가 있다면, 날짜 변수를 주중/주말로 나눈 변수를 추가한다던지, eSports의 관람객 데이터의 경우 해당 일에 SKT T1의 경기가 있는지 여부 등을 추가하는 것이지요.

Transform에도 특별한 원칙이 있는 것은 아니며, 분석가의 Business 이해도에 따라 다양한 변수가 생성될 수 있습니다.

D. DUMMY

Binning 과는 반대로 범주형 변수를 연속형 변수로 변환하기 위해 사용합니다. 사용하고자 하는 분석 방법론에서 필요한 경우에 주로 사용합니다.

dummy

 

끝내며…

Garbage-in, garbage-out이기 때문에 Model building 전에 내가 가지고 있는 데이터의 상태를 확인하고, 내가 설계한 분석 방법에 맞게 적절한 전처리를 해주는 것은 정확한 결과를 얻기 위해 필수적인 단계라고 할 수 있습니다.  마치 요리를 하기 전 좋은 재료를 구해 잘 손질하는 과정과도 같죠.

본 포스팅을 읽으신 분들께 데이터 전처리에 대한 큰그림(?)이 생기셨으면 좋겠네요.

 

참고 자료

1) MeasuringU, 7 Ways To Handle Missing Data

2) Boston University Technical Report, Marina Soley-Bori, Dealing with missing data: Key assumptions and methods for applied analysis

3) R-bloggers, Imputing missing data with R; MICE package

4) Analytics Vidhya, A comprehensive guide to data exploration

5) The analysis factor, Outliers: To Drop or Not to Drop

6) Kellogg, Outliers

소스: 데이터 전처리 -데이터 전처리(클린징)에 대한 모든 것

How to Create Infographics in R – nandeshwar.info

What’s so special about this?

Now you may ask: “What’s so special about this?” Well, the theory supports that like regular dominoes, these dominoes could keep pushing the next ones down. Even the small, 2in domino can bring down the tallest of dominoes. You keep going on and,

  • with the 15th domino you will reach the height of one of the tallest dinosaurs, Argentinosaurus (16m tall)
  • with the 18th domino you will reach at the top of the statue of liberty (46m)
  • with the 40th domino you be in the space station (370km)
  • and by 57th, you will reach the moon (370,000km)

People use the expression “reach for the moon” meaning try to achieve very difficult tasks, in almost a defeating tone; however, this example empowers us to think that it IS indeed possible for a small domino to build up to reach the moon.

It might be easy for some people to sustain with whatever current knowledge they have, but we know that in the knowledge economy we must continuously improve and learn. I heard this recently from Mike Rayburn: “coasting only happens downhill.” Although it is easy to coast at a job, it will only bring us down. Another difficult, but invigorating approach is to become a “virtuoso“: mastery in the chosen field.
public speaker quote mike rayburn "coasting only happens when you are going downhill"

virtuoso
noun
a person who has a masterly or dazzling skill or technique in any field of activity

Reaching for the moon is of course extremely difficult and improbable for most of us, but the metaphor is powerful. We start with one step small step, we repeat that step with increased intensity and momentum, and we can achieve our goals. With focus, efforts and momentum, it is possible to achieve even the most improbable goals. Find your one thing that will make you successful and repeat it everyday in increasing order.
improve 1% every day, you will be 37 times better in a year quote

Recipe for Infographics in R

Ingredients

Now I feel better. I justified myself to replicate the above example in R. After I justified myself, I searched for some basics and found some fantastic threads on stackoverflow on using images in R and ggplot2.

Hint

It was hardly obvious to me that inforgraphics in statistics are called pictograms. Remember this when you search for information on infographics in R

After knowing that it was possible to create infographics in R, I searched for some vector art. I found them on vecteezy.com and vector.me.

Not lying

Edward Tufte, in his book The Visual Display of Quantitative Information, famously described how graphic designers (or let’s say data communicators) “lie” with data, especially when the objects they plot are hardly in true proportions. My challenge was thus to avoid lying and still communicate the message.

Tufte's quote on proportion and lie factor

R code

Now to the fun part! Getting our hands dirty in R when not pulling our hair dealing with R.

Step 1:Load my favorite libraries

library('ggplot2')
library('scales') # for number formating
library('png') # to read png files
library('grid') # for expanding the plot
library('Cairo') # for high quality plots and anti-aliasing 
library('plyr') # for easy data manipulation

Step 2: Generate data and the base plot

dominoes <- data.frame(n = 1:58, height = 0.051 *1.5^(0:57)) # 2inch is 0.051 meters
base_plot <- qplot(x = n, y = height, data = dominoes, geom = "line") #+  scale_y_sqrt()
base_plot <- base_plot  + labs(x = "Sequence Number", y = "Height/Distance\n(meters)") +  theme(axis.ticks = element_blank(), panel.background = element_rect(fill = "white", colour = "white"), legend.position = "none")
base_plot <- base_plot  + theme(axis.title.y = element_text(angle = 0), axis.text = element_text(size = 18), axis.title = element_text(size = 20))
base_plot <- base_plot +  theme(plot.margin = unit(c(1,1,18,1), "lines")) + scale_y_continuous(labels = comma)
base_plot

Note

The argument plot.margin. I increased the height of the plot by supplying the parameter unit(c(1,1,18,1), "lines")

We get this plot:
line plot in R using ggplot2 of a geometric series

Step 3: Read all the vector arts in a Grob form

domino_img <- readPNG("domino.png")
domino_grob <- rasterGrob(domino_img, interpolate = TRUE)
 
eiffel_tower_img <- readPNG("eiffel-tower.png")
eiffel_tower_grob <- rasterGrob(eiffel_tower_img, interpolate = TRUE)
 
pisa_img <- readPNG("pisa-tower.png")
pisa_grob <- rasterGrob(pisa_img, interpolate = TRUE)
 
liberty_img <- readPNG("statue-of-liberty.png")
libery_grob <- rasterGrob(liberty_img, interpolate = TRUE)
 
long_neck_dino_img <- readPNG("dinosaur-long-neck.png")
long_neck_dino_grob <- rasterGrob(long_neck_dino_img, interpolate = TRUE)

Step 4: Line up the images without lying

p <- base_plot + annotation_custom(eiffel_tower_grob, xmin = 20, xmax = 26, ymin = 0, ymax = 381) + annotation_custom(libery_grob, xmin = 17, xmax = 19, ymin = 0, ymax = 50) + annotation_custom(long_neck_dino_grob, xmin = 13, xmax = 17, ymin = 0, ymax = 15)
 
CairoPNG(filename = "domino-effect-geometric-progression.png", width = 1800, height = 600, quality = 90)
plot(p)
dev.off()

From step 4, we get this:
A geometric series infographics visualized in R

Shucks! All this for this boring looking graph. Not lying is not fun. Although the Argentinosaurus, statue of liberty, and Eiffel Tower all are proportionate to their heights, the plot lacks appeal. I thought the next best thing would be to place all the objects close to their values on the x-axis. Another benefit of this approach: I added some other objects that have very small and big y-axis values i.e. a domino, the space station and our moon.

Step 5: Place more images using a custom function

#create a data frame to store file names and x/y coordinates
grob_placement <- data.frame(imgname = c("dinosaur-long-neck.png", "statue-of-liberty.png", "eiffel-tower.png", "space-station.png", "moon.png"),                            
                             xmins = c(13, 17, 20, 38, 53),
                             ymins = rep(-1*10^8, 5),
                             ymaxs = rep(-4.5*10^8, 5),
                            stringsAsFactors = FALSE)
grob_placement$xmaxs <- grob_placement$xmins + 4
 
#make a function to create the grobs and call the annotation_custom function
add_images <- function(df) {
  dlply(df, .(imgname), function(df){
  img <- readPNG(unique(df$imgname))
  grb <- rasterGrob(img, interpolate = TRUE) 
  annotation_custom(grb, xmin = df$xmins, xmax = df$xmax, ymin = df$ymins, ymax = df$ymaxs)
  })
}

Step 6: Add text labels

#text data frame with x/y coordinates
img_texts <- data.frame(imgname = c("domino", "dino", "space-station", "moon"),                            
                             xs = c(1, 13, 38, 53),
                             ys = rep(-5.2*10^8, 4),
                             texts = c("1st domino is\nonly 2in",
                                       "15th domino will reach Argentinosaurus (16m).\nBy 18th domino, you will reach the statue of liberty (46m).\n23 domino will be taller than the Eiffel Tower (300m)",
                                       "40th domino will\nreach the ISS (370km)",
                                       "57th domino will\nreach the moon (370,000km)"
                                       ))
 
add_texts <- function(df) {
  dlply(df, .(imgname), function(df){
    annotation_custom(grob = textGrob(label = df$texts, hjust = 0),
      xmin = df$xs, xmax = df$xs, ymin = df$ys, ymax = df$ys)    
  })
}

Step 7: Put everything together

base_plot + add_images(grob_placement) + add_texts(img_texts)
 
CairoPNG(filename = "domino-effect-geometric-progression-2.png", width = 1800, height = 600, quality = 90)
g <- base_plot + add_images(grob_placement) + add_texts(img_texts) + annotation_custom(domino_grob, xmin = 1, xmax = 2, ymin = -1*10^8, ymax = -5*10^8)
gt <- ggplot_gtable(ggplot_build(g))
gt$layout$clip[gt$layout$name == "panel"] <- "off"
grid.draw(gt)
dev.off()

This is what we get. Not bad, huh?
A geometric series infographics visualized in R

We still have a problem: our beloved moon is smaller than the space station, because I placed all the images in rectangles of same height. I could have made the moon slightly bigger, but I could not have maintained the proportion. I thought it is better to have all the objects in similar size rectangles than changing proportions at will. If you have other ideas, please let me know.

Step 7: Make it pretty

And by pretty, I mean, upload the final plot to Canva and add the orange color. 🙂 Here is my final version:
Final infographics created with R and finalized in Canva

There it is! It is possible to use R to create infographics or pictograms, and the obvious advantage, as I explained my post Tableau vs. R, is a programming language’s repeatability and reproducibility. You can, of course, edit the output plots in Illustrator or GIMP, but for quick wins, R’s output is fantastic. Can you think of any other ideas to create infographics in R?

Improve Data Visualization In As Quick As 5 Minutes With These 20+ Special Tips

Expert Advice To Create Data Visualization Like Pros

Full Script

 
#http://stackoverflow.com/questions/14113691/pictorial-chart-in-r?lq=1
#http://stackoverflow.com/questions/6797457/images-as-labels-in-a-graph?lq=1
#http://stackoverflow.com/questions/20733328/labelling-the-plots-with-images-on-graph-in-ggplot2?rq=1
#http://stackoverflow.com/questions/25014492/geom-bar-pictograms-how-to?lq=1
#http://stackoverflow.com/questions/19625328/make-the-value-of-the-fill-the-actual-fill-in-ggplot2/20196002#20196002
#http://stackoverflow.com/questions/12409960/ggplot2-annotate-outside-of-plot?lq=1
library('ggplot2')
library('scales')
library('png')
library('grid')
library('Cairo')
library('plyr')
 
dominoes <- data.frame(n = 1:58, height = 0.051 *1.5^(0:57)) # 2inch is 0.051 meters
base_plot <- qplot(x = n, y = height, data = dominoes, geom = "line") #+  scale_y_sqrt()
base_plot <- base_plot  + labs(x = "Sequence Number", y = "Height/Distance\n(meters)") +  theme(axis.ticks = element_blank(), panel.background = element_rect(fill = "white", colour = "white"), legend.position = "none")
base_plot <- base_plot  + theme(axis.title.y = element_text(angle = 0), axis.text = element_text(size = 18), axis.title = element_text(size = 20))
base_plot <- base_plot +  theme(plot.margin = unit(c(1,1,18,1), "lines")) + scale_y_continuous(labels = comma)
base_plot
 
domino_img <- readPNG("domino.png")
domino_grob <- rasterGrob(domino_img, interpolate = TRUE)
 
eiffel_tower_img <- readPNG("eiffel-tower.png")
eiffel_tower_grob <- rasterGrob(eiffel_tower_img, interpolate = TRUE)
 
pisa_img <- readPNG("pisa-tower.png")
pisa_grob <- rasterGrob(pisa_img, interpolate = TRUE)
 
liberty_img <- readPNG("statue-of-liberty.png")
libery_grob <- rasterGrob(liberty_img, interpolate = TRUE)
 
long_neck_dino_img <- readPNG("dinosaur-long-neck.png")
long_neck_dino_grob <- rasterGrob(long_neck_dino_img, interpolate = TRUE)
 
 
#space station is 370,149.120 meters 
 
#this version tries to scale images by their heights
p <- base_plot + annotation_custom(eiffel_tower_grob, xmin = 20, xmax = 26, ymin = 0, ymax = 381) + annotation_custom(libery_grob, xmin = 17, xmax = 19, ymin = 0, ymax = 50) + annotation_custom(long_neck_dino_grob, xmin = 13, xmax = 17, ymin = 0, ymax = 15)
 
CairoPNG(filename = "domino-effect-geometric-progression.png", width = 1800, height = 600, quality = 90)
plot(p)
dev.off()
 
 
 
#this version just places a picture at the number
grob_placement <- data.frame(imgname = c("dinosaur-long-neck.png", "statue-of-liberty.png", "eiffel-tower.png", "space-station.png", "moon.png"),                            
                             xmins = c(13, 17, 20, 38, 53),
                             ymins = rep(-1*10^8, 5),
                             ymaxs = rep(-4.5*10^8, 5),
                            stringsAsFactors = FALSE)
grob_placement$xmaxs <- grob_placement$xmins + 4
 
#make a function to create the grobs and call the annotation_custom function
add_images <- function(df) {
  dlply(df, .(imgname), function(df){
  img <- readPNG(unique(df$imgname))
  grb <- rasterGrob(img, interpolate = TRUE) 
  annotation_custom(grb, xmin = df$xmins, xmax = df$xmax, ymin = df$ymins, ymax = df$ymaxs)
  })
}
 
img_texts <- data.frame(imgname = c("domino", "dino", "space-station", "moon"),                            
                             xs = c(1, 13, 38, 53),
                             ys = rep(-5.2*10^8, 4),
                             texts = c("1st domino is\nonly 2in",
                                       "15th domino will reach Argentinosaurus (16m).\nBy 18th domino, you will reach the statue of liberty (46m).\n23 domino will be taller than the Eiffel Tower (300m)",
                                       "40th domino will\nreach the ISS (370km)",
                                       "57th domino will\nreach the moon (370,000km)"
                                       ))
 
add_texts <- function(df) {
  dlply(df, .(imgname), function(df){
    annotation_custom(grob = textGrob(label = df$texts, hjust = 0),
      xmin = df$xs, xmax = df$xs, ymin = df$ys, ymax = df$ys)    
  })
}
 
base_plot + add_images(grob_placement) + add_texts(img_texts)
 
CairoPNG(filename = "domino-effect-geometric-progression-2.png", width = 1800, height = 600, quality = 90)
g <- base_plot + add_images(grob_placement) + add_texts(img_texts) + annotation_custom(domino_grob, xmin = 1, xmax = 2, ymin = -1*10^8, ymax = -5*10^8)
gt <- ggplot_gtable(ggplot_build(g))
gt$layout$clip[gt$layout$name == "panel"] <- "off"
grid.draw(gt)
dev.off()

 

소스: How to Create Infographics in R – nandeshwar.info

ggplot2에서 heatmap 플로팅 빠르게 해보기

FlowingData 블로그의 게시물 은 R 기본 그래픽을 사용하여 아래에서 히트 맵을 빠르게 만드는 방법을 알려 주고있습니다..

이 게시물은 ggplot2를 사용하여 매우 유사한 결과를 얻는 방법을 보여줍니다.

nba_heatmap_revised.png


데이터 가져오기

FlowingData는 databasebasketball.com에서 제공한 지난 시즌의 NBA 농구 통계를 사용했으며 데이터가 포함 된 csv 파일은 해당 웹 사이트에서 직접 다운로드 할 수 있습니다.

> nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")

 

플레이어는 점수가 매겨진 점수에 따라 순서가 정해지고 Name 변수는 점수의 적절한 정렬을 보장하는 요소로 변환됩니다.

> nba$Name <- with(nba, reorder(Name, PTS))

 

FlowingData는 플롯 된 값을 매트릭스 형식으로해야하는 stats-package에서 heatmap 함수를 사용하지만, ggplot2는 데이터 프레임과 함께 작동합니다. 처리가 쉽도록 데이터 프레임이 와이드 형식에서 긴 형식으로 변환됩니다.

게임 통계에는 매우 다른 범위가 있으므로 비교할 수 있도록 모든 개별 통계가 재조정됩니다.

> library(ggplot2)
> nba.m <- melt(nba)
> nba.m <- ddply(nba.m, .(variable), transform,
+     rescale = rescale(value))

플로팅

ggplot2에는 특정 히트맵 플로팅 기능이 없지만 geom_tile과 부드러운 그라데이션 채우기를 결합하면 작업이 잘 수행됩니다.

> (p <- ggplot(nba.m, aes(variable, Name)) + geom_tile(aes(fill = rescale),
+     colour = "white") + scale_fill_gradient(low = "white",
+     high = "steelblue"))
basketball_heatmap-008.png

포맷팅에 몇 가지 마무리가 적용되며 히트맵 도면이 표시 될 준비가되었습니다.

> base_size <- 9
> p + theme_grey(base_size = base_size) + labs(x = "",
+     y = "") + scale_x_discrete(expand = c(0, 0)) +
+     scale_y_discrete(expand = c(0, 0)) + opts(legend.position = "none",
+     axis.ticks = theme_blank(), axis.text.x = theme_text(size = base_size *
+         0.8, angle = 330, hjust = 0, colour = "grey50"))
basketball_heatmap-010.png

재조정 업데이트

 

위의 그림에 대한 데이터를 준비 할 때 모든 변수는 0과 1 사이의 값이되도록 재조정되었습니다.

Jim은 heatmap-function이 다른 크기 조정 방법을 사용한다는 점을 주석에서 지적했습니다. (그리고 처음에는 얻지 못했습니다.) 따라서 그 그림은 동일하지 않습니다. 아래는 히트 맵의 업데이트 된 버전으로 원본과 훨씬 비슷하게 보입니다.

> nba.s <- ddply(nba.m, .(variable), transform,
+     rescale = scale(value))
> last_plot() %+% nba.s
basketball_heatmap-013.png

소스: ggplot2: Quick Heatmap Plotting | Learning R

프로그래밍에 대한 두려움 극복하기

전에 인생에서 프로그래밍 한 적이 없다고하셨습니까? 클래스와 객체, 데이터 프레임, 메소드, 상속, 루프와 같은 단어를 들어 본 적이 없습니까? 프로그래밍을 두려워하나요?

두려워하지 마세요. 프로그래밍은 재미 있고 자극적 일 수 있으며 일단 프로그래밍을 시작하고 배우면 많은 전략을 프로그래밍하는 데 시간을 보내는 것을 좋아할 것입니다. 당신은 자신의 코드가 눈 깜짝 할 사이에 움직이는 것을보고 싶을 것입니다. 그리고이 코드가 얼마나 강력한지 보게 될 것입니다.

Executive Programme in Algorithmic Trading (EPAT™) 과정은 Python 및 R 프로그래밍 언어를 광범위하게 사용하여 전략, 백 테스팅 및 최적화를 가르칩니다. R의 도움을 받아 프로그래밍에 대한 두려움을 극복 할 수있는 방법을 보여줍니다. 다음은 초보자 프로그래머를위한 몇 가지 제안 사항입니다.

1) Think and let the questions pop in your mind

As a newbie programmer when you have a task to code, even before you start on it, spend some time ideating on how you would like to solve it step-by-step. Simply let questions pop up in your mind, as many questions as your mind may throw up.

Here are a few questions:
Is it possible to download stock price data in R from google finance?
How to delete a column in R? How to compute an exponential moving average (EMA)?
How do I draw a line chart in R? How to merge two data sets?
Is it possible to save the results in an excel workbook using R?

2) Google the questions for answers

Use google search to see whether solutions exist for the questions that you have raised. Let us take the second question, how to delete a column in R? We posted the question in the google search, and as we can see from the screenshot below we have the solution in the very first result shown by google.

R is an open-source project, and there are hundreds of articles, blogs, forums, tutorials, Youtube videos on the net and books which will help you overcome the fear of programming and transition you from a beginner to an intermediate level, and eventually to an expert if you aspire to.

The chart below shows the number of questions/threads posted by newbie and expert programmers on two popular websites. As you can see, R clearly tops the results with more than 10 thousand questions/threads.
(Source: www.r4stats.com )

Let us search in google whether QuantInsti™ has put up any programming material on R.
As you can see from the google results, QuantInsti™ has posted quality content on its website to help newbie programmers design and model quantitative trading strategies in R. You can read all the rich content posted regularly by QuantInsti™ here – https://www.quantinsti.com/blog

3) Use the print command in R

As a newbie programmer, don’t get intimidated when you come across complex looking codes on the internet. If you are unable to figure out what exactly the code does, just copy the code in R. You can use a simple “print” command to help understand the code’s working.

One can also use Ctrl+Enter to execute the code line-by-line and see the results in the console.

Let us take an example of an MACD trading strategy posted on QuantInsti’s blog.

An example of a trading strategy coded using Quantmod Package in R

I am unsure of the working of commands at line 9 and line 11. So I simply inserted a print(head(returns)) command at line 10 and one more at line 12. Thereafter I ran the code. Below is the result as shown in the console window of R.

The returns = returns[‘2008-06-02/2015-09-22’] command simply trims the original NSEI.Close price returns series. The series was earlier starting from 2007-09-17. The series now starts from 2008-06-02 and ends at 2015-09-22.

4) Use help() and example() functions in R

One can also make use of the help() and example() functions in R to understand a code, and also learn new ways of coding. Continuing with the code above, I am unsure what the ROC function does at line 9 in the code.

I used the help(“ROC”) command, and R displays all the relevant information regarding the usage, arguments of the ROC function.

There are hundreds of add-on packages in R which makes programming easy and yet powerful.

Below is the link to view all the available packages in R:
https://cran.r-project.org/web/packages/available_packages_by_name.html

5) Give time to programming

Programming can be a very rewarding experience, and we expect that you devote some time towards learning and honing your programming skills. Below is a word cloud of some essential characteristics a good programmer should possess. The best suggestion would be to just start programming!!

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

As a newbie programmer, you have just made a start. The faculty at QuantInsti™ will teach and guide you through different aspects of programming in R and Python. Over the course of the program, you will learn different data structures, classes and objects, functions, and many other aspects which will enable you to program algorithmic trading strategies in the most efficient and powerful way.

The post Overcome the Fear of Programming appeared first on .

 

소스: Overcome the Fear of Programming | R-bloggers

xwMOOC 기계학습

학습 목표

  • 표(table) 데이터를 깔끔한 데이터(tidy data)로 변환한다.
  • 깔끔한 데이터를 범주형 데이터, 즉 요인형 자료구조로 변환시킨다.
  • R 함수를 재활용하고자 vcd 팩키지 mosaic() 함수로 시각화한다.

1. 범주형 자료 처리를 위한 자료구조와 시각화

일상적으로 가장 많이 접하는 데이터 형태 중의 하나가 표(Table) 데이터다. 하지만, 역설적으로 가장 적은 데이터 활용법이 공개된 것도 사실이다. 통계학과에서도 연속형 데이터는 많이 다루지만, 범주형 데이터를 충분히 이해하고 실습해 본 경험을 갖고 있는 분은 드물다.

사실 범주형 자료를 시각화하고 다양한 표형태로 나타내는데 다양한 지식이 필요하다.

  • table 자료형
  • 깔끔한 데이터(tidy data) 개념
  • vcd 팩키지 mosaic() 함수 사용 및 해석
  • 범주형 자료형 forcats 팩키지 활용
  • 표를 웹에 표현하기 위한 kable 팩키지와 마크다운 언어

즉, 일반적이 표형식 데이터를 깔끔한 데이터(tidy data) 형태로 변환을 시키고 나서 탐색적 데이터 분석과정을 거쳐 최종 결과물을 생성시킨다.

모자이크 플롯과 자료구조

2. 표(table) 데이터를 자유로이 작업

상기 기반지식을 바탕으로 R datasets 팩키지에 포함된 HairEyeColor 데이터셋을 가지고 데이터 분석 작업을 시작한다.

2.1. 환경설정

범주형 데이터 분석 및 시각화 산출물 생성에 필요한 팩키지를 불러온다.

# 0. 데이터 가져오기 ----------------------------------------------
library(tidyverse)
library(datasets)
library(forcats)
library(ggmosaic)
library(vcd)
library(gridExtra)
library(knitr)

2.2. 표(table) 데이터

범주형 데이터로 유명한 HairEyeColor 데이터셋을 가져온다. HairEyeColor 데이터셋은 데이터프레임이 아니고 table 형태 데이터다. 익숙한 데이터프레임 자료형으로 작업하는데 필요한 함수가 있다.

  • tbl_df()
  • as_data_frame()

tbl_df(), as_data_frame() 함수는 표(table) 자료형을 데이터프레임으로 변환시키는 유용한 함수다.

data("HairEyeColor")

# 1. 데이터 변환 ----------------------------------------------

## 1.1 표형식 데이터 --> 깔끔한 데이터 ------------------------

hair_eye_df <- apply(HairEyeColor, c(1, 2), sum)

kable(hair_eye_df, digits=0)
Brown Blue Hazel Green
Black 68 20 15 5
Brown 119 84 54 29
Red 26 17 14 14
Blond 7 94 10 16
tbl_df <- as_data_frame(HairEyeColor)

tbl_df(HairEyeColor)
# A tibble: 32 × 4
    Hair   Eye   Sex     n
   <chr> <chr> <chr> <dbl>
1  Black Brown  Male    32
2  Brown Brown  Male    53
3    Red Brown  Male    10
4  Blond Brown  Male     3
5  Black  Blue  Male    11
6  Brown  Blue  Male    50
7    Red  Blue  Male    10
8  Blond  Blue  Male    30
9  Black Hazel  Male    10
10 Brown Hazel  Male    25
# ... with 22 more rows
# kable(tbl_df)

2.3. 깔끔한 데이터

데이터프레임으로 전환되면 long 형태 데이터프레임이라 원 표(table)과비교하려면 spread 함수와 비교한다.

## 1.2 Long & Wide 데이터 형식 ------------------------

long_df <- tbl_df %>% group_by(Hair, Eye) %>% 
    summarise(cnt = sum(n))

# 비교
# hair_eye_df
long_df %>% spread(Eye, cnt) %>% kable(digits=0)
Hair Blue Brown Green Hazel
Black 20 68 5 15
Blond 94 7 16 10
Brown 84 119 29 54
Red 17 26 14 14

2.4. 단변량 범주형 데이터 시각화

깔끔한 데이터프레임으로 작업이 되면 변수를 각 자료형에 맞춰 변환을 시킨다. 이런 과정에 도입되는 팩키지가 forcats 팩키지의 다양한 요인형 데이터 처리 함수다. 요인형 자료형은 다른 프로그래밍 언어에는 개념이 존재하지만, 실제 활용되는 경우도 많지 않고 R처럼 다양한 기능을 제공하는 경우는 드물다.

## 1.3 범주형 데이터 ------------------------

long_df %>% ungroup() %>%  mutate(Hair = factor(Hair)) %>% 
    group_by(Hair) %>% 
    summarise(hair_sum = sum(cnt)) %>% 
        ggplot(aes(hair_sum, fct_reorder(Hair, hair_sum))) + geom_point()

long_df %>% ungroup() %>%  mutate(Eye = factor(Eye)) %>% 
    group_by(Eye) %>% 
    summarise(eye_sum = sum(cnt)) %>% 
    ggplot(aes(eye_sum, fct_reorder(Eye, eye_sum))) + geom_point()

long_df %>% ungroup() %>%  mutate(Eye = factor(Eye),
                                  Hair = factor(Hair)) %>% 
    group_by(Eye, Hair) %>% 
    summarise(eye_hair_sum = sum(cnt)) %>% 
    tidyr::unite(eye_hair, Eye, Hair) %>% 
    ggplot(aes(eye_hair_sum, fct_reorder(eye_hair, eye_hair_sum))) + geom_point() 

3. 모자이크 플롯

ggplot에서도 모자이크 플롯을 구현할 수 있지만, 잔차(residual)를 반영하여 시각화하는 기능을 제공하지 않고 있다. 하지만, ggmosaic 팩키지를 활용하면 모자이크 플롯을 그래프 문법에 맞춰 구현이 가능하다. geom_mosaic() 함수를 사용한다.

하지만, 잔차(residual)를 반영하여 시각화 그래프를 생성시키려면 표(table) 자료형으로 vcd 팩키지에서 제공하는 mosaic() 함수에 인자로 넘겨야 한다.

# 2. 모자이크 플롯 ------------------------

long_df %>% ungroup() %>%  mutate(Eye = factor(Eye),
                                  Hair = factor(Hair)) %>% 
    ggplot() +
    geom_mosaic(aes(weight=cnt,x=product(Hair),fill=Eye))

# 3. 모자이크 플롯 통계모형 ------------------------

mosaic(HairEyeColor, shade=TRUE, legend=TRUE)

xtabs(cnt ~ Hair + Eye, long_df)
       Eye
Hair    Blue Brown Green Hazel
  Black   20    68     5    15
  Blond   94     7    16    10
  Brown   84   119    29    54
  Red     17    26    14    14
mosaic(xtabs(cnt ~ Hair + Eye, long_df), shade = TRUE, legend=TRUE)

# vcd::mosaic(hair_eye_df, shade = TRUE, legend=TRUE)

소스: xwMOOC 기계학습

R 프로젝트 시작 방법

R은 데이터 분석 및 데이터 마이닝에서 가장 널리 사용되는 프로그래밍 언어입니다. 처음 R을 시작했을 때, 초보자라면 조금은 두려워 할 수 있지만 때로는 통계 전문가에게도 어느정도 어렵긴 마찬가지 입니다.

R을 액세스 할 수있는 몇 가지 방법이 있습니다. Mac, Windows 또는 Linux 컴퓨터에 설치하여 터미널에서 실행할 수 있습니다. 또한 사용자 경험을 돕기 위해 설치할 수있는 다양한 클라이언트가 있습니다.

반면에 Datazar 에서는 R에 클라우드 기반 클라이언트를 제공합니다. 브라우저에서 R을 사용하고 데이터를 분석하고, 차트를 만들고, 패키지를 사용하고 결과를 공유 할 수 있습니다.

 

프로젝트 생성하기

프로젝트 생성 팝업

Datazar에 로그인 한 후 오른쪽 상단의 “새 프로젝트”버튼을 클릭하고 프로젝트 이름으로 팝업을 작성하십시오. 완료되면 “Create Project”를 클릭하면 완료됩니다!

 

R 인터페이스 선택

이제 R 콘솔과 R 노트북의 두 가지 인터페이스 중 하나를 선택할 수 있습니다. 둘 다 똑같이 유용하며 모두 기본 설정으로 조정 합니다.

“R Console”버튼을 클릭하시면 새로 만든 R 콘솔로 이동합니다. 터미널과 조금 닮았고 하단에는 텍스트 입력 필드가 있습니다.

 

R 콘솔

R 콘솔은 R에 대한 귀하의 포털이며 매우 멋진 기능입니다. hello world message 명령을 먼저 테스트 해 봅시다.

> message("Hello World")

R comes with datasets already included in the core program. So let’s use the famous iris dataset and play around with it. We’ll run two commands as below:

 

R은 핵심 프로그램에 이미 포함 된 데이터 세트와 함께 제공됩니다. 그럼 유명한 iris 데이터 세트를 사용하여 코딩해 봅시다. 아래와 같이 두 개의 명령을 실행합니다.

> iris
> head(iris)

첫 번째 명령은 전체 데이터 집합을 반환하고 두 번째 명령은 데이터 집합의 첫 번째 부분을 반환합니다.

The R Console has command history support so you can use your keyboard arrows to navigate to your previous commands.

Now let’s look at graphics. The R console will return the graphics in-line with the text instead of a separate window as in the terminal. Using the console gives you a more natural feel with a little bit extra something.

Importing External Datasets

Having a very sophisticated interface is useless if you can’t use data you have collected or gathered. Let’s look at how you can use CSVs in your console.

Above the console is a button named “File.” It will show you the list of files you have in your project. Click on the checkbox that’s next to the file you want to import to your R console and click “Load Selected Files.” This ensures only the files you want in your session are loaded and keeps your workspace clean.

Let’s save the dataset to a variable called dataset:

> dataset<-read.csv("Dataset.csv")

Since this dataset is kind of long, let’s take a part of it and save it to another variable called sample:

> sample<-dataset[1:100,]

And finally plot the sample dataset:

> with(sample,plot(exper,wage,col=union))

Importing External Functions

The method for importing external functions to your R workspace is exactly the same as the method for importing datasets. Once you’ve imported the function you want from your project, use the following function to use it:

> source("customFunction.r")

This R file can be an R script that contains all your custom-reusable functions. Or even functions you copied from somewhere else.

Use External Libraries

If you want to juice up your R workspace with extra packages, all you have to do it run this function:

> library("someLibrary")

R has an infinite amount of R packages that are contributed by the community on a regular basis. Packages like ggplot2 make your R experience come to life.

R Notebook

Although we used the R console throughout this entire guide, here’s what it would have looked like if it was made with the R notebook interface.

R notebooks are very useful when you want to go back and edit code, especially if you’re working in a team and you want a more presentable format. R consoles on the other hand are great for quick, dirty explorations. Dirty because all your error and commands will be shown. Again, it’s all a personal preference.

Links

Feel free to copy my file and play around with it:

R Console: https://www.datazar.com/file/f3cc3548b-7aa5-419c-83af-03139317ccae

R Notebook: https://www.datazar.com/file/f4e42b36f-944a-4f81-b1eb-688fc7ae9bf5

Resources

Useful resources and documentation when using R:

CRAN Manuals

CRAN: Manuals

R-Bloggers

R-bloggers

Datazar Blog: R Language

Datazar Blog

R Tutor: Introductions

R Introduction | R Tutorial


How to Start an R Project was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

 

소스: How to Start an R Project | R-bloggers

템플릿을 통해 R, Shiny, MySQL 및 AnyChart JS로 대화식 차트 만들기

How to creating interactive chart with R, Shiny, MySQL and AnyChart JS via template

데이터 시각화 및 차트 작성은 점점 더 중요한 웹 개발 분야로 진화하고 있습니다. 사실 사람들은 원시 데이터가 아닌 숫자로 그래픽으로 표현 될 때 정보를 훨씬 잘 인식합니다. 결과적으로 다양한 비즈니스 인텔리전스 응용 프로그램, 보고서 등은 그래프와 차트를 광범위하게 구현하여 데이터를 시각화하고 명확하게하고 결과적으로 의사 결정을 위해 분석 속도를 높이고 촉진합니다.

R에서 데이터 시각화를 처리하는 방법은 여러 가지가 있지만 오늘은 인기있는 JavaScript (HTML5) 차트 라이브러리 AnyChart를 사용하여 대화 형 차트를 만드는 방법을 살펴 보겠습니다. 최근에는 공식적인 R, Shiny 및 MySQL 템플릿이있어 전체 프로세스를 매우 쉽고 간단하게 만듭니다.

이 단계별 자습서 에서는 템플릿과 기본 원형 차트 예제를 자세히 살펴본 다음 템플릿을 신속하게 수정하여 다양한 데이터 시각화를 얻는 방법을 보여 드리겠습니다.

AnyChart에 대하여

AnyChart는 대화식 차트를 웹 사이트 및 웹 앱에 추가하는 유연한 크로스 브라우저 JS 차트 라이브러리입니다. 기본적으로 설치 및 플랫폼 및 데이터베이스 작업이 필요하지 않습니다. AnyChart의 기능 중 일부는 다음과 같습니다.

  • 지원되는 차트 유형은 수십 가지이며 그 수는 계속 증가하고 있습니다
  • 포괄적인 API 레퍼런스설명서 들을 보유 하고 있습니다
  • 바로 사용할 수있는 차트 샘플이 많이 있어 라이브러리의 첫 번째 단계를 쉽게 수행 할 수 있습니다
  • 모든 그래픽 및 다양한 추가 요소의 모양을 사용자가 많이 지정할 수 있습니다

현재 R, Shiny 및 MySQL과 같은 보편적 인 기술 스택을 위한 템플릿 들이 AnyChart의 통합을 더욱 용이하게 하고 있습니다.

Getting started

First of all, let’s make sure the R language is installed. If not, you can visit the official R website and follow the instructions.

If you have worked with R before, most likely you already have RStudio. Then you are welcome to create a project in it now, because the part devoted to R can be done there. If currently you do not have RStudio, you are welcome to install it from the official RStudio website. But, actually, using RStudio is not mandatory, and the pad will be enough in our case.

After that, we should check if MySQL is properly installed. To do that, you can open a terminal window and enter the next command:

1
2
$ mysql --version
mysql  Ver 14.14 Distrib 5.7.16, forLinux (x86_64) using  EditLine wrapper

You should receive the above written response (or a similar one) to be sure all is well. Please follow these instructions to install MySQL, if you do not have it at the moment.

Now that all the required components have been installed, we are ready to write some code for our example.

Basic template

First, to download the R, Shiny and MySQL template for AnyChart, type the next command in the terminal:

The folder you are getting here features the following structure:

1
2
3
4
5
6
7
8
9
r-shiny-mysql-template/
     www/
          css/
               style.css  # css style
     app.R                # main application code
     database_backup.sql  # MySQL database dump
     LICENSE
     README.md
     index.html           # html template

Let’s take a look at the project files and examine how this sample works. We’ll run the example first.

Open the terminal and go to the repository folder:

1
$ cd r-shiny-mysql-template

Set up the MySQL database. To specify your username and password, make use of -u and -p flags:

1
$ mysql < database_backup.sql

Then run the R command line, using the command below:

1
$ R

And install the Shiny and RMySQL packages as well as initialize the Shiny library at the end:

1
2
3
> install.packages(&quot;shiny&quot;)
> install.packages(&quot;RMySQL&quot;)
> library(shiny)

If you face any problems during the installation of these dependencies, carefully read error messages, e.g. you might need sudo apt-get install libmysqlclient-dev for installing RMySQL.

Finally, run the application:

1
> runApp(&quot;{PATH_TO_TEMPLATE}&quot;) # e.g. runApp(&quot;/workspace/r-shiny-mysql-template&quot;)

And the new tab that should have just opened in your browser shows you the example included in the template:

Interactive Pie chart created with R and AnyChart JS charting library. Basic sample from R, Shiny and MySQL integration template

Basic template: code

Now, let’s go back to the folder with our template to see how it works.

Files LICENSE and README.md contain information about the license and the template (how to run it, technologies, structure, etc.) respectively. They are not functionally important to our project, and therefore we will not explore them here. Please check these files by yourself for a general understanding.

The style.css file is responsible for the styles of the page.

The database_backup.sql file contains a code for the MySQL table and user creation and for writing data to the table. You can use your own table or change the data in this one.

Let’s move on to the code. First, open the app.R file. This file ensures the connection to the MySQL database, reads data, and passes it to the index.html file, which contains the main code of using AnyChart. The following is a part of the app.R code, which contains the htmlTemplate function; here we specify the name of the file where the data will be transmitted to, the names of our page and chart, and the JSON encoded chart data from MySQL database.

1
2
3
4
htmlTemplate("index.html",
     title = "Anychart R Shiny template",
     chartTitle = shQuote("Top 5 fruits"),
     chartData = toJSON(loadData())

The main thing here is the index.html file, which is actually where the template for creating charts is. As you see, the first part of this file simply connects all necessary files to the code, including the AnyChart library, the CSS file with styles, and so on. I’ll skip this for now and proceed directly to the script tag and the anychart.onDocumentReady (function () {...}) function.

1
2
3
4
5
6
anychart.onDocumentReady(function() {
     varchart = anychart.pie({{ chartData }});
     chart.title({{ chartTitle }});
     chart.container("container");
     chart.draw();
});

This pattern works as follows. We create a pie chart by using the function pie() and get the data that have already been read and prepared using the R code. Please note that the names of the variables containing data are the same in the app.R and index.html files. Then we display the chart title via (chart.title({{ chartTitle }})) and specify the ID of the element that will contain a chart, which is a div with id = container in this case. To show all that was coded, we use сhart.draw().

Modifying the template to create a custom chart

Now that we’ve explored the basic example included in the template, we can move forward and create our own, custom interactive chart. To do that, we simply need to change the template a little bit and add some features if needed. Let’s see how it works.

First, we create all the necessary files by ourselves or make a new project using RStudio.

Second, we add a project folder named anychart. Its structure should look like illustrated below. Please note that some difference is possible (and acceptable) if you are using a new project in RStudio.

1
2
3
4
5
6
7
8
anychart/
     www/
          css/
               style.css  # css style
     ui.R                 # main application code
     server.R             # sub code
     database_backup.sql  # data set
     index.html           # html template

Now you know what files you need. If you’ve made a project with the studio, the ui.R and server.R files are created automatically. If you’ve made a project by yourself, just create empty files with the same names and extensions as specified above.

The main difference from the original example included in the template is that we should change the file index.html and divide app.R into parts. You can copy the rest of the files or create new ones for your own chart.

Please take a look at the file server.R. If you’ve made a project using the studio, it was created automatically and you don’t need to change anything in it. However, if you’ve made it by yourself, open it in the Notepad and add the code below, which is standard for the Shiny framework. You can read more about that here.

The file structure of ui.R is similar to the one of app.R, so you can copy app.R from the template and change/add the following lines:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
loadData = dbGetQuery(db, "SELECT name, value FROM fruits")
data1 <- character()
#data preparation
for(var in1:nrow(loadData)){
     c = c(as.character(loadData[var, 1]), loadData[var, 2])
     data1 <- c(data1, c)
}
data = matrix(data1,  nrow=nrow(loadData), ncol=2, byrow=TRUE)
ui = function(){
     htmlTemplate("index.html",
          title = "Anychart R Shiny template",
          chartTitle = shQuote("Fruits"),
          chartData = toJSON(data)
)}

Since we are going to change the chart type, from pie to 3D vertical bar (column), the data needs some preparation before being passed to index.html. The main difference is that we will use the entire data from the database, not simply the top 5 positions.

We will slightly modify and expand the basic template. Let’s see the resulting code of the index.html first (the script tag) and then explore it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
anychart.onDocumentReady(function() {
    varchart = anychart.column3d({{ chartData }});
    chart.title({{ chartTitle }});
    chart.animation(true);
   
    varxAxis = chart.xAxis();
    xAxis.title("fruits");
    varyAxis = chart.yAxis();
    yAxis.title("pounds, t");
    varyScale = chart.yScale();
    yScale.minimum(0);
    yScale.maximum(120);
    chart.container("container");
    chart.draw();
});

With the help of var chart = anychart.column3d({{chartData}}), we are creating a 3D column chart by using the function column3d(). Here you can choose any other chart type you need; consider getting help from Chartopedia if you are unsure which one works best in your situation.

Next, we are adding animation to the column chart via chart.animation (true) to make it appear on page load gradually.

In the following section, we are creating two variables, xAxis and yAxis. Including these is required if you want to provide the coordinate axes of the chart with captions. So, you should create variables that will match the captions for the X and Y axes, and then use the function, transmit the values that you want to see.

The next unit is basically optional. We are explicitly specifying the maximum and minimum values for the Y axis, or else AnyChart will independently calculate these values. You can do that the same way for the X axis.

And that’s it! Our 3D column chart is ready, and all seems to be fine for successfully running the code. The only thing left to do before that is to change the MySQL table to make it look as follows:

1
2
3
4
5
6
7
8
9
10
11
12
('apple',100),
('orange',58),
('banana',81),
('lemon',42),
('melon',21),
('kiwi',66),
('mango',22),
('pear',48),
('coconut',29),
('cherries',65),
('grapes',31),
('strawberries',76),

To see what you’ve got, follow the same steps as for running the R, Shiny and MySQL template example, but do not forget to change the path to the folder and the folder name to anychart. So, let’s open the terminal and command the following:

1
2
3
4
5
6
7
$ cd anychart
$ mysql < database_backup.sql
$ R
     > install.packages("shiny")
     > install.packages("RMySQL")
     > library(shiny)
     > runApp("{PATH_TO_TEMPLATE}") # e.g. runApp("/workspace/anychart")

Interactive 3D Column chart made with R and AnyChart JS charting library, based on R, Shiny and MySQL integration template

For consistency purposes, I am including the code of ui.R and server.R below. The full source code of this example can be found on GitHub.

ui.R:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
library(shiny)
library(RMySQL)
library(jsonlite)
data1 <- character()
db = dbConnect(MySQL(),
     dbname = "anychart_db",
     host = "localhost",
     port = 3306,
     user = "anychart_user",
     password = "anychart_pass")
loadData = dbGetQuery(db, "SELECT name, value FROM fruits")
#data preparation
for(var in1:nrow(loadData)){
     c = c(as.character(loadData[var, 1]), loadData[var, 2])
     data1 <- c(data1, c)
}
data = matrix(data1,  nrow=nrow(loadData), ncol=2, byrow=TRUE)
server = function(input, output){}
ui = function(){
     htmlTemplate("index.html",
     title = "Anychart R Shiny template",
     chartTitle = shQuote("Fruits"),
     chartData = toJSON(data)
)}
shinyApp(ui = ui, server = server)

server.R:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
library(shiny)
shinyServer(function(input, output) {
     output$distPlot <- renderPlot({
          # generate bins based on input$bins from ui.R
          x    <- faithful[, 2]
          bins <- seq(min(x), max(x), length.out = input$bins + 1)
          # draw the chart with the specified number of bins
          hist(x, breaks = bins, col = 'darkgray', border = 'white')
 })
})

Conclusion

When your technology stack includes R, Shiny and MySQL, using AnyChart JS with the integration template we were talking about in this tutorial requires no big effort and allows you to add beautiful interactive JavaScript-based charts to your web apps quite quickly. It is also worth mentioning that you can customize the look and feel of charts created this way as deeply as needed by using some of the library’s numerous out-of-the-box features: add or remove axis labels, change the background color and how the axis is positioned, leverage interactivity, and so on.

The scope of this tutorial is likely to be actually even broader, because the process I described here not only applies to the AnyChart JS charting library, but also is mostly the same for its sister libraries AnyMap (geovisualization in maps), AnyStock (date/time graphs), and AnyGantt (charts for project management). All of them are free for non-profit projects but – I must put it clearly here again just in case – require a special license for commercial use.

I hope you find this article helpful in your activities when it comes to interactive data visualization in R. Now ask your questions, please, if any.