## 비지니스를 위한 R을 배우는 6 가지 이유

비즈니스를 위한 데이터 과학 (DS4B)은 비즈니스 분석의 미래이지만 아직 시작해야 할 부분을 파악하기가 어렵습니다. 마지막으로하고 싶은 일은 잘못된 도구로 시간을 낭비하는 것입니다. 시간을 효과적으로 활용하려면 (1) 작업에 적합한 도구 선택과 (2) 도구를 사용하여 비즈니스 가치를 반환하는 방법을 효율적으로 학습하는 두 가지가 있습니다. 이 기사에서는 첫 번째 부분에 초점을 맞추어 왜 R이 6 가지 점에서 올바른 선택인지 설명합니다. 다음 기사에서는 12주 안에 R을 배우는 두 번째 부분에 초점을 맞 춥니 다.

## REASON 1: R HAS THE BEST OVERALL QUALITIES

There are a number of tools available business analysis/business intelligence (with DS4B being a subset of this area). Each tool has its pros and cons, many of which are important in the business context. We can use these attributes to compare how each tool stacks up against the others! We did a qualitative assessment using several criteria:

• Business Capability (1 = Low, 10 = High)
• Ease of Learning (1 = Difficult, 10 = Easy)
• Cost (Free/Minimal, Low, High)
• Trend (0 = Fast Decline, 5 = Stable, 10 = Fast Growth)

Further discussion on the assessment is included in the Appendix at the end of the article.

What we saw was particularly interesting. A trendline developed exposing a tradeoff between learning curve and DS4B capability rating. The most flexible tools are more difficult to learn but tend to have higher business capability. Conversely, the “easy-to-learn” tools are often not the best long-term tools for business or data science capability. Our opinion is go for capability over ease of use.

Of the top tools in capability, R has the best mix of desirable attributes including high data science for business capability, low cost, and it’s growing very fast. The only downside is the learning curve. The rest of the article explains why R is so great for business.

## REASON 2: R IS DATA SCIENCE FOR NON-COMPUTER SCIENTISTS

If you are seeking high-performance data science tools, you really have two options: R or Python. When starting out, you should pick one. It’s a mistake to try to learn both. Your choice comes down to what’s right for you. The difference between the R and Python has been described in numerous infographics and debates online, but the most overlooked reason is person-programming language fit. Don’t understand what we mean? Let’s break it down.

Fact 1: Most people interested in learning data science for business are not computer scientists.They are business professionals, non-software engineers (e.g. mechanical, chemical), and other technical-to-business converts. This is important because of where each language excels.

Fact 2: Most activities in business and finance involve communication. This comes in the form of reports, dashboards, and interactive web applications that allow decision makers to recognize when things are not going well and to make well-informed decisions that improve the business.

Now that we recognize what’s important, let’s learn about the two major players in data science.

Python is a general service programming language developed by software engineers that has solid programming libraries for math, statistics and machine learning. Python has best-in-class tools for pure machine learning and deep learning, but lacks much of the infrastructure for subjects like econometrics and communication tools such as reporting. Because of this, Python is well-suited for computer scientists and software engineers.

R is a statistical programming language developed by scientists that has open source libraries for statistics, machine learning, and data science. R lends itself well to business because of its depth of topic-specific packages and its communciation infrastructure. R has packages covering a wide range of topics such as econometrics, finance, and time series. R has best-in-class tools for visualization, reporting, and interactivity, which are as important to business as they are to science. Because of this, R is well-suited for scientists, engineers and business professionals.

### WHAT SHOULD YOU DO?

Don’t make the decision tougher than what it is. Think about where you are coming from:

• Are you a computer scientist or software engineer? If yes, choose Python.
• Are you an analytics professional or mechanical/industrial/chemical engineer looking to get into data science? If yes, choose R.

Think about what you are trying to do:

• Are you trying to build a self-driving car? If yes, choose Python.
• Are you trying to communicate business analytics throughout your organization? If yes, choose R.

## REASON 3: LEARNING R IS EASY WITH THE TIDYVERSE

Learning R used to be a major challenge. Base R was a complex and inconsistent programming language. Structure and formality was not the top priority as in other programming languages. This all changed with the “tidyverse”, a set of packages and tools that have a consistently structured programming interface.

When tools such as dplyr and ggplot2 came to fruition, it made the learning curve much easier by providing a consistent and structured approach to working with data. As Hadley Wickham and many others continued to evolve R, the tidyverse came to be, which includes a series of commonly used packages for data manipulation, visualization, iteration, modeling, and communication. The end result is that R is now much easier to learn (we’ll show you in our next article!).

Source: tidyverse.org

R continues to evolve in a structured manner, with advanced packages that are built on top of the tidyverse infrastructure. A new focus is being placed on modeling and algorithms, which we are excited to see. Further, the tidyverse is being extended to cover topical areas such as text (tidytext) and finance (tidyquant). For newcomers, this should give you confidence in selecting this language. R has a bright future.

## REASON 4: R HAS BRAINS, MUSCLE, AND HEART

Saying R is powerful is actually an understatement. From the business context, R is like Excel on steroids! But more important than just muscle is the combination of what R offers: brains, muscle, and heart.

### R HAS BRAINS

R implements cutting-edge algorithms including:

• H2O (h2o) – High-end machine learning package
• Keras/TensorFlow (kerastensorflow) – Go-to deep learning packages
• xgboost – Top Kaggle algorithm
• And many more!

These tools are used everywhere from AI products to Kaggle Competitions, and you can use them in your business analyses.

### R HAS MUSCLE

R has powerful tools for:

• Vectorized Operations – R uses vectorized operations to make math computations lightning fast right out of the box
• Loops (purrr)
• Parallelizing operations (parallelfuture)
• Speeding up code using C++ (Rcpp)
• Connecting to other languages (rJavareticulate)
• Working With Databases – Connecting to databases (dbplyrodbcbigrquery)
• Handling Big Data – Connecting to Apache Spark (sparklyr)
• And many more!

### R HAS HEART

We already talked about the infrastructure, the tidyverse, that enables the ecosystem of applications to be built using a consistent approach. It’s this infrastructure that brings life into your data analysis. The tidyverse enables:

• Data manipulation (dplyrtidyr)
• Working with data types (stringr for strings, lubridate for date/datetime, forcats for categorical/factors)
• Visualization (ggplot2)
• Programming (purrrtidyeval)
• Communication (Rmarkdownshiny)

## REASON 5: R IS BUILT FOR BUSINESS

Two major advantages of R versus every other programming language is that it can produce business-ready reports and machine learning-powered web applications. Neither Python or Tableau or any other tool can currently do this as efficiently as R can. The two capabilities we refer to are rmarkdown for report generation and shiny for interactive web applications.

### RMARKDOWN

Rmarkdown is a framework for creating reproducible reports that has since been extended to building blogs, presentations, websites, books, journals, and more. It’s the technology that’s behind this blog, and it allows us to include the code with the text so that anyone can follow the analysis and see the output right with the explanation. What’s really cool is that the technology has evolved so much. Here are a few examples of its capability:

### SHINY

Source: shiny.rstudio.com

Shiny is a framework for creating interactive web applications that are powered by R. Shiny is a major consulting area for us as four of five assignments involve building a web application using shiny. It’s not only powerful, it enables non-data scientists to gain the benefit of data science via interactive decision making tools. Here’s an example of a Google Trend app built with shiny.

## REASON 6: R COMMUNITY SUPPORT

Being a powerful language alone is not enough. To be successful, a language needs community support. We’ll hit on two ways that R excels in this respects: CRAN and the R Community.

### CRAN: COMMUNITY-PROVIDED R PACKAGES

CRAN is like the Apple App store, except everything is free, super useful, and built for R. With over 14,000 packages, it has most everything you can possibly want from machine learning to high-performance computing to finance and econometrics! The task views cover specific areas and are one way to explore R’s offerings. CRAN is community-driven, with top open source authors such as Hadley Wickham and Dirk Eddelbuettel leading the way. Package development is a great way to contribute to the community especially for those looking to showcase their coding skills and give back!

### COMMUNITY SUPPORT

You begin with R because of its capability, you stay with R because of its community. The R Community is the coolest part. It’s tight-knit, opinionated, fun, silly, and highly knowledgeable… all of the things you want in a high performing team.

#### SOCIAL/WEB

R users can be found all over the web. A few of the popular hangouts are:

#### CONFERENCES

R-focused business conferences are gaining traction in a big way. Here are a few that we attend and/or will be attending in the future:

• EARL – Mango Solution’s conference on enterprise and business applications of R
• R/Finance – Community-hosted conference on financial asset and portfolio analytics and applied finance
• Rstudio Conf – Rstudio’s technology conference
• New York R – Business and technology-focused R conference

#### MEETUPS

A really cool thing about R is that many major cities have a meetup nearby. Meetups are exactly what you think: a group of R-users getting together to talk R. They are usually funded by R-Consortium. You can get a full list of meetups here.

## CONCLUSION

R has a wide range of benefits making it our obvious choice for Data Science for Busienss (DS4B). That’s not to say that Python isn’t a good choice as well, but, for the wide-range of needs for business, there’s nothing that compares to R. In this article we saw why R is a great choice. In the next article we’ll show you how to learn R in 12 weeks.

Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business and financial applications. We build web applications and automated reportsto put machine learning in the hands of decision makers. Visit the Business Science or contact usto learn more!

Interested in learning data science for business? Enroll in Business Science University. We’ll teach you how to apply data science and machine learning in real-world business applications. We take you through the entire process of modeling problems, creating interactive data products, and distributing solutions within an organization. We are launching courses in early 2018!

## APPENDIX – DISCUSSION ON DS4B TOOL ASSESSMENT

Here’s some additional information on the tool assessment. We have provided the code used to make the visualization, the criteria explanation, and the tool assessment.

### CRITERIA EXPLANATION

Our assessment of the most powerful DS4B tools was based on three criteria:

• Business Capability (1 = Low, 10 = High): How well-suited is the tool for use in the business? Does it include features needed for the business including advanced analytics, interactivity, communication, interactivity, and web apps?
• Ease of Learning (1 = Difficult, 10 = Easy): How easy is it to pick up? Can you learn it in a week of short courses or will it take a longer time horizon to become proficient?
• Cost (Free/Minimal, Low, High): Cost has two undesirable effects. From a first-order perspective, the organization has to spend money. This is not in-and-of-itself undesirable because the software companies can theoretically spend on R&D and other efforts to advance the product. The second-order effect of lowering adoption is much more concerning. High-cost tools tend to have much less discussion in the online world, whereas open source or low-cost tools have great trends.
• Trend (0 = Fast Decline, 5 = Stable, 10 = Fast Growth): We used StackOverflow Insights of questions as a proxy for the trend of usage over time. A major assumption is that growing number of Stack Overflow questions is that the usage is also increasing in a similar trend.

Source: Stack Overflow Trends

### INDIVIDUAL TOOL ASSESSMENT

#### R:

• DS4B Capability = 10: Has it all. Great data science capability, great visualization libraries, Shiny for interactive web apps, rmarkdown for professional reporting.
• Learning Curve = 4: A lot to learn, but learning is getting easier with the tidyverse.
• Trend = 10: Stack overflow questions are growing at a very fast pace.
• Cost = Low: Free and open source

#### PYTHON:

• DS4B Capability = 7: Has great machine learning and deep learning libraries. Can connect to any major database. Communication is limited by flask / Django web applications, which can be difficult to build. Does not have a business reporting infrastructure comparable to rmarkdown.
• Learning Curve = 4: A lot to learn, but learning is relatively easy compared to other object oriented programming languages like Java.
• Trend = 10: Stack overflow questions are growing at a very fast pace.
• Cost = Low: Free and open source

#### EXCEL:

• DS4B Capability = 4: Mainly a spreadsheet software but has programming built in with VBA. Difficult to integrate R, but is possible. No data science libraries.
• Learning Curve = 10: Relatively easy to become an advanced user.
• Trend = 7: Stack overflow questions are growing at a relatively fast pace.
• Cost = Low: Comes with Microsoft Office, which most organizations use.

#### TABLEAU:

• DS4B Capability = 6: Has R integrated, but is very difficult to implement advanced algorithms and not as flexible as R+shiny.
• Learning Curve = 7: Very easy to pick up.
• Trend = 6: Stack overflow questions are growing at a relatively fast pace.
• Cost = Low: Free public version. Enterprise licenses are relatively affordable.

#### POWERBI:

• DS4B Capability = 5: Similar to Tableau, but not quite as feature-rich. Can integrate R to some extent.
• Learning Curve = 8: Very easy to pick up.
• Trend = 6: Expected to have same trend as Tableau.
• Cost = Low: Free public version. Licenses are very affordable.

#### MATLAB:

• DS4B Capability = 6: Can do a lot with it, but lacks the infrastructure to use for business.
• Learning Curve = 2: Matlab is quite difficult to learn.
• Trend = 1: Stack overflow growth is declining at a rapid pace.
• Cost = High: Matlab licenses are very expensive. Licensing structure does not scale well.

#### SAS:

• DS4B Capability = 8: Has data science, database connection, business reporting and visualization capabilities. Can also build applications. However, limited by closed-source nature. Does not get latest technologies like tensorflow and H2O.
• Learning Curve = 4: Similar to most data science programming languages for the tough stuff. Has a GUI for the easy stuff.
• Trend = 3: Stack Overflow growth is declining.
• Cost = High: Expensive for licenses. Licensing structure does not scale well.

### CODE FOR THE DS4B TOOL ASSESSMENT VISUALIZATION

Applying data science to business and financial analysis

## Pipes in R Tutorial For Beginners (article) by DataCamp

Learn more about the famous pipe operator %>% and other pipes in R, why and how you should use them and what alternatives you can consider!

You might have already seen or used the pipe operator when you’re working with packages such as dplyrmagrittr,… But do you know where pipes and the famous %>% operator come from, what they exactly are, or how, when and why you should use them? Can you also come up with some alternatives?

This tutorial will give you an introduction to pipes in R and will cover the following topics:

Are you interested in learning more about manipulating data in R with dplyr? Take a look at DataCamp’s Data Manipulation in R with dplyr course.

## Pipe Operator in R: Introduction

To understand what the pipe operator in R is and what you can do with it, it’s necessary to consider the full picture, to learn the history behind it. Questions such as “where does this weird combination of symbols come from and why was it made like this?” might be on top of your mind. You’ll discover the answers to these and more questions in this section.

Now, you can look at the history from three perspectives: from a mathematical point of view, from a holistic point of view of programming languages, and from the point of view of the R language itself. You’ll cover all three in what follows!

### History of the Pipe Operator in R

#### Mathematical History

If you have two functions, let’s say f:BCf:B→C and g:ABg:A→B, you can chain these functions together by taking the output of one function and inserting it into the next. In short, “chaining” means that you pass an intermediate result onto the next function, but you’ll see more about that later.

For example, you can say, f(g(x))f(g(x))g(x)g(x) serves as an input for f()f(), while xx, of course, serves as input to g()g().

If you would want to note this down, you will use the notation fgf◦g, which reads as “f follows g”. Alternatively, you can visually represent this as:

Image Credit: James Balamuta, “Piping Data”

#### Pipe Operators in Other Programming Languages

As mentioned in the introduction to this section, this operator is not new in programming: in the Shell or Terminal, you can pass command from one to the next with the pipeline character |. Similarly, F# has a forward pipe operator, which will prove to be important later on! Lastly, it’s also good to know that Haskell contains many piping operations that are derived from the Shell or Terminal.

#### Pipes in R

Now that you have seen some history of the pipe operator in other programming languages, it’s time to focus on R. The history of this operator in R starts, according to this fantastic blog post written by Adolfo Álvarez, on January 17th, 2012, when an anonymous user asked the following question in this Stack Overflow post:

How can you implement F#’s forward pipe operator in R? The operator makes it possible to easily chain a sequence of calculations. For example, when you have an input data and want to call functions foo and bar in sequence, you can write data |> foo |> bar?

The answer came from Ben Bolker, professor at McMaster University, who replied:

I don’t know how well it would hold up to any real use, but this seems (?) to do what you want, at least for single-argument functions …

"%>%" <- function(x,f) do.call(f,list(x))
pi %>% sin
[1] 1.224606e-16
pi %>% sin %>% cos
[1] 1
cos(sin(pi))
[1] 1


About nine months later, Hadley Wickham started the dplyr package on GitHub. You might now know Hadley, Chief Scientist at RStudio, as the author of many popular R packages (such as this last package!) and as the instructor for DataCamp’s Writing Functions in R course.

Be however it may, it wasn’t until 2013 that the first pipe %.% appears in this package. As Adolfo Álvarez rightfully mentions in his blog post, the function was denominated chain(), which had the purpose to simplify the notation for the application of several functions to a single data frame in R.

The %.% pipe would not be around for long, as Stefan Bache proposed an alternative on the 29th of December 2013, that included the operator as you might now know it:

iris %>%
subset(Sepal.Length > 5) %>%
aggregate(. ~ Species, ., mean)


Bache continued to work with this pipe operation and at the end of 2013, the magrittr package came to being. In the meantime, Hadley Wickham continued to work on dplyr and in April 2014, the %.% operator got replaced with the one that you now know, %>%.

Later that year, Kun Ren published the pipeR package on GitHub, which incorporated a different pipe operator, %>>%, which was designed to add more flexibility to the piping process. However, it’s safe to say that the %>% is now established in the R language, especially with the recent popularity of the Tidyverse.

### What Is It?

Knowing the history is one thing, but that still doesn’t give you an idea of what F#’s forward pipe operator is nor what it actually does in R.

In F#, the pipe-forward operator |> is syntactic sugar for chained method calls. Or, stated more simply, it lets you pass an intermediate result onto the next function.

Remember that “chaining” means that you invoke multiple method calls. As each method returns an object, you can actually allow the calls to be chained together in a single statement, without needing variables to store the intermediate results.

In R, the pipe operator is, as you have already seen, %>%. If you’re not familiar with F#, you can think of this operator as being similar to the + in a ggplot2statement. Its function is very similar to that one that you have seen of the F# operator: it takes the output of one statement and makes it the input of the next statement. When describing it, you can think of it as a “THEN”.

Take, for example, following code chunk and read it aloud:

iris %>%
subset(Sepal.Length > 5) %>%
aggregate(. ~ Species, ., mean)


You’re right, the code chunk above will translate to something like “you take the Iris data, then you subset the data and then you aggregate the data”.

This is one of the most powerful things about the Tidyverse. In fact, having a standardized chain of processing actions is called “a pipeline”. Making pipelines for a data format is great, because you can apply that pipeline to incoming data that has the same formatting and have it output in a ggplot2friendly format, for example.

### Why Use It?

R is a functional language, which means that your code often contains a lot of parenthesis, ( and ). When you have complex code, this often will mean that you will have to nest those parentheses together. This makes your R code hard to read and understand. Here’s where %>% comes in to the rescue!

Take a look at the following example, which is a typical example of nested code:

# Initialize x
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)

# Compute the logarithm of x, return suitably lagged and iterated differences,
# compute the exponential function and round the result
round(exp(diff(log(x))), 1)

1. 3.3
2. 1.8
3. 1.6
4. 0.5
5. 0.3
6. 0.1
7. 48.8
8. 1.1

With the help of %<%, you can rewrite the above code as follows:

# Import magrittr
library(magrittr)

# Perform the same computations on x as above
x %>% log() %>%
diff() %>%
exp() %>%
round(1)


Note that you need to import the magrittr library to get the above code to work. That’s because the pipe operator is, as you read above, part of the magrittr library and is, since 2014, also a part of dplyr. If you forget to import the library, you’ll get an error like Error in eval(expr, envir, enclos): could not find function "%>%".

Also note that it isn’t a formal requirement to add the parentheses after logdiff and exp, but that, within the R community, some will use it to increase the readability of the code.

In short, here are four reasons why you should be using pipes in R:

• You’ll structure the sequence of your data operations from left to right, as apposed to from inside and out;
• You’ll avoid nested function calls;
• You’ll minimize the need for local variables and function definitions; And
• You’ll make it easy to add steps anywhere in the sequence of operations.

These reasons are taken from the magrittr documentation itself. Implicitly, you see the arguments of readability and flexibility returning.

Even though %>% is the (main) pipe operator of the magrittr package, there are a couple of other operators that you should know and that are part of the same package:

• The compound assignment operator %<>%;
# Initialize x
x <- rnorm(100)

# Update value of x and assign it to x
x %<>% abs %>% sort

• The tee operator %T>%;
rnorm(200) %>%
matrix(ncol = 2) %T>%
plot %>%
colSums


Note that it’s good to know for now that the above code chunk is actually a shortcut for:

rnorm(200) %>%
matrix(ncol = 2) %T>%
{ plot(.); . } %>%
colSums


But you’ll see more about that later on!

• The exposition pipe operator %$%. data.frame(z = rnorm(100)) %$%
ts.plot(z)


Of course, these three operators work slightly differently than the main %>%operator. You’ll see more about their functionalities and their usage later on in this tutorial!

Note that, even though you’ll most often see the magrittr pipes, you might also encounter other pipes as you go along! Some examples are wrapr‘s dot arrow pipe %.>% or to dot pipe %>.%, or the Bizarro pipe ->.;.

## How to Use Pipes in R

Now that you know how the %>% operator originated, what it actually is and why you should use it, it’s time for you to discover how you can actually use it to your advantage. You will see that there are quite some ways in which you can use it!

### Basic Piping

Before you go into the more advanced usages of the operator, it’s good to first take a look at the most basic examples that use the operator. In essence, you’ll see that there are 3 rules that you can follow when you’re first starting out:

• f(x) can be rewritten as x %>% f

In short, this means that functions that take one argument, function(argument), can be rewritten as follows: argument %>% function(). Take a look at the following, more practical example to understand how these two are equivalent:

# Compute the logarithm of x
log(x)

# Compute the logarithm of x
x %>% log()

• f(x, y) can be rewritten as x %>% f(y)

Of course, there are a lot of functions that don’t just take one argument, but multiple. This is the case here: you see that the function takes two arguments, x and y. Similar to what you have seen in the first example, you can rewrite the function by following the structure argument1 %>% function(argument2), where argument1 is the magrittr placeholder and argument2 the function call.

This all seems quite theoretical. Let’s take a look at a more practical example:

# Round pi
round(pi, 6)

# Round pi
pi %>% round(6)

• x %>% f %>% g %>% h can be rewritten as h(g(f(x)))

This might seem complex, but it isn’t quite like that when you look at a real-life R example:

# Import babynames data
library(babynames)
# Import dplyr library
library(dplyr)

data(babynames)

# Count how many young boys with the name "Taylor" are born
sum(select(filter(babynames,sex=="M",name=="Taylor"),n))

# Do the same but now with %>%
babynames%>%filter(sex=="M",name=="Taylor")%>%
select(n)%>%
sum


Note how you work from the inside out when you rewrite the nested code: you first put in the babynames, then you use %>% to first filter() the data. After that, you’ll select n and lastly, you’ll sum() everything.

Remember also that you already saw another example of such a nested code that was converted to more readable code in the beginning of this tutorial, where you used the log()diff()exp() and round() functions to perform calculations on x.

#### Functions that Use the Current Environment

Unfortunately, there are some exceptions to the more general rules that were outlined in the previous section. Let’s take a look at some of them here.

Consider this example, where you use the assign() function to assign the value 10 to the variable x.

# Assign 10 to x
assign("x", 10)

# Assign 100 to x
"x" %>% assign(100)

# Return x
x


10

You see that the second call with the assign() function, in combination with the pipe, doesn’t work properly. The value of x is not updated.

Why is this?

That’s because the function assigns the new value 100 to a temporary environment used by %>%. So, if you want to use assign() with the pipe, you must be explicit about the environment:

# Define your environment
env <- environment()

# Add the environment to assign()
"x" %>% assign(100, envir = env)

# Return x
x


100

#### Functions with Lazy Evalution

Arguments within functions are only computed when the function uses them in R. This means that no arguments are computed before you call your function! That means also that the pipe computes each element of the function in turn.

One place that this is a problem is tryCatch(), which lets you capture and handle errors, like in this example:

tryCatch(stop("!"), error = function(e) "An error")

stop("!") %>%
tryCatch(error = function(e) "An error")


‘An error’

Error in eval(expr, envir, enclos): !
Traceback:

1. stop("!") %>% tryCatch(error = function(e) "An error")

2. eval(lhs, parent, parent)

3. eval(expr, envir, enclos)

4. stop("!")


You’ll see that the nested way of writing down this line of code works perfectly, while the piped alternative returns an error. Other functions with the same behavior are try()suppressMessages(), and suppressWarnings() in base R.

### Argument Placeholder

There are also instances where you can use the pipe operator as an argument placeholder. Take a look at the following examples:

• f(x, y) can be rewritten as y %>% f(x, .)

In some cases, you won’t want the value or the magrittr placeholder to the function call at the first position, which has been the case in every example that you have seen up until now. Reconsider this line of code:

pi %>% round(6)


If you would rewrite this line of code, pi would be the first argument in your round() function. But what if you would want to replace the second, third, … argument and use that one as the magrittr placeholder to your function call?

Take a look at this example, where the value is actually at the third position in the function call:

"Ceci n'est pas une pipe" %>% gsub("une", "un", .)


‘Ceci n\’est pas un pipe’

• f(y, z = x) can be rewritten as x %>% f(y, z = .)

Likewise, you might want to make the value of a specific argument within your function call the magrittr placeholder. Consider the following line of code:

6 %>% round(pi, digits=.)


### Re-using the Placeholder for Attributes

It is straight-forward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code.

Here are some general “rules” that you can take into account when you’re working with argument placeholders in nested function calls:

• f(x, y = nrow(x), z = ncol(x)) can be rewritten as x %>% f(y = nrow(.), z = ncol(.))
# Initialize a matrix ma
ma <- matrix(1:12, 3, 4)

# Return the maximum of the values inputted
max(ma, nrow(ma), ncol(ma))

# Return the maximum of the values inputted
ma %>% max(nrow(ma), ncol(ma))


12

12

The behavior can be overruled by enclosing the right-hand side in braces:

• f(y = nrow(x), z = ncol(x)) can be rewritten as x %>% {f(y = nrow(.), z = ncol(.))}
# Only return the maximum of the nrow(ma) and ncol(ma) input values
ma %>% {max(nrow(ma), ncol(ma))}


4

To conclude, also take a look at the following example, where you could possibly want to adjust the workings of the argument placeholder in the nested function call:

# The function that you want to rewrite
paste(1:5, letters[1:5])

# The nested function call with dot placeholder
1:5 %>%
paste(., letters[.])

1. ‘1 a’
2. ‘2 b’
3. ‘3 c’
4. ‘4 d’
5. ‘5 e’
1. ‘1 a’
2. ‘2 b’
3. ‘3 c’
4. ‘4 d’
5. ‘5 e’

You see that if the placeholder is only used in a nested function call, the magrittr placeholder will also be placed as the first argument! If you want to avoid this from happening, you can use the curly brackets { and }:

# The nested function call with dot placeholder and curly brackets
1:5 %>% {
paste(letters[.])
}

# Rewrite the above function call
paste(letters[1:5])

1. ‘a’
2. ‘b’
3. ‘c’
4. ‘d’
5. ‘e’
1. ‘a’
2. ‘b’
3. ‘c’
4. ‘d’
5. ‘e’

### Building Unary Functions

Unary functions are functions that take one argument. Any pipeline that you might make that consists of a dot ., followed by functions and that is chained together with %>% can be used later if you want to apply it to values. Take a look at the following example of such a pipeline:

. %>% cos %>% sin


This pipeline would take some input, after which both the cos() and sin()fuctions would be applied to it.

But you’re not there yet! If you want this pipeline to do exactly that which you have just read, you need to assign it first to a variable f, for example. After that, you can re-use it later to do the operations that are contained within the pipeline on other values.

# Unary function
f <- . %>% cos %>% sin

f

structure(function (value)
freduce(value, _function_list), class = c("fseq", "function"
))

Remember also that you could put parentheses after the cos() and sin()functions in the line of code if you want to improve readability. Consider the same example with parentheses: . %>% cos() %>% sin().

You see, building functions in magrittr very similar to building functions with base R! If you’re not sure how similar they actually are, check out the line above and compare it with the next line of code; Both lines have the same result!

# is equivalent to
f <- function(.) sin(cos(.))

f

function (.)
sin(cos(.))

### Compound Assignment Pipe Operations

There are situations where you want to overwrite the value of the left-hand side, just like in the example right below. Intuitively, you will use the assignment operator <- to do this.

# Load in the Iris data

# Add column names to the Iris data
names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")

# Compute the square root of iris$Sepal.Length and assign it to the variable iris$Sepal.Length <-
iris$Sepal.Length %>% sqrt()  However, there is a compound assignment pipe operator, which allows you to use a shorthand notation to assign the result of your pipeline immediately to the left-hand side: # Compute the square root of iris$Sepal.Length and assign it to the variable
iris$Sepal.Length %<>% sqrt # Return Sepal.Length iris$Sepal.Length


Note that the compound assignment operator %<>% needs to be the first pipe operator in the chain for this to work. This is completely in line with what you just read about the operator being a shorthand notation for a longer notation with repetition, where you use the regular <- assignment operator.

As a result, this operator will assign a result of a pipeline rather than returning it.

### Tee Operations with The Tee Operator

The tee operator works exactly like %>%, but it returns the left-hand side value rather than the potential result of the right-hand side operations.

This means that the tee operator can come in handy in situations where you have included functions that are used for their side effect, such as plotting with plot() or printing to a file.

In other words, functions like plot() typically don’t return anything. That means that, after calling plot(), for example, your pipeline would end. However, in the following example, the tee operator %T>% allows you to continue your pipeline even after you have used plot():

set.seed(123)
rnorm(200) %>%
matrix(ncol = 2) %T>%
plot %>%
colSums


### Exposing Data Variables with the Exposition Operator

When you’re working with R, you’ll find that many functions take a dataargument. Consider, for example, the lm() function or the with() function. These functions are useful in a pipeline where your data is first processed and then passed into the function.

For functions that don’t have a data argument, such as the cor() function, it’s still handy if you can expose the variables in the data. That’s where the %$%operator comes in. Consider the following example: iris %>% subset(Sepal.Length > mean(Sepal.Length)) %$%
cor(Sepal.Length, Sepal.Width)


0.336696922252551

With the help of %$% you make sure that Sepal.Length and Sepal.Width are exposed to cor(). Likewise, you see that the data in the data.frame() function is passed to the ts.plot() to plot several time series on a common plot: data.frame(z = rnorm(100)) %$%
ts.plot(z)


## dplyr and magrittr

In the introduction to this tutorial, you already learned that the development of dplyr and magrittr occurred around the same time, namely, around 2013-2014. And, as you have read, the magrittr package is also part of the Tidyverse.

In this section, you will discover how exciting it can be when you combine both packages in your R code.

For those of you who are new to the dplyr package, you should know that this R package was built around five verbs, namely, “select”, “filter”, “arrange”, “mutate” and “summarize”. If you have already manipulated data for some data science project, you will know that these verbs make up the majority of the data manipulation tasks that you generally need to perform on your data.

Take an example of some traditional code that makes use of these dplyrfunctions:

library(hflights)

grouped_flights <- group_by(hflights, Year, Month, DayofMonth)
flights_data <- select(grouped_flights, Year:DayofMonth, ArrDelay, DepDelay)
summarized_flights <- summarise(flights_data,
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE))
final_result <- filter(summarized_flights, arr > 30 | dep > 30)

final_result

Year Month DayofMonth arr dep
2011 2 4 44.08088 47.17216
2011 3 3 35.12898 38.20064
2011 3 14 46.63830 36.13657
2011 4 4 38.71651 27.94915
2011 4 25 37.79845 22.25574
2011 5 12 69.52046 64.52039
2011 5 20 37.02857 26.55090
2011 6 22 65.51852 62.30979
2011 7 29 29.55755 31.86944
2011 9 29 39.19649 32.49528
2011 10 9 61.90172 59.52586
2011 11 15 43.68134 39.23333
2011 12 29 26.30096 30.78855
2011 12 31 46.48465 54.17137

When you look at this example, you immediately understand why dplyr and magrittr are able to work so well together:

hflights %>%
group_by(Year, Month, DayofMonth) %>%
select(Year:DayofMonth, ArrDelay, DepDelay) %>%
summarise(arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) %>%
filter(arr > 30 | dep > 30)


Both code chunks are fairly long, but you could argue that the second code chunk is more clear if you want to follow along through all of the operations. With the creation of intermediate variables in the first code chunk, you could possibly lose the “flow” of the code. By using %>%, you gain a more clear overview of the operations that are being performed on the data!

In short, dplyr and magrittr are your dreamteam for manipulating data in R!

## RStudio Keyboard Shortcuts for Pipes

Adding all these pipes to your R code can be a challenging task! To make your life easier, John Mount, co-founder and Principal Consultant at Win-Vector, LLC and DataCamp instructor, has released a package with some RStudio add-ins that allow you to create keyboard shortcuts for pipes in R. Addins are actually R functions with a bit of special registration metadata. An example of a simple addin can, for example, be a function that inserts a commonly used snippet of text, but can also get very complex!

With these addins, you’ll be able to execute R functions interactively from within the RStudio IDE, either by using keyboard shortcuts or by going through the Addins menu.

Note that this package is actually a fork from RStudio’s original add-in package, which you can find here. Be careful though, the support for addins is available only within the most recent release of RStudio! If you want to know more on how you can install these RStudio addins, check out this page.

## When Not To Use the Pipe Operator in R

In the above, you have seen that pipes are definitely something that you should be using when you’re programming with R. More specifically, you have seen this by covering some cases in which pipes prove to be very useful! However, there are some situations, outlined by Hadley Wickham in “R for Data Science”, in which you can best avoid them:

• Your pipes are longer than (say) ten steps.

In cases like these, it’s better to create intermediate objects with meaningful names. It will not only be easier for you to debug your code, but you’ll also understand your code better and it’ll be easier for others to understand your code.

• You have multiple inputs or outputs.

If you aren’t transforming one primary object, but two or more objects are combined together, it’s better not to use the pipe.

• You are starting to think about a directed graph with a complex dependency structure.

Pipes are fundamentally linear and expressing complex relationships with them will only result in complex code that will be hard to read and understand.

• You’re doing internal package development

Using pipes in internal package development is a no-go, as it makes it harder to debug!

For more reflections on this topic, check out this Stack Overflow discussion. Other situations that appear in that discussion are loops, package dependencies, argument order and readability.

In short, you could summarize it all as follows: keep the two things in mind that make this construct so great, namely, readability and flexibility. As soon as one of these two big advantages is compromised, you might consider some alternatives in favor of the pipes.

## Alternatives to Pipes in R

After all that you have read by you might also be interested in some alternatives that exist in the R programming language. Some of the solutions that you have seen in this tutorial were the following:

• Create intermediate variables with meaningful names;

Instead of chaining all operations together and outputting one single result, break up the chain and make sure you save intermediate results in separate variables. Be careful with the naming of these variables: the goal should always be to make your code as understandable as possible!

• Nest your code so that you read it from the inside out;

One of the possible objections that you could have against pipes is the fact that it goes against the “flow” that you have been accustomed to with base R. The solution is then to stick with nesting your code! But what to do then if you don’t like pipes but you also think nesting can be quite confusing? The solution here can be to use tabs to highlight the hierarchy.

• … Do you have more suggestions? Make sure to let me know – Drop me a tweet @willems_karlijn

## Conclusion

You have covered a lot of ground in this tutorial: you have seen where %>%comes from, what it exactly is, why you should use it and how you should use it. You’ve seen that the dplyr and magrittr packages work wonderfully together and that there are even more operators out there! Lastly, you have also seen some cases in which you shouldn’t use it when you’re programming in R and what alternatives you can use in such cases.

If you’re interested in learning more about the Tidyverse, consider DataCamp’s Introduction to the Tidyverse course.

## Five Tips to Improve Your R Code (article) by DataCamp

Five useful tips that you can use to effectively improve your R code, from using seq() to create sequences to ditching which() and much more!

@drsimonj here with five simple tricks I find myself sharing all the time with fellow R users to improve their code!

## 1. More fun to sequence from 1

Next time you use the colon operator to create a sequence from 1 like 1:n, try seq().

# Sequence a vector
x <- runif(10)
seq(x)
#>  [1]  1  2  3  4  5  6  7  8  9 10

# Sequence an integer
seq(nrow(mtcars))
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
#> [24] 24 25 26 27 28 29 30 31 32


The colon operator can produce unexpected results that can create all sorts of problems without you noticing! Take a look at what happens when you want to sequence the length of an empty vector:

# Empty vector
x <- c()

1:length(x)
#> [1] 1 0

seq(x)
#> integer(0)


You’ll also notice that this saves you from using functions like length(). When applied to an object of a certain length, seq() will automatically create a sequence from 1 to the length of the object.

## 2. vector() what you c()

Next time you create an empty vector with c(), try to replace it with vector("type", length).

# A numeric vector with 5 elements
vector("numeric", 5)
#> [1] 0 0 0 0 0

# A character vector with 3 elements
vector("character", 3)
#> [1] "" "" ""


Doing this improves memory usage and increases speed! You often know upfront what type of values will go into a vector, and how long the vector will be. Using c() means R has to slowly work both of these things out. So help give it a boost with vector()!

A good example of this value is in a for loop. People often write loops by declaring an empty vector and growing it with c() like this:

x <- c()
for (i in seq(5)) {
x <- c(x, i)
}

#> x at step 1 : 1
#> x at step 2 : 1, 2
#> x at step 3 : 1, 2, 3
#> x at step 4 : 1, 2, 3, 4
#> x at step 5 : 1, 2, 3, 4, 5


Instead, pre-define the type and length with vector(), and reference positions by index, like this:

n <- 5
x <- vector("integer", n)
for (i in seq(n)) {
x[i] <- i
}

#> x at step 1 : 1, 0, 0, 0, 0
#> x at step 2 : 1, 2, 0, 0, 0
#> x at step 3 : 1, 2, 3, 0, 0
#> x at step 4 : 1, 2, 3, 4, 0
#> x at step 5 : 1, 2, 3, 4, 5


Here’s a quick speed comparison:

n <- 1e5

x_empty <- c()
system.time(for(i in seq(n)) x_empty <- c(x_empty, i))
#>    user  system elapsed
#>  15.238   2.327  17.650

x_zeros <- vector("integer", n)
system.time(for(i in seq(n)) x_zeros[i] <- i)
#>    user  system elapsed
#>   0.007   0.000   0.007


That should be convincing enough!

## 3. Ditch the which()

Next time you use which(), try to ditch it! People often use which() to get indices from some boolean condition, and then select values at those indices. This is not necessary.

Getting vector elements greater than 5:

x <- 3:7

# Using which (not necessary)
x[which(x > 5)]
#> [1] 6 7

# No which
x[x > 5]
#> [1] 6 7


Or counting number of values greater than 5:

# Using which
length(which(x > 5))
#> [1] 2

# Without which
sum(x > 5)
#> [1] 2


Why should you ditch which()? It’s often unnecessary and boolean vectors are all you need.

For example, R lets you select elements flagged as TRUE in a boolean vector:

condition <- x > 5
condition
#> [1] FALSE FALSE FALSE  TRUE  TRUE
x[condition]
#> [1] 6 7


Also, when combined with sum() or mean(), boolean vectors can be used to get the count or proportion of values meeting a condition:

sum(condition)
#> [1] 2
mean(condition)
#> [1] 0.4


which() tells you the indices of TRUE values:

which(condition)
#> [1] 4 5


And while the results are not wrong, it’s just not necessary. For example, I often see people combining which() and length() to test whether any or all values are TRUE. Instead, you just need any() or all():

x <- c(1, 2, 12)

# Using which() and length() to test if any values are greater than 10
if (length(which(x > 10)) > 0)
print("At least one value is greater than 10")
#> [1] "At least one value is greater than 10"

# Wrapping a boolean vector with any()
if (any(x > 10))
print("At least one value is greater than 10")
#> [1] "At least one value is greater than 10"

# Using which() and length() to test if all values are positive
if (length(which(x > 0)) == length(x))
print("All values are positive")
#> [1] "All values are positive"

# Wrapping a boolean vector with all()
if (all(x > 0))
print("All values are positive")
#> [1] "All values are positive"


Oh, and it saves you a little time…

x <- runif(1e8)

system.time(x[which(x > .5)])
#>    user  system elapsed
#>   1.156   0.522   1.686

system.time(x[x > .5])
#>    user  system elapsed
#>   1.071   0.442   1.662


## 4. factor that factor!

Ever removed values from a factor and found you’re stuck with old levels that don’t exist anymore? I see all sorts of creative ways to deal with this. The simplest solution is often just to wrap it in factor() again.

This example creates a factor with four levels ("a""b""c" and "d"):

# A factor with four levels
x <- factor(c("a", "b", "c", "d"))
x
#> [1] a b c d
#> Levels: a b c d

plot(x)


If you drop all cases of one level ("d"), the level is still recorded in the factor:

# Drop all values for one level
x <- x[x != "d"]

# But we still have this level!
x
#> [1] a b c
#> Levels: a b c d

plot(x)


A super simple method for removing it is to use factor() again:

x <- factor(x)
x
#> [1] a b c
#> Levels: a b c

plot(x)


This is typically a good solution to a problem that gets a lot of people mad. So save yourself a headache and factor that factor!

## 5. First you get the $, then you get the power Next time you want to extract values from a data.frame column where the rows meet a condition, specify the column with $ before the rows with [.

Say you want the horsepower (hp) for cars with 4 cylinders (cyl), using the mtcars data set. You can write either of these:

# rows first, column second - not ideal
mtcars[mtcars$cyl == 4, ]$hp
#>  [1]  93  62  95  66  52  65  97  66  91 113 109

# column first, rows second - much better
mtcars$hp[mtcars$cyl == 4]
#>  [1]  93  62  95  66  52  65  97  66  91 113 109


The tip here is to use the second approach.

But why is that?

First reason: do away with that pesky comma! When you specify rows before the column, you need to remember the comma: mtcars[mtcars$cyl == 4,]$hp. When you specify column first, this means that you’re now referring to a vector, and don’t need the comma!

Second reason: speed! Let’s test it out on a larger data frame:

# Simulate a data frame...
n <- 1e7
d <- data.frame(
a = seq(n),
b = runif(n)
)

# rows first, column second - not ideal
system.time(d[d$b > .5, ]$a)
#>    user  system elapsed
#>   0.497   0.126   0.629

# column first, rows second - much better
system.time(d$a[d$b > .5])
#>    user  system elapsed
#>   0.089   0.017   0.107


Worth it, right?

Still, if you want to hone your skills as an R data frame ninja, I suggest learning dplyr. You can get a good overview on the dplyr website or really learn the ropes with online courses like DataCamp’s Data Manipulation in R with dplyr.

## Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

## [책소개] Efficient R Programming

데이터 과학 및 R 언어 서적의 혼잡한 시장 공간에서 Lovelace와 Gillespie의 Efficient R Programming (2016)이 두각을 나타내고 있습니다. 10개의 포괄적인 장을 통해 저자는 효율적인 R 프로그램을 개발하는 주된 원리를 다룹니다. R 핵심 개발 팀의 일원이 아니면,이 책은 초보자 R 프로그래머이든, 고급 데이터 과학자 및 엔지니어이든간에 유용 할 것입니다. 이 책은 R 프로그램의 효율성과 개발 프로세스의 효율성을 향상시키는 데 도움이되는 유용한 팁과 기술로 가득 차 있습니다. 지난 4년 이상 매일 매일 R을 사용하고 있었지만, 이 책의 모든 장에서는 이전에 배운 기술에 대한 이해를 돕는 동시에 R 코드를 개선하는 방법에 대한 새로운 통찰을 제공했습니다. Efficient R Programming의 각 장은 “Top five tips”목록을 포함하는 단일 주제에 대해 다루며 다양한 패키지와 기술을 다루며 핵심 통찰력을 통합하기위한 유용한 연습과 문제 세트를 포함합니다.

In Chapter 1. Introduction, the authors orient the audience to the key characteristics of R that affect its efficiency, compared to other programming languages. Importantly, the authors address R efficiency not just in the expected sense of algorithmic speed and complexity, but broaden its scope to include programmer productivity and how it relates to programming idioms, IDEs, coding conventions, and community support – all things that can improve the efficiency of writing and maintaining code. This is doubly important for a language like R, which is notoriously flexible in its ability to solve problems in multiple ways. The first chapter concludes by introducing the reader to two valuable packages: (1) microbenchmark, an accurate benchmarking tool with nanosecond precision; and (2) profvis, a handy tool for profiling larger chunks of code. These two packages are repeatedly used throughout the remainder of the book to illustrate key concepts and highlight efficient techniques.

In Chapter 2. Efficient Setup, the reader is introduced to techniques for setting up a development environment that facilitates efficient workflow. Here the authors cover choices in operating system, R version, R start-up, alternative R interpreters, and how to maintain up-to-date packages with tools like packrat and installr. I found their overview of the R startup process particularly useful, as the authors taught me how to modify my .Renviron and .Rprofile files to cache external API keys and customize my R environment, for example by adding alias shortcuts to commonly used functions. The chapter concludes by discussing how to setup and customize the RStudio environment (e.g., modifying code editing preference, editing keyboard shortcuts, and turning off restore .Rdata to help prevent bugs), which can greatly improve individual efficiency.

Chapter 3. Efficient Programming introduces the reader to efficient programming by discussing “big picture” programming techniques and how they relate to the R language. This chapter will most likely be beneficial to established programmers who are new to R, as well as to data scientists and analysts who have limited exposure to programming in a production environment. In this chapter the authors introduce the “golden rule of R programming” before delving into the usual suspects of inefficient R code. Usefully, the book illustrates multiple ways of performing the same task (e.g., data selection) with different code snippets, and highlights the performance differences through benchmarked results. Here we learn about the pitfalls of growing vectors, the benefits of vectorization, and the proper use of factors. The chapter wraps up with the requisite overview of the apply function family, before discussing the use of variable caching (package memoise) and byte compilation as important techniques in writing fast, responsive R code.

Chapter 4. Efficient Workflow will be of primary use to junior-level programmers, analysts, and project managers who haven’t had enough time or practice to develop their own efficient workflows. This chapter discusses the importance of project planning, audience, and scope before delving into common tools that facilitate project management. In my opinion, one of best aspects of R is the huge, maddeningly broad number of packages that are available on CRAN and GitHub. The authors provide useful advice and techniques for identifying the packages that will be of most use to your project. A brief discussion on the use of R Markdown and knitr concludes this chapter.

Chapter 5. Efficient Input/Output is devoted to efficient read/write operations. Anybody who has ever struggled with loading a big file into R for analysis will appreciate this discussion and the packages covered in this chapter. The rio package, which can handle a wide variety of common data file types, provides a useful starting point for exploratory work on a new project. Other packages that are discussed (including readr and data.table) provide more efficient I/O than those in base R. The chapter ends with a discussion of two new file formats and associated packages, (feather and RProtoBuf), that can be used for cross-language, fast, efficient serialized data I/O.

Chapter 6. Efficient Data Carpentry introduces what are, in my opinion, the most useful R tools for data munging – what Lovelace and Gillespie prefer to call by the more admirable term “data carpentry.” This chapter could more aptly be titled the “Tidyverse” or the “Hadleyverse”, for most of the tools discussed in this chapter were developed by prolific R package writer, Hadley Wickham. Sections of the chapter are devoted to each of the primary packages of the tidyverse: tibble, a more useful and user-friendly data.frame; tidyr, used for reshaping data between short and long forms; stringr, which provides a consistent API over obtuse regex functions; dplyr, used for efficient data processing including filtering, sorting, mutating, joining, and summarizing; and of course magrittr, for piping all these operations together with %>%. A brief section on package data.table rounds out the discussion on efficient data carpentry.

Chapter 7. Efficient Optimization begins with the requisite optimization quote by computer scientist Donald Knuth:

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

In this chapter, the authors introduce profvis, and they illustrate the utility of this package by showing how it can be used to identify bottlenecks in a Monte Carlo simulation of a Monopoly game. The authors next examine alternative methods in base R that can be used for greater efficiency. These include discussion of if() vs. ifelse(), sorting operations, AND (&) and OR (|) vs. && and ||, row/column operations, and sparse matrices. The authors then apply these tricks to the Monopoly code to show a 20-fold decrease in run time. The chapter concludes with a discussion and examples of parallelization, and the use of Rcpp as an R interface to underlying fast and efficient C++ code.

I found the chapter Efficient Hardware to be the least useful in the book (spoiler alert: add more RAM or migrate to cloud-based services), though the chapter on Efficient Collaboration will be particularly useful for novice data scientists lacking real-world experience developing data artifacts and production applications in a distributed, collaborative environment. In this chapter, the authors discuss the importance of coding style, code comments, version control, and code review. The final chapter Efficient Learning, will find appreciative readers among those just getting started with R (and if this describes you, I would suggest that you start with this chapter first!). Here the authors discuss using and navigating R’s excellent internal help utility, as well as the importance of vignettes and source code in learning/understanding. After briefly introducing swirl, the book concludes with a discussion of online resources, including Stack Overflow; the authors thankfully provide the newbie with important information on how to ask the right questions and the importance of providing a great R reproducible example.

In summary, Lovelace and Gillespie’s Efficient R Programming does an admirable job of illustrating the key techniques and packages for writing efficient R programs. The book will appeal to a wide audience from advanced R programmers to those just starting out. In my opinion, the book hits that pragmatic sweet spot between breadth and depth, and it usefully contains links to external resources for those wishing to delve deeper into a specific topic. After reading this book, I immediately went to work refactoring a Shiny dashboard application I am developing and several internal R packages I maintain for our data science team. In a matter of a few short hours, I witnessed a 5 to 10-fold performance increase in these applications just by implementing a couple of new techniques. I was particularly impressed with the greatly improved end-user performance and the ease with which I implemented intelligent caching with the memoise package for a consumer decision tree application I am developing. If you care deeply about writing beautiful, clean, efficient code and bringing your data science to the next level, I highly recommend adding Efficient R Programming to your arsenal.

The book is published by O’Reilly Media and is available online at the authors’ website, as well as through Safari.

## R을 배우는 5 가지 가장 효과적인 방법

간단한 시계열을 계획하고 있거나 다음 선거에 대한 예측 모델을 작성하든 관계없이 R 프로그래밍 언어의 유연성을 통해 작업을 수행하는 데 필요한 모든 기능을 사용할 수 있습니다. 이 블로그에서는이 필수 데이터 과학 언어를 학습하기위한 5 가지 효과적인 전략과 각각의 주요 리소스에 대해 살펴볼 것입니다. 이 전략은 세계에서 가장 강력한 통계 언어를 마스터하는 길에서 서로 보완하기 위해 사용해야합니다!

Whether you’re plotting a simple time series or building a predictive model for the next election, the R programming language’s flexibility will ensure you have all the capabilities you need to get the job done. In this blog we will take a look at five effective tactics for learning this essential data science language, as well as some of the top resources associated with each. These tactics should be used to complement one another on your path to mastering the world’s most powerful statistical language!

#### 1. Watch Instructive Videos

We often flock to YouTube when we want to learn how to play a song on the piano, change a tire, or chop an onion, but why should it be any different when learning how to perform calculations using the most popular statistical programming language? LearnR, Google Developers, and MarinStatsLectures are all fantastic YouTube channels with playlists specifically dedicated to the R language.

There’s a good chance you came across this article through the R-bloggers website, which curates content from some of the best blogs about R that can be found on the web today. Since there are 750+ blogs that are curated on R-bloggers alone, you shouldn’t have a problem finding an article on the exact topic or use case you’re interested in!

A few notable R blogs:

#### 3. Take an Online Course

As we’ve mentioned in previous blogs, there are a great number of online classes you can take to learn specific technical skills. In many instances, these courses are free, or very affordable, with some offering discounts to college students. Why spend thousands of dollars on a university course, when you can get as good, if not better (IMHO), of an understanding online.

Some sites that offer great R courses include:

Many times, books are given a bad rap since most programming concepts can be found online, for free. Sure, if you are going to use the book just as a reference, you’d probably be better off saving that money and taking to Google search. However, if you’re a beginner, or someone who wants to learn the fundamentals, working through an entire book at the foundational level will provide a high degree of understanding.

There is a fantastic list of the best books for R at Data Science Central.

#### 5. Experiment!

You can read articles and watch videos all day long, but if you never try it for yourself, you’ll never learn! Datazar is a great place for you to jump right in and experiment with what you’ve learned. You can immediately start by opening the R console or creating a notebook in our cloud-based environment. If you get stuck, you can consult with other users and even work with scripts that have been opened up by others!

I hope you found this helpful and as always if you would like to share any additional resources, feel free to drop them in the comments below!

#### Resources Included in this Article

The 5 Most Effective Ways to Learn R was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

## R 학습을위한 자습서

인터넷에는 R의 다양한 측면을 배우는 데 도움이되는 수많은 자료가 있으며, 초보자에게는 이것이 압도적 일 수 있습니다. 또한 역동적인 언어이며 빠르게 변화하므로 최신 도구 및 기술을 따라 잡는 것이 중요합니다.

이것이 R-bloggers와 DataCamp가 함께 연구하여 R에 대한 학습 경로를 제공하는 이유입니다. 각 섹션에서는 관련 리소스와 도구를 통해 학습자가 계속 학습하고 계속 학습 할 수 있도록 도와줍니다. 설명서, 온라인 과정, 서적 등 다양한 자료가 혼합되어 있습니다. R과 마찬가지로이 학습 경로는 동적 인 리소스입니다. 최상의 학습 경험을 제공하기 위해 지속적으로 진화하고 자원을 개선하기를 원합니다. 따라서 제안 사항이 있으면 개선을 위해 tal.galili@gmail.com으로 이메일을 보내주십시오.

There are tons of resources to help you learn the different aspects of R, and as a beginner this can be overwhelming. It’s also a dynamic language and rapidly changing, so it’s important to keep up with the latest tools and technologies.

That’s why R-bloggers and DataCamp have worked together to bring you a learning path for R. Each section points you to relevant resources and tools to get you started and keep you engaged to continue learning. It’s a mix of materials ranging from documentation, online courses, books, and more.

Just like R, this learning path is a dynamic resource. We want to continually evolve and improve the resources to provide the best possible learning experience. So if you have suggestions for improvement please email tal.galili@gmail.com with your feedback.

## Learning Path

Getting started:  The basics of R

R packages

Data Manipulation

Data Visualization

Data Science & Machine Learning with R

Reporting Results in R

Learning advanced R topics in (paid) online courses

Next steps

## Getting started:  The basics of R

The best way to learn R is by doing. In case you are just getting started with R, this free introduction to R tutorial by DataCamp is a great resource as well the successor Intermediate R programming (subscription required). Both courses teach you R programming and data science interactively, at your own pace, in the comfort of your browser. You get immediate feedback during exercises with helpful hints along the way so you don’t get stuck.

Another free online interactive learning tutorial for R is available by O’reilly’s code school website called try R. An offline interactive learning resource is swirl, an R package that makes if fun and easy to become an R programmer. You can take a swirl course by (i) installing the package in R, and (ii) selecting a course from the course library. If you want to start right away without needing to install anything you can also choose for the online version of Swirl.

There are also some very good MOOC’s available on edX and Coursera that teach you the basics of R programming. On edX you can find Introduction to R Programming by Microsoft, an 8 hour course that focuses on the fundamentals and basic syntax of R. At Coursera there is the very popular R Programming course by Johns Hopkins. Both are highly recommended!

If you instead prefer to learn R via a written tutorial or book there is plenty of choice. There is the introduction to R manual by CRAN, as well as some very accessible books like Jared Lander’s R for Everyone or R in Action by Robert Kabacoff.

You can download a copy of R from the Comprehensive R Archive Network (CRAN). There are binaries available for Linux, Mac and Windows.

Once R is installed you can choose to either work with the basic R console, or with an integrated development environment (IDE). RStudio is by far the most popular IDE for R and supports debugging, workspace management, plotting and much more (make sure to check out the RStudio shortcuts).

Next to RStudio you also have Architect, and Eclipse-based IDE for R. If you prefer to work with a graphical user interface you can have a look at R-commander  (aka as Rcmdr), or Deducer.

## R packages

R packages are the fuel that drive the growth and popularity of R. R packages are bundles of code, data, documentation, and tests that are easy to share with others. Before you can use a package, you will first have to install it. Some packages, like the base package, are automatically installed when you install R. Other packages, like for example the ggplot2 package, won’t come with the bundled R installation but need to be installed.

Many (but not all) R packages are organized and available from CRAN, a network of servers around the world that store identical, up-to-date, versions of code and documentation for R. You can easily install these package from inside R, using the install.packages function. CRAN also maintains a set of Task Views that identify all the packages associated with a particular task such as for example TimeSeries.

Next to CRAN you also have bioconductor which has packages for the analysis of high-throughput genomic data, as well as for example the github and bitbucket repositories of R package developers. You can easily install packages from these repositories using the package.

Finding a package can be hard, but luckily you can easily search packages from CRAN, github and bioconductor using Rdocumentation, inside-R, or you can have a look at this quick list of useful R packages.

To end, once you start working with R, you’ll quickly find out that R package dependencies can cause a lot of headaches. Once you get confronted with that issue, make sure to check out packrat (see video tutorial) or checkpoint. When you’d need to update R, if you are using Windows, you can use the updateR() function from the installr package.

## Importing your data into R

The data you want to import into R can come in all sorts for formats: flat files, statistical software files, databases and web data.

Getting different types of data into R often requires a different approach to use. To learn more in general on how to get different data types into R you can check out this online Importing Data into R tutorial (subscription required), this post on data importing, or this webinar by RStudio.

• Flat files are typically simple text files that contain table data. The standard distribution of R provides functionality to import these flat files into R as a data frame with functions such as read.table() and read.csv() from the utils package. Specific R packages to import flat files data are readr, a fast and very easy to use package that is less verbose as utils and multiple times faster (more information), and data.table’s fread() function for importing and munging data into R (using the fread function).
• Software packages such as SAS, STATA and SPSS use and produce their own file types. The haven package by Hadley Wickham can deal with importing SAS, STATA and SPSS data files into R and is very easy to use. Alternatively there is the foreign package, which is able to import not only SAS, STATA and SPSS files but also more exotic formats like Systat and Weka for example. It’s also able to export data again to various formats. (Tip: if you’re switching from SAS,SPSS or STATA to R, check out Bob Muenchen’s tutorial (subscription required))
• The packages used to connect to and import from a relational database depend on the type of database you want to connect to. Suppose you want to connect to a MySQL database, you will need the RMySQL package. Others are for example the RpostgreSQL and ROracle package.The R functions you can then use to access and manipulate the database, is specified in another R package called DBI.
• If you want to harvest web data using R you need to connect R to resources online using API’s or through scraping with packages like rvest. To get started with all of this, there is this great resource freely available on the blog of Rolf Fredheim.

## Data Manipulation

Turning your raw data into well structured data is important for robust analysis, and to make data suitable for processing. R has many built-in functions for data processing, but they are not always that easy to use. Luckily, there are some great packages that can help you:

• The tidyr package allows you to “tidy” your data. Tidy data is data where each column is a variable and each row an observation. As such, it turns your data into data that is easy to work with. Check this excellent resource on how you can tidy your data using tidyr.
• If you want to do string manipulation, you should learn about the stringr package. The vignette is very understandable, and full of useful examples to get you started.
• dplyr is a great package when working with data frame like objects (in memory and out of memory). It combines speed with a very intuitive syntax. To learn more on dplyr you can take this data manipulation course (subscription required) and check out this handy cheat sheet.
• When performing heavy data wrangling tasks, the data.table package should be your “go-to”package. It’s blazingly fast, and once you get the hang of it’s syntax you will find yourself using data.table all the time. Check this data analysis course (subscription required) to discover the ins and outs of data.table, and use this cheat sheet as a reference.
• Chances are you find yourself working with times and dates at some point. This can be a painful process, but luckily lubridate makes it a bit easier to work with. Check it’s vignette to better understand how you can use lubridate in your day-to-day analysis.
• Base R has limited functionality to handle time series data. Fortunately, there are package like zoo, xts and quantmod. Take this tutorial by Eric Zivot to better understand how to use these packages, and how to work with time series data in R.

If you want to have a general overview of data manipulation with R, you can read more in the book Data Manipulation with R or see the Data Wrangling with R video by RStudio. In case you run into troubles with handling your data frames, check 15 easy solutions to your data frame problems.

## Data Visualization

One of the things that make R such a great tool is its data visualizations capabilities. For performing visualizations in R, ggplot2 is probably the most well known package and a must learn for beginners! You can find all relevant information to get you started with ggplot2 onhttp://ggplot2.org/ and make sure to check out the cheatsheet and the upcomming book. Next to ggplot2, you also have packages such as ggvis for interactive web graphics (seetutorial (subscription required)), googleVis to interface with google charts (learn to re-create this TED talk), Plotly for R, and many more. See the task view for some hidden gems, and if you have some issues with plotting your data this post might help you out.

In R there is a whole task view dedicated to handling spatial data that allow you to create beautiful maps such as this famous one:

To get started look at for example a package such as ggmap, which allows you to visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps. Alternatively you can start playing around with maptools, choroplethr, and the tmap package. If you need a great tutorial take this Introduction to visualising spatial data in R.

You’ll often see that visualizations in R make use of all these magnificent color schemes that fit like a glove on the graph/map/… If you want to achieve this for your visualizations as well, then deepen yourself into the RColorBrewer package and ColorBrewer.

One of the latest visualizations tools in R is HTML widgets. HTML widgets work just like R plots but they create interactive web visualizations such as  dynamic maps (leaflet), time-series data charting (dygraphs), and interactive tables (DataTables). There are some very nice examples of HTML widgets in the wild, and solid documentation on how to create your own one (not in a reading mode: just watch this video).

If you want to get some inspiration on what visualization to create next, you can have a look at blogs dedicated to visualizations such as FlowingData.

## Data Science & Machine Learning with R

There are many beginner resources on how to do data science with R. A list of available online courses:

Alternatively, if you prefer a good read:

Once your start doing some machine learning with R, you will quickly find yourself using packages such as caret, rpart and randomForest. Luckily, there are some great learning resources for these packages and Machine Learning in general. If you are just getting started,this guide will get you going in no time. Alternatively, you can have a look at the books Mastering Machine Learning with R and Machine Learning with R. If you are looking for some step-by-step tutorials that guide you through a real life example there is the Kaggle Machine Learning course or you can have a look at Wiekvoet’s blog.

## Reporting Results in R

R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It is a great tool for reporting your data analysis in a reproducible manner, thereby making the analysis more useful and understandable. R markdown is based on knitr and pandoc. With R markdown, R generates a final document that replaces the R code with its results. This document can be in an html, word, pfd, ioslides, etc. format. You can even create interactive R markdown documents using Shiny. This 4 hour tutorial on Reporting with R Markdown (subscription required) get’s you going with R markdown, and in addition you can use this nice cheat sheet for future reference.

Next to R markdown, you should also make sure to check out  Shiny. Shiny makes it incredibly easy to build interactive web applications with R. It allows you to turn your analysis into interactive web applications without needing to know HTML, CSS or Javascript. RStudio maintains a great learning portal to get you started with Shiny, including this set of video tutorials (click on the essentials of Shiny Learning Roadmap). More advanced topics are available, as well as a great set of examples.

## Learning advanced R topics in (paid) online courses

Other than its free courses, DataCamp also offers access to all of its advance R courses for $25/month, these include: ### Udemy Another company is Udemy. While they do not offer video + interactive sessions like DataCamp, they do offer extensive video lessons, covering some other topics in using R and learning statistics. For readers of R-bloggers, Udemy is offering access to its courses for$15-$30 per course, use the code RBLOGGERS30 for an extra 30% discount. Here are some of their courses: ### Statistics.com Statistics.com is an online learning website with 100+ courses in statistics, analytics, data mining, text mining, forecasting, social network analysis, spatial analysis, etc. They have kindly agreed to offer R-Bloggers readers a reduced rate of$399 for any of their 23 courses in R, Python, SQL or SAS.  These are high-impact courses, each 4-weeks long (normally costing up to \$589).  They feature hands-on exercises and projects and the opportunity to receive  answers online from leading experts like Paul Murrell (member of the R core development team), Chris Brunsdon (co-developer of the GISTools package), Ben Baumer (former statistician for the NY Mets baseball team), and others. These instructors will answer all your questions (via a private discussion forum) over a 4-week period.

You may use the code “R-Blogger16″ when registering.  You can register for any R, Python, Hadoop, SQL or SAS course starting on any date. Here is a list of the R related courses:

Using R as a statistical package

Building R programming skills – for those familiar with R, or experienced with other programming languages or statistical computing environments

Applying R to specific domains or applications

You may pick any of the R courses from their catalog page:
www.statistics.com/course-catalog/

## Next steps

Once you become more fluent in writing R syntax (and consequently addicted to R), you will want to unlock more of its power (read: do some really nifty stuff). In that case make sure to check out RCPP, an R package that makes it easier for integrating C++ code with R, or RevoScaleR (start the free tutorial).

After spending some time writing R code (and you became an R-addict), you’ll reach a point that you want to start writing your own R package. Hilary Parker from Etsy has written a short tutorial on how to create your first package, and if you’re really serious about it you need to read R packages, an upcoming book by Hadley Wickham that is already available for free on the web.

If you want to start learning on the inner workings of R and improve your understanding of it, the best way to get you started is by reading Advanced R.

Finally, come visit us again at R-bloggers.com to read of the latest news and tutorials from bloggers of the R community.

## 빅데이터 분석전문가 과정 모집 – 한국데이터진흥원 빅데이터 아카데미

교육 목표

빅데이터 분석 기획, 분석 방법 및 분석 도구 활용 등에 대한 지식 함양을 기반으로 현업에서 빅데이터 분석을 통해 새로운 가치를 창출해낼 수 있는 빅데이터 분석 전문가 양성
본 교육은 통계학 중심의 빅데이터 분석 과정(18기,20기)과 대용량 데이터 처리기술 기반의 빅데이터 분석 과정
(19기,21기)으로 구분하여 실시하오니, 교육 신청 전 반드시 커리큘럼을 확인하시길 바랍니다.

교육 대상

<아래 조건을 모두 충족하는 자>
빅데이터 프로젝트 수행 또는 예정인 산업계 재직자
시장‧고객‧제품 등 데이터 분석 3년이상 경력자
(18기, 20기) 기초통계 유경험자
(19기, 21기) 프로그래밍 유경험자

유의사항

빅데이터아카데미는 재직자 대상의 빅데이터 양성 프로그램으로 대학(원)생 및 미취업자는 참여불가
빅데이터아카데미 프로그램의 수료 이력이 있는 자는 중복 수강 불가
정당하지 못한 이유로 수강을 취소한 자 또는 수료 기준 미달로 인해 미수료된 자는 해당년도를 포함하여 2년간 수강 불가

수강료 : 300만원

대기업 재직자 : 수강료 80% 지원 (자부담금 60만원 납부)
중소기업 재직자, 프리랜서 등 : 수강료 100% 지원
소스: :: DBguide.net :: 데이터 전문가 지식포털

## R Packages | Hadley Wickham

2015년 4월에 발간된 이 책은 저자가 Hadley Wickham이라 화제가 된 책이다.

이 분을 빼놓고 R을 생각할 수 없을 정도로 중요한 R 패키지들을 개발해서 수많은 데이터 분석가들에게 존경을 받는 진정한 데이터 사이언티스 입니다. 현재 RStudio의 수석 과학자 이자, Auckland, Standford, Rice 대학교의 통계학 겸임 교수입니다. 데이터 과학을 보다 쉽고 빠르며 재미있게 만들어주는 도구를 구축하는 목표로 주요 R 패키지 개발에 큰 공을 세우고 있습니다. 그가 제작한 주요 R 패키지는 다음과 같습니다.

DATA SCIENCE 분야
– ggplot2 : 데이터 시각화를 위한 패키지
– dplyr : 데이터 핸들링을 위한 패키지
– tidyr for tidying data.
– stringr for working with strings.
– lubridate for working with date/times.

DATA IMPORT
– haven for SAS, SPSS, and Stata files.
– httr for talking to web APIs.
– rvest for scraping websites.
– xml2 for importing XML files.

SOFTWARE ENGINEERING
– devtools for general package development.
– roxygen2 for in-line documentation.
– testthat for unit testing

## The Art of Data Science | Roger D. Peng and Elizabeth Matsui

Roger D. Peng은 존스 홉킨스 대학 (Johns Hopkins University) 블룸버그 공중 보건 학교의 생물 통계학 교수입니다. 우리에게는 Coursera에서 John Hopkins Data Specialization 분야의 공동 창립자이자 강연자로 잘 알려져 있기도 하지요. 2016년에는 공중 보건에 탁월한 공헌을 한 통계 학자를 기리는 미국 공중 보건 협회 (American Public Health Association)의 2016 Mortimer Spiegelman 상을 수상하기도 했습니다.

최근 bookdown.org에 발간한 라는 책이 있어서 공유합니다. 이외에도 leanpub.com과 같은 온라인 출판 사이트에 무료로 R 학습에 도움이 되는 책들을 발간하고 있어 데이터 분석을 하는 많은 사람들에게 도움을 주고 있습니다.

Elizabeth Matsui는 존스 홉킨스 대학 (Johns Hopkins University)의 소아과, 역학 및 환경 건강 과학 교수이자 소아 알레르기 전문의 / 면역 학자입니다. 그녀는 역학 연구 및 임상 시험을 지원하는 Peng 박사와 데이터 관리 및 분석 센터를 지휘하고 데이터 과학 컨설팅 회사인 Skybrude Consulting, LLC의 공동 창립자입니다.

우선 책을 읽어 보시기 전에 샘플을 읽어보시고 어떤 주제를 다루는지 확인해 보시지요.