비지니스를 위한 R을 배우는 6 가지 이유

비즈니스를 위한 데이터 과학 (DS4B)은 비즈니스 분석의 미래이지만 아직 시작해야 할 부분을 파악하기가 어렵습니다. 마지막으로하고 싶은 일은 잘못된 도구로 시간을 낭비하는 것입니다. 시간을 효과적으로 활용하려면 (1) 작업에 적합한 도구 선택과 (2) 도구를 사용하여 비즈니스 가치를 반환하는 방법을 효율적으로 학습하는 두 가지가 있습니다. 이 기사에서는 첫 번째 부분에 초점을 맞추어 왜 R이 6 가지 점에서 올바른 선택인지 설명합니다. 다음 기사에서는 12주 안에 R을 배우는 두 번째 부분에 초점을 맞 춥니 다.

REASON 1: R HAS THE BEST OVERALL QUALITIES

There are a number of tools available business analysis/business intelligence (with DS4B being a subset of this area). Each tool has its pros and cons, many of which are important in the business context. We can use these attributes to compare how each tool stacks up against the others! We did a qualitative assessment using several criteria:

• Business Capability (1 = Low, 10 = High)
• Ease of Learning (1 = Difficult, 10 = Easy)
• Cost (Free/Minimal, Low, High)
• Trend (0 = Fast Decline, 5 = Stable, 10 = Fast Growth)

Further discussion on the assessment is included in the Appendix at the end of the article.

What we saw was particularly interesting. A trendline developed exposing a tradeoff between learning curve and DS4B capability rating. The most flexible tools are more difficult to learn but tend to have higher business capability. Conversely, the “easy-to-learn” tools are often not the best long-term tools for business or data science capability. Our opinion is go for capability over ease of use.

Of the top tools in capability, R has the best mix of desirable attributes including high data science for business capability, low cost, and it’s growing very fast. The only downside is the learning curve. The rest of the article explains why R is so great for business.

REASON 2: R IS DATA SCIENCE FOR NON-COMPUTER SCIENTISTS

If you are seeking high-performance data science tools, you really have two options: R or Python. When starting out, you should pick one. It’s a mistake to try to learn both. Your choice comes down to what’s right for you. The difference between the R and Python has been described in numerous infographics and debates online, but the most overlooked reason is person-programming language fit. Don’t understand what we mean? Let’s break it down.

Fact 1: Most people interested in learning data science for business are not computer scientists.They are business professionals, non-software engineers (e.g. mechanical, chemical), and other technical-to-business converts. This is important because of where each language excels.

Fact 2: Most activities in business and finance involve communication. This comes in the form of reports, dashboards, and interactive web applications that allow decision makers to recognize when things are not going well and to make well-informed decisions that improve the business.

Now that we recognize what’s important, let’s learn about the two major players in data science.

Python is a general service programming language developed by software engineers that has solid programming libraries for math, statistics and machine learning. Python has best-in-class tools for pure machine learning and deep learning, but lacks much of the infrastructure for subjects like econometrics and communication tools such as reporting. Because of this, Python is well-suited for computer scientists and software engineers.

R is a statistical programming language developed by scientists that has open source libraries for statistics, machine learning, and data science. R lends itself well to business because of its depth of topic-specific packages and its communciation infrastructure. R has packages covering a wide range of topics such as econometrics, finance, and time series. R has best-in-class tools for visualization, reporting, and interactivity, which are as important to business as they are to science. Because of this, R is well-suited for scientists, engineers and business professionals.

WHAT SHOULD YOU DO?

Don’t make the decision tougher than what it is. Think about where you are coming from:

• Are you a computer scientist or software engineer? If yes, choose Python.
• Are you an analytics professional or mechanical/industrial/chemical engineer looking to get into data science? If yes, choose R.

Think about what you are trying to do:

• Are you trying to build a self-driving car? If yes, choose Python.
• Are you trying to communicate business analytics throughout your organization? If yes, choose R.

REASON 3: LEARNING R IS EASY WITH THE TIDYVERSE

Learning R used to be a major challenge. Base R was a complex and inconsistent programming language. Structure and formality was not the top priority as in other programming languages. This all changed with the “tidyverse”, a set of packages and tools that have a consistently structured programming interface.

When tools such as dplyr and ggplot2 came to fruition, it made the learning curve much easier by providing a consistent and structured approach to working with data. As Hadley Wickham and many others continued to evolve R, the tidyverse came to be, which includes a series of commonly used packages for data manipulation, visualization, iteration, modeling, and communication. The end result is that R is now much easier to learn (we’ll show you in our next article!).

Source: tidyverse.org

R continues to evolve in a structured manner, with advanced packages that are built on top of the tidyverse infrastructure. A new focus is being placed on modeling and algorithms, which we are excited to see. Further, the tidyverse is being extended to cover topical areas such as text (tidytext) and finance (tidyquant). For newcomers, this should give you confidence in selecting this language. R has a bright future.

REASON 4: R HAS BRAINS, MUSCLE, AND HEART

Saying R is powerful is actually an understatement. From the business context, R is like Excel on steroids! But more important than just muscle is the combination of what R offers: brains, muscle, and heart.

R HAS BRAINS

R implements cutting-edge algorithms including:

• H2O (h2o) – High-end machine learning package
• Keras/TensorFlow (kerastensorflow) – Go-to deep learning packages
• xgboost – Top Kaggle algorithm
• And many more!

These tools are used everywhere from AI products to Kaggle Competitions, and you can use them in your business analyses.

R HAS MUSCLE

R has powerful tools for:

• Vectorized Operations – R uses vectorized operations to make math computations lightning fast right out of the box
• Loops (purrr)
• Parallelizing operations (parallelfuture)
• Speeding up code using C++ (Rcpp)
• Connecting to other languages (rJavareticulate)
• Working With Databases – Connecting to databases (dbplyrodbcbigrquery)
• Handling Big Data – Connecting to Apache Spark (sparklyr)
• And many more!

R HAS HEART

We already talked about the infrastructure, the tidyverse, that enables the ecosystem of applications to be built using a consistent approach. It’s this infrastructure that brings life into your data analysis. The tidyverse enables:

• Data manipulation (dplyrtidyr)
• Working with data types (stringr for strings, lubridate for date/datetime, forcats for categorical/factors)
• Visualization (ggplot2)
• Programming (purrrtidyeval)
• Communication (Rmarkdownshiny)

REASON 5: R IS BUILT FOR BUSINESS

Two major advantages of R versus every other programming language is that it can produce business-ready reports and machine learning-powered web applications. Neither Python or Tableau or any other tool can currently do this as efficiently as R can. The two capabilities we refer to are rmarkdown for report generation and shiny for interactive web applications.

RMARKDOWN

Rmarkdown is a framework for creating reproducible reports that has since been extended to building blogs, presentations, websites, books, journals, and more. It’s the technology that’s behind this blog, and it allows us to include the code with the text so that anyone can follow the analysis and see the output right with the explanation. What’s really cool is that the technology has evolved so much. Here are a few examples of its capability:

SHINY

Source: shiny.rstudio.com

Shiny is a framework for creating interactive web applications that are powered by R. Shiny is a major consulting area for us as four of five assignments involve building a web application using shiny. It’s not only powerful, it enables non-data scientists to gain the benefit of data science via interactive decision making tools. Here’s an example of a Google Trend app built with shiny.

REASON 6: R COMMUNITY SUPPORT

Being a powerful language alone is not enough. To be successful, a language needs community support. We’ll hit on two ways that R excels in this respects: CRAN and the R Community.

CRAN: COMMUNITY-PROVIDED R PACKAGES

CRAN is like the Apple App store, except everything is free, super useful, and built for R. With over 14,000 packages, it has most everything you can possibly want from machine learning to high-performance computing to finance and econometrics! The task views cover specific areas and are one way to explore R’s offerings. CRAN is community-driven, with top open source authors such as Hadley Wickham and Dirk Eddelbuettel leading the way. Package development is a great way to contribute to the community especially for those looking to showcase their coding skills and give back!

COMMUNITY SUPPORT

You begin with R because of its capability, you stay with R because of its community. The R Community is the coolest part. It’s tight-knit, opinionated, fun, silly, and highly knowledgeable… all of the things you want in a high performing team.

SOCIAL/WEB

R users can be found all over the web. A few of the popular hangouts are:

CONFERENCES

R-focused business conferences are gaining traction in a big way. Here are a few that we attend and/or will be attending in the future:

• EARL – Mango Solution’s conference on enterprise and business applications of R
• R/Finance – Community-hosted conference on financial asset and portfolio analytics and applied finance
• Rstudio Conf – Rstudio’s technology conference
• New York R – Business and technology-focused R conference

MEETUPS

A really cool thing about R is that many major cities have a meetup nearby. Meetups are exactly what you think: a group of R-users getting together to talk R. They are usually funded by R-Consortium. You can get a full list of meetups here.

CONCLUSION

R has a wide range of benefits making it our obvious choice for Data Science for Busienss (DS4B). That’s not to say that Python isn’t a good choice as well, but, for the wide-range of needs for business, there’s nothing that compares to R. In this article we saw why R is a great choice. In the next article we’ll show you how to learn R in 12 weeks.

Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business and financial applications. We build web applications and automated reportsto put machine learning in the hands of decision makers. Visit the Business Science or contact usto learn more!

Interested in learning data science for business? Enroll in Business Science University. We’ll teach you how to apply data science and machine learning in real-world business applications. We take you through the entire process of modeling problems, creating interactive data products, and distributing solutions within an organization. We are launching courses in early 2018!

APPENDIX – DISCUSSION ON DS4B TOOL ASSESSMENT

Here’s some additional information on the tool assessment. We have provided the code used to make the visualization, the criteria explanation, and the tool assessment.

CRITERIA EXPLANATION

Our assessment of the most powerful DS4B tools was based on three criteria:

• Business Capability (1 = Low, 10 = High): How well-suited is the tool for use in the business? Does it include features needed for the business including advanced analytics, interactivity, communication, interactivity, and web apps?
• Ease of Learning (1 = Difficult, 10 = Easy): How easy is it to pick up? Can you learn it in a week of short courses or will it take a longer time horizon to become proficient?
• Cost (Free/Minimal, Low, High): Cost has two undesirable effects. From a first-order perspective, the organization has to spend money. This is not in-and-of-itself undesirable because the software companies can theoretically spend on R&D and other efforts to advance the product. The second-order effect of lowering adoption is much more concerning. High-cost tools tend to have much less discussion in the online world, whereas open source or low-cost tools have great trends.
• Trend (0 = Fast Decline, 5 = Stable, 10 = Fast Growth): We used StackOverflow Insights of questions as a proxy for the trend of usage over time. A major assumption is that growing number of Stack Overflow questions is that the usage is also increasing in a similar trend.

Source: Stack Overflow Trends

INDIVIDUAL TOOL ASSESSMENT

R:

• DS4B Capability = 10: Has it all. Great data science capability, great visualization libraries, Shiny for interactive web apps, rmarkdown for professional reporting.
• Learning Curve = 4: A lot to learn, but learning is getting easier with the tidyverse.
• Trend = 10: Stack overflow questions are growing at a very fast pace.
• Cost = Low: Free and open source

PYTHON:

• DS4B Capability = 7: Has great machine learning and deep learning libraries. Can connect to any major database. Communication is limited by flask / Django web applications, which can be difficult to build. Does not have a business reporting infrastructure comparable to rmarkdown.
• Learning Curve = 4: A lot to learn, but learning is relatively easy compared to other object oriented programming languages like Java.
• Trend = 10: Stack overflow questions are growing at a very fast pace.
• Cost = Low: Free and open source

EXCEL:

• DS4B Capability = 4: Mainly a spreadsheet software but has programming built in with VBA. Difficult to integrate R, but is possible. No data science libraries.
• Learning Curve = 10: Relatively easy to become an advanced user.
• Trend = 7: Stack overflow questions are growing at a relatively fast pace.
• Cost = Low: Comes with Microsoft Office, which most organizations use.

TABLEAU:

• DS4B Capability = 6: Has R integrated, but is very difficult to implement advanced algorithms and not as flexible as R+shiny.
• Learning Curve = 7: Very easy to pick up.
• Trend = 6: Stack overflow questions are growing at a relatively fast pace.
• Cost = Low: Free public version. Enterprise licenses are relatively affordable.

POWERBI:

• DS4B Capability = 5: Similar to Tableau, but not quite as feature-rich. Can integrate R to some extent.
• Learning Curve = 8: Very easy to pick up.
• Trend = 6: Expected to have same trend as Tableau.
• Cost = Low: Free public version. Licenses are very affordable.

MATLAB:

• DS4B Capability = 6: Can do a lot with it, but lacks the infrastructure to use for business.
• Learning Curve = 2: Matlab is quite difficult to learn.
• Trend = 1: Stack overflow growth is declining at a rapid pace.
• Cost = High: Matlab licenses are very expensive. Licensing structure does not scale well.

SAS:

• DS4B Capability = 8: Has data science, database connection, business reporting and visualization capabilities. Can also build applications. However, limited by closed-source nature. Does not get latest technologies like tensorflow and H2O.
• Learning Curve = 4: Similar to most data science programming languages for the tough stuff. Has a GUI for the easy stuff.
• Trend = 3: Stack Overflow growth is declining.
• Cost = High: Expensive for licenses. Licensing structure does not scale well.

CODE FOR THE DS4B TOOL ASSESSMENT VISUALIZATION

Applying data science to business and financial analysis

Pipes in R Tutorial For Beginners (article) by DataCamp

Learn more about the famous pipe operator %>% and other pipes in R, why and how you should use them and what alternatives you can consider!

You might have already seen or used the pipe operator when you’re working with packages such as dplyrmagrittr,… But do you know where pipes and the famous %>% operator come from, what they exactly are, or how, when and why you should use them? Can you also come up with some alternatives?

This tutorial will give you an introduction to pipes in R and will cover the following topics:

Are you interested in learning more about manipulating data in R with dplyr? Take a look at DataCamp’s Data Manipulation in R with dplyr course.

Pipe Operator in R: Introduction

To understand what the pipe operator in R is and what you can do with it, it’s necessary to consider the full picture, to learn the history behind it. Questions such as “where does this weird combination of symbols come from and why was it made like this?” might be on top of your mind. You’ll discover the answers to these and more questions in this section.

Now, you can look at the history from three perspectives: from a mathematical point of view, from a holistic point of view of programming languages, and from the point of view of the R language itself. You’ll cover all three in what follows!

History of the Pipe Operator in R

Mathematical History

If you have two functions, let’s say f:BCf:B→C and g:ABg:A→B, you can chain these functions together by taking the output of one function and inserting it into the next. In short, “chaining” means that you pass an intermediate result onto the next function, but you’ll see more about that later.

For example, you can say, f(g(x))f(g(x))g(x)g(x) serves as an input for f()f(), while xx, of course, serves as input to g()g().

If you would want to note this down, you will use the notation fgf◦g, which reads as “f follows g”. Alternatively, you can visually represent this as:

Image Credit: James Balamuta, “Piping Data”

Pipe Operators in Other Programming Languages

As mentioned in the introduction to this section, this operator is not new in programming: in the Shell or Terminal, you can pass command from one to the next with the pipeline character |. Similarly, F# has a forward pipe operator, which will prove to be important later on! Lastly, it’s also good to know that Haskell contains many piping operations that are derived from the Shell or Terminal.

Pipes in R

Now that you have seen some history of the pipe operator in other programming languages, it’s time to focus on R. The history of this operator in R starts, according to this fantastic blog post written by Adolfo Álvarez, on January 17th, 2012, when an anonymous user asked the following question in this Stack Overflow post:

How can you implement F#’s forward pipe operator in R? The operator makes it possible to easily chain a sequence of calculations. For example, when you have an input data and want to call functions foo and bar in sequence, you can write data |> foo |> bar?

The answer came from Ben Bolker, professor at McMaster University, who replied:

I don’t know how well it would hold up to any real use, but this seems (?) to do what you want, at least for single-argument functions …

"%>%" <- function(x,f) do.call(f,list(x))
pi %>% sin
[1] 1.224606e-16
pi %>% sin %>% cos
[1] 1
cos(sin(pi))
[1] 1


About nine months later, Hadley Wickham started the dplyr package on GitHub. You might now know Hadley, Chief Scientist at RStudio, as the author of many popular R packages (such as this last package!) and as the instructor for DataCamp’s Writing Functions in R course.

Be however it may, it wasn’t until 2013 that the first pipe %.% appears in this package. As Adolfo Álvarez rightfully mentions in his blog post, the function was denominated chain(), which had the purpose to simplify the notation for the application of several functions to a single data frame in R.

The %.% pipe would not be around for long, as Stefan Bache proposed an alternative on the 29th of December 2013, that included the operator as you might now know it:

iris %>%
subset(Sepal.Length > 5) %>%
aggregate(. ~ Species, ., mean)


Bache continued to work with this pipe operation and at the end of 2013, the magrittr package came to being. In the meantime, Hadley Wickham continued to work on dplyr and in April 2014, the %.% operator got replaced with the one that you now know, %>%.

Later that year, Kun Ren published the pipeR package on GitHub, which incorporated a different pipe operator, %>>%, which was designed to add more flexibility to the piping process. However, it’s safe to say that the %>% is now established in the R language, especially with the recent popularity of the Tidyverse.

What Is It?

Knowing the history is one thing, but that still doesn’t give you an idea of what F#’s forward pipe operator is nor what it actually does in R.

In F#, the pipe-forward operator |> is syntactic sugar for chained method calls. Or, stated more simply, it lets you pass an intermediate result onto the next function.

Remember that “chaining” means that you invoke multiple method calls. As each method returns an object, you can actually allow the calls to be chained together in a single statement, without needing variables to store the intermediate results.

In R, the pipe operator is, as you have already seen, %>%. If you’re not familiar with F#, you can think of this operator as being similar to the + in a ggplot2statement. Its function is very similar to that one that you have seen of the F# operator: it takes the output of one statement and makes it the input of the next statement. When describing it, you can think of it as a “THEN”.

Take, for example, following code chunk and read it aloud:

iris %>%
subset(Sepal.Length > 5) %>%
aggregate(. ~ Species, ., mean)


You’re right, the code chunk above will translate to something like “you take the Iris data, then you subset the data and then you aggregate the data”.

This is one of the most powerful things about the Tidyverse. In fact, having a standardized chain of processing actions is called “a pipeline”. Making pipelines for a data format is great, because you can apply that pipeline to incoming data that has the same formatting and have it output in a ggplot2friendly format, for example.

Why Use It?

R is a functional language, which means that your code often contains a lot of parenthesis, ( and ). When you have complex code, this often will mean that you will have to nest those parentheses together. This makes your R code hard to read and understand. Here’s where %>% comes in to the rescue!

Take a look at the following example, which is a typical example of nested code:

# Initialize x
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)

# Compute the logarithm of x, return suitably lagged and iterated differences,
# compute the exponential function and round the result
round(exp(diff(log(x))), 1)

1. 3.3
2. 1.8
3. 1.6
4. 0.5
5. 0.3
6. 0.1
7. 48.8
8. 1.1

With the help of %<%, you can rewrite the above code as follows:

# Import magrittr
library(magrittr)

# Perform the same computations on x as above
x %>% log() %>%
diff() %>%
exp() %>%
round(1)


Note that you need to import the magrittr library to get the above code to work. That’s because the pipe operator is, as you read above, part of the magrittr library and is, since 2014, also a part of dplyr. If you forget to import the library, you’ll get an error like Error in eval(expr, envir, enclos): could not find function "%>%".

Also note that it isn’t a formal requirement to add the parentheses after logdiff and exp, but that, within the R community, some will use it to increase the readability of the code.

In short, here are four reasons why you should be using pipes in R:

• You’ll structure the sequence of your data operations from left to right, as apposed to from inside and out;
• You’ll avoid nested function calls;
• You’ll minimize the need for local variables and function definitions; And
• You’ll make it easy to add steps anywhere in the sequence of operations.

These reasons are taken from the magrittr documentation itself. Implicitly, you see the arguments of readability and flexibility returning.

Even though %>% is the (main) pipe operator of the magrittr package, there are a couple of other operators that you should know and that are part of the same package:

• The compound assignment operator %<>%;
# Initialize x
x <- rnorm(100)

# Update value of x and assign it to x
x %<>% abs %>% sort

• The tee operator %T>%;
rnorm(200) %>%
matrix(ncol = 2) %T>%
plot %>%
colSums


Note that it’s good to know for now that the above code chunk is actually a shortcut for:

rnorm(200) %>%
matrix(ncol = 2) %T>%
{ plot(.); . } %>%
colSums


But you’ll see more about that later on!

• The exposition pipe operator %$%. data.frame(z = rnorm(100)) %$%
ts.plot(z)


Of course, these three operators work slightly differently than the main %>%operator. You’ll see more about their functionalities and their usage later on in this tutorial!

Note that, even though you’ll most often see the magrittr pipes, you might also encounter other pipes as you go along! Some examples are wrapr‘s dot arrow pipe %.>% or to dot pipe %>.%, or the Bizarro pipe ->.;.

How to Use Pipes in R

Now that you know how the %>% operator originated, what it actually is and why you should use it, it’s time for you to discover how you can actually use it to your advantage. You will see that there are quite some ways in which you can use it!

Basic Piping

Before you go into the more advanced usages of the operator, it’s good to first take a look at the most basic examples that use the operator. In essence, you’ll see that there are 3 rules that you can follow when you’re first starting out:

• f(x) can be rewritten as x %>% f

In short, this means that functions that take one argument, function(argument), can be rewritten as follows: argument %>% function(). Take a look at the following, more practical example to understand how these two are equivalent:

# Compute the logarithm of x
log(x)

# Compute the logarithm of x
x %>% log()

• f(x, y) can be rewritten as x %>% f(y)

Of course, there are a lot of functions that don’t just take one argument, but multiple. This is the case here: you see that the function takes two arguments, x and y. Similar to what you have seen in the first example, you can rewrite the function by following the structure argument1 %>% function(argument2), where argument1 is the magrittr placeholder and argument2 the function call.

This all seems quite theoretical. Let’s take a look at a more practical example:

# Round pi
round(pi, 6)

# Round pi
pi %>% round(6)

• x %>% f %>% g %>% h can be rewritten as h(g(f(x)))

This might seem complex, but it isn’t quite like that when you look at a real-life R example:

# Import babynames data
library(babynames)
# Import dplyr library
library(dplyr)

data(babynames)

# Count how many young boys with the name "Taylor" are born
sum(select(filter(babynames,sex=="M",name=="Taylor"),n))

# Do the same but now with %>%
babynames%>%filter(sex=="M",name=="Taylor")%>%
select(n)%>%
sum


Note how you work from the inside out when you rewrite the nested code: you first put in the babynames, then you use %>% to first filter() the data. After that, you’ll select n and lastly, you’ll sum() everything.

Remember also that you already saw another example of such a nested code that was converted to more readable code in the beginning of this tutorial, where you used the log()diff()exp() and round() functions to perform calculations on x.

Functions that Use the Current Environment

Unfortunately, there are some exceptions to the more general rules that were outlined in the previous section. Let’s take a look at some of them here.

Consider this example, where you use the assign() function to assign the value 10 to the variable x.

# Assign 10 to x
assign("x", 10)

# Assign 100 to x
"x" %>% assign(100)

# Return x
x


10

You see that the second call with the assign() function, in combination with the pipe, doesn’t work properly. The value of x is not updated.

Why is this?

That’s because the function assigns the new value 100 to a temporary environment used by %>%. So, if you want to use assign() with the pipe, you must be explicit about the environment:

# Define your environment
env <- environment()

# Add the environment to assign()
"x" %>% assign(100, envir = env)

# Return x
x


100

Functions with Lazy Evalution

Arguments within functions are only computed when the function uses them in R. This means that no arguments are computed before you call your function! That means also that the pipe computes each element of the function in turn.

One place that this is a problem is tryCatch(), which lets you capture and handle errors, like in this example:

tryCatch(stop("!"), error = function(e) "An error")

stop("!") %>%
tryCatch(error = function(e) "An error")


‘An error’

Error in eval(expr, envir, enclos): !
Traceback:

1. stop("!") %>% tryCatch(error = function(e) "An error")

2. eval(lhs, parent, parent)

3. eval(expr, envir, enclos)

4. stop("!")


You’ll see that the nested way of writing down this line of code works perfectly, while the piped alternative returns an error. Other functions with the same behavior are try()suppressMessages(), and suppressWarnings() in base R.

Argument Placeholder

There are also instances where you can use the pipe operator as an argument placeholder. Take a look at the following examples:

• f(x, y) can be rewritten as y %>% f(x, .)

In some cases, you won’t want the value or the magrittr placeholder to the function call at the first position, which has been the case in every example that you have seen up until now. Reconsider this line of code:

pi %>% round(6)


If you would rewrite this line of code, pi would be the first argument in your round() function. But what if you would want to replace the second, third, … argument and use that one as the magrittr placeholder to your function call?

Take a look at this example, where the value is actually at the third position in the function call:

"Ceci n'est pas une pipe" %>% gsub("une", "un", .)


‘Ceci n\’est pas un pipe’

• f(y, z = x) can be rewritten as x %>% f(y, z = .)

Likewise, you might want to make the value of a specific argument within your function call the magrittr placeholder. Consider the following line of code:

6 %>% round(pi, digits=.)


Re-using the Placeholder for Attributes

It is straight-forward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code.

Here are some general “rules” that you can take into account when you’re working with argument placeholders in nested function calls:

• f(x, y = nrow(x), z = ncol(x)) can be rewritten as x %>% f(y = nrow(.), z = ncol(.))
# Initialize a matrix ma
ma <- matrix(1:12, 3, 4)

# Return the maximum of the values inputted
max(ma, nrow(ma), ncol(ma))

# Return the maximum of the values inputted
ma %>% max(nrow(ma), ncol(ma))


12

12

The behavior can be overruled by enclosing the right-hand side in braces:

• f(y = nrow(x), z = ncol(x)) can be rewritten as x %>% {f(y = nrow(.), z = ncol(.))}
# Only return the maximum of the nrow(ma) and ncol(ma) input values
ma %>% {max(nrow(ma), ncol(ma))}


4

To conclude, also take a look at the following example, where you could possibly want to adjust the workings of the argument placeholder in the nested function call:

# The function that you want to rewrite
paste(1:5, letters[1:5])

# The nested function call with dot placeholder
1:5 %>%
paste(., letters[.])

1. ‘1 a’
2. ‘2 b’
3. ‘3 c’
4. ‘4 d’
5. ‘5 e’
1. ‘1 a’
2. ‘2 b’
3. ‘3 c’
4. ‘4 d’
5. ‘5 e’

You see that if the placeholder is only used in a nested function call, the magrittr placeholder will also be placed as the first argument! If you want to avoid this from happening, you can use the curly brackets { and }:

# The nested function call with dot placeholder and curly brackets
1:5 %>% {
paste(letters[.])
}

# Rewrite the above function call
paste(letters[1:5])

1. ‘a’
2. ‘b’
3. ‘c’
4. ‘d’
5. ‘e’
1. ‘a’
2. ‘b’
3. ‘c’
4. ‘d’
5. ‘e’

Building Unary Functions

Unary functions are functions that take one argument. Any pipeline that you might make that consists of a dot ., followed by functions and that is chained together with %>% can be used later if you want to apply it to values. Take a look at the following example of such a pipeline:

. %>% cos %>% sin


This pipeline would take some input, after which both the cos() and sin()fuctions would be applied to it.

But you’re not there yet! If you want this pipeline to do exactly that which you have just read, you need to assign it first to a variable f, for example. After that, you can re-use it later to do the operations that are contained within the pipeline on other values.

# Unary function
f <- . %>% cos %>% sin

f

structure(function (value)
freduce(value, _function_list), class = c("fseq", "function"
))

Remember also that you could put parentheses after the cos() and sin()functions in the line of code if you want to improve readability. Consider the same example with parentheses: . %>% cos() %>% sin().

You see, building functions in magrittr very similar to building functions with base R! If you’re not sure how similar they actually are, check out the line above and compare it with the next line of code; Both lines have the same result!

# is equivalent to
f <- function(.) sin(cos(.))

f

function (.)
sin(cos(.))

Compound Assignment Pipe Operations

There are situations where you want to overwrite the value of the left-hand side, just like in the example right below. Intuitively, you will use the assignment operator <- to do this.

# Load in the Iris data

# Add column names to the Iris data
names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")

# Compute the square root of iris$Sepal.Length and assign it to the variable iris$Sepal.Length <-
iris$Sepal.Length %>% sqrt()  However, there is a compound assignment pipe operator, which allows you to use a shorthand notation to assign the result of your pipeline immediately to the left-hand side: # Compute the square root of iris$Sepal.Length and assign it to the variable
iris$Sepal.Length %<>% sqrt # Return Sepal.Length iris$Sepal.Length


Note that the compound assignment operator %<>% needs to be the first pipe operator in the chain for this to work. This is completely in line with what you just read about the operator being a shorthand notation for a longer notation with repetition, where you use the regular <- assignment operator.

As a result, this operator will assign a result of a pipeline rather than returning it.

Tee Operations with The Tee Operator

The tee operator works exactly like %>%, but it returns the left-hand side value rather than the potential result of the right-hand side operations.

This means that the tee operator can come in handy in situations where you have included functions that are used for their side effect, such as plotting with plot() or printing to a file.

In other words, functions like plot() typically don’t return anything. That means that, after calling plot(), for example, your pipeline would end. However, in the following example, the tee operator %T>% allows you to continue your pipeline even after you have used plot():

set.seed(123)
rnorm(200) %>%
matrix(ncol = 2) %T>%
plot %>%
colSums


Exposing Data Variables with the Exposition Operator

When you’re working with R, you’ll find that many functions take a dataargument. Consider, for example, the lm() function or the with() function. These functions are useful in a pipeline where your data is first processed and then passed into the function.

For functions that don’t have a data argument, such as the cor() function, it’s still handy if you can expose the variables in the data. That’s where the %$%operator comes in. Consider the following example: iris %>% subset(Sepal.Length > mean(Sepal.Length)) %$%
cor(Sepal.Length, Sepal.Width)


0.336696922252551

With the help of %$% you make sure that Sepal.Length and Sepal.Width are exposed to cor(). Likewise, you see that the data in the data.frame() function is passed to the ts.plot() to plot several time series on a common plot: data.frame(z = rnorm(100)) %$%
ts.plot(z)


dplyr and magrittr

In the introduction to this tutorial, you already learned that the development of dplyr and magrittr occurred around the same time, namely, around 2013-2014. And, as you have read, the magrittr package is also part of the Tidyverse.

In this section, you will discover how exciting it can be when you combine both packages in your R code.

For those of you who are new to the dplyr package, you should know that this R package was built around five verbs, namely, “select”, “filter”, “arrange”, “mutate” and “summarize”. If you have already manipulated data for some data science project, you will know that these verbs make up the majority of the data manipulation tasks that you generally need to perform on your data.

Take an example of some traditional code that makes use of these dplyrfunctions:

library(hflights)

grouped_flights <- group_by(hflights, Year, Month, DayofMonth)
flights_data <- select(grouped_flights, Year:DayofMonth, ArrDelay, DepDelay)
summarized_flights <- summarise(flights_data,
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE))
final_result <- filter(summarized_flights, arr > 30 | dep > 30)

final_result

Year Month DayofMonth arr dep
2011 2 4 44.08088 47.17216
2011 3 3 35.12898 38.20064
2011 3 14 46.63830 36.13657
2011 4 4 38.71651 27.94915
2011 4 25 37.79845 22.25574
2011 5 12 69.52046 64.52039
2011 5 20 37.02857 26.55090
2011 6 22 65.51852 62.30979
2011 7 29 29.55755 31.86944
2011 9 29 39.19649 32.49528
2011 10 9 61.90172 59.52586
2011 11 15 43.68134 39.23333
2011 12 29 26.30096 30.78855
2011 12 31 46.48465 54.17137

When you look at this example, you immediately understand why dplyr and magrittr are able to work so well together:

hflights %>%
group_by(Year, Month, DayofMonth) %>%
select(Year:DayofMonth, ArrDelay, DepDelay) %>%
summarise(arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) %>%
filter(arr > 30 | dep > 30)


Both code chunks are fairly long, but you could argue that the second code chunk is more clear if you want to follow along through all of the operations. With the creation of intermediate variables in the first code chunk, you could possibly lose the “flow” of the code. By using %>%, you gain a more clear overview of the operations that are being performed on the data!

In short, dplyr and magrittr are your dreamteam for manipulating data in R!

RStudio Keyboard Shortcuts for Pipes

Adding all these pipes to your R code can be a challenging task! To make your life easier, John Mount, co-founder and Principal Consultant at Win-Vector, LLC and DataCamp instructor, has released a package with some RStudio add-ins that allow you to create keyboard shortcuts for pipes in R. Addins are actually R functions with a bit of special registration metadata. An example of a simple addin can, for example, be a function that inserts a commonly used snippet of text, but can also get very complex!

With these addins, you’ll be able to execute R functions interactively from within the RStudio IDE, either by using keyboard shortcuts or by going through the Addins menu.

Note that this package is actually a fork from RStudio’s original add-in package, which you can find here. Be careful though, the support for addins is available only within the most recent release of RStudio! If you want to know more on how you can install these RStudio addins, check out this page.

When Not To Use the Pipe Operator in R

In the above, you have seen that pipes are definitely something that you should be using when you’re programming with R. More specifically, you have seen this by covering some cases in which pipes prove to be very useful! However, there are some situations, outlined by Hadley Wickham in “R for Data Science”, in which you can best avoid them:

• Your pipes are longer than (say) ten steps.

In cases like these, it’s better to create intermediate objects with meaningful names. It will not only be easier for you to debug your code, but you’ll also understand your code better and it’ll be easier for others to understand your code.

• You have multiple inputs or outputs.

If you aren’t transforming one primary object, but two or more objects are combined together, it’s better not to use the pipe.

• You are starting to think about a directed graph with a complex dependency structure.

Pipes are fundamentally linear and expressing complex relationships with them will only result in complex code that will be hard to read and understand.

• You’re doing internal package development

Using pipes in internal package development is a no-go, as it makes it harder to debug!

For more reflections on this topic, check out this Stack Overflow discussion. Other situations that appear in that discussion are loops, package dependencies, argument order and readability.

In short, you could summarize it all as follows: keep the two things in mind that make this construct so great, namely, readability and flexibility. As soon as one of these two big advantages is compromised, you might consider some alternatives in favor of the pipes.

Alternatives to Pipes in R

After all that you have read by you might also be interested in some alternatives that exist in the R programming language. Some of the solutions that you have seen in this tutorial were the following:

• Create intermediate variables with meaningful names;

Instead of chaining all operations together and outputting one single result, break up the chain and make sure you save intermediate results in separate variables. Be careful with the naming of these variables: the goal should always be to make your code as understandable as possible!

• Nest your code so that you read it from the inside out;

One of the possible objections that you could have against pipes is the fact that it goes against the “flow” that you have been accustomed to with base R. The solution is then to stick with nesting your code! But what to do then if you don’t like pipes but you also think nesting can be quite confusing? The solution here can be to use tabs to highlight the hierarchy.

• … Do you have more suggestions? Make sure to let me know – Drop me a tweet @willems_karlijn

Conclusion

You have covered a lot of ground in this tutorial: you have seen where %>%comes from, what it exactly is, why you should use it and how you should use it. You’ve seen that the dplyr and magrittr packages work wonderfully together and that there are even more operators out there! Lastly, you have also seen some cases in which you shouldn’t use it when you’re programming in R and what alternatives you can use in such cases.

If you’re interested in learning more about the Tidyverse, consider DataCamp’s Introduction to the Tidyverse course.

Five Tips to Improve Your R Code (article) by DataCamp

Five useful tips that you can use to effectively improve your R code, from using seq() to create sequences to ditching which() and much more!

@drsimonj here with five simple tricks I find myself sharing all the time with fellow R users to improve their code!

1. More fun to sequence from 1

Next time you use the colon operator to create a sequence from 1 like 1:n, try seq().

# Sequence a vector
x <- runif(10)
seq(x)
#>  [1]  1  2  3  4  5  6  7  8  9 10

# Sequence an integer
seq(nrow(mtcars))
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
#> [24] 24 25 26 27 28 29 30 31 32


The colon operator can produce unexpected results that can create all sorts of problems without you noticing! Take a look at what happens when you want to sequence the length of an empty vector:

# Empty vector
x <- c()

1:length(x)
#> [1] 1 0

seq(x)
#> integer(0)


You’ll also notice that this saves you from using functions like length(). When applied to an object of a certain length, seq() will automatically create a sequence from 1 to the length of the object.

2. vector() what you c()

Next time you create an empty vector with c(), try to replace it with vector("type", length).

# A numeric vector with 5 elements
vector("numeric", 5)
#> [1] 0 0 0 0 0

# A character vector with 3 elements
vector("character", 3)
#> [1] "" "" ""


Doing this improves memory usage and increases speed! You often know upfront what type of values will go into a vector, and how long the vector will be. Using c() means R has to slowly work both of these things out. So help give it a boost with vector()!

A good example of this value is in a for loop. People often write loops by declaring an empty vector and growing it with c() like this:

x <- c()
for (i in seq(5)) {
x <- c(x, i)
}

#> x at step 1 : 1
#> x at step 2 : 1, 2
#> x at step 3 : 1, 2, 3
#> x at step 4 : 1, 2, 3, 4
#> x at step 5 : 1, 2, 3, 4, 5


Instead, pre-define the type and length with vector(), and reference positions by index, like this:

n <- 5
x <- vector("integer", n)
for (i in seq(n)) {
x[i] <- i
}

#> x at step 1 : 1, 0, 0, 0, 0
#> x at step 2 : 1, 2, 0, 0, 0
#> x at step 3 : 1, 2, 3, 0, 0
#> x at step 4 : 1, 2, 3, 4, 0
#> x at step 5 : 1, 2, 3, 4, 5


Here’s a quick speed comparison:

n <- 1e5

x_empty <- c()
system.time(for(i in seq(n)) x_empty <- c(x_empty, i))
#>    user  system elapsed
#>  15.238   2.327  17.650

x_zeros <- vector("integer", n)
system.time(for(i in seq(n)) x_zeros[i] <- i)
#>    user  system elapsed
#>   0.007   0.000   0.007


That should be convincing enough!

3. Ditch the which()

Next time you use which(), try to ditch it! People often use which() to get indices from some boolean condition, and then select values at those indices. This is not necessary.

Getting vector elements greater than 5:

x <- 3:7

# Using which (not necessary)
x[which(x > 5)]
#> [1] 6 7

# No which
x[x > 5]
#> [1] 6 7


Or counting number of values greater than 5:

# Using which
length(which(x > 5))
#> [1] 2

# Without which
sum(x > 5)
#> [1] 2


Why should you ditch which()? It’s often unnecessary and boolean vectors are all you need.

For example, R lets you select elements flagged as TRUE in a boolean vector:

condition <- x > 5
condition
#> [1] FALSE FALSE FALSE  TRUE  TRUE
x[condition]
#> [1] 6 7


Also, when combined with sum() or mean(), boolean vectors can be used to get the count or proportion of values meeting a condition:

sum(condition)
#> [1] 2
mean(condition)
#> [1] 0.4


which() tells you the indices of TRUE values:

which(condition)
#> [1] 4 5


And while the results are not wrong, it’s just not necessary. For example, I often see people combining which() and length() to test whether any or all values are TRUE. Instead, you just need any() or all():

x <- c(1, 2, 12)

# Using which() and length() to test if any values are greater than 10
if (length(which(x > 10)) > 0)
print("At least one value is greater than 10")
#> [1] "At least one value is greater than 10"

# Wrapping a boolean vector with any()
if (any(x > 10))
print("At least one value is greater than 10")
#> [1] "At least one value is greater than 10"

# Using which() and length() to test if all values are positive
if (length(which(x > 0)) == length(x))
print("All values are positive")
#> [1] "All values are positive"

# Wrapping a boolean vector with all()
if (all(x > 0))
print("All values are positive")
#> [1] "All values are positive"


Oh, and it saves you a little time…

x <- runif(1e8)

system.time(x[which(x > .5)])
#>    user  system elapsed
#>   1.156   0.522   1.686

system.time(x[x > .5])
#>    user  system elapsed
#>   1.071   0.442   1.662


4. factor that factor!

Ever removed values from a factor and found you’re stuck with old levels that don’t exist anymore? I see all sorts of creative ways to deal with this. The simplest solution is often just to wrap it in factor() again.

This example creates a factor with four levels ("a""b""c" and "d"):

# A factor with four levels
x <- factor(c("a", "b", "c", "d"))
x
#> [1] a b c d
#> Levels: a b c d

plot(x)


If you drop all cases of one level ("d"), the level is still recorded in the factor:

# Drop all values for one level
x <- x[x != "d"]

# But we still have this level!
x
#> [1] a b c
#> Levels: a b c d

plot(x)


A super simple method for removing it is to use factor() again:

x <- factor(x)
x
#> [1] a b c
#> Levels: a b c

plot(x)


This is typically a good solution to a problem that gets a lot of people mad. So save yourself a headache and factor that factor!

5. First you get the $, then you get the power Next time you want to extract values from a data.frame column where the rows meet a condition, specify the column with $ before the rows with [.

Say you want the horsepower (hp) for cars with 4 cylinders (cyl), using the mtcars data set. You can write either of these:

# rows first, column second - not ideal
mtcars[mtcars$cyl == 4, ]$hp
#>  [1]  93  62  95  66  52  65  97  66  91 113 109

# column first, rows second - much better
mtcars$hp[mtcars$cyl == 4]
#>  [1]  93  62  95  66  52  65  97  66  91 113 109


The tip here is to use the second approach.

But why is that?

First reason: do away with that pesky comma! When you specify rows before the column, you need to remember the comma: mtcars[mtcars$cyl == 4,]$hp. When you specify column first, this means that you’re now referring to a vector, and don’t need the comma!

Second reason: speed! Let’s test it out on a larger data frame:

# Simulate a data frame...
n <- 1e7
d <- data.frame(
a = seq(n),
b = runif(n)
)

# rows first, column second - not ideal
system.time(d[d$b > .5, ]$a)
#>    user  system elapsed
#>   0.497   0.126   0.629

# column first, rows second - much better
system.time(d$a[d$b > .5])
#>    user  system elapsed
#>   0.089   0.017   0.107


Worth it, right?

Still, if you want to hone your skills as an R data frame ninja, I suggest learning dplyr. You can get a good overview on the dplyr website or really learn the ropes with online courses like DataCamp’s Data Manipulation in R with dplyr.

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

Introduction to Skewness · R Views

In previous posts herehere, and here, we spent quite a bit of time on portfolio volatility, using the standard deviation of returns as a proxy for volatility. Today we will begin to a two-part series on additional statistics that aid our understanding of return dispersion: skewness and kurtosis. Beyond being fancy words and required vocabulary for CFA level 1, these two concepts are both important and fascinating for lovers of returns distributions. For today, we will focus on skewness.

Skewness is the degree to which returns are asymmetric around the mean. Since a normal distribution is symmetric around the mean, skewness can be taken as one measure of how returns are not distributed normally. Why does skewness matter? If portfolio returns are right, or positively, skewed, it implies numerous small negative returns and a few large positive returns. If portfolio returns are left, or negatively, skewed, it implies numerous small positive returns and few large negative returns. The phrase “large negative returns” should trigger Pavlovian sweating for investors, even if it’s preceded by a diminutive modifier like “just a few”. For a portfolio manager, a negatively skewed distribution of returns implies a portfolio at risk of rare but large losses. This makes us nervous and is a bit like saying, “I’m healthy, except for my occasional massive heart attack.”

Let’s get to it.

First, have a look at one equation for skewness:

Skew=∑t=1n(xi−x¯)3/n/(∑t=1n(xi−x¯)2/n)3/2

Skew has important substantive implications for risk, and is also a concept that lends itself to data visualization. In fact, I find the visualizations of skewness more illuminating than the numbers themselves (though the numbers are what matter in the end). In this section, we will cover how to calculate skewness using xts and tidyverse methods, how to calculate rolling skewness, and how to create several data visualizations as pedagogical aids. We will be working with our usual portfolio consisting of:

+ SPY (S&P500 fund) weighted 25%
+ EFA (a non-US equities fund) weighted 25%
+ IJS (a small-cap value fund) weighted 20%
+ EEM (an emerging-mkts fund) weighted 20%
+ AGG (a bond fund) weighted 10%

Before we can calculate the skewness, we need to find portfolio monthly returns, which was covered in this post.

Building off that previous work, we will be working with two objects of portfolio returns:

+ portfolio_returns_xts_rebalanced_monthly (an xts of monthly returns)
+ portfolio_returns_tq_rebalanced_monthly (a tibble of monthly returns)

Let’s begin in the xts world and make use of the skewness() function from PerformanceAnalytics.

library(PerformanceAnalytics)
skew_xts <-  skewness(portfolio_returns_xts_rebalanced_monthly$returns) skew_xts ## [1] -0.1710568 Our portfolio is relatively balanced, and a slight negative skewness of -0.1710568 is unsurprising and unworrisome. However, that final number could be omitting important information and we will resist the temptation to stop there. For example, is that slight negative skew being caused by one very large negative monthly return? If so, what happened? Or is it caused by several medium-sized negative returns? What caused those? Were they consecutive? Are they seasonal? We need to investigate further. Before doing so and having fun with data visualization, let’s explore the tidyverse methods and confirm consistent results. We will make use of the same skewness() function, but because we are using a tibble, we use summarise() as well and call summarise(skew = skewness(returns). It’s not necessary, but we are also going to run this calculation by hand, the same as we have done with standard deviation. Feel free to delete the by-hand section from your code should this be ported to enterprise scripts, but keep in mind that there is a benefit to forcing ourselves and loved ones to write out equations: it emphasizes what those nice built-in functions are doing under the hood. If a client, customer or risk officer were ever to drill into our skewness calculations, it would be nice to have a super-firm grasp on the equation. library(tidyverse) library(tidyquant) skew_tidy <- portfolio_returns_tq_rebalanced_monthly %>% summarise(skew_builtin = skewness(returns), skew_byhand = (sum((returns - mean(returns))^3)/length(returns))/ ((sum((returns - mean(returns))^2)/length(returns)))^(3/2)) %>% select(skew_builtin, skew_byhand) Let’s confirm that we have consistent calculations. skew_xts ## [1] -0.1710568 skew_tidy$skew_builtin
## [1] -0.1710568
skew_tidy$skew_byhand ## [1] -0.1710568 The results are consistent using xts and our tidyverse, by-hand methods. Again, though, that singular number -0.1710568 does not fully illuminate the riskiness or distribution of this portfolio. To dig deeper, let’s first visualize the density of returns with stat_density from ggplot2. portfolio_density_plot <- portfolio_returns_tq_rebalanced_monthly %>% ggplot(aes(x = returns)) + stat_density(geom = "line", alpha = 1, colour = "cornflowerblue") portfolio_density_plot The slight negative skew is a bit more evident here. It would be nice to shade the area that falls below some threshold again, and let’s go with the mean return. To do that, let’s create an object called shaded_area using ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x < mean(portfolio_returns_tq_rebalanced_monthly$returns)). That snippet will take our original ggplot object and create a new object filtered for x values less than mean return. Then we use geom_area to add the shaded area to portfolio_density_plot. shaded_area_data <- ggplot_build(portfolio_density_plot)$data[[1]] %>%
filter(x < mean(portfolio_returns_tq_rebalanced_monthly$returns)) portfolio_density_plot_shaded <- portfolio_density_plot + geom_area(data = shaded_area_data, aes(x = x, y = y), fill="pink", alpha = 0.5) portfolio_density_plot_shaded The shaded area highlights the mass of returns that fall below the mean. Let’s add a vertical line at the mean and median, and some explanatory labels. This will help to emphasize that negative skew indicates a mean less than the median. First, create variables for mean and median so that we can add a vertical line. median <- median(portfolio_returns_tq_rebalanced_monthly$returns)
mean <- mean(portfolio_returns_tq_rebalanced_monthly$returns) We want the vertical lines to just touch the density plot so we once again use a call to ggplot_build(portfolio_density_plot)$data[[1]].

median_line_data <-
ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x <= median) Now we can start adding aesthetics to the latest iteration of our graph, which is stored in the object portfolio_density_plot_shaded. portfolio_density_plot_shaded + geom_segment(aes(x = 0, y = 1.9, xend = -.045, yend = 1.9), arrow = arrow(length = unit(0.5, "cm")), size = .05) + annotate(geom = "text", x = -.02, y = .1, label = "returns <= mean", fontface = "plain", alpha = .8, vjust = -1) + geom_segment(data = shaded_area_data, aes(x = mean, y = 0, xend = mean, yend = density), color = "red", linetype = "dotted") + annotate(geom = "text", x = mean, y = 5, label = "mean", color = "red", fontface = "plain", angle = 90, alpha = .8, vjust = -1.75) + geom_segment(data = median_line_data, aes(x = median, y = 0, xend = median, yend = density), color = "black", linetype = "dotted") + annotate(geom = "text", x = median, y = 5, label = "median", fontface = "plain", angle = 90, alpha = .8, vjust = 1.75) + ggtitle("Density Plot Illustrating Skewness") We added quite a bit to the chart, possibly too much, but it’s better to be over-inclusive now to test different variants. We can delete any of those features when using this chart later, or refer back to these lines of code should we ever want to reuse some of the aesthetics. At this point, we have calculated the skewness of this portfolio throughout its history, and done so using three methods. We have also created an explanatory visualization. Similar to the portfolio standard deviation, though, our work is not complete until we look at rolling skewness. Perhaps the first two years of the portfolio were positive skewed, and last two were negative skewed but the overall skewness is slightly negative. We would like to understand how the skewness has changed over time, and in different economic and market regimes. To do so, we calculate and visualize the rolling skewness over time. In the xts world, calculating rolling skewness is almost identical to calculating rolling standard deviation, except we call the skewness() function instead of StdDev(). Since this is a rolling calculation, we need a window of time for each skewness; here, we will use a six-month window. window <- 6 rolling_skew_xts <- na.omit(rollapply(portfolio_returns_xts_rebalanced_monthly, window, function(x) skewness(x))) Now we pop that xts object into highcharter for a visualization. Let’s make sure our y-axis range is large enough to capture the nature of the rolling skewness fluctuations by setting the range to between 3 and -3 with hc_yAxis(..., max = 3, min = -3). I find that if we keep the range from 1 to -1, it makes most rolling skews look like a roller coaster. library(highcharter) highchart(type = "stock") %>% hc_title(text = "Rolling") %>% hc_add_series(rolling_skew_xts, name = "Rolling skewness", color = "cornflowerblue") %>% hc_yAxis(title = list(text = "skewness"), opposite = FALSE, max = 3, min = -3) %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE)  Zoom1m3m6mYTD1yAllFromJul 31, 2013ToDec 12, 2017skewnessRollingOct ’13Apr ’14Oct ’14Apr ’15Oct ’15Apr ’16Oct ’16Apr ’17Oct ’17-3-2-10123 For completeness of methods, we can calculate rolling skewness in a tibble and then use ggplot. We will make use of rollapply() from within tq_mutate in tidyquant. rolling_skew_tidy <- portfolio_returns_tq_rebalanced_monthly %>% tq_mutate(select = returns, mutate_fun = rollapply, width = window, FUN = skewness, col_rename = "skew") rolling_skew_tidy is ready for ggplotggplot is not purpose-built for time series plotting, but we can set aes(x = date, y = skew) to make the x-axis our date values. library(scales) theme_update(plot.title = element_text(hjust = 0.5)) rolling_skew_tidy %>% ggplot(aes(x = date, y = skew)) + geom_line(color = "cornflowerblue") + ggtitle("Rolling Skew with ggplot") + ylab(paste("Rolling", window, "month skewness", sep = " ")) + scale_y_continuous(limits = c(-3, 3), breaks = pretty_breaks(n = 8)) + scale_x_date(breaks = pretty_breaks(n = 8)) The rolling charts are quite illuminating and show that the six-month-interval skewness has been positive for about half the lifetime of this portfolio. Today, the overall skewness is negative, but the rolling skewness in mid-2016 was positive and greater than 1. It took a huge plunge starting at the end of 2016, and the lowest reading was -1.65 in March of 2017, most likely caused by one or two very large negative returns when the market was worried about the US election. We can see those worries start to abate as the rolling skewness becomes more positive throughout 2017. That’s all for today. Thanks for reading and see you next time when we tackle kurtosis. networkD3: D3 JavaScript Network Graphs from R Dev-version: 0.4 About This started as a port of Christopher Gandrud’s R package d3Network for creating D3network graphs to the htmlwidgets framework. The htmlwidgets framework greatly simplifies the package’s syntax for exporting the graphs, improves integration with RStudio’s Viewer Pane, RMarkdown, and Shiny web apps. See below for examples. It currently supports the following types of network graphs: Install networkD3 works very well with the most recent version of RStudio (>=v0.99, download). When you use this version of RStudio, graphs will appear in the Viewer Pane. Not only does this give you a handy way of seeing and tweaking your graphs, but you can also export the graphs to the clipboard or a PNG/JPEG/TIFF/etc. file. The package can be downloaded from CRAN. Usage For a full set of examples for each of the functions see this page. Note: You are probably used to R’s 1-based numbering (i.e. counting in R starts from 1). However, networkD3 plots are created using JavaScript, which is 0-based. So, your data links will need to start from 0. See this data set for example. You can also use igraph to build your graph data and then use the igraph_to_networkD3 function to convert this data to a suitable object for networkD3 plotting. > simpleNetwork For very basic force directed network graphics you can use simpleNetwork. For example: # Load package library(networkD3) # Create fake data src <- c("A", "A", "A", "A", "B", "B", "C", "C", "D") target <- c("B", "C", "D", "J", "E", "F", "G", "H", "I") networkData <- data.frame(src, target) # Plot simpleNetwork(networkData) ABCDEFGHIJ > forceNetwork Use forceNetwork to have more control over the appearance of the forced directed network and to plot more complicated networks. Here is an example: # Load data data(MisLinks) data(MisNodes) # Plot forceNetwork(Links = MisLinks, Nodes = MisNodes, Source = "source", Target = "target", Value = "value", NodeID = "name", Group = "group", opacity = 0.8) MyrielNapoleonMlle.BaptistineMme.MagloireCountessdeLoGeborandChamptercierCravatteCountOldManLabarreValjeanMargueriteMme.deRIsabeauGervaisTholomyesListolierFameuilBlachevilleFavouriteDahliaZephineFantineMme.ThenardierThenardierCosetteJavertFaucheleventBamataboisPerpetueSimpliceScaufflaireWoman1JudgeChampmathieuBrevetChenildieuCochepaillePontmercyBoulatruelleEponineAnzelmaWoman2MotherInnocentGribierJondretteMme.BurgonGavrocheGillenormandMagnonMlle.GillenormandMme.PontmercyMlle.VauboisLt.GillenormandMariusBaronessTMabeufEnjolrasCombeferreProuvaireFeuillyCourfeyracBahorelBossuetJolyGrantaireMotherPlutarchGueulemerBabetClaquesousMontparnasseToussaintChild1Child2BrujonMme.Hucheloup From version 0.1.3 you can also allow scroll-wheel zooming by setting zoom = TRUE. > sankeyNetwork You can also create Sankey diagrams with sankeyNetwork. Here is an example using downloaded JSON data: # Load energy projection data # Load energy projection data URL <- paste0( "https://cdn.rawgit.com/christophergandrud/networkD3/", "master/JSONdata/energy.json") Energy <- jsonlite::fromJSON(URL) # Plot sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source", Target = "target", Value = "value", NodeID = "name", units = "TWh", fontSize = 12, nodeWidth = 30) Agricultural ‘waste’Bio-conversionLiquidLossesSolidGasBiofuel importsBiomass importsCoal importsCoalCoal reservesDistrict heatingIndustryHeating and cooling – commercialHeating and cooling – homesElectricity gridOver generation / exportsH2 conversionRoad transportAgricultureRail transportLighting & appliances – commercialLighting & appliances – homesGas importsNgasGas reservesThermal generationGeothermalH2HydroInternational shippingDomestic aviationInternational aviationNational navigationMarine algaeNuclearOil importsOilOil reservesOther wastePumped heatSolar PVSolar ThermalSolarTidalUK land based bioenergyWaveWind > radialNetwork From version 0.2, tree diagrams can be created using radialNetwork or diagonalNetwork. URL <- paste0( "https://cdn.rawgit.com/christophergandrud/networkD3/", "master/JSONdata//flare.json") ## Convert to list format Flare <- jsonlite::fromJSON(URL, simplifyDataFrame = FALSE) # Use subset of data for more readable diagram Flare$children = Flare$children[1:3] radialNetwork(List = Flare, fontSize = 10, opacity = 0.9) flareanalyticsanimatedataclustergraphoptimizationEasingFunctionSequenceinterpolateISchedulableParallelPauseSchedulerSequenceTransitionTransitionerTransitionEventTweenconvertersDataFieldDataSchemaDataSetDataSourceDataTableDataUtilAgglomerativeClusterCommunityStructureHierarchicalClusterMergeEdgeBetweennessCentralityLinkDistanceMaxFlowMinCutShortestPathsSpanningTreeAspectRatioBankerArrayInterpolatorColorInterpolatorDateInterpolatorInterpolatorMatrixInterpolatorNumberInterpolatorObjectInterpolatorPointInterpolatorRectangleInterpolatorConvertersDelimitedTextConverterGraphMLConverterIDataConverterJSONConverter diagonalNetwork(List = Flare, fontSize = 10, opacity = 0.9) flareanalyticsanimatedataclustergraphoptimizationEasingFunctionSequenceinterpolateISchedulableParallelPauseSchedulerSequenceTransitionTransitionerTransitionEventTweenconvertersDataFieldDataSchemaDataSetDataSourceDataTableDataUtilAgglomerativeClusterCommunityStructureHierarchicalClusterMergeEdgeBetweennessCentralityLinkDistanceMaxFlowMinCutShortestPathsSpanningTreeAspectRatioBankerArrayInterpolatorColorInterpolatorDateInterpolatorInterpolatorMatrixInterpolatorNumberInterpolatorObjectInterpolatorPointInterpolatorRectangleInterpolatorConvertersDelimitedTextConverterGraphMLConverterIDataConverterJSONConverter > dendroNetwork From version 0.2, it is also possible to create dendrograms using dendroNetwork. hc <- hclust(dist(USArrests), "ave") dendroNetwork(hc, height = 600) FloridaNorth CarolinaCaliforniaHawaiiMarylandAlaskaWashingtonRhode IslandMissouriGeorgiaIdahoArizonaNew MexicoDelawareMississippiSouth CarolinaOregonMassachusettsNew JerseyArkansasTennesseeColoradoTexasNebraskaOhioUtahWest VirginiaAlabamaLouisianaIllinoisNew YorkMichiganNevadaWyomingKentuckyMontanaIndianaKansasConnecticutPennsylvaniaMaineSouth DakotaNorth DakotaVermontMinnesotaOklahomaVirginiaWisconsinIowaNew Hampshire Interacting with igraph You can use igraph to create network graph data that can be plotted with networkD3. The igraph_to_networkD3 function converts igraph graphs to lists that work well with networkD3. For example: # Load igraph library(igraph) # Use igraph to make the graph and find membership karate <- make_graph("Zachary") wc <- cluster_walktrap(karate) members <- membership(wc) # Convert to object suitable for networkD3 karate_d3 <- igraph_to_networkD3(karate, group = members) # Create force directed network plot forceNetwork(Links = karate_d3$links, Nodes = karate_d3$nodes, Source = 'source', Target = 'target', NodeID = 'name', Group = 'group') 12345678910111213141516171819202122232425262728293031323334 Output Saving to an external stand alone HTML file Use saveNetwork to save a network to a stand alone HTML file: library(magrittr) simpleNetwork(networkData) %>% saveNetwork(file = 'Net1.html') Including in an RMarkdown file It is simple to include a networkD3 graphic in an RMarkdown file. Simply place the code to create the graph in a code chunk the same way you would any other plot. Checkout this simple example. Including in Shiny web apps You can also easily include networkD3 graphs in Shiny web apps. In the server.R file create the graph by placing the function inside of render*Network, where the * is either SimpleForce, or Sankey depending on the graph type. For example: output$force <- renderForceNetwork({
Source = "source", Target = "target",
Value = "value", NodeID = "name",
data.frame(z = rnorm(100)) %$% ts.plot(z)  Of course, these three operators work slightly differently than the main %>%operator. You’ll see more about their functionalities and their usage later on in this tutorial! Note that, even though you’ll most often see the magrittr pipes, you might also encounter other pipes as you go along! Some examples are wrapr‘s dot arrow pipe %.>% or to dot pipe %>.%, or the Bizarro pipe ->.;. How to Use Pipes in R Now that you know how the %>% operator originated, what it actually is and why you should use it, it’s time for you to discover how you can actually use it to your advantage. You will see that there are quite some ways in which you can use it! Basic Piping Before you go into the more advanced usages of the operator, it’s good to first take a look at the most basic examples that use the operator. In essence, you’ll see that there are 3 rules that you can follow when you’re first starting out: • f(x) can be rewritten as x %>% f In short, this means that functions that take one argument, function(argument), can be rewritten as follows: argument %>% function(). Take a look at the following, more practical example to understand how these two are equivalent: # Compute the logarithm of x log(x) # Compute the logarithm of x x %>% log()  • f(x, y) can be rewritten as x %>% f(y) Of course, there are a lot of functions that don’t just take one argument, but multiple. This is the case here: you see that the function takes two arguments, x and y. Similar to what you have seen in the first example, you can rewrite the function by following the structure argument1 %>% function(argument2), where argument1 is the magrittr placeholder and argument2 the function call. This all seems quite theoretical. Let’s take a look at a more practical example: # Round pi round(pi, 6) # Round pi pi %>% round(6)  • x %>% f %>% g %>% h can be rewritten as h(g(f(x))) This might seem complex, but it isn’t quite like that when you look at a real-life R example: # Import babynames data library(babynames) # Import dplyr library library(dplyr) # Load the data data(babynames) # Count how many young boys with the name "Taylor" are born sum(select(filter(babynames,sex=="M",name=="Taylor"),n)) # Do the same but now with %>% babynames%>%filter(sex=="M",name=="Taylor")%>% select(n)%>% sum  Note how you work from the inside out when you rewrite the nested code: you first put in the babynames, then you use %>% to first filter() the data. After that, you’ll select n and lastly, you’ll sum() everything. Remember also that you already saw another example of such a nested code that was converted to more readable code in the beginning of this tutorial, where you used the log()diff()exp() and round() functions to perform calculations on x. Functions that Use the Current Environment Unfortunately, there are some exceptions to the more general rules that were outlined in the previous section. Let’s take a look at some of them here. Consider this example, where you use the assign() function to assign the value 10 to the variable x. # Assign 10 to x assign("x", 10) # Assign 100 to x "x" %>% assign(100) # Return x x  10 You see that the second call with the assign() function, in combination with the pipe, doesn’t work properly. The value of x is not updated. Why is this? That’s because the function assigns the new value 100 to a temporary environment used by %>%. So, if you want to use assign() with the pipe, you must be explicit about the environment: # Define your environment env <- environment() # Add the environment to assign() "x" %>% assign(100, envir = env) # Return x x  100 Functions with Lazy Evalution Arguments within functions are only computed when the function uses them in R. This means that no arguments are computed before you call your function! That means also that the pipe computes each element of the function in turn. One place that this is a problem is tryCatch(), which lets you capture and handle errors, like in this example: tryCatch(stop("!"), error = function(e) "An error") stop("!") %>% tryCatch(error = function(e) "An error")  ‘An error’ Error in eval(expr, envir, enclos): ! Traceback: 1. stop("!") %>% tryCatch(error = function(e) "An error") 2. eval(lhs, parent, parent) 3. eval(expr, envir, enclos) 4. stop("!")  You’ll see that the nested way of writing down this line of code works perfectly, while the piped alternative returns an error. Other functions with the same behavior are try()suppressMessages(), and suppressWarnings() in base R. Argument Placeholder There are also instances where you can use the pipe operator as an argument placeholder. Take a look at the following examples: • f(x, y) can be rewritten as y %>% f(x, .) In some cases, you won’t want the value or the magrittr placeholder to the function call at the first position, which has been the case in every example that you have seen up until now. Reconsider this line of code: pi %>% round(6)  If you would rewrite this line of code, pi would be the first argument in your round() function. But what if you would want to replace the second, third, … argument and use that one as the magrittr placeholder to your function call? Take a look at this example, where the value is actually at the third position in the function call: "Ceci n'est pas une pipe" %>% gsub("une", "un", .)  ‘Ceci n\’est pas un pipe’ • f(y, z = x) can be rewritten as x %>% f(y, z = .) Likewise, you might want to make the value of a specific argument within your function call the magrittr placeholder. Consider the following line of code: 6 %>% round(pi, digits=.)  Re-using the Placeholder for Attributes It is straight-forward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code. Here are some general “rules” that you can take into account when you’re working with argument placeholders in nested function calls: • f(x, y = nrow(x), z = ncol(x)) can be rewritten as x %>% f(y = nrow(.), z = ncol(.)) # Initialize a matrix ma ma <- matrix(1:12, 3, 4) # Return the maximum of the values inputted max(ma, nrow(ma), ncol(ma)) # Return the maximum of the values inputted ma %>% max(nrow(ma), ncol(ma))  12 12 The behavior can be overruled by enclosing the right-hand side in braces: • f(y = nrow(x), z = ncol(x)) can be rewritten as x %>% {f(y = nrow(.), z = ncol(.))} # Only return the maximum of the nrow(ma) and ncol(ma) input values ma %>% {max(nrow(ma), ncol(ma))}  4 To conclude, also take a look at the following example, where you could possibly want to adjust the workings of the argument placeholder in the nested function call: # The function that you want to rewrite paste(1:5, letters[1:5]) # The nested function call with dot placeholder 1:5 %>% paste(., letters[.])  1. ‘1 a’ 2. ‘2 b’ 3. ‘3 c’ 4. ‘4 d’ 5. ‘5 e’ 1. ‘1 a’ 2. ‘2 b’ 3. ‘3 c’ 4. ‘4 d’ 5. ‘5 e’ You see that if the placeholder is only used in a nested function call, the magrittr placeholder will also be placed as the first argument! If you want to avoid this from happening, you can use the curly brackets { and }: # The nested function call with dot placeholder and curly brackets 1:5 %>% { paste(letters[.]) } # Rewrite the above function call paste(letters[1:5])  1. ‘a’ 2. ‘b’ 3. ‘c’ 4. ‘d’ 5. ‘e’ 1. ‘a’ 2. ‘b’ 3. ‘c’ 4. ‘d’ 5. ‘e’ Building Unary Functions Unary functions are functions that take one argument. Any pipeline that you might make that consists of a dot ., followed by functions and that is chained together with %>% can be used later if you want to apply it to values. Take a look at the following example of such a pipeline: . %>% cos %>% sin  This pipeline would take some input, after which both the cos() and sin()fuctions would be applied to it. But you’re not there yet! If you want this pipeline to do exactly that which you have just read, you need to assign it first to a variable f, for example. After that, you can re-use it later to do the operations that are contained within the pipeline on other values. # Unary function f <- . %>% cos %>% sin f  structure(function (value) freduce(value, _function_list), class = c("fseq", "function" )) Remember also that you could put parentheses after the cos() and sin()functions in the line of code if you want to improve readability. Consider the same example with parentheses: . %>% cos() %>% sin(). You see, building functions in magrittr very similar to building functions with base R! If you’re not sure how similar they actually are, check out the line above and compare it with the next line of code; Both lines have the same result! # is equivalent to f <- function(.) sin(cos(.)) f  function (.) sin(cos(.)) Compound Assignment Pipe Operations There are situations where you want to overwrite the value of the left-hand side, just like in the example right below. Intuitively, you will use the assignment operator <- to do this. # Load in the Iris data iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE) # Add column names to the Iris data names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") # Compute the square root of iris$Sepal.Length and assign it to the variable
iris$Sepal.Length <- iris$Sepal.Length %>%
sqrt()


However, there is a compound assignment pipe operator, which allows you to use a shorthand notation to assign the result of your pipeline immediately to the left-hand side:

# Compute the square root of iris$Sepal.Length and assign it to the variable iris$Sepal.Length %<>% sqrt

# Return Sepal.Length
iris$Sepal.Length  Note that the compound assignment operator %<>% needs to be the first pipe operator in the chain for this to work. This is completely in line with what you just read about the operator being a shorthand notation for a longer notation with repetition, where you use the regular <- assignment operator. As a result, this operator will assign a result of a pipeline rather than returning it. Tee Operations with The Tee Operator The tee operator works exactly like %>%, but it returns the left-hand side value rather than the potential result of the right-hand side operations. This means that the tee operator can come in handy in situations where you have included functions that are used for their side effect, such as plotting with plot() or printing to a file. In other words, functions like plot() typically don’t return anything. That means that, after calling plot(), for example, your pipeline would end. However, in the following example, the tee operator %T>% allows you to continue your pipeline even after you have used plot(): set.seed(123) rnorm(200) %>% matrix(ncol = 2) %T>% plot %>% colSums  Exposing Data Variables with the Exposition Operator When you’re working with R, you’ll find that many functions take a dataargument. Consider, for example, the lm() function or the with() function. These functions are useful in a pipeline where your data is first processed and then passed into the function. For functions that don’t have a data argument, such as the cor() function, it’s still handy if you can expose the variables in the data. That’s where the %$% operator comes in. Consider the following example:

iris %>%
subset(Sepal.Length > mean(Sepal.Length)) %$% cor(Sepal.Length, Sepal.Width)  0.336696922252551 With the help of %$% you make sure that Sepal.Length and Sepal.Width are exposed to cor(). Likewise, you see that the data in the data.frame() function is passed to the ts.plot() to plot several time series on a common plot:

data.frame(z = rnorm(100)) %$% ts.plot(z)  dplyr and magrittr In the introduction to this tutorial, you already learned that the development of dplyr and magrittr occurred around the same time, namely, around 2013-2014. And, as you have read, the magrittr package is also part of the Tidyverse. In this section, you will discover how exciting it can be when you combine both packages in your R code. For those of you who are new to the dplyr package, you should know that this R package was built around five verbs, namely, “select”, “filter”, “arrange”, “mutate” and “summarize”. If you have already manipulated data for some data science project, you will know that these verbs make up the majority of the data manipulation tasks that you generally need to perform on your data. Take an example of some traditional code that makes use of these dplyrfunctions: library(hflights) grouped_flights <- group_by(hflights, Year, Month, DayofMonth) flights_data <- select(grouped_flights, Year:DayofMonth, ArrDelay, DepDelay) summarized_flights <- summarise(flights_data, arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) final_result <- filter(summarized_flights, arr > 30 | dep > 30) final_result  Year Month DayofMonth arr dep 2011 2 4 44.08088 47.17216 2011 3 3 35.12898 38.20064 2011 3 14 46.63830 36.13657 2011 4 4 38.71651 27.94915 2011 4 25 37.79845 22.25574 2011 5 12 69.52046 64.52039 2011 5 20 37.02857 26.55090 2011 6 22 65.51852 62.30979 2011 7 29 29.55755 31.86944 2011 9 29 39.19649 32.49528 2011 10 9 61.90172 59.52586 2011 11 15 43.68134 39.23333 2011 12 29 26.30096 30.78855 2011 12 31 46.48465 54.17137 When you look at this example, you immediately understand why dplyr and magrittr are able to work so well together: hflights %>% group_by(Year, Month, DayofMonth) %>% select(Year:DayofMonth, ArrDelay, DepDelay) %>% summarise(arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) %>% filter(arr > 30 | dep > 30)  Both code chunks are fairly long, but you could argue that the second code chunk is more clear if you want to follow along through all of the operations. With the creation of intermediate variables in the first code chunk, you could possibly lose the “flow” of the code. By using %>%, you gain a more clear overview of the operations that are being performed on the data! In short, dplyr and magrittr are your dreamteam for manipulating data in R! RStudio Keyboard Shortcuts for Pipes Adding all these pipes to your R code can be a challenging task! To make your life easier, John Mount, co-founder and Principal Consultant at Win-Vector, LLC and DataCamp instructor, has released a package with some RStudio add-ins that allow you to create keyboard shortcuts for pipes in R. Addins are actually R functions with a bit of special registration metadata. An example of a simple addin can, for example, be a function that inserts a commonly used snippet of text, but can also get very complex! With these addins, you’ll be able to execute R functions interactively from within the RStudio IDE, either by using keyboard shortcuts or by going through the Addins menu. Note that this package is actually a fork from RStudio’s original add-in package, which you can find here. Be careful though, the support for addins is available only within the most recent release of RStudio! If you want to know more on how you can install these RStudio addins, check out this page. You can download the add-ins and keyboard shortcuts here. When Not To Use the Pipe Operator in R In the above, you have seen that pipes are definitely something that you should be using when you’re programming with R. More specifically, you have seen this by covering some cases in which pipes prove to be very useful! However, there are some situations, outlined by Hadley Wickham in “R for Data Science”, in which you can best avoid them: • Your pipes are longer than (say) ten steps. In cases like these, it’s better to create intermediate objects with meaningful names. It will not only be easier for you to debug your code, but you’ll also understand your code better and it’ll be easier for others to understand your code. • You have multiple inputs or outputs. If you aren’t transforming one primary object, but two or more objects are combined together, it’s better not to use the pipe. • You are starting to think about a directed graph with a complex dependency structure. Pipes are fundamentally linear and expressing complex relationships with them will only result in complex code that will be hard to read and understand. • You’re doing internal package development Using pipes in internal package development is a no-go, as it makes it harder to debug! For more reflections on this topic, check out this Stack Overflow discussion. Other situations that appear in that discussion are loops, package dependencies, argument order and readability. In short, you could summarize it all as follows: keep the two things in mind that make this construct so great, namely, readability and flexibility. As soon as one of these two big advantages is compromised, you might consider some alternatives in favor of the pipes. Alternatives to Pipes in R After all that you have read by you might also be interested in some alternatives that exist in the R programming language. Some of the solutions that you have seen in this tutorial were the following: • Create intermediate variables with meaningful names; Instead of chaining all operations together and outputting one single result, break up the chain and make sure you save intermediate results in separate variables. Be careful with the naming of these variables: the goal should always be to make your code as understandable as possible! • Nest your code so that you read it from the inside out; One of the possible objections that you could have against pipes is the fact that it goes against the “flow” that you have been accustomed to with base R. The solution is then to stick with nesting your code! But what to do then if you don’t like pipes but you also think nesting can be quite confusing? The solution here can be to use tabs to highlight the hierarchy. • … Do you have more suggestions? Make sure to let me know – Drop me a tweet @willems_karlijn Conclusion You have covered a lot of ground in this tutorial: you have seen where %>%comes from, what it exactly is, why you should use it and how you should use it. You’ve seen that the dplyr and magrittr packages work wonderfully together and that there are even more operators out there! Lastly, you have also seen some cases in which you shouldn’t use it when you’re programming in R and what alternatives you can use in such cases. If you’re interested in learning more about the Tidyverse, consider DataCamp’s Introduction to the Tidyverse course. 데이터 전처리 -데이터 전처리(클린징)에 대한 모든 것 본 포스팅에서는 탐색적 데이터 분석(EDA)라고 불리우는 단계에서 수행해야 할 Task에 대해 순서대로 정리해 보고자 합니다. EDA는 데이터 셋 확인 – 결측값 처리 – 이상값 처리 – Feature Engineering 의 순서로 진행합니다. 데이터 분석의 단계 중 가장 많은 시간이 소요되는 단계가 바로 Exploratory Data Analysis 단계입니다. Forbes에서 인용한 CrowdFlower의 설문 결과에 따르면 데이터 분석가는 업무 시간 중 80%정도를 데이터 수집 및 전처리 과정에 사용한다고 합니다. (하지만 동일 설문에서 데이터 분석 업무 중 가장 싫은 단계로 꼽히기도 했다죠.) 본 포스팅에서는 탐색적 데이터 분석(EDA)라고 불리우기도 하는 데이터 전처리 단계에서 수행해야 할 Task에 대해 순서대로 정리해 보고자 합니다. 데이터 전처리는 데이터 셋 확인 – 결측값 처리 – 이상값 처리 – Feature Engineering 의 순서로 진행합니다. 1 데이터 셋 확인 분석하고자 하는 데이터 셋과 친해지는 단계입니다. 데이터 셋에 대해 아래 두가지 확인 작업을 하게 됩니다. A. 변수 확인 독립/종속 변수의 정의, 각 변수의 유형(범주형인지 연속형인지), 변수의 데이터 타입(Date인지, Character인지, Numeric 인지 등)을 확인합니다. 다른 툴도 마찬가지겠지만, R을 사용하는 분들은 변수의 데이터 타입에 따라 모델 Fitting 할때 전혀 다른 결과가 나오기 때문에 사전에 변수 타입을 체크하고, 잘못 설정되어 있는 경우 이 단계에서 변경해 주세요. B. RAW 데이터 확인 B-1. 단변수 분석 변수 하나에 대해 기술 통계 확인을 하는 단계입니다. Histogram이나 Boxplot을 사용해서 평균, 최빈값, 중간값 등과 함께 각 변수들의 분포를 확인하면 됩니다. 범주형 변수의 경우 Boxplot을 사용해서 빈도 수 분포를 체크해 주면 됩니다. B-2. 이변수 분석 변수 2개 간의 관계를 분석하는 단계 입니다. 아래 그림과 같이 변수의 유형에 따라 적절한 시각화 및 분석 방법을 택하면 됩니다. B-2. 셋 이상의 변수 번거롭지만 세개 이상의 변수 간의 관계를 시각화, 분석해야 할 경우도 있을 텐데요. 이때 범주형 변수가 하나이상 포함되어 있는 경우 변수를 범주에 따라 쪼갠 후에 위 분석 방법에 따라 분석하면 됩니다. 예를 들어 남성-여성의 정보가 있고 연소득, 학력, 키의 정보가 있다고 할때 성별로 구분해서 연소득과 학력이 독립적인지 T-test로 확인해 볼 수 있겠죠? 학력으로 구분해서 연소득과 키의 상관관계를 확인해도 될 것이구요. 세개 이상의 연속형 변수의 관계를 확인하기 위해서는, 연속형 변수를 Feature engineering을 통해 범주형 변수로 변환한 후 분석하시거나, 혹은 (추천하지 않습니다만 굳이 필요하다면) 3차원 그래프를 그려 시각적으로 확인해 볼 수 있습니다. 2. 결측값 처리 (Missing value treatment) 결측값이 있는 상태로 모델을 만들게 될 경우 변수간의 관계가 왜곡될수 있기 때문에 모델의 정확성이 떨어지게 됩니다. 결측값이 발생하는 유형은 다양한데요, 결측값이 무작위로 발생하느냐, 아니면 결측값의 발생이 다른 변수와 관계가 있는지 여부에 따라 결측값을 처리하는 방법도 조금씩 달라집니다. 결측값 처리 방법의 종류 A. 삭제 결측값이 발생한 모든 관측치를 삭제하거나 (전체 삭제, Listwise Deletion), 데이터 중 모델에 포함시킬 변수들 중 관측값이 발생한 모든 관측치를 삭제하는 방법(부분 삭제)이 있습니다. 전체 삭제는 간편한 반면 관측치가 줄어들어 모델의 유효성이 낮아질 수 있고, 부분 삭제는 모델에 따라 변수가 제각각 다르기 때문에 관리 Cost가 늘어난다는 단점이 있습니다. 삭제는 결측값이 무작위로 발생한 경우에 사용합니다. 결측값이 무작위로 발생한 것이 아닌데 관측치를 삭제한 데이터를 사용할 경우 왜곡된 모델이 생성될 수 있습니다. B. 다른 값으로 대체 (평균, 최빈값, 중간값) 결측값이 발생한 경우 다른 관측치의 평균, 최빈값, 중간값 등으로 대체할 수 있는데요, 모든 관측치의 평균값 등으로 대체하는 일괄 대체 방법과, 범주형 변수를 활용해 유사한 유형의 평균값 등으로 대체하는 유사 유형 대체 방법이 있습니다. (예 – 남자 키의 평균 값 173, 여자 키의 평균 값 158인 경우, 남자 관측치의 결측 값은 173으로 대체) 결측 값의 발생이 다른 변수와 관계가 있는 경우 대체 방법이 유용한 측면은 있지만, 유사 유형 대체 방법의 경우 어떤 범주형 변수를 유사한 유형으로 선택할 것인지는 자의적으로 선택하므로 모델이 왜곡될 가능성이 존재합니다. C. 예측값 삽입 결측값이 없는 관측치를 트레이닝 데이터로 사용해서 결측값을 예측하는 모델을 만들고, 이 모델을 통해 결측값이 있는 관측 데이터의 결측값을 예측하는 방법입니다. Regression이나 Logistic regression을 주로 사용합니다. 대체하는 방법보다 조금 덜 자의적이나, 결측 값이 다양한 변수에서 발생하는 경우 사용 가능 변수 수가 적어 적합한 모델을 만들기 어렵고, 또 이렇게 만들어 진 모델의 예측력이 낮은 경우에는 사용하기 어려운 방법입니다. 3. 이상값 처리 (Outlier treatment) 이상값이란 데이터/샘플과 동떨어진 관측치로, 모델을 왜곡할 가능성이 있는 관측치를 말합니다. 이상값 찾아내기 이상값을 찾아 내기 위한 쉽고 간단한 방법은 변수의 분포를 시각화하는 것입니다. 일반적으로 하나의 변수에 대해서는 Boxplot이나 Histogram을, 두개의 변수 간 이상값을 찾기 위해서는 Scatter plot을 사용합니다. 시각적으로 확인하는 방법은 직관적이지만 자의적이기도 하고 하나하나 확인해야 해서 번거로운 측면이 있습니다. 두 변수 간 이상값을 찾기 위한 또 다른 방법으로는 두 변수 간 회귀 모형에서 Residual, Studentized residual(혹은 standardized residual), leverage, Cook’s D값을 확인하면 됩니다. 이상값 처리하기 A. 단순 삭제 이상값이 Human error에 의해서 발생한 경우에는 해당 관측치를 삭제하면 됩니다. 단순 오타나, 주관식 설문 등의 비현실 적인 응답, 데이터 처리 과정에서의 오류 등의 경우에 사용합니다. B. 다른 값으로 대체 절대적인 관측치의 숫자가 작은 경우, 삭제의 방법으로 이상치를 제거하면 관측치의 절대량이 작아지는 문제가 발생합니다. 이런 경우 이상 값이 Human error에 의해 발생했더라도 관측치를 삭제하는 대신 다른 값(평균 등)으로 대체하거나, 결측값과 유사하게 다른 변수들을 사용해서 예측 모델을 만들고, 이상값을 예측한 후 해당 값으로 대체하는 방법도 사용할 수 있습니다. C. 변수화 이상값이 자연 발생한 경우, 단순 삭제나 대체의 방법을 통해 수립된 모델은 설명/예측하고자 하는 현상을 잘 설명하지 못할 수도 있습니다. 예를 들어 아래 그래프에서 다른 관측치들만 보면 경력과 연봉이 비례하는 현상이 존재하는 것 처럼 보이지만, 5년차의 연봉$35,000인 이상치가 포함됨으로써 모델의 설명력이 크게 낮아 집니다.

자연발생적인 이상값의 경우, 바로 삭제하지 말고 좀 더 찬찬히 이상값에 대해 파악하는 것이 중요합니다.

예를 들어 위 이상값의 경우 의사 등 전문직종에 종사하는 사람이라고 가정해 봅시다. 이럴 경우 전문직종 종사 여부를 Yes – No로 변수화 하면 이상값을 삭제하지 않고 모델에 포함시킬 수 있습니다.

D. 리샘플링

자연 발생한 이상값을 처리하는 또 다른 방법으로는 해당 이상값을 분리해서 모델을 만드는 방법이 있습니다.

아래와 같이 15년 이상의 경력을 가진 이상값이 존재한다고 가정해 봅시다. 이 관측치는 경력은 길지만 연봉이 비례해서 늘어나지 않은 사람입니다.

(위 사례와의 차이:
위 사례는 설명 변수, 즉 경력 측면에서는 Outlier가 아니고, 종속 변수인 연봉만 예측치를 벗어나는 반면, 본 케이스는 설명 변수, 종속 변수 모두에서 Outlier라는 점입니다.)

이 경우 간단하게는 이상치를 삭제하고 분석 범위는 10년 이내의 경력자를 대상으로 한다는 설명 등을 다는 것으로 이상값을 처리할 수 있습니다.

E. 케이스를 분리하여 분석

위와 동일한 사례에서 실은 경력이 지나치게 길어질 경우 연봉이 낮아지는 현상이 실제로 발생할 수도 있습니다. (건강상의 이유 등으로)

이 경우 이상값을 대상에서 제외시키는 것은 현상에 대한 정확한 설명이 되지 않을 수 있습니다. 보다 좋은 방법은 이상값을 포함한 모델과 제외한 모델을 모두 만들고 각각의 모델에 대한 설명을 다는 것입니다.

자연 발생한 이상값에 별다른 특이점이 발견되지 않는다면, 단순 제외 보다는 케이스를 분리하여 분석하는 것을 추천합니다.

4. Feature Engineering

Feature Engineering이란, 기존의 변수를 사용해서 데이터에 정보를 추가하는 일련의 과정입니다. 새로 관측치나 변수를 추가하지 않고도 기존의 데이터를 보다 유용하게 만드는 방법론입니다.

A. SCALING

변수의 단위를 변경하고 싶거나, 변수의 분포가 편향되어 있을 경우, 변수 간의 관계가 잘 드러나지 않는 경우에는 변수 변환의 방법을 사용합니다.

가장 자주 사용하는 방법으로는 Log 함수가 있고, 유사하지만 좀 덜 자주 사용되는 Square root를 취하는 방법도 있습니다.

B. BINNING

연속형 변수를 범주형 변수로 만드는 방법입니다. 예를 들어 연봉 데이터가 수치로 존재하는 경우, 이를 100만원 미만, 101만원~200만원.. 하는 식으로 범주형 변수로 변환하는 것이죠.

Binning에는 특별한 원칙이 있는 것이 아니기 때문에,  분석가의 Business 이해도에 따라 창의적인 방법으로 Binning  할 수 있습니다.

C. TRANSFORM

기존 존재하는 변수의 성질을 이용해 다른 변수를 만드는 방법입니다.

예를 들어 날짜 별 판매 데이터가 있다면, 날짜 변수를 주중/주말로 나눈 변수를 추가한다던지, eSports의 관람객 데이터의 경우 해당 일에 SKT T1의 경기가 있는지 여부 등을 추가하는 것이지요.

Transform에도 특별한 원칙이 있는 것은 아니며, 분석가의 Business 이해도에 따라 다양한 변수가 생성될 수 있습니다.

D. DUMMY

Binning 과는 반대로 범주형 변수를 연속형 변수로 변환하기 위해 사용합니다. 사용하고자 하는 분석 방법론에서 필요한 경우에 주로 사용합니다.

끝내며…

Garbage-in, garbage-out이기 때문에 Model building 전에 내가 가지고 있는 데이터의 상태를 확인하고, 내가 설계한 분석 방법에 맞게 적절한 전처리를 해주는 것은 정확한 결과를 얻기 위해 필수적인 단계라고 할 수 있습니다.  마치 요리를 하기 전 좋은 재료를 구해 잘 손질하는 과정과도 같죠.

본 포스팅을 읽으신 분들께 데이터 전처리에 대한 큰그림(?)이 생기셨으면 좋겠네요.

참고 자료

1) MeasuringU, 7 Ways To Handle Missing Data

2) Boston University Technical Report, Marina Soley-Bori, Dealing with missing data: Key assumptions and methods for applied analysis

3) R-bloggers, Imputing missing data with R; MICE package

4) Analytics Vidhya, A comprehensive guide to data exploration

5) The analysis factor, Outliers: To Drop or Not to Drop

6) Kellogg, Outliers