## 비지니스를 위한 R을 배우는 6 가지 이유

비즈니스를 위한 데이터 과학 (DS4B)은 비즈니스 분석의 미래이지만 아직 시작해야 할 부분을 파악하기가 어렵습니다. 마지막으로하고 싶은 일은 잘못된 도구로 시간을 낭비하는 것입니다. 시간을 효과적으로 활용하려면 (1) 작업에 적합한 도구 선택과 (2) 도구를 사용하여 비즈니스 가치를 반환하는 방법을 효율적으로 학습하는 두 가지가 있습니다. 이 기사에서는 첫 번째 부분에 초점을 맞추어 왜 R이 6 가지 점에서 올바른 선택인지 설명합니다. 다음 기사에서는 12주 안에 R을 배우는 두 번째 부분에 초점을 맞 춥니 다.

## REASON 1: R HAS THE BEST OVERALL QUALITIES

There are a number of tools available business analysis/business intelligence (with DS4B being a subset of this area). Each tool has its pros and cons, many of which are important in the business context. We can use these attributes to compare how each tool stacks up against the others! We did a qualitative assessment using several criteria:

• Business Capability (1 = Low, 10 = High)
• Ease of Learning (1 = Difficult, 10 = Easy)
• Cost (Free/Minimal, Low, High)
• Trend (0 = Fast Decline, 5 = Stable, 10 = Fast Growth)

Further discussion on the assessment is included in the Appendix at the end of the article.

What we saw was particularly interesting. A trendline developed exposing a tradeoff between learning curve and DS4B capability rating. The most flexible tools are more difficult to learn but tend to have higher business capability. Conversely, the “easy-to-learn” tools are often not the best long-term tools for business or data science capability. Our opinion is go for capability over ease of use.

Of the top tools in capability, R has the best mix of desirable attributes including high data science for business capability, low cost, and it’s growing very fast. The only downside is the learning curve. The rest of the article explains why R is so great for business.

## REASON 2: R IS DATA SCIENCE FOR NON-COMPUTER SCIENTISTS

If you are seeking high-performance data science tools, you really have two options: R or Python. When starting out, you should pick one. It’s a mistake to try to learn both. Your choice comes down to what’s right for you. The difference between the R and Python has been described in numerous infographics and debates online, but the most overlooked reason is person-programming language fit. Don’t understand what we mean? Let’s break it down.

Fact 1: Most people interested in learning data science for business are not computer scientists.They are business professionals, non-software engineers (e.g. mechanical, chemical), and other technical-to-business converts. This is important because of where each language excels.

Fact 2: Most activities in business and finance involve communication. This comes in the form of reports, dashboards, and interactive web applications that allow decision makers to recognize when things are not going well and to make well-informed decisions that improve the business.

Now that we recognize what’s important, let’s learn about the two major players in data science.

### ABOUT PYTHON

Python is a general service programming language developed by software engineers that has solid programming libraries for math, statistics and machine learning. Python has best-in-class tools for pure machine learning and deep learning, but lacks much of the infrastructure for subjects like econometrics and communication tools such as reporting. Because of this, Python is well-suited for computer scientists and software engineers.

### ABOUT R

R is a statistical programming language developed by scientists that has open source libraries for statistics, machine learning, and data science. R lends itself well to business because of its depth of topic-specific packages and its communciation infrastructure. R has packages covering a wide range of topics such as econometrics, finance, and time series. R has best-in-class tools for visualization, reporting, and interactivity, which are as important to business as they are to science. Because of this, R is well-suited for scientists, engineers and business professionals.

### WHAT SHOULD YOU DO?

Don’t make the decision tougher than what it is. Think about where you are coming from:

• Are you a computer scientist or software engineer? If yes, choose Python.
• Are you an analytics professional or mechanical/industrial/chemical engineer looking to get into data science? If yes, choose R.

Think about what you are trying to do:

• Are you trying to build a self-driving car? If yes, choose Python.
• Are you trying to communicate business analytics throughout your organization? If yes, choose R.

## REASON 3: LEARNING R IS EASY WITH THE TIDYVERSE

Learning R used to be a major challenge. Base R was a complex and inconsistent programming language. Structure and formality was not the top priority as in other programming languages. This all changed with the “tidyverse”, a set of packages and tools that have a consistently structured programming interface.

When tools such as dplyr and ggplot2 came to fruition, it made the learning curve much easier by providing a consistent and structured approach to working with data. As Hadley Wickham and many others continued to evolve R, the tidyverse came to be, which includes a series of commonly used packages for data manipulation, visualization, iteration, modeling, and communication. The end result is that R is now much easier to learn (we’ll show you in our next article!).

Source: tidyverse.org

R continues to evolve in a structured manner, with advanced packages that are built on top of the tidyverse infrastructure. A new focus is being placed on modeling and algorithms, which we are excited to see. Further, the tidyverse is being extended to cover topical areas such as text (tidytext) and finance (tidyquant). For newcomers, this should give you confidence in selecting this language. R has a bright future.

## REASON 4: R HAS BRAINS, MUSCLE, AND HEART

Saying R is powerful is actually an understatement. From the business context, R is like Excel on steroids! But more important than just muscle is the combination of what R offers: brains, muscle, and heart.

### R HAS BRAINS

R implements cutting-edge algorithms including:

• H2O (h2o) – High-end machine learning package
• Keras/TensorFlow (kerastensorflow) – Go-to deep learning packages
• xgboost – Top Kaggle algorithm
• And many more!

These tools are used everywhere from AI products to Kaggle Competitions, and you can use them in your business analyses.

### R HAS MUSCLE

R has powerful tools for:

• Vectorized Operations – R uses vectorized operations to make math computations lightning fast right out of the box
• Loops (purrr)
• Parallelizing operations (parallelfuture)
• Speeding up code using C++ (Rcpp)
• Connecting to other languages (rJavareticulate)
• Working With Databases – Connecting to databases (dbplyrodbcbigrquery)
• Handling Big Data – Connecting to Apache Spark (sparklyr)
• And many more!

### R HAS HEART

We already talked about the infrastructure, the tidyverse, that enables the ecosystem of applications to be built using a consistent approach. It’s this infrastructure that brings life into your data analysis. The tidyverse enables:

• Data manipulation (dplyrtidyr)
• Working with data types (stringr for strings, lubridate for date/datetime, forcats for categorical/factors)
• Visualization (ggplot2)
• Programming (purrrtidyeval)
• Communication (Rmarkdownshiny)

## REASON 5: R IS BUILT FOR BUSINESS

Two major advantages of R versus every other programming language is that it can produce business-ready reports and machine learning-powered web applications. Neither Python or Tableau or any other tool can currently do this as efficiently as R can. The two capabilities we refer to are rmarkdown for report generation and shiny for interactive web applications.

### RMARKDOWN

Rmarkdown is a framework for creating reproducible reports that has since been extended to building blogs, presentations, websites, books, journals, and more. It’s the technology that’s behind this blog, and it allows us to include the code with the text so that anyone can follow the analysis and see the output right with the explanation. What’s really cool is that the technology has evolved so much. Here are a few examples of its capability:

### SHINY

Source: shiny.rstudio.com

Shiny is a framework for creating interactive web applications that are powered by R. Shiny is a major consulting area for us as four of five assignments involve building a web application using shiny. It’s not only powerful, it enables non-data scientists to gain the benefit of data science via interactive decision making tools. Here’s an example of a Google Trend app built with shiny.

## REASON 6: R COMMUNITY SUPPORT

Being a powerful language alone is not enough. To be successful, a language needs community support. We’ll hit on two ways that R excels in this respects: CRAN and the R Community.

### CRAN: COMMUNITY-PROVIDED R PACKAGES

CRAN is like the Apple App store, except everything is free, super useful, and built for R. With over 14,000 packages, it has most everything you can possibly want from machine learning to high-performance computing to finance and econometrics! The task views cover specific areas and are one way to explore R’s offerings. CRAN is community-driven, with top open source authors such as Hadley Wickham and Dirk Eddelbuettel leading the way. Package development is a great way to contribute to the community especially for those looking to showcase their coding skills and give back!

### COMMUNITY SUPPORT

You begin with R because of its capability, you stay with R because of its community. The R Community is the coolest part. It’s tight-knit, opinionated, fun, silly, and highly knowledgeable… all of the things you want in a high performing team.

#### SOCIAL/WEB

R users can be found all over the web. A few of the popular hangouts are:

#### CONFERENCES

R-focused business conferences are gaining traction in a big way. Here are a few that we attend and/or will be attending in the future:

• EARL – Mango Solution’s conference on enterprise and business applications of R
• R/Finance – Community-hosted conference on financial asset and portfolio analytics and applied finance
• Rstudio Conf – Rstudio’s technology conference
• New York R – Business and technology-focused R conference

#### MEETUPS

A really cool thing about R is that many major cities have a meetup nearby. Meetups are exactly what you think: a group of R-users getting together to talk R. They are usually funded by R-Consortium. You can get a full list of meetups here.

## CONCLUSION

R has a wide range of benefits making it our obvious choice for Data Science for Busienss (DS4B). That’s not to say that Python isn’t a good choice as well, but, for the wide-range of needs for business, there’s nothing that compares to R. In this article we saw why R is a great choice. In the next article we’ll show you how to learn R in 12 weeks.

## ABOUT BUSINESS SCIENCE

Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business and financial applications. We build web applications and automated reportsto put machine learning in the hands of decision makers. Visit the Business Science or contact usto learn more!

## BUSINESS SCIENCE UNIVERSITY

Interested in learning data science for business? Enroll in Business Science University. We’ll teach you how to apply data science and machine learning in real-world business applications. We take you through the entire process of modeling problems, creating interactive data products, and distributing solutions within an organization. We are launching courses in early 2018!

## APPENDIX – DISCUSSION ON DS4B TOOL ASSESSMENT

Here’s some additional information on the tool assessment. We have provided the code used to make the visualization, the criteria explanation, and the tool assessment.

### CRITERIA EXPLANATION

Our assessment of the most powerful DS4B tools was based on three criteria:

• Business Capability (1 = Low, 10 = High): How well-suited is the tool for use in the business? Does it include features needed for the business including advanced analytics, interactivity, communication, interactivity, and web apps?
• Ease of Learning (1 = Difficult, 10 = Easy): How easy is it to pick up? Can you learn it in a week of short courses or will it take a longer time horizon to become proficient?
• Cost (Free/Minimal, Low, High): Cost has two undesirable effects. From a first-order perspective, the organization has to spend money. This is not in-and-of-itself undesirable because the software companies can theoretically spend on R&D and other efforts to advance the product. The second-order effect of lowering adoption is much more concerning. High-cost tools tend to have much less discussion in the online world, whereas open source or low-cost tools have great trends.
• Trend (0 = Fast Decline, 5 = Stable, 10 = Fast Growth): We used StackOverflow Insights of questions as a proxy for the trend of usage over time. A major assumption is that growing number of Stack Overflow questions is that the usage is also increasing in a similar trend.

Source: Stack Overflow Trends

### INDIVIDUAL TOOL ASSESSMENT

#### R:

• DS4B Capability = 10: Has it all. Great data science capability, great visualization libraries, Shiny for interactive web apps, rmarkdown for professional reporting.
• Learning Curve = 4: A lot to learn, but learning is getting easier with the tidyverse.
• Trend = 10: Stack overflow questions are growing at a very fast pace.
• Cost = Low: Free and open source

#### PYTHON:

• DS4B Capability = 7: Has great machine learning and deep learning libraries. Can connect to any major database. Communication is limited by flask / Django web applications, which can be difficult to build. Does not have a business reporting infrastructure comparable to rmarkdown.
• Learning Curve = 4: A lot to learn, but learning is relatively easy compared to other object oriented programming languages like Java.
• Trend = 10: Stack overflow questions are growing at a very fast pace.
• Cost = Low: Free and open source

#### EXCEL:

• DS4B Capability = 4: Mainly a spreadsheet software but has programming built in with VBA. Difficult to integrate R, but is possible. No data science libraries.
• Learning Curve = 10: Relatively easy to become an advanced user.
• Trend = 7: Stack overflow questions are growing at a relatively fast pace.
• Cost = Low: Comes with Microsoft Office, which most organizations use.

#### TABLEAU:

• DS4B Capability = 6: Has R integrated, but is very difficult to implement advanced algorithms and not as flexible as R+shiny.
• Learning Curve = 7: Very easy to pick up.
• Trend = 6: Stack overflow questions are growing at a relatively fast pace.
• Cost = Low: Free public version. Enterprise licenses are relatively affordable.

#### POWERBI:

• DS4B Capability = 5: Similar to Tableau, but not quite as feature-rich. Can integrate R to some extent.
• Learning Curve = 8: Very easy to pick up.
• Trend = 6: Expected to have same trend as Tableau.
• Cost = Low: Free public version. Licenses are very affordable.

#### MATLAB:

• DS4B Capability = 6: Can do a lot with it, but lacks the infrastructure to use for business.
• Learning Curve = 2: Matlab is quite difficult to learn.
• Trend = 1: Stack overflow growth is declining at a rapid pace.
• Cost = High: Matlab licenses are very expensive. Licensing structure does not scale well.

#### SAS:

• DS4B Capability = 8: Has data science, database connection, business reporting and visualization capabilities. Can also build applications. However, limited by closed-source nature. Does not get latest technologies like tensorflow and H2O.
• Learning Curve = 4: Similar to most data science programming languages for the tough stuff. Has a GUI for the easy stuff.
• Trend = 3: Stack Overflow growth is declining.
• Cost = High: Expensive for licenses. Licensing structure does not scale well.

### CODE FOR THE DS4B TOOL ASSESSMENT VISUALIZATION

Applying data science to business and financial analysis

## Pipes in R Tutorial For Beginners (article) by DataCamp

Learn more about the famous pipe operator %>% and other pipes in R, why and how you should use them and what alternatives you can consider!

You might have already seen or used the pipe operator when you’re working with packages such as dplyrmagrittr,… But do you know where pipes and the famous %>% operator come from, what they exactly are, or how, when and why you should use them? Can you also come up with some alternatives?

This tutorial will give you an introduction to pipes in R and will cover the following topics:

Are you interested in learning more about manipulating data in R with dplyr? Take a look at DataCamp’s Data Manipulation in R with dplyr course.

## Pipe Operator in R: Introduction

To understand what the pipe operator in R is and what you can do with it, it’s necessary to consider the full picture, to learn the history behind it. Questions such as “where does this weird combination of symbols come from and why was it made like this?” might be on top of your mind. You’ll discover the answers to these and more questions in this section.

Now, you can look at the history from three perspectives: from a mathematical point of view, from a holistic point of view of programming languages, and from the point of view of the R language itself. You’ll cover all three in what follows!

### History of the Pipe Operator in R

#### Mathematical History

If you have two functions, let’s say f:BCf:B→C and g:ABg:A→B, you can chain these functions together by taking the output of one function and inserting it into the next. In short, “chaining” means that you pass an intermediate result onto the next function, but you’ll see more about that later.

For example, you can say, f(g(x))f(g(x))g(x)g(x) serves as an input for f()f(), while xx, of course, serves as input to g()g().

If you would want to note this down, you will use the notation fgf◦g, which reads as “f follows g”. Alternatively, you can visually represent this as:

Image Credit: James Balamuta, “Piping Data”

#### Pipe Operators in Other Programming Languages

As mentioned in the introduction to this section, this operator is not new in programming: in the Shell or Terminal, you can pass command from one to the next with the pipeline character |. Similarly, F# has a forward pipe operator, which will prove to be important later on! Lastly, it’s also good to know that Haskell contains many piping operations that are derived from the Shell or Terminal.

#### Pipes in R

Now that you have seen some history of the pipe operator in other programming languages, it’s time to focus on R. The history of this operator in R starts, according to this fantastic blog post written by Adolfo Álvarez, on January 17th, 2012, when an anonymous user asked the following question in this Stack Overflow post:

How can you implement F#’s forward pipe operator in R? The operator makes it possible to easily chain a sequence of calculations. For example, when you have an input data and want to call functions foo and bar in sequence, you can write data |> foo |> bar?

The answer came from Ben Bolker, professor at McMaster University, who replied:

I don’t know how well it would hold up to any real use, but this seems (?) to do what you want, at least for single-argument functions …

"%>%" <- function(x,f) do.call(f,list(x))
pi %>% sin
[1] 1.224606e-16
pi %>% sin %>% cos
[1] 1
cos(sin(pi))
[1] 1


About nine months later, Hadley Wickham started the dplyr package on GitHub. You might now know Hadley, Chief Scientist at RStudio, as the author of many popular R packages (such as this last package!) and as the instructor for DataCamp’s Writing Functions in R course.

Be however it may, it wasn’t until 2013 that the first pipe %.% appears in this package. As Adolfo Álvarez rightfully mentions in his blog post, the function was denominated chain(), which had the purpose to simplify the notation for the application of several functions to a single data frame in R.

The %.% pipe would not be around for long, as Stefan Bache proposed an alternative on the 29th of December 2013, that included the operator as you might now know it:

iris %>%
subset(Sepal.Length > 5) %>%
aggregate(. ~ Species, ., mean)


Bache continued to work with this pipe operation and at the end of 2013, the magrittr package came to being. In the meantime, Hadley Wickham continued to work on dplyr and in April 2014, the %.% operator got replaced with the one that you now know, %>%.

Later that year, Kun Ren published the pipeR package on GitHub, which incorporated a different pipe operator, %>>%, which was designed to add more flexibility to the piping process. However, it’s safe to say that the %>% is now established in the R language, especially with the recent popularity of the Tidyverse.

### What Is It?

Knowing the history is one thing, but that still doesn’t give you an idea of what F#’s forward pipe operator is nor what it actually does in R.

In F#, the pipe-forward operator |> is syntactic sugar for chained method calls. Or, stated more simply, it lets you pass an intermediate result onto the next function.

Remember that “chaining” means that you invoke multiple method calls. As each method returns an object, you can actually allow the calls to be chained together in a single statement, without needing variables to store the intermediate results.

In R, the pipe operator is, as you have already seen, %>%. If you’re not familiar with F#, you can think of this operator as being similar to the + in a ggplot2statement. Its function is very similar to that one that you have seen of the F# operator: it takes the output of one statement and makes it the input of the next statement. When describing it, you can think of it as a “THEN”.

Take, for example, following code chunk and read it aloud:

iris %>%
subset(Sepal.Length > 5) %>%
aggregate(. ~ Species, ., mean)


You’re right, the code chunk above will translate to something like “you take the Iris data, then you subset the data and then you aggregate the data”.

This is one of the most powerful things about the Tidyverse. In fact, having a standardized chain of processing actions is called “a pipeline”. Making pipelines for a data format is great, because you can apply that pipeline to incoming data that has the same formatting and have it output in a ggplot2friendly format, for example.

### Why Use It?

R is a functional language, which means that your code often contains a lot of parenthesis, ( and ). When you have complex code, this often will mean that you will have to nest those parentheses together. This makes your R code hard to read and understand. Here’s where %>% comes in to the rescue!

Take a look at the following example, which is a typical example of nested code:

# Initialize x
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)

# Compute the logarithm of x, return suitably lagged and iterated differences,
# compute the exponential function and round the result
round(exp(diff(log(x))), 1)

1. 3.3
2. 1.8
3. 1.6
4. 0.5
5. 0.3
6. 0.1
7. 48.8
8. 1.1

With the help of %<%, you can rewrite the above code as follows:

# Import magrittr
library(magrittr)

# Perform the same computations on x as above
x %>% log() %>%
diff() %>%
exp() %>%
round(1)


Does this seem difficult to you? No worries! You’ll learn more on how to go about this later on in this tutorial.

Note that you need to import the magrittr library to get the above code to work. That’s because the pipe operator is, as you read above, part of the magrittr library and is, since 2014, also a part of dplyr. If you forget to import the library, you’ll get an error like Error in eval(expr, envir, enclos): could not find function "%>%".

Also note that it isn’t a formal requirement to add the parentheses after logdiff and exp, but that, within the R community, some will use it to increase the readability of the code.

In short, here are four reasons why you should be using pipes in R:

• You’ll structure the sequence of your data operations from left to right, as apposed to from inside and out;
• You’ll avoid nested function calls;
• You’ll minimize the need for local variables and function definitions; And
• You’ll make it easy to add steps anywhere in the sequence of operations.

These reasons are taken from the magrittr documentation itself. Implicitly, you see the arguments of readability and flexibility returning.

### Additional Pipes

Even though %>% is the (main) pipe operator of the magrittr package, there are a couple of other operators that you should know and that are part of the same package:

• The compound assignment operator %<>%;
# Initialize x
x <- rnorm(100)

# Update value of x and assign it to x
x %<>% abs %>% sort

• The tee operator %T>%;
rnorm(200) %>%
matrix(ncol = 2) %T>%
plot %>%
colSums


Note that it’s good to know for now that the above code chunk is actually a shortcut for:

rnorm(200) %>%
matrix(ncol = 2) %T>%
{ plot(.); . } %>%
colSums


But you’ll see more about that later on!

• The exposition pipe operator %$%. data.frame(z = rnorm(100)) %$%
ts.plot(z)


Of course, these three operators work slightly differently than the main %>%operator. You’ll see more about their functionalities and their usage later on in this tutorial!

Note that, even though you’ll most often see the magrittr pipes, you might also encounter other pipes as you go along! Some examples are wrapr‘s dot arrow pipe %.>% or to dot pipe %>.%, or the Bizarro pipe ->.;.

## How to Use Pipes in R

Now that you know how the %>% operator originated, what it actually is and why you should use it, it’s time for you to discover how you can actually use it to your advantage. You will see that there are quite some ways in which you can use it!

### Basic Piping

Before you go into the more advanced usages of the operator, it’s good to first take a look at the most basic examples that use the operator. In essence, you’ll see that there are 3 rules that you can follow when you’re first starting out:

• f(x) can be rewritten as x %>% f

In short, this means that functions that take one argument, function(argument), can be rewritten as follows: argument %>% function(). Take a look at the following, more practical example to understand how these two are equivalent:

# Compute the logarithm of x
log(x)

# Compute the logarithm of x
x %>% log()

• f(x, y) can be rewritten as x %>% f(y)

Of course, there are a lot of functions that don’t just take one argument, but multiple. This is the case here: you see that the function takes two arguments, x and y. Similar to what you have seen in the first example, you can rewrite the function by following the structure argument1 %>% function(argument2), where argument1 is the magrittr placeholder and argument2 the function call.

This all seems quite theoretical. Let’s take a look at a more practical example:

# Round pi
round(pi, 6)

# Round pi
pi %>% round(6)

• x %>% f %>% g %>% h can be rewritten as h(g(f(x)))

This might seem complex, but it isn’t quite like that when you look at a real-life R example:

# Import babynames data
library(babynames)
# Import dplyr library
library(dplyr)

# Load the data
data(babynames)

# Count how many young boys with the name "Taylor" are born
sum(select(filter(babynames,sex=="M",name=="Taylor"),n))

# Do the same but now with %>%
babynames%>%filter(sex=="M",name=="Taylor")%>%
select(n)%>%
sum


Note how you work from the inside out when you rewrite the nested code: you first put in the babynames, then you use %>% to first filter() the data. After that, you’ll select n and lastly, you’ll sum() everything.

Remember also that you already saw another example of such a nested code that was converted to more readable code in the beginning of this tutorial, where you used the log()diff()exp() and round() functions to perform calculations on x.

#### Functions that Use the Current Environment

Unfortunately, there are some exceptions to the more general rules that were outlined in the previous section. Let’s take a look at some of them here.

Consider this example, where you use the assign() function to assign the value 10 to the variable x.

# Assign 10 to x
assign("x", 10)

# Assign 100 to x
"x" %>% assign(100)

# Return x
x


10

You see that the second call with the assign() function, in combination with the pipe, doesn’t work properly. The value of x is not updated.

Why is this?

That’s because the function assigns the new value 100 to a temporary environment used by %>%. So, if you want to use assign() with the pipe, you must be explicit about the environment:

# Define your environment
env <- environment()

# Add the environment to assign()
"x" %>% assign(100, envir = env)

# Return x
x


100

#### Functions with Lazy Evalution

Arguments within functions are only computed when the function uses them in R. This means that no arguments are computed before you call your function! That means also that the pipe computes each element of the function in turn.

One place that this is a problem is tryCatch(), which lets you capture and handle errors, like in this example:

tryCatch(stop("!"), error = function(e) "An error")

stop("!") %>%
tryCatch(error = function(e) "An error")


‘An error’

Error in eval(expr, envir, enclos): !
Traceback:

1. stop("!") %>% tryCatch(error = function(e) "An error")

2. eval(lhs, parent, parent)

3. eval(expr, envir, enclos)

4. stop("!")


You’ll see that the nested way of writing down this line of code works perfectly, while the piped alternative returns an error. Other functions with the same behavior are try()suppressMessages(), and suppressWarnings() in base R.

### Argument Placeholder

There are also instances where you can use the pipe operator as an argument placeholder. Take a look at the following examples:

• f(x, y) can be rewritten as y %>% f(x, .)

In some cases, you won’t want the value or the magrittr placeholder to the function call at the first position, which has been the case in every example that you have seen up until now. Reconsider this line of code:

pi %>% round(6)


If you would rewrite this line of code, pi would be the first argument in your round() function. But what if you would want to replace the second, third, … argument and use that one as the magrittr placeholder to your function call?

Take a look at this example, where the value is actually at the third position in the function call:

"Ceci n'est pas une pipe" %>% gsub("une", "un", .)


‘Ceci n\’est pas un pipe’

• f(y, z = x) can be rewritten as x %>% f(y, z = .)

Likewise, you might want to make the value of a specific argument within your function call the magrittr placeholder. Consider the following line of code:

6 %>% round(pi, digits=.)


### Re-using the Placeholder for Attributes

It is straight-forward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code.

Here are some general “rules” that you can take into account when you’re working with argument placeholders in nested function calls:

• f(x, y = nrow(x), z = ncol(x)) can be rewritten as x %>% f(y = nrow(.), z = ncol(.))
# Initialize a matrix ma
ma <- matrix(1:12, 3, 4)

# Return the maximum of the values inputted
max(ma, nrow(ma), ncol(ma))

# Return the maximum of the values inputted
ma %>% max(nrow(ma), ncol(ma))


12

12

The behavior can be overruled by enclosing the right-hand side in braces:

• f(y = nrow(x), z = ncol(x)) can be rewritten as x %>% {f(y = nrow(.), z = ncol(.))}
# Only return the maximum of the nrow(ma) and ncol(ma) input values
ma %>% {max(nrow(ma), ncol(ma))}


4

To conclude, also take a look at the following example, where you could possibly want to adjust the workings of the argument placeholder in the nested function call:

# The function that you want to rewrite
paste(1:5, letters[1:5])

# The nested function call with dot placeholder
1:5 %>%
paste(., letters[.])

1. ‘1 a’
2. ‘2 b’
3. ‘3 c’
4. ‘4 d’
5. ‘5 e’
1. ‘1 a’
2. ‘2 b’
3. ‘3 c’
4. ‘4 d’
5. ‘5 e’

You see that if the placeholder is only used in a nested function call, the magrittr placeholder will also be placed as the first argument! If you want to avoid this from happening, you can use the curly brackets { and }:

# The nested function call with dot placeholder and curly brackets
1:5 %>% {
paste(letters[.])
}

# Rewrite the above function call
paste(letters[1:5])

1. ‘a’
2. ‘b’
3. ‘c’
4. ‘d’
5. ‘e’
1. ‘a’
2. ‘b’
3. ‘c’
4. ‘d’
5. ‘e’

### Building Unary Functions

Unary functions are functions that take one argument. Any pipeline that you might make that consists of a dot ., followed by functions and that is chained together with %>% can be used later if you want to apply it to values. Take a look at the following example of such a pipeline:

. %>% cos %>% sin


This pipeline would take some input, after which both the cos() and sin()fuctions would be applied to it.

But you’re not there yet! If you want this pipeline to do exactly that which you have just read, you need to assign it first to a variable f, for example. After that, you can re-use it later to do the operations that are contained within the pipeline on other values.

# Unary function
f <- . %>% cos %>% sin

f

structure(function (value)
freduce(value, _function_list), class = c("fseq", "function"
))

Remember also that you could put parentheses after the cos() and sin()functions in the line of code if you want to improve readability. Consider the same example with parentheses: . %>% cos() %>% sin().

You see, building functions in magrittr very similar to building functions with base R! If you’re not sure how similar they actually are, check out the line above and compare it with the next line of code; Both lines have the same result!

# is equivalent to
f <- function(.) sin(cos(.))

f

function (.)
sin(cos(.))

### Compound Assignment Pipe Operations

There are situations where you want to overwrite the value of the left-hand side, just like in the example right below. Intuitively, you will use the assignment operator <- to do this.

# Load in the Iris data
iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE)

# Add column names to the Iris data
names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")

# Compute the square root of iris$Sepal.Length and assign it to the variable iris$Sepal.Length <-
iris$Sepal.Length %>% sqrt()  However, there is a compound assignment pipe operator, which allows you to use a shorthand notation to assign the result of your pipeline immediately to the left-hand side: # Compute the square root of iris$Sepal.Length and assign it to the variable
iris$Sepal.Length %<>% sqrt # Return Sepal.Length iris$Sepal.Length


Note that the compound assignment operator %<>% needs to be the first pipe operator in the chain for this to work. This is completely in line with what you just read about the operator being a shorthand notation for a longer notation with repetition, where you use the regular <- assignment operator.

As a result, this operator will assign a result of a pipeline rather than returning it.

### Tee Operations with The Tee Operator

The tee operator works exactly like %>%, but it returns the left-hand side value rather than the potential result of the right-hand side operations.

This means that the tee operator can come in handy in situations where you have included functions that are used for their side effect, such as plotting with plot() or printing to a file.

In other words, functions like plot() typically don’t return anything. That means that, after calling plot(), for example, your pipeline would end. However, in the following example, the tee operator %T>% allows you to continue your pipeline even after you have used plot():

set.seed(123)
rnorm(200) %>%
matrix(ncol = 2) %T>%
plot %>%
colSums


### Exposing Data Variables with the Exposition Operator

When you’re working with R, you’ll find that many functions take a dataargument. Consider, for example, the lm() function or the with() function. These functions are useful in a pipeline where your data is first processed and then passed into the function.

For functions that don’t have a data argument, such as the cor() function, it’s still handy if you can expose the variables in the data. That’s where the %$%operator comes in. Consider the following example: iris %>% subset(Sepal.Length > mean(Sepal.Length)) %$%
cor(Sepal.Length, Sepal.Width)


0.336696922252551

With the help of %$% you make sure that Sepal.Length and Sepal.Width are exposed to cor(). Likewise, you see that the data in the data.frame() function is passed to the ts.plot() to plot several time series on a common plot: data.frame(z = rnorm(100)) %$%
ts.plot(z)


## dplyr and magrittr

In the introduction to this tutorial, you already learned that the development of dplyr and magrittr occurred around the same time, namely, around 2013-2014. And, as you have read, the magrittr package is also part of the Tidyverse.

In this section, you will discover how exciting it can be when you combine both packages in your R code.

For those of you who are new to the dplyr package, you should know that this R package was built around five verbs, namely, “select”, “filter”, “arrange”, “mutate” and “summarize”. If you have already manipulated data for some data science project, you will know that these verbs make up the majority of the data manipulation tasks that you generally need to perform on your data.

Take an example of some traditional code that makes use of these dplyrfunctions:

library(hflights)

grouped_flights <- group_by(hflights, Year, Month, DayofMonth)
flights_data <- select(grouped_flights, Year:DayofMonth, ArrDelay, DepDelay)
summarized_flights <- summarise(flights_data,
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE))
final_result <- filter(summarized_flights, arr > 30 | dep > 30)

final_result

Year Month DayofMonth arr dep
2011 2 4 44.08088 47.17216
2011 3 3 35.12898 38.20064
2011 3 14 46.63830 36.13657
2011 4 4 38.71651 27.94915
2011 4 25 37.79845 22.25574
2011 5 12 69.52046 64.52039
2011 5 20 37.02857 26.55090
2011 6 22 65.51852 62.30979
2011 7 29 29.55755 31.86944
2011 9 29 39.19649 32.49528
2011 10 9 61.90172 59.52586
2011 11 15 43.68134 39.23333
2011 12 29 26.30096 30.78855
2011 12 31 46.48465 54.17137

When you look at this example, you immediately understand why dplyr and magrittr are able to work so well together:

hflights %>%
group_by(Year, Month, DayofMonth) %>%
select(Year:DayofMonth, ArrDelay, DepDelay) %>%
summarise(arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) %>%
filter(arr > 30 | dep > 30)


Both code chunks are fairly long, but you could argue that the second code chunk is more clear if you want to follow along through all of the operations. With the creation of intermediate variables in the first code chunk, you could possibly lose the “flow” of the code. By using %>%, you gain a more clear overview of the operations that are being performed on the data!

In short, dplyr and magrittr are your dreamteam for manipulating data in R!

## RStudio Keyboard Shortcuts for Pipes

Adding all these pipes to your R code can be a challenging task! To make your life easier, John Mount, co-founder and Principal Consultant at Win-Vector, LLC and DataCamp instructor, has released a package with some RStudio add-ins that allow you to create keyboard shortcuts for pipes in R. Addins are actually R functions with a bit of special registration metadata. An example of a simple addin can, for example, be a function that inserts a commonly used snippet of text, but can also get very complex!

With these addins, you’ll be able to execute R functions interactively from within the RStudio IDE, either by using keyboard shortcuts or by going through the Addins menu.

Note that this package is actually a fork from RStudio’s original add-in package, which you can find here. Be careful though, the support for addins is available only within the most recent release of RStudio! If you want to know more on how you can install these RStudio addins, check out this page.

You can download the add-ins and keyboard shortcuts here.

## When Not To Use the Pipe Operator in R

In the above, you have seen that pipes are definitely something that you should be using when you’re programming with R. More specifically, you have seen this by covering some cases in which pipes prove to be very useful! However, there are some situations, outlined by Hadley Wickham in “R for Data Science”, in which you can best avoid them:

• Your pipes are longer than (say) ten steps.

In cases like these, it’s better to create intermediate objects with meaningful names. It will not only be easier for you to debug your code, but you’ll also understand your code better and it’ll be easier for others to understand your code.

• You have multiple inputs or outputs.

If you aren’t transforming one primary object, but two or more objects are combined together, it’s better not to use the pipe.

• You are starting to think about a directed graph with a complex dependency structure.

Pipes are fundamentally linear and expressing complex relationships with them will only result in complex code that will be hard to read and understand.

• You’re doing internal package development

Using pipes in internal package development is a no-go, as it makes it harder to debug!

For more reflections on this topic, check out this Stack Overflow discussion. Other situations that appear in that discussion are loops, package dependencies, argument order and readability.

In short, you could summarize it all as follows: keep the two things in mind that make this construct so great, namely, readability and flexibility. As soon as one of these two big advantages is compromised, you might consider some alternatives in favor of the pipes.

## Alternatives to Pipes in R

After all that you have read by you might also be interested in some alternatives that exist in the R programming language. Some of the solutions that you have seen in this tutorial were the following:

• Create intermediate variables with meaningful names;

Instead of chaining all operations together and outputting one single result, break up the chain and make sure you save intermediate results in separate variables. Be careful with the naming of these variables: the goal should always be to make your code as understandable as possible!

• Nest your code so that you read it from the inside out;

One of the possible objections that you could have against pipes is the fact that it goes against the “flow” that you have been accustomed to with base R. The solution is then to stick with nesting your code! But what to do then if you don’t like pipes but you also think nesting can be quite confusing? The solution here can be to use tabs to highlight the hierarchy.

• … Do you have more suggestions? Make sure to let me know – Drop me a tweet @willems_karlijn

## Conclusion

You have covered a lot of ground in this tutorial: you have seen where %>%comes from, what it exactly is, why you should use it and how you should use it. You’ve seen that the dplyr and magrittr packages work wonderfully together and that there are even more operators out there! Lastly, you have also seen some cases in which you shouldn’t use it when you’re programming in R and what alternatives you can use in such cases.

If you’re interested in learning more about the Tidyverse, consider DataCamp’s Introduction to the Tidyverse course.

## Five Tips to Improve Your R Code (article) by DataCamp

Five useful tips that you can use to effectively improve your R code, from using seq() to create sequences to ditching which() and much more!

@drsimonj here with five simple tricks I find myself sharing all the time with fellow R users to improve their code!

## 1. More fun to sequence from 1

Next time you use the colon operator to create a sequence from 1 like 1:n, try seq().

# Sequence a vector
x <- runif(10)
seq(x)
#>  [1]  1  2  3  4  5  6  7  8  9 10

# Sequence an integer
seq(nrow(mtcars))
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
#> [24] 24 25 26 27 28 29 30 31 32


The colon operator can produce unexpected results that can create all sorts of problems without you noticing! Take a look at what happens when you want to sequence the length of an empty vector:

# Empty vector
x <- c()

1:length(x)
#> [1] 1 0

seq(x)
#> integer(0)


You’ll also notice that this saves you from using functions like length(). When applied to an object of a certain length, seq() will automatically create a sequence from 1 to the length of the object.

## 2. vector() what you c()

Next time you create an empty vector with c(), try to replace it with vector("type", length).

# A numeric vector with 5 elements
vector("numeric", 5)
#> [1] 0 0 0 0 0

# A character vector with 3 elements
vector("character", 3)
#> [1] "" "" ""


Doing this improves memory usage and increases speed! You often know upfront what type of values will go into a vector, and how long the vector will be. Using c() means R has to slowly work both of these things out. So help give it a boost with vector()!

A good example of this value is in a for loop. People often write loops by declaring an empty vector and growing it with c() like this:

x <- c()
for (i in seq(5)) {
x <- c(x, i)
}

#> x at step 1 : 1
#> x at step 2 : 1, 2
#> x at step 3 : 1, 2, 3
#> x at step 4 : 1, 2, 3, 4
#> x at step 5 : 1, 2, 3, 4, 5


Instead, pre-define the type and length with vector(), and reference positions by index, like this:

n <- 5
x <- vector("integer", n)
for (i in seq(n)) {
x[i] <- i
}

#> x at step 1 : 1, 0, 0, 0, 0
#> x at step 2 : 1, 2, 0, 0, 0
#> x at step 3 : 1, 2, 3, 0, 0
#> x at step 4 : 1, 2, 3, 4, 0
#> x at step 5 : 1, 2, 3, 4, 5


Here’s a quick speed comparison:

n <- 1e5

x_empty <- c()
system.time(for(i in seq(n)) x_empty <- c(x_empty, i))
#>    user  system elapsed
#>  15.238   2.327  17.650

x_zeros <- vector("integer", n)
system.time(for(i in seq(n)) x_zeros[i] <- i)
#>    user  system elapsed
#>   0.007   0.000   0.007


That should be convincing enough!

## 3. Ditch the which()

Next time you use which(), try to ditch it! People often use which() to get indices from some boolean condition, and then select values at those indices. This is not necessary.

Getting vector elements greater than 5:

x <- 3:7

# Using which (not necessary)
x[which(x > 5)]
#> [1] 6 7

# No which
x[x > 5]
#> [1] 6 7


Or counting number of values greater than 5:

# Using which
length(which(x > 5))
#> [1] 2

# Without which
sum(x > 5)
#> [1] 2


Why should you ditch which()? It’s often unnecessary and boolean vectors are all you need.

For example, R lets you select elements flagged as TRUE in a boolean vector:

condition <- x > 5
condition
#> [1] FALSE FALSE FALSE  TRUE  TRUE
x[condition]
#> [1] 6 7


Also, when combined with sum() or mean(), boolean vectors can be used to get the count or proportion of values meeting a condition:

sum(condition)
#> [1] 2
mean(condition)
#> [1] 0.4


which() tells you the indices of TRUE values:

which(condition)
#> [1] 4 5


And while the results are not wrong, it’s just not necessary. For example, I often see people combining which() and length() to test whether any or all values are TRUE. Instead, you just need any() or all():

x <- c(1, 2, 12)

# Using which() and length() to test if any values are greater than 10
if (length(which(x > 10)) > 0)
print("At least one value is greater than 10")
#> [1] "At least one value is greater than 10"

# Wrapping a boolean vector with any()
if (any(x > 10))
print("At least one value is greater than 10")
#> [1] "At least one value is greater than 10"

# Using which() and length() to test if all values are positive
if (length(which(x > 0)) == length(x))
print("All values are positive")
#> [1] "All values are positive"

# Wrapping a boolean vector with all()
if (all(x > 0))
print("All values are positive")
#> [1] "All values are positive"


Oh, and it saves you a little time…

x <- runif(1e8)

system.time(x[which(x > .5)])
#>    user  system elapsed
#>   1.156   0.522   1.686

system.time(x[x > .5])
#>    user  system elapsed
#>   1.071   0.442   1.662


## 4. factor that factor!

Ever removed values from a factor and found you’re stuck with old levels that don’t exist anymore? I see all sorts of creative ways to deal with this. The simplest solution is often just to wrap it in factor() again.

This example creates a factor with four levels ("a""b""c" and "d"):

# A factor with four levels
x <- factor(c("a", "b", "c", "d"))
x
#> [1] a b c d
#> Levels: a b c d

plot(x)


If you drop all cases of one level ("d"), the level is still recorded in the factor:

# Drop all values for one level
x <- x[x != "d"]

# But we still have this level!
x
#> [1] a b c
#> Levels: a b c d

plot(x)


A super simple method for removing it is to use factor() again:

x <- factor(x)
x
#> [1] a b c
#> Levels: a b c

plot(x)


This is typically a good solution to a problem that gets a lot of people mad. So save yourself a headache and factor that factor!

## 5. First you get the $, then you get the power Next time you want to extract values from a data.frame column where the rows meet a condition, specify the column with $ before the rows with [.

Say you want the horsepower (hp) for cars with 4 cylinders (cyl), using the mtcars data set. You can write either of these:

# rows first, column second - not ideal
mtcars[mtcars$cyl == 4, ]$hp
#>  [1]  93  62  95  66  52  65  97  66  91 113 109

# column first, rows second - much better
mtcars$hp[mtcars$cyl == 4]
#>  [1]  93  62  95  66  52  65  97  66  91 113 109


The tip here is to use the second approach.

But why is that?

First reason: do away with that pesky comma! When you specify rows before the column, you need to remember the comma: mtcars[mtcars$cyl == 4,]$hp. When you specify column first, this means that you’re now referring to a vector, and don’t need the comma!

Second reason: speed! Let’s test it out on a larger data frame:

# Simulate a data frame...
n <- 1e7
d <- data.frame(
a = seq(n),
b = runif(n)
)

# rows first, column second - not ideal
system.time(d[d$b > .5, ]$a)
#>    user  system elapsed
#>   0.497   0.126   0.629

# column first, rows second - much better
system.time(d$a[d$b > .5])
#>    user  system elapsed
#>   0.089   0.017   0.107


Worth it, right?

Still, if you want to hone your skills as an R data frame ninja, I suggest learning dplyr. You can get a good overview on the dplyr website or really learn the ropes with online courses like DataCamp’s Data Manipulation in R with dplyr.

## Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

## Introduction to Skewness · R Views

In previous posts herehere, and here, we spent quite a bit of time on portfolio volatility, using the standard deviation of returns as a proxy for volatility. Today we will begin to a two-part series on additional statistics that aid our understanding of return dispersion: skewness and kurtosis. Beyond being fancy words and required vocabulary for CFA level 1, these two concepts are both important and fascinating for lovers of returns distributions. For today, we will focus on skewness.

Skewness is the degree to which returns are asymmetric around the mean. Since a normal distribution is symmetric around the mean, skewness can be taken as one measure of how returns are not distributed normally. Why does skewness matter? If portfolio returns are right, or positively, skewed, it implies numerous small negative returns and a few large positive returns. If portfolio returns are left, or negatively, skewed, it implies numerous small positive returns and few large negative returns. The phrase “large negative returns” should trigger Pavlovian sweating for investors, even if it’s preceded by a diminutive modifier like “just a few”. For a portfolio manager, a negatively skewed distribution of returns implies a portfolio at risk of rare but large losses. This makes us nervous and is a bit like saying, “I’m healthy, except for my occasional massive heart attack.”

Let’s get to it.

First, have a look at one equation for skewness:

Skew=∑t=1n(xi−x¯)3/n/(∑t=1n(xi−x¯)2/n)3/2

Skew has important substantive implications for risk, and is also a concept that lends itself to data visualization. In fact, I find the visualizations of skewness more illuminating than the numbers themselves (though the numbers are what matter in the end). In this section, we will cover how to calculate skewness using xts and tidyverse methods, how to calculate rolling skewness, and how to create several data visualizations as pedagogical aids. We will be working with our usual portfolio consisting of:

+ SPY (S&P500 fund) weighted 25%
+ EFA (a non-US equities fund) weighted 25%
+ IJS (a small-cap value fund) weighted 20%
+ EEM (an emerging-mkts fund) weighted 20%
+ AGG (a bond fund) weighted 10%

Before we can calculate the skewness, we need to find portfolio monthly returns, which was covered in this post.

Building off that previous work, we will be working with two objects of portfolio returns:

+ portfolio_returns_xts_rebalanced_monthly (an xts of monthly returns)
+ portfolio_returns_tq_rebalanced_monthly (a tibble of monthly returns)

Let’s begin in the xts world and make use of the skewness() function from PerformanceAnalytics.

library(PerformanceAnalytics)
skew_xts <-  skewness(portfolio_returns_xts_rebalanced_monthly$returns) skew_xts ## [1] -0.1710568 Our portfolio is relatively balanced, and a slight negative skewness of -0.1710568 is unsurprising and unworrisome. However, that final number could be omitting important information and we will resist the temptation to stop there. For example, is that slight negative skew being caused by one very large negative monthly return? If so, what happened? Or is it caused by several medium-sized negative returns? What caused those? Were they consecutive? Are they seasonal? We need to investigate further. Before doing so and having fun with data visualization, let’s explore the tidyverse methods and confirm consistent results. We will make use of the same skewness() function, but because we are using a tibble, we use summarise() as well and call summarise(skew = skewness(returns). It’s not necessary, but we are also going to run this calculation by hand, the same as we have done with standard deviation. Feel free to delete the by-hand section from your code should this be ported to enterprise scripts, but keep in mind that there is a benefit to forcing ourselves and loved ones to write out equations: it emphasizes what those nice built-in functions are doing under the hood. If a client, customer or risk officer were ever to drill into our skewness calculations, it would be nice to have a super-firm grasp on the equation. library(tidyverse) library(tidyquant) skew_tidy <- portfolio_returns_tq_rebalanced_monthly %>% summarise(skew_builtin = skewness(returns), skew_byhand = (sum((returns - mean(returns))^3)/length(returns))/ ((sum((returns - mean(returns))^2)/length(returns)))^(3/2)) %>% select(skew_builtin, skew_byhand) Let’s confirm that we have consistent calculations. skew_xts ## [1] -0.1710568 skew_tidy$skew_builtin
## [1] -0.1710568
skew_tidy$skew_byhand ## [1] -0.1710568 The results are consistent using xts and our tidyverse, by-hand methods. Again, though, that singular number -0.1710568 does not fully illuminate the riskiness or distribution of this portfolio. To dig deeper, let’s first visualize the density of returns with stat_density from ggplot2. portfolio_density_plot <- portfolio_returns_tq_rebalanced_monthly %>% ggplot(aes(x = returns)) + stat_density(geom = "line", alpha = 1, colour = "cornflowerblue") portfolio_density_plot The slight negative skew is a bit more evident here. It would be nice to shade the area that falls below some threshold again, and let’s go with the mean return. To do that, let’s create an object called shaded_area using ggplot_build(portfolio_density_plot)$data[[1]] %>% filter(x < mean(portfolio_returns_tq_rebalanced_monthly$returns)). That snippet will take our original ggplot object and create a new object filtered for x values less than mean return. Then we use geom_area to add the shaded area to portfolio_density_plot. shaded_area_data <- ggplot_build(portfolio_density_plot)$data[[1]] %>%
filter(x < mean(portfolio_returns_tq_rebalanced_monthly$returns)) portfolio_density_plot_shaded <- portfolio_density_plot + geom_area(data = shaded_area_data, aes(x = x, y = y), fill="pink", alpha = 0.5) portfolio_density_plot_shaded The shaded area highlights the mass of returns that fall below the mean. Let’s add a vertical line at the mean and median, and some explanatory labels. This will help to emphasize that negative skew indicates a mean less than the median. First, create variables for mean and median so that we can add a vertical line. median <- median(portfolio_returns_tq_rebalanced_monthly$returns)
mean <- mean(portfolio_returns_tq_rebalanced_monthly$returns) We want the vertical lines to just touch the density plot so we once again use a call to ggplot_build(portfolio_density_plot)$data[[1]].

median_line_data <-
data.frame(z = rnorm(100)) %$% ts.plot(z)  Of course, these three operators work slightly differently than the main %>%operator. You’ll see more about their functionalities and their usage later on in this tutorial! Note that, even though you’ll most often see the magrittr pipes, you might also encounter other pipes as you go along! Some examples are wrapr‘s dot arrow pipe %.>% or to dot pipe %>.%, or the Bizarro pipe ->.;. ## How to Use Pipes in R Now that you know how the %>% operator originated, what it actually is and why you should use it, it’s time for you to discover how you can actually use it to your advantage. You will see that there are quite some ways in which you can use it! ### Basic Piping Before you go into the more advanced usages of the operator, it’s good to first take a look at the most basic examples that use the operator. In essence, you’ll see that there are 3 rules that you can follow when you’re first starting out: • f(x) can be rewritten as x %>% f In short, this means that functions that take one argument, function(argument), can be rewritten as follows: argument %>% function(). Take a look at the following, more practical example to understand how these two are equivalent: # Compute the logarithm of x log(x) # Compute the logarithm of x x %>% log()  • f(x, y) can be rewritten as x %>% f(y) Of course, there are a lot of functions that don’t just take one argument, but multiple. This is the case here: you see that the function takes two arguments, x and y. Similar to what you have seen in the first example, you can rewrite the function by following the structure argument1 %>% function(argument2), where argument1 is the magrittr placeholder and argument2 the function call. This all seems quite theoretical. Let’s take a look at a more practical example: # Round pi round(pi, 6) # Round pi pi %>% round(6)  • x %>% f %>% g %>% h can be rewritten as h(g(f(x))) This might seem complex, but it isn’t quite like that when you look at a real-life R example: # Import babynames data library(babynames) # Import dplyr library library(dplyr) # Load the data data(babynames) # Count how many young boys with the name "Taylor" are born sum(select(filter(babynames,sex=="M",name=="Taylor"),n)) # Do the same but now with %>% babynames%>%filter(sex=="M",name=="Taylor")%>% select(n)%>% sum  Note how you work from the inside out when you rewrite the nested code: you first put in the babynames, then you use %>% to first filter() the data. After that, you’ll select n and lastly, you’ll sum() everything. Remember also that you already saw another example of such a nested code that was converted to more readable code in the beginning of this tutorial, where you used the log()diff()exp() and round() functions to perform calculations on x. #### Functions that Use the Current Environment Unfortunately, there are some exceptions to the more general rules that were outlined in the previous section. Let’s take a look at some of them here. Consider this example, where you use the assign() function to assign the value 10 to the variable x. # Assign 10 to x assign("x", 10) # Assign 100 to x "x" %>% assign(100) # Return x x  10 You see that the second call with the assign() function, in combination with the pipe, doesn’t work properly. The value of x is not updated. Why is this? That’s because the function assigns the new value 100 to a temporary environment used by %>%. So, if you want to use assign() with the pipe, you must be explicit about the environment: # Define your environment env <- environment() # Add the environment to assign() "x" %>% assign(100, envir = env) # Return x x  100 #### Functions with Lazy Evalution Arguments within functions are only computed when the function uses them in R. This means that no arguments are computed before you call your function! That means also that the pipe computes each element of the function in turn. One place that this is a problem is tryCatch(), which lets you capture and handle errors, like in this example: tryCatch(stop("!"), error = function(e) "An error") stop("!") %>% tryCatch(error = function(e) "An error")  ‘An error’ Error in eval(expr, envir, enclos): ! Traceback: 1. stop("!") %>% tryCatch(error = function(e) "An error") 2. eval(lhs, parent, parent) 3. eval(expr, envir, enclos) 4. stop("!")  You’ll see that the nested way of writing down this line of code works perfectly, while the piped alternative returns an error. Other functions with the same behavior are try()suppressMessages(), and suppressWarnings() in base R. ### Argument Placeholder There are also instances where you can use the pipe operator as an argument placeholder. Take a look at the following examples: • f(x, y) can be rewritten as y %>% f(x, .) In some cases, you won’t want the value or the magrittr placeholder to the function call at the first position, which has been the case in every example that you have seen up until now. Reconsider this line of code: pi %>% round(6)  If you would rewrite this line of code, pi would be the first argument in your round() function. But what if you would want to replace the second, third, … argument and use that one as the magrittr placeholder to your function call? Take a look at this example, where the value is actually at the third position in the function call: "Ceci n'est pas une pipe" %>% gsub("une", "un", .)  ‘Ceci n\’est pas un pipe’ • f(y, z = x) can be rewritten as x %>% f(y, z = .) Likewise, you might want to make the value of a specific argument within your function call the magrittr placeholder. Consider the following line of code: 6 %>% round(pi, digits=.)  ### Re-using the Placeholder for Attributes It is straight-forward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code. Here are some general “rules” that you can take into account when you’re working with argument placeholders in nested function calls: • f(x, y = nrow(x), z = ncol(x)) can be rewritten as x %>% f(y = nrow(.), z = ncol(.)) # Initialize a matrix ma ma <- matrix(1:12, 3, 4) # Return the maximum of the values inputted max(ma, nrow(ma), ncol(ma)) # Return the maximum of the values inputted ma %>% max(nrow(ma), ncol(ma))  12 12 The behavior can be overruled by enclosing the right-hand side in braces: • f(y = nrow(x), z = ncol(x)) can be rewritten as x %>% {f(y = nrow(.), z = ncol(.))} # Only return the maximum of the nrow(ma) and ncol(ma) input values ma %>% {max(nrow(ma), ncol(ma))}  4 To conclude, also take a look at the following example, where you could possibly want to adjust the workings of the argument placeholder in the nested function call: # The function that you want to rewrite paste(1:5, letters[1:5]) # The nested function call with dot placeholder 1:5 %>% paste(., letters[.])  1. ‘1 a’ 2. ‘2 b’ 3. ‘3 c’ 4. ‘4 d’ 5. ‘5 e’ 1. ‘1 a’ 2. ‘2 b’ 3. ‘3 c’ 4. ‘4 d’ 5. ‘5 e’ You see that if the placeholder is only used in a nested function call, the magrittr placeholder will also be placed as the first argument! If you want to avoid this from happening, you can use the curly brackets { and }: # The nested function call with dot placeholder and curly brackets 1:5 %>% { paste(letters[.]) } # Rewrite the above function call paste(letters[1:5])  1. ‘a’ 2. ‘b’ 3. ‘c’ 4. ‘d’ 5. ‘e’ 1. ‘a’ 2. ‘b’ 3. ‘c’ 4. ‘d’ 5. ‘e’ ### Building Unary Functions Unary functions are functions that take one argument. Any pipeline that you might make that consists of a dot ., followed by functions and that is chained together with %>% can be used later if you want to apply it to values. Take a look at the following example of such a pipeline: . %>% cos %>% sin  This pipeline would take some input, after which both the cos() and sin()fuctions would be applied to it. But you’re not there yet! If you want this pipeline to do exactly that which you have just read, you need to assign it first to a variable f, for example. After that, you can re-use it later to do the operations that are contained within the pipeline on other values. # Unary function f <- . %>% cos %>% sin f  structure(function (value) freduce(value, _function_list), class = c("fseq", "function" )) Remember also that you could put parentheses after the cos() and sin()functions in the line of code if you want to improve readability. Consider the same example with parentheses: . %>% cos() %>% sin(). You see, building functions in magrittr very similar to building functions with base R! If you’re not sure how similar they actually are, check out the line above and compare it with the next line of code; Both lines have the same result! # is equivalent to f <- function(.) sin(cos(.)) f  function (.) sin(cos(.)) ### Compound Assignment Pipe Operations There are situations where you want to overwrite the value of the left-hand side, just like in the example right below. Intuitively, you will use the assignment operator <- to do this. # Load in the Iris data iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE) # Add column names to the Iris data names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") # Compute the square root of iris$Sepal.Length and assign it to the variable
iris$Sepal.Length <- iris$Sepal.Length %>%
sqrt()


However, there is a compound assignment pipe operator, which allows you to use a shorthand notation to assign the result of your pipeline immediately to the left-hand side:

# Compute the square root of iris$Sepal.Length and assign it to the variable iris$Sepal.Length %<>% sqrt

# Return Sepal.Length
iris$Sepal.Length  Note that the compound assignment operator %<>% needs to be the first pipe operator in the chain for this to work. This is completely in line with what you just read about the operator being a shorthand notation for a longer notation with repetition, where you use the regular <- assignment operator. As a result, this operator will assign a result of a pipeline rather than returning it. ### Tee Operations with The Tee Operator The tee operator works exactly like %>%, but it returns the left-hand side value rather than the potential result of the right-hand side operations. This means that the tee operator can come in handy in situations where you have included functions that are used for their side effect, such as plotting with plot() or printing to a file. In other words, functions like plot() typically don’t return anything. That means that, after calling plot(), for example, your pipeline would end. However, in the following example, the tee operator %T>% allows you to continue your pipeline even after you have used plot(): set.seed(123) rnorm(200) %>% matrix(ncol = 2) %T>% plot %>% colSums  ### Exposing Data Variables with the Exposition Operator When you’re working with R, you’ll find that many functions take a dataargument. Consider, for example, the lm() function or the with() function. These functions are useful in a pipeline where your data is first processed and then passed into the function. For functions that don’t have a data argument, such as the cor() function, it’s still handy if you can expose the variables in the data. That’s where the %$% operator comes in. Consider the following example:

iris %>%
subset(Sepal.Length > mean(Sepal.Length)) %$% cor(Sepal.Length, Sepal.Width)  0.336696922252551 With the help of %$% you make sure that Sepal.Length and Sepal.Width are exposed to cor(). Likewise, you see that the data in the data.frame() function is passed to the ts.plot() to plot several time series on a common plot:

data.frame(z = rnorm(100)) %$% ts.plot(z)  ## dplyr and magrittr In the introduction to this tutorial, you already learned that the development of dplyr and magrittr occurred around the same time, namely, around 2013-2014. And, as you have read, the magrittr package is also part of the Tidyverse. In this section, you will discover how exciting it can be when you combine both packages in your R code. For those of you who are new to the dplyr package, you should know that this R package was built around five verbs, namely, “select”, “filter”, “arrange”, “mutate” and “summarize”. If you have already manipulated data for some data science project, you will know that these verbs make up the majority of the data manipulation tasks that you generally need to perform on your data. Take an example of some traditional code that makes use of these dplyrfunctions: library(hflights) grouped_flights <- group_by(hflights, Year, Month, DayofMonth) flights_data <- select(grouped_flights, Year:DayofMonth, ArrDelay, DepDelay) summarized_flights <- summarise(flights_data, arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) final_result <- filter(summarized_flights, arr > 30 | dep > 30) final_result  Year Month DayofMonth arr dep 2011 2 4 44.08088 47.17216 2011 3 3 35.12898 38.20064 2011 3 14 46.63830 36.13657 2011 4 4 38.71651 27.94915 2011 4 25 37.79845 22.25574 2011 5 12 69.52046 64.52039 2011 5 20 37.02857 26.55090 2011 6 22 65.51852 62.30979 2011 7 29 29.55755 31.86944 2011 9 29 39.19649 32.49528 2011 10 9 61.90172 59.52586 2011 11 15 43.68134 39.23333 2011 12 29 26.30096 30.78855 2011 12 31 46.48465 54.17137 When you look at this example, you immediately understand why dplyr and magrittr are able to work so well together: hflights %>% group_by(Year, Month, DayofMonth) %>% select(Year:DayofMonth, ArrDelay, DepDelay) %>% summarise(arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) %>% filter(arr > 30 | dep > 30)  Both code chunks are fairly long, but you could argue that the second code chunk is more clear if you want to follow along through all of the operations. With the creation of intermediate variables in the first code chunk, you could possibly lose the “flow” of the code. By using %>%, you gain a more clear overview of the operations that are being performed on the data! In short, dplyr and magrittr are your dreamteam for manipulating data in R! ## RStudio Keyboard Shortcuts for Pipes Adding all these pipes to your R code can be a challenging task! To make your life easier, John Mount, co-founder and Principal Consultant at Win-Vector, LLC and DataCamp instructor, has released a package with some RStudio add-ins that allow you to create keyboard shortcuts for pipes in R. Addins are actually R functions with a bit of special registration metadata. An example of a simple addin can, for example, be a function that inserts a commonly used snippet of text, but can also get very complex! With these addins, you’ll be able to execute R functions interactively from within the RStudio IDE, either by using keyboard shortcuts or by going through the Addins menu. Note that this package is actually a fork from RStudio’s original add-in package, which you can find here. Be careful though, the support for addins is available only within the most recent release of RStudio! If you want to know more on how you can install these RStudio addins, check out this page. You can download the add-ins and keyboard shortcuts here. ## When Not To Use the Pipe Operator in R In the above, you have seen that pipes are definitely something that you should be using when you’re programming with R. More specifically, you have seen this by covering some cases in which pipes prove to be very useful! However, there are some situations, outlined by Hadley Wickham in “R for Data Science”, in which you can best avoid them: • Your pipes are longer than (say) ten steps. In cases like these, it’s better to create intermediate objects with meaningful names. It will not only be easier for you to debug your code, but you’ll also understand your code better and it’ll be easier for others to understand your code. • You have multiple inputs or outputs. If you aren’t transforming one primary object, but two or more objects are combined together, it’s better not to use the pipe. • You are starting to think about a directed graph with a complex dependency structure. Pipes are fundamentally linear and expressing complex relationships with them will only result in complex code that will be hard to read and understand. • You’re doing internal package development Using pipes in internal package development is a no-go, as it makes it harder to debug! For more reflections on this topic, check out this Stack Overflow discussion. Other situations that appear in that discussion are loops, package dependencies, argument order and readability. In short, you could summarize it all as follows: keep the two things in mind that make this construct so great, namely, readability and flexibility. As soon as one of these two big advantages is compromised, you might consider some alternatives in favor of the pipes. ## Alternatives to Pipes in R After all that you have read by you might also be interested in some alternatives that exist in the R programming language. Some of the solutions that you have seen in this tutorial were the following: • Create intermediate variables with meaningful names; Instead of chaining all operations together and outputting one single result, break up the chain and make sure you save intermediate results in separate variables. Be careful with the naming of these variables: the goal should always be to make your code as understandable as possible! • Nest your code so that you read it from the inside out; One of the possible objections that you could have against pipes is the fact that it goes against the “flow” that you have been accustomed to with base R. The solution is then to stick with nesting your code! But what to do then if you don’t like pipes but you also think nesting can be quite confusing? The solution here can be to use tabs to highlight the hierarchy. • … Do you have more suggestions? Make sure to let me know – Drop me a tweet @willems_karlijn ## Conclusion You have covered a lot of ground in this tutorial: you have seen where %>%comes from, what it exactly is, why you should use it and how you should use it. You’ve seen that the dplyr and magrittr packages work wonderfully together and that there are even more operators out there! Lastly, you have also seen some cases in which you shouldn’t use it when you’re programming in R and what alternatives you can use in such cases. If you’re interested in learning more about the Tidyverse, consider DataCamp’s Introduction to the Tidyverse course. ## 데이터 전처리 -데이터 전처리(클린징)에 대한 모든 것 본 포스팅에서는 탐색적 데이터 분석(EDA)라고 불리우는 단계에서 수행해야 할 Task에 대해 순서대로 정리해 보고자 합니다. EDA는 데이터 셋 확인 – 결측값 처리 – 이상값 처리 – Feature Engineering 의 순서로 진행합니다. 데이터 분석의 단계 중 가장 많은 시간이 소요되는 단계가 바로 Exploratory Data Analysis 단계입니다. Forbes에서 인용한 CrowdFlower의 설문 결과에 따르면 데이터 분석가는 업무 시간 중 80%정도를 데이터 수집 및 전처리 과정에 사용한다고 합니다. (하지만 동일 설문에서 데이터 분석 업무 중 가장 싫은 단계로 꼽히기도 했다죠.) 본 포스팅에서는 탐색적 데이터 분석(EDA)라고 불리우기도 하는 데이터 전처리 단계에서 수행해야 할 Task에 대해 순서대로 정리해 보고자 합니다. 데이터 전처리는 데이터 셋 확인 – 결측값 처리 – 이상값 처리 – Feature Engineering 의 순서로 진행합니다. #### 1 데이터 셋 확인 분석하고자 하는 데이터 셋과 친해지는 단계입니다. 데이터 셋에 대해 아래 두가지 확인 작업을 하게 됩니다. ##### A. 변수 확인 독립/종속 변수의 정의, 각 변수의 유형(범주형인지 연속형인지), 변수의 데이터 타입(Date인지, Character인지, Numeric 인지 등)을 확인합니다. 다른 툴도 마찬가지겠지만, R을 사용하는 분들은 변수의 데이터 타입에 따라 모델 Fitting 할때 전혀 다른 결과가 나오기 때문에 사전에 변수 타입을 체크하고, 잘못 설정되어 있는 경우 이 단계에서 변경해 주세요. ###### B. RAW 데이터 확인 B-1. 단변수 분석 변수 하나에 대해 기술 통계 확인을 하는 단계입니다. Histogram이나 Boxplot을 사용해서 평균, 최빈값, 중간값 등과 함께 각 변수들의 분포를 확인하면 됩니다. 범주형 변수의 경우 Boxplot을 사용해서 빈도 수 분포를 체크해 주면 됩니다. B-2. 이변수 분석 변수 2개 간의 관계를 분석하는 단계 입니다. 아래 그림과 같이 변수의 유형에 따라 적절한 시각화 및 분석 방법을 택하면 됩니다. B-2. 셋 이상의 변수 번거롭지만 세개 이상의 변수 간의 관계를 시각화, 분석해야 할 경우도 있을 텐데요. 이때 범주형 변수가 하나이상 포함되어 있는 경우 변수를 범주에 따라 쪼갠 후에 위 분석 방법에 따라 분석하면 됩니다. 예를 들어 남성-여성의 정보가 있고 연소득, 학력, 키의 정보가 있다고 할때 성별로 구분해서 연소득과 학력이 독립적인지 T-test로 확인해 볼 수 있겠죠? 학력으로 구분해서 연소득과 키의 상관관계를 확인해도 될 것이구요. 세개 이상의 연속형 변수의 관계를 확인하기 위해서는, 연속형 변수를 Feature engineering을 통해 범주형 변수로 변환한 후 분석하시거나, 혹은 (추천하지 않습니다만 굳이 필요하다면) 3차원 그래프를 그려 시각적으로 확인해 볼 수 있습니다. #### 2. 결측값 처리 (Missing value treatment) 결측값이 있는 상태로 모델을 만들게 될 경우 변수간의 관계가 왜곡될수 있기 때문에 모델의 정확성이 떨어지게 됩니다. 결측값이 발생하는 유형은 다양한데요, 결측값이 무작위로 발생하느냐, 아니면 결측값의 발생이 다른 변수와 관계가 있는지 여부에 따라 결측값을 처리하는 방법도 조금씩 달라집니다. #### 결측값 처리 방법의 종류 ##### A. 삭제 결측값이 발생한 모든 관측치를 삭제하거나 (전체 삭제, Listwise Deletion), 데이터 중 모델에 포함시킬 변수들 중 관측값이 발생한 모든 관측치를 삭제하는 방법(부분 삭제)이 있습니다. 전체 삭제는 간편한 반면 관측치가 줄어들어 모델의 유효성이 낮아질 수 있고, 부분 삭제는 모델에 따라 변수가 제각각 다르기 때문에 관리 Cost가 늘어난다는 단점이 있습니다. 삭제는 결측값이 무작위로 발생한 경우에 사용합니다. 결측값이 무작위로 발생한 것이 아닌데 관측치를 삭제한 데이터를 사용할 경우 왜곡된 모델이 생성될 수 있습니다. ##### B. 다른 값으로 대체 (평균, 최빈값, 중간값) 결측값이 발생한 경우 다른 관측치의 평균, 최빈값, 중간값 등으로 대체할 수 있는데요, 모든 관측치의 평균값 등으로 대체하는 일괄 대체 방법과, 범주형 변수를 활용해 유사한 유형의 평균값 등으로 대체하는 유사 유형 대체 방법이 있습니다. (예 – 남자 키의 평균 값 173, 여자 키의 평균 값 158인 경우, 남자 관측치의 결측 값은 173으로 대체) 결측 값의 발생이 다른 변수와 관계가 있는 경우 대체 방법이 유용한 측면은 있지만, 유사 유형 대체 방법의 경우 어떤 범주형 변수를 유사한 유형으로 선택할 것인지는 자의적으로 선택하므로 모델이 왜곡될 가능성이 존재합니다. C. 예측값 삽입 결측값이 없는 관측치를 트레이닝 데이터로 사용해서 결측값을 예측하는 모델을 만들고, 이 모델을 통해 결측값이 있는 관측 데이터의 결측값을 예측하는 방법입니다. Regression이나 Logistic regression을 주로 사용합니다. 대체하는 방법보다 조금 덜 자의적이나, 결측 값이 다양한 변수에서 발생하는 경우 사용 가능 변수 수가 적어 적합한 모델을 만들기 어렵고, 또 이렇게 만들어 진 모델의 예측력이 낮은 경우에는 사용하기 어려운 방법입니다. #### 3. 이상값 처리 (Outlier treatment) 이상값이란 데이터/샘플과 동떨어진 관측치로, 모델을 왜곡할 가능성이 있는 관측치를 말합니다. #### 이상값 찾아내기 이상값을 찾아 내기 위한 쉽고 간단한 방법은 변수의 분포를 시각화하는 것입니다. 일반적으로 하나의 변수에 대해서는 Boxplot이나 Histogram을, 두개의 변수 간 이상값을 찾기 위해서는 Scatter plot을 사용합니다. 시각적으로 확인하는 방법은 직관적이지만 자의적이기도 하고 하나하나 확인해야 해서 번거로운 측면이 있습니다. 두 변수 간 이상값을 찾기 위한 또 다른 방법으로는 두 변수 간 회귀 모형에서 Residual, Studentized residual(혹은 standardized residual), leverage, Cook’s D값을 확인하면 됩니다. #### 이상값 처리하기 ###### A. 단순 삭제 이상값이 Human error에 의해서 발생한 경우에는 해당 관측치를 삭제하면 됩니다. 단순 오타나, 주관식 설문 등의 비현실 적인 응답, 데이터 처리 과정에서의 오류 등의 경우에 사용합니다. ###### B. 다른 값으로 대체 절대적인 관측치의 숫자가 작은 경우, 삭제의 방법으로 이상치를 제거하면 관측치의 절대량이 작아지는 문제가 발생합니다. 이런 경우 이상 값이 Human error에 의해 발생했더라도 관측치를 삭제하는 대신 다른 값(평균 등)으로 대체하거나, 결측값과 유사하게 다른 변수들을 사용해서 예측 모델을 만들고, 이상값을 예측한 후 해당 값으로 대체하는 방법도 사용할 수 있습니다. C. 변수화 이상값이 자연 발생한 경우, 단순 삭제나 대체의 방법을 통해 수립된 모델은 설명/예측하고자 하는 현상을 잘 설명하지 못할 수도 있습니다. 예를 들어 아래 그래프에서 다른 관측치들만 보면 경력과 연봉이 비례하는 현상이 존재하는 것 처럼 보이지만, 5년차의 연봉$35,000인 이상치가 포함됨으로써 모델의 설명력이 크게 낮아 집니다.

자연발생적인 이상값의 경우, 바로 삭제하지 말고 좀 더 찬찬히 이상값에 대해 파악하는 것이 중요합니다.

예를 들어 위 이상값의 경우 의사 등 전문직종에 종사하는 사람이라고 가정해 봅시다. 이럴 경우 전문직종 종사 여부를 Yes – No로 변수화 하면 이상값을 삭제하지 않고 모델에 포함시킬 수 있습니다.

###### D. 리샘플링

자연 발생한 이상값을 처리하는 또 다른 방법으로는 해당 이상값을 분리해서 모델을 만드는 방법이 있습니다.

아래와 같이 15년 이상의 경력을 가진 이상값이 존재한다고 가정해 봅시다. 이 관측치는 경력은 길지만 연봉이 비례해서 늘어나지 않은 사람입니다.

(위 사례와의 차이:
위 사례는 설명 변수, 즉 경력 측면에서는 Outlier가 아니고, 종속 변수인 연봉만 예측치를 벗어나는 반면, 본 케이스는 설명 변수, 종속 변수 모두에서 Outlier라는 점입니다.)

이 경우 간단하게는 이상치를 삭제하고 분석 범위는 10년 이내의 경력자를 대상으로 한다는 설명 등을 다는 것으로 이상값을 처리할 수 있습니다.

###### E. 케이스를 분리하여 분석

위와 동일한 사례에서 실은 경력이 지나치게 길어질 경우 연봉이 낮아지는 현상이 실제로 발생할 수도 있습니다. (건강상의 이유 등으로)

이 경우 이상값을 대상에서 제외시키는 것은 현상에 대한 정확한 설명이 되지 않을 수 있습니다. 보다 좋은 방법은 이상값을 포함한 모델과 제외한 모델을 모두 만들고 각각의 모델에 대한 설명을 다는 것입니다.

자연 발생한 이상값에 별다른 특이점이 발견되지 않는다면, 단순 제외 보다는 케이스를 분리하여 분석하는 것을 추천합니다.

#### 4. Feature Engineering

Feature Engineering이란, 기존의 변수를 사용해서 데이터에 정보를 추가하는 일련의 과정입니다. 새로 관측치나 변수를 추가하지 않고도 기존의 데이터를 보다 유용하게 만드는 방법론입니다.

##### A. SCALING

변수의 단위를 변경하고 싶거나, 변수의 분포가 편향되어 있을 경우, 변수 간의 관계가 잘 드러나지 않는 경우에는 변수 변환의 방법을 사용합니다.

가장 자주 사용하는 방법으로는 Log 함수가 있고, 유사하지만 좀 덜 자주 사용되는 Square root를 취하는 방법도 있습니다.

###### B. BINNING

연속형 변수를 범주형 변수로 만드는 방법입니다. 예를 들어 연봉 데이터가 수치로 존재하는 경우, 이를 100만원 미만, 101만원~200만원.. 하는 식으로 범주형 변수로 변환하는 것이죠.

Binning에는 특별한 원칙이 있는 것이 아니기 때문에,  분석가의 Business 이해도에 따라 창의적인 방법으로 Binning  할 수 있습니다.

###### C. TRANSFORM

기존 존재하는 변수의 성질을 이용해 다른 변수를 만드는 방법입니다.

예를 들어 날짜 별 판매 데이터가 있다면, 날짜 변수를 주중/주말로 나눈 변수를 추가한다던지, eSports의 관람객 데이터의 경우 해당 일에 SKT T1의 경기가 있는지 여부 등을 추가하는 것이지요.

Transform에도 특별한 원칙이 있는 것은 아니며, 분석가의 Business 이해도에 따라 다양한 변수가 생성될 수 있습니다.

##### D. DUMMY

Binning 과는 반대로 범주형 변수를 연속형 변수로 변환하기 위해 사용합니다. 사용하고자 하는 분석 방법론에서 필요한 경우에 주로 사용합니다.

#### 끝내며…

Garbage-in, garbage-out이기 때문에 Model building 전에 내가 가지고 있는 데이터의 상태를 확인하고, 내가 설계한 분석 방법에 맞게 적절한 전처리를 해주는 것은 정확한 결과를 얻기 위해 필수적인 단계라고 할 수 있습니다.  마치 요리를 하기 전 좋은 재료를 구해 잘 손질하는 과정과도 같죠.

본 포스팅을 읽으신 분들께 데이터 전처리에 대한 큰그림(?)이 생기셨으면 좋겠네요.

##### 참고 자료

1) MeasuringU, 7 Ways To Handle Missing Data

2) Boston University Technical Report, Marina Soley-Bori, Dealing with missing data: Key assumptions and methods for applied analysis

3) R-bloggers, Imputing missing data with R; MICE package

4) Analytics Vidhya, A comprehensive guide to data exploration

5) The analysis factor, Outliers: To Drop or Not to Drop

6) Kellogg, Outliers

## How to Create Infographics in R – nandeshwar.info

What’s so special about this?

Now you may ask: “What’s so special about this?” Well, the theory supports that like regular dominoes, these dominoes could keep pushing the next ones down. Even the small, 2in domino can bring down the tallest of dominoes. You keep going on and,

• with the 15th domino you will reach the height of one of the tallest dinosaurs, Argentinosaurus (16m tall)
• with the 18th domino you will reach at the top of the statue of liberty (46m)
• with the 40th domino you be in the space station (370km)
• and by 57th, you will reach the moon (370,000km)

People use the expression “reach for the moon” meaning try to achieve very difficult tasks, in almost a defeating tone; however, this example empowers us to think that it IS indeed possible for a small domino to build up to reach the moon.

It might be easy for some people to sustain with whatever current knowledge they have, but we know that in the knowledge economy we must continuously improve and learn. I heard this recently from Mike Rayburn: “coasting only happens downhill.” Although it is easy to coast at a job, it will only bring us down. Another difficult, but invigorating approach is to become a “virtuoso“: mastery in the chosen field.

virtuoso
noun
a person who has a masterly or dazzling skill or technique in any field of activity

Reaching for the moon is of course extremely difficult and improbable for most of us, but the metaphor is powerful. We start with one step small step, we repeat that step with increased intensity and momentum, and we can achieve our goals. With focus, efforts and momentum, it is possible to achieve even the most improbable goals. Find your one thing that will make you successful and repeat it everyday in increasing order.

## Recipe for Infographics in R

### Ingredients

Now I feel better. I justified myself to replicate the above example in R. After I justified myself, I searched for some basics and found some fantastic threads on stackoverflow on using images in R and ggplot2.

### Hint

It was hardly obvious to me that inforgraphics in statistics are called pictograms. Remember this when you search for information on infographics in R

After knowing that it was possible to create infographics in R, I searched for some vector art. I found them on vecteezy.com and vector.me.

### Not lying

Edward Tufte, in his book The Visual Display of Quantitative Information, famously described how graphic designers (or let’s say data communicators) “lie” with data, especially when the objects they plot are hardly in true proportions. My challenge was thus to avoid lying and still communicate the message.

### R code

Now to the fun part! Getting our hands dirty in R when not pulling our hair dealing with R.

#### Step 1:Load my favorite libraries

 library('ggplot2') library('scales') # for number formating library('png') # to read png files library('grid') # for expanding the plot library('Cairo') # for high quality plots and anti-aliasing library('plyr') # for easy data manipulation 

#### Step 2: Generate data and the base plot

 dominoes <- data.frame(n = 1:58, height = 0.051 *1.5^(0:57)) # 2inch is 0.051 meters base_plot <- qplot(x = n, y = height, data = dominoes, geom = "line") #+ scale_y_sqrt() base_plot <- base_plot + labs(x = "Sequence Number", y = "Height/Distance\n(meters)") + theme(axis.ticks = element_blank(), panel.background = element_rect(fill = "white", colour = "white"), legend.position = "none") base_plot <- base_plot + theme(axis.title.y = element_text(angle = 0), axis.text = element_text(size = 18), axis.title = element_text(size = 20)) base_plot <- base_plot + theme(plot.margin = unit(c(1,1,18,1), "lines")) + scale_y_continuous(labels = comma) base_plot 

### Note

The argument plot.margin. I increased the height of the plot by supplying the parameter unit(c(1,1,18,1), "lines")

We get this plot:

#### Step 3: Read all the vector arts in a Grob form

 domino_img <- readPNG("domino.png") domino_grob <- rasterGrob(domino_img, interpolate = TRUE)   eiffel_tower_img <- readPNG("eiffel-tower.png") eiffel_tower_grob <- rasterGrob(eiffel_tower_img, interpolate = TRUE)   pisa_img <- readPNG("pisa-tower.png") pisa_grob <- rasterGrob(pisa_img, interpolate = TRUE)   liberty_img <- readPNG("statue-of-liberty.png") libery_grob <- rasterGrob(liberty_img, interpolate = TRUE)   long_neck_dino_img <- readPNG("dinosaur-long-neck.png") long_neck_dino_grob <- rasterGrob(long_neck_dino_img, interpolate = TRUE) 

#### Step 4: Line up the images without lying

 p <- base_plot + annotation_custom(eiffel_tower_grob, xmin = 20, xmax = 26, ymin = 0, ymax = 381) + annotation_custom(libery_grob, xmin = 17, xmax = 19, ymin = 0, ymax = 50) + annotation_custom(long_neck_dino_grob, xmin = 13, xmax = 17, ymin = 0, ymax = 15)   CairoPNG(filename = "domino-effect-geometric-progression.png", width = 1800, height = 600, quality = 90) plot(p) dev.off() 

From step 4, we get this:

Shucks! All this for this boring looking graph. Not lying is not fun. Although the Argentinosaurus, statue of liberty, and Eiffel Tower all are proportionate to their heights, the plot lacks appeal. I thought the next best thing would be to place all the objects close to their values on the x-axis. Another benefit of this approach: I added some other objects that have very small and big y-axis values i.e. a domino, the space station and our moon.

#### Step 7: Put everything together

 base_plot + add_images(grob_placement) + add_texts(img_texts)   CairoPNG(filename = "domino-effect-geometric-progression-2.png", width = 1800, height = 600, quality = 90) g <- base_plot + add_images(grob_placement) + add_texts(img_texts) + annotation_custom(domino_grob, xmin = 1, xmax = 2, ymin = -1*10^8, ymax = -5*10^8) gt <- ggplot_gtable(ggplot_build(g)) gt$layout$clip[gt$layout$name == "panel"] <- "off" grid.draw(gt) dev.off() 

This is what we get. Not bad, huh?

We still have a problem: our beloved moon is smaller than the space station, because I placed all the images in rectangles of same height. I could have made the moon slightly bigger, but I could not have maintained the proportion. I thought it is better to have all the objects in similar size rectangles than changing proportions at will. If you have other ideas, please let me know.

#### Step 7: Make it pretty

And by pretty, I mean, upload the final plot to Canva and add the orange color. 🙂 Here is my final version:

There it is! It is possible to use R to create infographics or pictograms, and the obvious advantage, as I explained my post Tableau vs. R, is a programming language’s repeatability and reproducibility. You can, of course, edit the output plots in Illustrator or GIMP, but for quick wins, R’s output is fantastic. Can you think of any other ideas to create infographics in R?

## Improve Data Visualization In As Quick As 5 Minutes With These 20+ Special Tips

Expert Advice To Create Data Visualization Like Pros

## Full Script

   #http://stackoverflow.com/questions/14113691/pictorial-chart-in-r?lq=1 #http://stackoverflow.com/questions/6797457/images-as-labels-in-a-graph?lq=1 #http://stackoverflow.com/questions/20733328/labelling-the-plots-with-images-on-graph-in-ggplot2?rq=1 #http://stackoverflow.com/questions/25014492/geom-bar-pictograms-how-to?lq=1 #http://stackoverflow.com/questions/19625328/make-the-value-of-the-fill-the-actual-fill-in-ggplot2/20196002#20196002 #http://stackoverflow.com/questions/12409960/ggplot2-annotate-outside-of-plot?lq=1 library('ggplot2') library('scales') library('png') library('grid') library('Cairo') library('plyr')   dominoes <- data.frame(n = 1:58, height = 0.051 *1.5^(0:57)) # 2inch is 0.051 meters base_plot <- qplot(x = n, y = height, data = dominoes, geom = "line") #+ scale_y_sqrt() base_plot <- base_plot + labs(x = "Sequence Number", y = "Height/Distance\n(meters)") + theme(axis.ticks = element_blank(), panel.background = element_rect(fill = "white", colour = "white"), legend.position = "none") base_plot <- base_plot + theme(axis.title.y = element_text(angle = 0), axis.text = element_text(size = 18), axis.title = element_text(size = 20)) base_plot <- base_plot + theme(plot.margin = unit(c(1,1,18,1), "lines")) + scale_y_continuous(labels = comma) base_plot   domino_img <- readPNG("domino.png") domino_grob <- rasterGrob(domino_img, interpolate = TRUE)   eiffel_tower_img <- readPNG("eiffel-tower.png") eiffel_tower_grob <- rasterGrob(eiffel_tower_img, interpolate = TRUE)   pisa_img <- readPNG("pisa-tower.png") pisa_grob <- rasterGrob(pisa_img, interpolate = TRUE)   liberty_img <- readPNG("statue-of-liberty.png") libery_grob <- rasterGrob(liberty_img, interpolate = TRUE)   long_neck_dino_img <- readPNG("dinosaur-long-neck.png") long_neck_dino_grob <- rasterGrob(long_neck_dino_img, interpolate = TRUE)     #space station is 370,149.120 meters   #this version tries to scale images by their heights p <- base_plot + annotation_custom(eiffel_tower_grob, xmin = 20, xmax = 26, ymin = 0, ymax = 381) + annotation_custom(libery_grob, xmin = 17, xmax = 19, ymin = 0, ymax = 50) + annotation_custom(long_neck_dino_grob, xmin = 13, xmax = 17, ymin = 0, ymax = 15)   CairoPNG(filename = "domino-effect-geometric-progression.png", width = 1800, height = 600, quality = 90) plot(p) dev.off()       #this version just places a picture at the number grob_placement <- data.frame(imgname = c("dinosaur-long-neck.png", "statue-of-liberty.png", "eiffel-tower.png", "space-station.png", "moon.png"), xmins = c(13, 17, 20, 38, 53), ymins = rep(-1*10^8, 5), ymaxs = rep(-4.5*10^8, 5), stringsAsFactors = FALSE) grob_placement$xmaxs <- grob_placement$xmins + 4   #make a function to create the grobs and call the annotation_custom function add_images <- function(df) { dlply(df, .(imgname), function(df){ img <- readPNG(unique(df$imgname)) grb <- rasterGrob(img, interpolate = TRUE) annotation_custom(grb, xmin = df$xmins, xmax = df$xmax, ymin = df$ymins, ymax = df$ymaxs) }) } img_texts <- data.frame(imgname = c("domino", "dino", "space-station", "moon"), xs = c(1, 13, 38, 53), ys = rep(-5.2*10^8, 4), texts = c("1st domino is\nonly 2in", "15th domino will reach Argentinosaurus (16m).\nBy 18th domino, you will reach the statue of liberty (46m).\n23 domino will be taller than the Eiffel Tower (300m)", "40th domino will\nreach the ISS (370km)", "57th domino will\nreach the moon (370,000km)" )) add_texts <- function(df) { dlply(df, .(imgname), function(df){ annotation_custom(grob = textGrob(label = df$texts, hjust = 0), xmin = df$xs, xmax = df$xs, ymin = df$ys, ymax = df$ys) }) }   base_plot + add_images(grob_placement) + add_texts(img_texts)   CairoPNG(filename = "domino-effect-geometric-progression-2.png", width = 1800, height = 600, quality = 90) g <- base_plot + add_images(grob_placement) + add_texts(img_texts) + annotation_custom(domino_grob, xmin = 1, xmax = 2, ymin = -1*10^8, ymax = -5*10^8) gt <- ggplot_gtable(ggplot_build(g)) gt$layout$clip[gt$layout$name == "panel"] <- "off" grid.draw(gt) dev.off() 

## ggplot2에서 heatmap 플로팅 빠르게 해보기

FlowingData 블로그의 게시물 은 R 기본 그래픽을 사용하여 아래에서 히트 맵을 빠르게 만드는 방법을 알려 주고있습니다..

이 게시물은 ggplot2를 사용하여 매우 유사한 결과를 얻는 방법을 보여줍니다.

## 데이터 가져오기

FlowingData는 databasebasketball.com에서 제공한 지난 시즌의 NBA 농구 통계를 사용했으며 데이터가 포함 된 csv 파일은 해당 웹 사이트에서 직접 다운로드 할 수 있습니다.

 > nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")

플레이어는 점수가 매겨진 점수에 따라 순서가 정해지고 Name 변수는 점수의 적절한 정렬을 보장하는 요소로 변환됩니다.

 > nba\$Name <- with(nba, reorder(Name, PTS))

FlowingData는 플롯 된 값을 매트릭스 형식으로해야하는 stats-package에서 heatmap 함수를 사용하지만, ggplot2는 데이터 프레임과 함께 작동합니다. 처리가 쉽도록 데이터 프레임이 와이드 형식에서 긴 형식으로 변환됩니다.

게임 통계에는 매우 다른 범위가 있으므로 비교할 수 있도록 모든 개별 통계가 재조정됩니다.

 > library(ggplot2)
 > nba.m <- melt(nba) > nba.m <- ddply(nba.m, .(variable), transform, + rescale = rescale(value))

## 플로팅

ggplot2에는 특정 히트맵 플로팅 기능이 없지만 geom_tile과 부드러운 그라데이션 채우기를 결합하면 작업이 잘 수행됩니다.

 > (p <- ggplot(nba.m, aes(variable, Name)) + geom_tile(aes(fill = rescale), + colour = "white") + scale_fill_gradient(low = "white", + high = "steelblue"))

포맷팅에 몇 가지 마무리가 적용되며 히트맵 도면이 표시 될 준비가되었습니다.

 > base_size <- 9 > p + theme_grey(base_size = base_size) + labs(x = "", + y = "") + scale_x_discrete(expand = c(0, 0)) + + scale_y_discrete(expand = c(0, 0)) + opts(legend.position = "none", + axis.ticks = theme_blank(), axis.text.x = theme_text(size = base_size * + 0.8, angle = 330, hjust = 0, colour = "grey50"))

## 재조정 업데이트

위의 그림에 대한 데이터를 준비 할 때 모든 변수는 0과 1 사이의 값이되도록 재조정되었습니다.

Jim은 heatmap-function이 다른 크기 조정 방법을 사용한다는 점을 주석에서 지적했습니다. (그리고 처음에는 얻지 못했습니다.) 따라서 그 그림은 동일하지 않습니다. 아래는 히트 맵의 업데이트 된 버전으로 원본과 훨씬 비슷하게 보입니다.

 > nba.s <- ddply(nba.m, .(variable), transform, + rescale = scale(value))
 > last_plot() %+% nba.s

## 프로그래밍에 대한 두려움 극복하기

전에 인생에서 프로그래밍 한 적이 없다고하셨습니까? 클래스와 객체, 데이터 프레임, 메소드, 상속, 루프와 같은 단어를 들어 본 적이 없습니까? 프로그래밍을 두려워하나요?

두려워하지 마세요. 프로그래밍은 재미 있고 자극적 일 수 있으며 일단 프로그래밍을 시작하고 배우면 많은 전략을 프로그래밍하는 데 시간을 보내는 것을 좋아할 것입니다. 당신은 자신의 코드가 눈 깜짝 할 사이에 움직이는 것을보고 싶을 것입니다. 그리고이 코드가 얼마나 강력한지 보게 될 것입니다.

Executive Programme in Algorithmic Trading (EPAT™) 과정은 Python 및 R 프로그래밍 언어를 광범위하게 사용하여 전략, 백 테스팅 및 최적화를 가르칩니다. R의 도움을 받아 프로그래밍에 대한 두려움을 극복 할 수있는 방법을 보여줍니다. 다음은 초보자 프로그래머를위한 몇 가지 제안 사항입니다.

### 1) Think and let the questions pop in your mind

As a newbie programmer when you have a task to code, even before you start on it, spend some time ideating on how you would like to solve it step-by-step. Simply let questions pop up in your mind, as many questions as your mind may throw up.

Here are a few questions:
Is it possible to download stock price data in R from google finance?
How to delete a column in R? How to compute an exponential moving average (EMA)?
How do I draw a line chart in R? How to merge two data sets?
Is it possible to save the results in an excel workbook using R?

### 2) Google the questions for answers

Use google search to see whether solutions exist for the questions that you have raised. Let us take the second question, how to delete a column in R? We posted the question in the google search, and as we can see from the screenshot below we have the solution in the very first result shown by google.

R is an open-source project, and there are hundreds of articles, blogs, forums, tutorials, Youtube videos on the net and books which will help you overcome the fear of programming and transition you from a beginner to an intermediate level, and eventually to an expert if you aspire to.

The chart below shows the number of questions/threads posted by newbie and expert programmers on two popular websites. As you can see, R clearly tops the results with more than 10 thousand questions/threads.
(Source: www.r4stats.com )

Let us search in google whether QuantInsti™ has put up any programming material on R.
As you can see from the google results, QuantInsti™ has posted quality content on its website to help newbie programmers design and model quantitative trading strategies in R. You can read all the rich content posted regularly by QuantInsti™ here – https://www.quantinsti.com/blog

### 3) Use the print command in R

As a newbie programmer, don’t get intimidated when you come across complex looking codes on the internet. If you are unable to figure out what exactly the code does, just copy the code in R. You can use a simple “print” command to help understand the code’s working.

One can also use Ctrl+Enter to execute the code line-by-line and see the results in the console.

Let us take an example of an MACD trading strategy posted on QuantInsti’s blog.

An example of a trading strategy coded using Quantmod Package in R

I am unsure of the working of commands at line 9 and line 11. So I simply inserted a print(head(returns)) command at line 10 and one more at line 12. Thereafter I ran the code. Below is the result as shown in the console window of R.

The returns = returns[‘2008-06-02/2015-09-22’] command simply trims the original NSEI.Close price returns series. The series was earlier starting from 2007-09-17. The series now starts from 2008-06-02 and ends at 2015-09-22.

### 4) Use help() and example() functions in R

One can also make use of the help() and example() functions in R to understand a code, and also learn new ways of coding. Continuing with the code above, I am unsure what the ROC function does at line 9 in the code.

I used the help(“ROC”) command, and R displays all the relevant information regarding the usage, arguments of the ROC function.

There are hundreds of add-on packages in R which makes programming easy and yet powerful.

Below is the link to view all the available packages in R:
https://cran.r-project.org/web/packages/available_packages_by_name.html

### 5) Give time to programming

Programming can be a very rewarding experience, and we expect that you devote some time towards learning and honing your programming skills. Below is a word cloud of some essential characteristics a good programmer should possess. The best suggestion would be to just start programming!!

### Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to be a successful trader. Enroll now!

As a newbie programmer, you have just made a start. The faculty at QuantInsti™ will teach and guide you through different aspects of programming in R and Python. Over the course of the program, you will learn different data structures, classes and objects, functions, and many other aspects which will enable you to program algorithmic trading strategies in the most efficient and powerful way.

The post Overcome the Fear of Programming appeared first on .

## xwMOOC 기계학습

### 학습 목표

• 표(table) 데이터를 깔끔한 데이터(tidy data)로 변환한다.
• 깔끔한 데이터를 범주형 데이터, 즉 요인형 자료구조로 변환시킨다.
• R 함수를 재활용하고자 vcd 팩키지 mosaic() 함수로 시각화한다.

## 1. 범주형 자료 처리를 위한 자료구조와 시각화

일상적으로 가장 많이 접하는 데이터 형태 중의 하나가 표(Table) 데이터다. 하지만, 역설적으로 가장 적은 데이터 활용법이 공개된 것도 사실이다. 통계학과에서도 연속형 데이터는 많이 다루지만, 범주형 데이터를 충분히 이해하고 실습해 본 경험을 갖고 있는 분은 드물다.

사실 범주형 자료를 시각화하고 다양한 표형태로 나타내는데 다양한 지식이 필요하다.

• table 자료형
• 깔끔한 데이터(tidy data) 개념
• vcd 팩키지 mosaic() 함수 사용 및 해석
• 범주형 자료형 forcats 팩키지 활용
• 표를 웹에 표현하기 위한 kable 팩키지와 마크다운 언어

즉, 일반적이 표형식 데이터를 깔끔한 데이터(tidy data) 형태로 변환을 시키고 나서 탐색적 데이터 분석과정을 거쳐 최종 결과물을 생성시킨다.

## 2. 표(table) 데이터를 자유로이 작업

상기 기반지식을 바탕으로 R datasets 팩키지에 포함된 HairEyeColor 데이터셋을 가지고 데이터 분석 작업을 시작한다.

### 2.1. 환경설정

범주형 데이터 분석 및 시각화 산출물 생성에 필요한 팩키지를 불러온다.

# 0. 데이터 가져오기 ----------------------------------------------
library(tidyverse)
library(datasets)
library(forcats)
library(ggmosaic)
library(vcd)
library(gridExtra)
library(knitr)

### 2.2. 표(table) 데이터

범주형 데이터로 유명한 HairEyeColor 데이터셋을 가져온다. HairEyeColor 데이터셋은 데이터프레임이 아니고 table 형태 데이터다. 익숙한 데이터프레임 자료형으로 작업하는데 필요한 함수가 있다.

• tbl_df()
• as_data_frame()

tbl_df(), as_data_frame() 함수는 표(table) 자료형을 데이터프레임으로 변환시키는 유용한 함수다.

data("HairEyeColor")

# 1. 데이터 변환 ----------------------------------------------

## 1.1 표형식 데이터 --> 깔끔한 데이터 ------------------------

hair_eye_df <- apply(HairEyeColor, c(1, 2), sum)

kable(hair_eye_df, digits=0)
Brown Blue Hazel Green
Black 68 20 15 5
Brown 119 84 54 29
Red 26 17 14 14
Blond 7 94 10 16
tbl_df <- as_data_frame(HairEyeColor)

tbl_df(HairEyeColor)
# A tibble: 32 × 4
Hair   Eye   Sex     n
<chr> <chr> <chr> <dbl>
1  Black Brown  Male    32
2  Brown Brown  Male    53
3    Red Brown  Male    10
4  Blond Brown  Male     3
5  Black  Blue  Male    11
6  Brown  Blue  Male    50
7    Red  Blue  Male    10
8  Blond  Blue  Male    30
9  Black Hazel  Male    10
10 Brown Hazel  Male    25
# ... with 22 more rows

# kable(tbl_df)

### 2.3. 깔끔한 데이터

데이터프레임으로 전환되면 long 형태 데이터프레임이라 원 표(table)과비교하려면 spread 함수와 비교한다.

## 1.2 Long & Wide 데이터 형식 ------------------------

long_df <- tbl_df %>% group_by(Hair, Eye) %>%
summarise(cnt = sum(n))

# 비교
# hair_eye_df
long_df %>% spread(Eye, cnt) %>% kable(digits=0)
Hair Blue Brown Green Hazel
Black 20 68 5 15
Blond 94 7 16 10
Brown 84 119 29 54
Red 17 26 14 14

### 2.4. 단변량 범주형 데이터 시각화

깔끔한 데이터프레임으로 작업이 되면 변수를 각 자료형에 맞춰 변환을 시킨다. 이런 과정에 도입되는 팩키지가 forcats 팩키지의 다양한 요인형 데이터 처리 함수다. 요인형 자료형은 다른 프로그래밍 언어에는 개념이 존재하지만, 실제 활용되는 경우도 많지 않고 R처럼 다양한 기능을 제공하는 경우는 드물다.

## 1.3 범주형 데이터 ------------------------

long_df %>% ungroup() %>%  mutate(Hair = factor(Hair)) %>%
group_by(Hair) %>%
summarise(hair_sum = sum(cnt)) %>%
ggplot(aes(hair_sum, fct_reorder(Hair, hair_sum))) + geom_point()

long_df %>% ungroup() %>%  mutate(Eye = factor(Eye)) %>%
group_by(Eye) %>%
summarise(eye_sum = sum(cnt)) %>%
ggplot(aes(eye_sum, fct_reorder(Eye, eye_sum))) + geom_point()

long_df %>% ungroup() %>%  mutate(Eye = factor(Eye),
Hair = factor(Hair)) %>%
group_by(Eye, Hair) %>%
summarise(eye_hair_sum = sum(cnt)) %>%
tidyr::unite(eye_hair, Eye, Hair) %>%
ggplot(aes(eye_hair_sum, fct_reorder(eye_hair, eye_hair_sum))) + geom_point() 

## 3. 모자이크 플롯

ggplot에서도 모자이크 플롯을 구현할 수 있지만, 잔차(residual)를 반영하여 시각화하는 기능을 제공하지 않고 있다. 하지만, ggmosaic 팩키지를 활용하면 모자이크 플롯을 그래프 문법에 맞춰 구현이 가능하다. geom_mosaic() 함수를 사용한다.

하지만, 잔차(residual)를 반영하여 시각화 그래프를 생성시키려면 표(table) 자료형으로 vcd 팩키지에서 제공하는 mosaic() 함수에 인자로 넘겨야 한다.

# 2. 모자이크 플롯 ------------------------

long_df %>% ungroup() %>%  mutate(Eye = factor(Eye),
Hair = factor(Hair)) %>%
ggplot() +
geom_mosaic(aes(weight=cnt,x=product(Hair),fill=Eye))

# 3. 모자이크 플롯 통계모형 ------------------------

mosaic(HairEyeColor, shade=TRUE, legend=TRUE)

xtabs(cnt ~ Hair + Eye, long_df)
       Eye
Hair    Blue Brown Green Hazel
Black   20    68     5    15
Blond   94     7    16    10
Brown   84   119    29    54
Red     17    26    14    14

mosaic(xtabs(cnt ~ Hair + Eye, long_df), shade = TRUE, legend=TRUE)

# vcd::mosaic(hair_eye_df, shade = TRUE, legend=TRUE)



소스: xwMOOC 기계학습