R 에서 tidyverse 패키지를 제대로 이해해 보기위해 타이타닉 데이터 분석을 시작해 봅시다. dplyr 및 ggplot2를 사용하여 데이터 필터링, 정렬, 요약, 변형 및 시각화 방법을 학습하십시오

And the second part here:

Note that you can find all the code for the session in this repository.

The core tidyverse includes the packages that you’re likely to use in everyday data analyses, such as
ggplot2
for data visualization and
dplyr
for data wrangling. You’ll focus on these two in this tutorial.

Remember that a package is essentially a set of tools for doing stuff with data. If you’d like to know more about packages in R, check out this tutorial.

For more on the Tidyverse, check out David Robinson’s Introduction to Tidyverse course on DataCamp and the Learn the Tidyverse resources.

Also check out this week’s episode of DataFramed, a conversation with Mara Averick, Tidyverse dev. advocate at RStudio.

To start off, you should install the tidyverse, if you haven’t already:
# Install the tidyverse# install.packages("tidyverse")
Now that you’ve installed the tidyverse, it’s time to load your data and check out some of the observations.

In this tutorial, you’ll be exploring the Titanic dataset, where each observation is a person and each variable is a feature such as
Name
,
Age
and
Survived
(or not).

Load your data like this:
# Import the Tidyverselibrary(tidyverse)# Import datapassengers <- read.csv("data/train.csv")# Check out the first several observations of your dataframepassengers##   PassengerId Survived Pclass## 1           1        0      3## 2           2        1      1## 3           3        1      3## 4           4        1      1## 5           5        0      3## 6           6        0      3##                                                  Name    Sex Age SibSp## 1                             Braund, Mr. Owen Harris   male  22     1## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1## 3                              Heikkinen, Miss. Laina female  26     0## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1## 5                            Allen, Mr. William Henry   male  35     0## 6                                    Moran, Mr. James   male  NA     0##   Parch           Ticket    Fare Cabin Embarked## 1     0        A/5 21171  7.2500              S## 2     0         PC 17599 71.2833   C85        C## 3     0 STON/O2. 3101282  7.9250              S## 4     0           113803 53.1000  C123        S## 5     0           373450  8.0500              S## 6     0           330877  8.4583              Q
Note that the path that you will give to
read.csv()
can change depending on your setup. If you’d like to just keep the same file path, you can always check out the folder structure of this project in our GitHub repository, which you can find here.

Note also that if you want to know more about all the variables that you see as a result of executing
passengers
, you can take a look at the dataset description.

To get an overview of your data, you can use the
summary()
function:
# Summarize titanicsummary(passengers)##   PassengerId       Survived          Pclass     ##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  ##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  ##  Median :446.0   Median :0.0000   Median :3.000  ##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  ##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  ##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  ##                                                  ##                                     Name         Sex           Age       ##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  ##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  ##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  ##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  ##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  ##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  ##  (Other)                              :885                NA's   :177    ##      SibSp           Parch             Ticket         Fare       ##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  ##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  ##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  ##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  ##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  ##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  ##                                   (Other) :852                   ##          Cabin     Embarked##             :687    :  2   ##  B96 B98    :  4   C:168   ##  C23 C25 C27:  4   Q: 77   ##  G6         :  4   S:644   ##  C22 C26    :  3           ##  D          :  3           ##  (Other)    :186
Now do the same using a pipe
%>%
, one of the handiest tools in the tidyverse:
# Summarize titanic using a pipepassengers %>%  summary()##   PassengerId       Survived          Pclass     ##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  ##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  ##  Median :446.0   Median :0.0000   Median :3.000  ##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  ##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  ##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  ##                                                  ##                                     Name         Sex           Age       ##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  ##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  ##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  ##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  ##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  ##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  ##  (Other)                              :885                NA's   :177    ##      SibSp           Parch             Ticket         Fare       ##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  ##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  ##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  ##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  ##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  ##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  ##                                   (Other) :852                   ##          Cabin     Embarked##             :687    :  2   ##  B96 B98    :  4   C:168   ##  C23 C25 C27:  4   Q: 77   ##  G6         :  4   S:644   ##  C22 C26    :  3           ##  D          :  3           ##  (Other)    :186
Tip: if you’d like to know more about the use of the pipe operator in R, you can check out this tutorial.

Do the same after dropping observations that have missing values. Here’s a hint: you can concatenate pipes!
# Summarize titanic after dropping napassengers %>%  drop_na() %>%  summary()##   PassengerId       Survived          Pclass     ##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  ##  1st Qu.:222.2   1st Qu.:0.0000   1st Qu.:1.000  ##  Median :445.0   Median :0.0000   Median :2.000  ##  Mean   :448.6   Mean   :0.4062   Mean   :2.237  ##  3rd Qu.:677.8   3rd Qu.:1.0000   3rd Qu.:3.000  ##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  ##                                                  ##                                     Name         Sex           Age       ##  Abbing, Mr. Anthony                  :  1   female:261   Min.   : 0.42  ##  Abbott, Mr. Rossmore Edward          :  1   male  :453   1st Qu.:20.12  ##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  ##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  ##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  ##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  ##  (Other)                              :708                               ##      SibSp            Parch                 Ticket         Fare       ##  Min.   :0.0000   Min.   :0.0000   347082      :  7   Min.   :  0.00  ##  1st Qu.:0.0000   1st Qu.:0.0000   3101295     :  6   1st Qu.:  8.05  ##  Median :0.0000   Median :0.0000   347088      :  6   Median : 15.74  ##  Mean   :0.5126   Mean   :0.4314   CA 2144     :  6   Mean   : 34.69  ##  3rd Qu.:1.0000   3rd Qu.:1.0000   382652      :  5   3rd Qu.: 33.38  ##  Max.   :5.0000   Max.   :6.0000   S.O.C. 14879:  5   Max.   :512.33  ##                                    (Other)     :679                   ##          Cabin     Embarked##             :529    :  2   ##  B96 B98    :  4   C:130   ##  C23 C25 C27:  4   Q: 28   ##  G6         :  4   S:554   ##  C22 C26    :  3           ##  D          :  3           ##  (Other)    :167
You may have noticed stylistic consistency in the code that you’ve written above. That’s because you’re adhering to a style guide. In data science and programming/coding in general, it’s important to get accustomed to using a style guide asap. As Hadley Wickham puts it in thetidyverse style guide,

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.

In the next section, you’ll tackle data wrangling with
dplyr
to filter your data, arrange it and create new features by mutating old ones.

Wrangle your data

Now it’s time to explore your data and get some initial insight into the dataset. You’ll be using
dplyr
verbs such as
filter()
,
arrange()
and
mutate()
, which do exactly what they say.

Let’s sat that you wanted to choose a particular set of observations, say, those for which the “Sex” was ‘female’.
dplyr
allows us to do intuitively and in a language that mirrors how you think and talk about data.

The
filter
verb chooses only the observations that match the condition. See it in action:
# Filter to get all "male" rowspassengers %>%  filter(Sex == "male")##   PassengerId Survived Pclass                           Name  Sex Age## 1           1        0      3        Braund, Mr. Owen Harris male  22## 2           5        0      3       Allen, Mr. William Henry male  35## 3           6        0      3               Moran, Mr. James male  NA## 4           7        0      1        McCarthy, Mr. Timothy J male  54## 5           8        0      3 Palsson, Master. Gosta Leonard male   2## 6          13        0      3 Saundercock, Mr. William Henry male  20##   SibSp Parch    Ticket    Fare Cabin Embarked## 1     1     0 A/5 21171  7.2500              S## 2     0     0    373450  8.0500              S## 3     0     0    330877  8.4583              Q## 4     0     0     17463 51.8625   E46        S## 5     3     1    349909 21.0750              S## 6     0     0 A/5. 2151  8.0500              S
When looking at the results of the code chunk above, you’ll see that it looks like many men didn’t survive the sinking of the RMS Titanic. This is interesting, and you’ll explore this more rigorously later on in this tutorial!

Note that you can read
dplyr
code like a sentence: take your data and then (
%>%
) filter it according the condition that the sex is male. The code that you see in the above code chunk does not alter the original data frame.
filter(Sex = "male")
is a common error (Hugo made this same mistake!); there’s a cool tidyverse error guide.
# Filter to get all "female" rowspassengers %>%  filter(Sex == "female")##   PassengerId Survived Pclass## 1           2        1      1## 2           3        1      3## 3           4        1      1## 4           9        1      3## 5          10        1      2## 6          11        1      3##                                                  Name    Sex Age SibSp## 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1## 2                              Heikkinen, Miss. Laina female  26     0## 3        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1## 4   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female  27     0## 5                 Nasser, Mrs. Nicholas (Adele Achem) female  14     1## 6                     Sandstrom, Miss. Marguerite Rut female   4     1##   Parch           Ticket    Fare Cabin Embarked## 1     0         PC 17599 71.2833   C85        C## 2     0 STON/O2. 3101282  7.9250              S## 3     0           113803 53.1000  C123        S## 4     2           347742 11.1333              S## 5     0           237736 30.0708              C## 6     1          PP 9549 16.7000    G6        S
Women seem to be more likely to have survived the Titanic disaster, at least, anecdotally speaking.

When you explore data, you should be asking and answering questions that you might be interested in to the best of your abilities! Right now, You might want to
arrange()
your observations by increasing
Fare
to see if you can notice any trends. You can use the verb
arrange()
to achieve this:
# Arrange by increasing Farepassengers %>%  arrange(Fare)##   PassengerId Survived Pclass                            Name  Sex Age## 1         180        0      3             Leonard, Mr. Lionel male  36## 2         264        0      1           Harrison, Mr. William male  40## 3         272        1      3    Tornquist, Mr. William Henry male  25## 4         278        0      2     Parkes, Mr. Francis "Frank" male  NA## 5         303        0      3 Johnson, Mr. William Cahoone Jr male  19## 6         414        0      2  Cunningham, Mr. Alfred Fleming male  NA##   SibSp Parch Ticket Fare Cabin Embarked## 1     0     0   LINE    0              S## 2     0     0 112059    0   B94        S## 3     0     0   LINE    0              S## 4     0     0 239853    0              S## 5     0     0   LINE    0              S## 6     0     0 239853    0              S
A lot of the people who paid less, did not survive the disaster. That’s something interesting that you’ve just discovered just by re-arranging your data!

You can also
arrange
by decreasing
Fare
:
# Arrange by decreasing Farepassengers %>%  arrange(desc(Fare))##   PassengerId Survived Pclass                               Name    Sex## 1         259        1      1                   Ward, Miss. Anna female## 2         680        1      1 Cardeza, Mr. Thomas Drake Martinez   male## 3         738        1      1             Lesurer, Mr. Gustave J   male## 4          28        0      1     Fortune, Mr. Charles Alexander   male## 5          89        1      1         Fortune, Miss. Mabel Helen female## 6         342        1      1     Fortune, Miss. Alice Elizabeth female##   Age SibSp Parch   Ticket     Fare       Cabin Embarked## 1  35     0     0 PC 17755 512.3292                    C## 2  36     0     1 PC 17755 512.3292 B51 B53 B55        C## 3  35     0     0 PC 17755 512.3292        B101        C## 4  19     3     2    19950 263.0000 C23 C25 C27        S## 5  23     3     2    19950 263.0000 C23 C25 C27        S## 6  24     3     2    19950 263.0000 C23 C25 C27        S
There are indeed more survivors at the top end!

Sometimes you may wish to create new variables. You know that the variable
Parch
is the number of parents and children while
SibSp
is the number of siblings and spouses. You can add these together to get a new variable
FamSize
. This is feature engineering and it’s a big part of machine learning a lot of the time!

Now, to make your new variable, you
mutate()
the original variables into the new one.
# Create new column FamSize (size of family)passengers %>%  mutate(FamSize = Parch + SibSp)##   PassengerId Survived Pclass## 1           1        0      3## 2           2        1      1## 3           3        1      3## 4           4        1      1## 5           5        0      3## 6           6        0      3##                                                  Name    Sex Age SibSp## 1                             Braund, Mr. Owen Harris   male  22     1## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1## 3                              Heikkinen, Miss. Laina female  26     0## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1## 5                            Allen, Mr. William Henry   male  35     0## 6                                    Moran, Mr. James   male  NA     0##   Parch           Ticket    Fare Cabin Embarked FamSize## 1     0        A/5 21171  7.2500              S       1## 2     0         PC 17599 71.2833   C85        C       1## 3     0 STON/O2. 3101282  7.9250              S       0## 4     0           113803 53.1000  C123        S       1## 5     0           373450  8.0500              S       0## 6     0           330877  8.4583              Q       0
Note that
mutate()
can be used to create new columns but also modify existing columns in much the same way that a mutation may from a biological perspective. It’s not exactly right but a nice way to provide context for verb choice.

Now that you have your extra variable, you can ask other questions, such as “Is it possible that larger family have a lower rate of survival?”.

To test this hypothesis, create a new variable
FamSize
as the sum of
Parch
and
SibSp
as above, then arrange by decreasing
FamSize
:
# Create new column FamSize (size of family)# Arrange by decreasing FamSizepassengers %>%  mutate(FamSize = Parch + SibSp) %>%  arrange(desc(FamSize))##   PassengerId Survived Pclass                         Name    Sex Age## 1         160        0      3   Sage, Master. Thomas Henry   male  NA## 2         181        0      3 Sage, Miss. Constance Gladys female  NA## 3         202        0      3          Sage, Mr. Frederick   male  NA## 4         325        0      3     Sage, Mr. George John Jr   male  NA## 5         793        0      3      Sage, Miss. Stella Anna female  NA## 6         847        0      3     Sage, Mr. Douglas Bullen   male  NA##   SibSp Parch   Ticket  Fare Cabin Embarked FamSize## 1     8     2 CA. 2343 69.55              S      10## 2     8     2 CA. 2343 69.55              S      10## 3     8     2 CA. 2343 69.55              S      10## 4     8     2 CA. 2343 69.55              S      10## 5     8     2 CA. 2343 69.55              S      10## 6     8     2 CA. 2343 69.55              S      10
Everyone in the top family did not survive! This might be telling: maybe if you’re part of a large family, you weren’t able to get off of the Titanic in time.

However, as zeroes and ones don’t often say a great deal, mutate the values of the
Survived
variable to strings
No
and
Yes
(and create new data frame!):
# Turn numerical values of Survived column to "No" & "Yes" (new data frame)passengers1 <- passengers %>%  mutate(Survived = ifelse(Survived == 0, "No", "Yes"))passengers1##   PassengerId Survived Pclass## 1           1       No      3## 2           2      Yes      1## 3           3      Yes      3## 4           4      Yes      1## 5           5       No      3## 6           6       No      3##                                                  Name    Sex Age SibSp## 1                             Braund, Mr. Owen Harris   male  22     1## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1## 3                              Heikkinen, Miss. Laina female  26     0## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1## 5                            Allen, Mr. William Henry   male  35     0## 6                                    Moran, Mr. James   male  NA     0##   Parch           Ticket    Fare Cabin Embarked## 1     0        A/5 21171  7.2500              S## 2     0         PC 17599 71.2833   C85        C## 3     0 STON/O2. 3101282  7.9250              S## 4     0           113803 53.1000  C123        S## 5     0           373450  8.0500              S## 6     0           330877  8.4583              Q

Plotting your data

To plot your data using
ggplot2
, you specify three things:

  • Your data
  • Your aesthetics (for example, what’s on the x-axis)
  • Your layers (for example, barplot, scatterplot)

First you’ll plot a barplot of
Sex
to see how many males and females were recorded aboard the Titanic:
# Plot barplot of passenger Sexggplot(passengers, aes(x = Sex)) +  geom_bar()

From the above plot, you can tell that there were about 600 men and 300 women aboard of the RMS Titanic.

Note that the
aes()
function is really a way to map the
aes
thetics of the plot to the variables in the data. See Modern Dive by Chester Ismay & Albert Y. Kim for more.

Now it’s time for some scatter plots. Is the
Age
of any given passenger correlated with the
Fare
that they paid?
# Scatter plot of Age vs Fareggplot(passengers, aes(x = Age, y = Fare)) +  geom_point()## Warning: Removed 177 rows containing missing values (geom_point).

You gathered a lot of insights just from this one plot! You see that a lot of the people who paid more, were also substantially older. There might have been an increase in fare according to age… Additionally, you immediately see the two outliers on top of the plot, which you might want to investigate further!

You can read your
ggplot2
code like a sentence, just like your
dplyr
code above: “You take the data as titanic and you map
Age
to the x axis and
Fare
to the y axis, adding points on as the layer of the plot.”

Let’s take the previous plot and color each point by
Sex
, to see any correlations between
Sex
,
Age
and
Fare
:
# Scatter plot of Age vs Fare colored by Sexggplot(passengers %>% drop_na(), aes(x = Age, y = Fare, color = Sex)) +  geom_point()

You see that a lot of the people down the bottom, who paid less, were men. You also have a cluster of women towards the top of the plot that were slightly older and also paid more to get on board of the Titanic.

Visualizing three variables (two numeric ones,
Age
and
Fare
, and one categorical one,
Sex
) on one plot is pretty cool, but what if you wanted to throw the
Survived
variable into the mix to see if there are any apparent trends? You can do this with faceting, which is a way to produce multiple plots simultaneously:
# Scatter plot of Age vs Fare colored by Sex faceted by Survivedggplot(passengers1, aes(x = Age, y = Fare, color = Sex)) +  geom_point() +  facet_grid(~Survived)## Warning: Removed 177 rows containing missing values (geom_point).

Suddenly, you see that a lot of the women who you noted up there in the previous plot have survived, and that most of them who didn’t survive, paid under 50 units. Tip: try to figure out the units of the currency yourself!

Now let’s redo your bar plot of passenger
Sex
and this time fill your bars according to
Survived
:
# Plot barplot of passenger Sex & fill according to Survivalggplot(passengers1, aes(x = Sex, fill = Survived)) +  geom_bar()

You see that the vast proportion of men did not survive, whereas over two-thirds of the women did survive!

Summarizing and grouping your data

Use the
summarise()
verb to find out the mean fare paid:
# Check out mean Farepassengers %>%  summarise(meanFare = mean(Fare))##   meanFare## 1 32.20421
Use the
summarise()
verb to find out the median fare paid:
# Check out mean Farepassengers %>%  summarise(medianFare = median(Fare))##   medianFare## 1    14.4542
You can also use the
filter()
and
summarise()
verbs to find out the mean fare paid among men:
# Check out mean Fare for menpassengers %>%  filter(Sex == "male") %>%  summarise(meanFare = mean(Fare))##   meanFare## 1 25.52389
Use the
filter()
and
summarise()
verbs to find out the mean fare paid among women:
# Check out mean Fare for womenpassengers %>%  filter(Sex == "female") %>%  summarise(meanFare = mean(Fare))##   meanFare## 1 44.47982
Use the
filter()
and
summarise()
verbs to find out the mean fare paid among women and how many women survived:
# Check out mean Fare & number of survivors for womenpassengers %>%  filter(Sex == "female") %>%  summarise(meanFare = mean(Fare), numSurv = sum(Survived))##   meanFare numSurv## 1 44.47982     233
Use the
group_by()
and
summarise()
verbs to find the mean fare and number of survivors as a function of sex:
# Check out mean Fare & number of survivors grouped by Sexpassengers %>%  group_by(Sex) %>%  summarise(meanFare = mean(Fare), numSurv = sum(Survived))## # A tibble: 2 x 3##   Sex    meanFare numSurv##   <fct>     <dbl>   <int>## 1 female     44.5     233## 2 male       25.5     109
Use the
group_by()
and
summarise()
verbs to find the mean fare and proportion of survivors as a function of sex:
# Check out mean Fare & proportion of survivors grouped by Sexpassengers %>%  group_by(Sex) %>%  summarise(meanFare = mean(Fare), numSurv = sum(Survived)/n())## # A tibble: 2 x 3##   Sex    meanFare numSurv##   <fct>     <dbl>   <dbl>## 1 female     44.5   0.742## 2 male       25.5   0.189

Conclusion

In this tutorial, you have gone from zero to one with the basics of data analysis using the tidyverse and tidy tools. You’ve learnt how to
filter()
your data,
arrange()
and
mutate()
it, plot and
summarise()
it using
dplyr
and
ggplot2
, all by writing code that mirrors the way you think and talk about data. Congratulations. I’d encourage you to learn more by taking David Robinson’s Introduction to Tidyverse course and by using these tools on other datasets that interest you. Share these analyses with us on twitter @DataCamp and @hugobowne. Thanks for reading.

소스: Introduction to the Tidyverse: Tutorial (article) – DataCamp