New package polypoly (helper functions for orthogonal polynomials) – Higher Order Functions

Last week, I released a new package called polypoly to CRAN. It wraps up some common tasks for dealing with orthogonal polynomials into a single package. The README shows off the main functionality, as well as the neat “logo” I made for the package. In this post, I use the package on some word recognition data.

Demo: Growth curve analysis

I primarily use orthogonal polynomials to model data from eyetracking experiments where growth curves describe how the probability of looking at a image changes as the image is named. The analysis technique, including orthogonal polynomials and mixed effects models of eyetracking data, are described in Mirman’s 2014 book.

In our 2015 paper, toddlers saw two images on a computer screen. The objects in the images started with different consonants: for example, duck and ball. The toddlers heard sentences like “find the ball”, and we measured how their gaze location onscreen changed in response to speech. This setup is a pretty standard procedure for studying spoken word recognition.

We manipulated the vowel in the word the. In the facilitating condition, the vowel has acoustic information (via anticipatory coarticulation) which would allow an adult listener to predict the upcoming consonant. In the neutralcondition, the vowel provides no cues about the upcoming consonant. The scientific question is whether these kiddos can take advantage of these acoustic cues during word recognition.

Here’s how the data look, both in R and in a plot.

library(ggplot2)
library(dplyr)

# The data
d
#> # A tibble: 986 x 6
#>     Subj    Condition  Time ToDistractor ToTarget Proportion
#>    <int>        <chr> <int>        <int>    <int>      <dbl>
#>  1     1 facilitating   200            9        9  0.5000000
#>  2     1 facilitating   250            9       10  0.5263158
#>  3     1 facilitating   300            6       12  0.6666667
#>  4     1 facilitating   350            6       12  0.6666667
#>  5     1 facilitating   400            6       12  0.6666667
#>  6     1 facilitating   450            6       12  0.6666667
#>  7     1 facilitating   500            6       12  0.6666667
#>  8     1 facilitating   550            6       12  0.6666667
#>  9     1 facilitating   600            4       12  0.7500000
#> 10     1 facilitating   650            3       15  0.8333333
#> # ... with 976 more rows

# Helper dataframe of where to put condition labels on the next plot
df_labs <- data_frame(
  Time = c(650, 800),
  Proportion = c(.775, .625), 
  Condition = c("facilitating", "neutral"))

p <- ggplot(d) + 
  aes(x = Time, y = Proportion, color = Condition) + 
  geom_hline(yintercept = .5, size = 2, color = "white") +
  stat_summary(fun.data = mean_se) + 
  geom_text(aes(label = Condition), data = df_labs, size = 6) +
  labs(x = "Time after noun onset [ms]", 
       y = "Proportion looks to named image",
       caption = "Mean ± SE. N = 29 children.") + 
  guides(color = "none")
p

Eyetracking data from Mahr et al. (2015)

Early on, children look equal amounts to both images on average (.5), and the proportion of looks to the named image increase as the word unfolds. In the facilitating condition, that rise happens earlier.

We fit a mixed-effects logistic regression model to estimate how the probability of looking to the named image changes over time, across conditions, and within children. We use cubic orthogonal polynomials to represent Time. For each time point, we have three predictors available to us: Time1, Time2, and Time3. (Plus, there’s a constant “intercept” term.) Our model’s growth curve will be a weighted combination of these polynomial curves. The code below shows off about half the functionality of the package:bowtie::

poly(unique(d$Time), 3) %>% 
  # Force Time^1 term to range from -.5 to .5. Rescale others accordingly.
  polypoly::poly_rescale(scale_width = 1) %>% 
  polypoly::poly_plot()

Three orthogonal polynomial curves

I think people sometimes describe the contributions of these curves to the overall growth curve as trends: “A negative linear trend”, “a significant quadratic trend”, etc. I like that word because it makes the terminology a little less intimidating.

Quick aside: Why orthogonal polynomials?

Why do we use orthogonal polynomial terms? First, note that simple polynomials x, x2 and x3 are correlated. Orthogonal ones are not correlated. (Hence, the name.)

# Simple
poly(1:10, 3, raw = TRUE) %>% 
  cor() %>% 
  round(2)
#>      1    2    3
#> 1 1.00 0.97 0.93
#> 2 0.97 1.00 0.99
#> 3 0.93 0.99 1.00

# Orthogonal
poly(1:10, 3, raw = FALSE) %>% 
  cor() %>% 
  round(2)
#>   1 2 3
#> 1 1 0 0
#> 2 0 1 0
#> 3 0 0 1

Adding new correlated predictors to a model is a problem. The parameter estimates will change as different predictors are added. Here we simulate some fake data, and fit three models with 1-, 2- and 3-degree raw polynomials.

x <- 1:10
y <- x + 
  rnorm(1, mean = 100) * (x) +
  rnorm(1, mean = 0, sd = .01) * (x) ^ 2 +
  rnorm(1, mean = -1) * (x) ^ 3 + 
  rnorm(10)

models <- list(
  m1 = lm(y ~ x),
  m2 = lm(y ~ x + I(x^2)),
  m3 = lm(y ~ x + I(x^2) + I(x^3))
)

As expected, the estimates for the effects change from model to model:

models %>% 
  lapply(broom::tidy) %>% 
  bind_rows(.id = "model") %>% 
  select(model:estimate) %>% 
  mutate(estimate = round(estimate, 2))
#>   model        term estimate
#> 1    m1 (Intercept)    75.69
#> 2    m1           x    72.91
#> 3    m2 (Intercept)   -23.91
#> 4    m2           x   122.72
#> 5    m2      I(x^2)    -4.53
#> 6    m3 (Intercept)     1.15
#> 7    m3           x   100.48
#> 8    m3      I(x^2)     0.29
#> 9    m3      I(x^3)    -0.29

But with orthogonal polynomials, the parameter estimates don’t change from model to model.

models2 <- list(
  m1 = lm(y ~ poly(x, 1)),
  m2 = lm(y ~ poly(x, 2)),
  m3 = lm(y ~ poly(x, 3))
)

models2 %>% 
  lapply(broom::tidy) %>% 
  bind_rows(.id = "model") %>% 
  select(model:estimate) %>% 
  mutate(estimate = round(estimate, 2))
#>   model        term estimate
#> 1    m1 (Intercept)   476.72
#> 2    m1  poly(x, 1)   662.27
#> 3    m2 (Intercept)   476.72
#> 4    m2 poly(x, 2)1   662.27
#> 5    m2 poly(x, 2)2  -104.03
#> 6    m3 (Intercept)   476.72
#> 7    m3 poly(x, 3)1   662.27
#> 8    m3 poly(x, 3)2  -104.03
#> 9    m3 poly(x, 3)3   -16.24

That’s probably the simplest reason why orthogonal polynomials are preferred. (I can’t remember any others right now.)

Back to the data

Before fitting the model, I use poly_add_columns() to add polynomial terms as columns to the dataframe. (For speed here, I use a simplified random effects structure, estimating growth curve parameters for each Child x Condition combination.)

library(lme4)
#> Loading required package: Matrix
#> Loading required package: methods

d <- d %>% 
  polypoly::poly_add_columns(Time, degree = 3, prefix = "ot", scale_width = 1) %>% 
  # Change the reference level
  mutate(Condition = factor(Condition, c("neutral", "facilitating")))

m <- glmer(
  cbind(ToTarget, ToDistractor) ~ 
    (ot1 + ot2 + ot3) * Condition + 
    (ot1 + ot2 + ot3 | Subj:Condition), 
  family = binomial, 
  data = d)

We can confirm that the model captures the overall shape of the growth curves.

# The lines here are not quite the overall average, but the averages of 29
# individual fits (for each participant). That's why the caption is a little
# weird.
p + 
  stat_summary(aes(y = fitted(m)), fun.y = mean, geom = "line") + 
  labs(caption = "Line: Average of model-fitted values. Points: Mean ± SE.")

Eyetracking data with model fits overlaid

We can inspect the model summary as well.

arm::display(m)
#> glmer(formula = cbind(ToTarget, ToDistractor) ~ (ot1 + ot2 + 
#>     ot3) * Condition + (ot1 + ot2 + ot3 | Subj:Condition), data = d, 
#>     family = binomial)
#>                           coef.est coef.se
#> (Intercept)                0.47     0.10  
#> ot1                        1.57     0.28  
#> ot2                        0.45     0.11  
#> ot3                       -0.34     0.09  
#> Conditionfacilitating      0.23     0.14  
#> ot1:Conditionfacilitating  0.45     0.39  
#> ot2:Conditionfacilitating -0.44     0.16  
#> ot3:Conditionfacilitating  0.11     0.13  
#> 
#> Error terms:
#>  Groups         Name        Std.Dev. Corr              
#>  Subj:Condition (Intercept) 0.53                       
#>                 ot1         1.46      0.23             
#>                 ot2         0.52     -0.05  0.31       
#>                 ot3         0.39     -0.08 -0.64  0.09 
#>  Residual                   1.00                       
#> ---
#> number of obs: 986, groups: Subj:Condition, 58
#> AIC = 4788.2, DIC = -3961.1
#> deviance = 395.6

The model summary indicates a significant Condition x Time2 interaction, but really, only the intercept and Time1 can ever be interpreted directly. To understand the model fit, we visualize how each of the polynomial terms are weighted.

Here we create a matrix of the polynomial terms plus a column of ones for the intercept.

time_mat <- poly(sort(unique(d$Time)), 3) %>%
  polypoly::poly_rescale(1) %>%
  cbind(constant = 1, .)
round(time_mat, 2)
#>       constant     1     2     3
#>  [1,]        1 -0.50  0.57 -0.57
#>  [2,]        1 -0.44  0.36 -0.14
#>  [3,]        1 -0.37  0.17  0.14
#>  [4,]        1 -0.31  0.01  0.30
#>  [5,]        1 -0.25 -0.11  0.36
#>  [6,]        1 -0.19 -0.22  0.34
#>  [7,]        1 -0.12 -0.29  0.26
#>  [8,]        1 -0.06 -0.33  0.14
#>  [9,]        1  0.00 -0.34  0.00
#> [10,]        1  0.06 -0.33 -0.14
#> [11,]        1  0.12 -0.29 -0.26
#> [12,]        1  0.19 -0.22 -0.34
#> [13,]        1  0.25 -0.11 -0.36
#> [14,]        1  0.31  0.01 -0.30
#> [15,]        1  0.37  0.17 -0.14
#> [16,]        1  0.44  0.36  0.14
#> [17,]        1  0.50  0.57  0.57

To compute the weighted values, we multiply by a diagonal matrix of the coefficients.

neut_coefs <- fixef(m)[1:4]
faci_coefs <- neut_coefs + fixef(m)[5:8]
faci_coefs
#>  (Intercept)          ot1          ot2          ot3 
#>  0.699926322  2.014186150  0.006646146 -0.226658408

set_colnames <- `colnames<-`

m_neut <- time_mat %*% diag(neut_coefs) %>%
  set_colnames(c("constant", "ot1", "ot2", "ot3")) 

m_faci <- time_mat %*% diag(faci_coefs) %>%
  set_colnames(c("constant", "ot1", "ot2", "ot3")) 

# Convince ourselves with an example
round(m_faci, 2)
#>       constant   ot1 ot2   ot3
#>  [1,]      0.7 -1.01   0  0.13
#>  [2,]      0.7 -0.88   0  0.03
#>  [3,]      0.7 -0.76   0 -0.03
#>  [4,]      0.7 -0.63   0 -0.07
#>  [5,]      0.7 -0.50   0 -0.08
#>  [6,]      0.7 -0.38   0 -0.08
#>  [7,]      0.7 -0.25   0 -0.06
#>  [8,]      0.7 -0.13   0 -0.03
#>  [9,]      0.7  0.00   0  0.00
#> [10,]      0.7  0.13   0  0.03
#> [11,]      0.7  0.25   0  0.06
#> [12,]      0.7  0.38   0  0.08
#> [13,]      0.7  0.50   0  0.08
#> [14,]      0.7  0.63   0  0.07
#> [15,]      0.7  0.76   0  0.03
#> [16,]      0.7  0.88   0 -0.03
#> [17,]      0.7  1.01   0 -0.13

Then, we can use the poly_melt() function to get a dataframe from each weighted matrix and then plot each of the effects.

df_neut <- m_neut %>%
  polypoly::poly_melt() %>%
  tibble::add_column(Condition = "neutral")

df_faci <- m_faci %>% 
  polypoly::poly_melt() %>%
  tibble::add_column(Condition = "facilitating")

df_both <- bind_rows(df_faci, df_neut) %>% 
  mutate(Condition = factor(Condition, c("neutral", "facilitating")))

ggplot(df_both) +
  aes(x = observation, y = value, color = Condition) +
  geom_line() + 
  facet_wrap("degree")

Each of the polynomial effects weighted by condition

Visually, the quadratic effect on the neutral curve pulls down the values during the center (when the curves are most different) and pushes the values in the tails upwards (when the curves are closest). Although only the quadratic effect is nominally significant, the constant and linear terms suggest other smaller effects but they are too noisy to pin down.

It’s worth noting that the predictors and weights discussed above are on the log-odds/logit scale used inside of the model, instead of the proportion scale used in the plots of the data and model fits. Basically, these weighted values are summed together and then squeezed into the range [0, 1] with a nonlinear transformation. For these data, the two scales produce similar looking growth curves, but you can notice that the right end of the curves are pinched slightly closer together in the probability-scale plot:

ggplot(df_both) +
  aes(x = observation, y = value, color = Condition) +
  stat_summary(fun.y = sum, geom = "line") + 
  ggtitle("logit scale") + 
  guides(color = "none")

ggplot(df_both) +
  aes(x = observation, y = value, color = Condition) +
  stat_summary(fun.y = function(xs) plogis(sum(xs)), geom = "line")  + 
  ggtitle("probability scale") + 
  guides(color = "none")

 

소스: New package polypoly (helper functions for orthogonal polynomials) – Higher Order Functions

5 ways to measure running time of R code | R-bloggers

A reviewer asked me to report detailed running times for all (so many :scream:) performed computations in one of my papers, and so I spent a Saturday morning figuring out my favorite way to benchmark R code. This is a quick summary of the options I found to be available.

A quick online search revealed at least three R packages for benchmarking R code (rbenchmark, microbenchmark, and tictoc). Additionally, base R provides at least two methods to measure the running time of R code (Sys.time and system.time). In the following I briefly go through the syntax of using each of the five option, and present my conclusions at the end.

1. Using Sys.time

The run time of a chunk of code can be measured by taking the difference between the time at the start and at the end of the code chunk. Simple yet flexible :sunglasses:.

sleep_for_a_minute <- function() { Sys.sleep(60) }

start_time <- Sys.time()
sleep_for_a_minute()
end_time <- Sys.time()

end_time - start_time
# Time difference of 1.000327 mins

2. Library tictoc

The functions tic and toc are used in the same manner for benchmarking as the just demonstrated Sys.time. However tictoc adds a lot more convenience to the whole.

The most recent development1 version of tictoc can be installed from github:

devtools::install_github("collectivemedia/tictoc")

One can time a single code chunk:

library(tictoc)

tic("sleeping")
print("falling asleep...")
sleep_for_a_minute()
print("...waking up")
toc()
# [1] "falling asleep..."
# [1] "...waking up"
# sleeping: 60.026 sec elapsed

Or nest multiple timers:

tic("total")
tic("data generation")
X <- matrix(rnorm(50000*1000), 50000, 1000)
b <- sample(1:1000, 1000)
y <- runif(1) + X %*% b + rnorm(50000)
toc()
tic("model fitting")
model <- lm(y ~ X)
toc()
toc()
# data generation: 3.792 sec elapsed
# model fitting: 39.278 sec elapsed
# total: 43.071 sec elapsed

3. Using system.time

One can time the evaluation of an R expression using system.time. For example, we can use it to measure the execution time of the function sleep_for_a_minute (defined above) as follows.

system.time({ sleep_for_a_minute() })
#   user  system elapsed
#  0.004   0.000  60.051

But what exactly are the reported times user, system, and elapsed? :confused:

Well, clearly elapsed is the wall clock time taken to execute the function sleep_for_a_minute, plus some benchmarking code wrapping it (that’s why it took slightly more than a minute to run I guess).

As for user and system times, William Dunlap has posted a great explanation to the r-help mailing list:

“User CPU time” gives the CPU time spent by the current process (i.e., the current R session) and “system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share. Different operating systems will have different things done by the operating system.

:grinning:

4. Library rbenchmark

The documentation to the function benchmark from the rbenchmark R package describes it as “a simple wrapper around system.time”. However it adds a lot of convenience compared to bare system.time calls. For example it requires just one benchmark call to time multiple replications of multiple expressions. Additionally the returned results are conveniently organized in a data frame.

I installed the development1 version of the rbenchmark package from github:

devtools::install_github("eddelbuettel/rbenchmark")

For example purposes, let’s compare the time required to compute linear regression coefficients using three alternative computational procedures:

  1. lm,
  2. the Moore-Penrose pseudoinverse,
  3. the Moore-Penrose pseudoinverse but without explicit matrix inverses.
library(rbenchmark)

benchmark("lm" = {
            X <- matrix(rnorm(1000), 100, 10)
            y <- X %*% sample(1:10, 10) + rnorm(100)
            b <- lm(y ~ X + 0)$coef
          },
          "pseudoinverse" = {
            X <- matrix(rnorm(1000), 100, 10)
            y <- X %*% sample(1:10, 10) + rnorm(100)
            b <- solve(t(X) %*% X) %*% t(X) %*% y
          },
          "linear system" = {
            X <- matrix(rnorm(1000), 100, 10)
            y <- X %*% sample(1:10, 10) + rnorm(100)
            b <- solve(t(X) %*% X, t(X) %*% y)
          },
          replications = 1000,
          columns = c("test", "replications", "elapsed",
                      "relative", "user.self", "sys.self"))

#            test replications elapsed relative user.self sys.self
# 3 linear system         1000   0.167    1.000     0.208    0.240
# 1            lm         1000   0.930    5.569     0.952    0.212
# 2 pseudoinverse         1000   0.240    1.437     0.332    0.612

Here, the meaning of elapsed, user.self, and sys.self is the same as described above in the section about system.time, and relative is simply the time ratio with the fastest test. Interestingly lm is by far the slowest here.

5. Library microbenchmark

The most recent development version of microbenchmark can be installed from github:

devtools::install_github("olafmersmann/microbenchmarkCore")
devtools::install_github("olafmersmann/microbenchmark")

Much like benchmark from the package rbenchmark, the function microbenchmark can be used to compare running times of multiple R code chunks. But it offers a great deal of convenience and additional functionality.

I find that one particularly nice feature of microbenchmark is the ability to automatically check the results of the benchmarked expressions with a user-specified function. This is demonstrated below, where we again compare three methods computing the coefficient vector of a linear model.

library(microbenchmark)

set.seed(2017)
n <- 10000
p <- 100
X <- matrix(rnorm(n*p), n, p)
y <- X %*% rnorm(p) + rnorm(100)

check_for_equal_coefs <- function(values) {
  tol <- 1e-12
  max_error <- max(c(abs(values[[1]] - values[[2]]),
                     abs(values[[2]] - values[[3]]),
                     abs(values[[1]] - values[[3]])))
  max_error < tol
}

mbm <- microbenchmark("lm" = { b <- lm(y ~ X + 0)$coef },
               "pseudoinverse" = {
                 b <- solve(t(X) %*% X) %*% t(X) %*% y
               },
               "linear system" = {
                 b <- solve(t(X) %*% X, t(X) %*% y)
               },
               check = check_for_equal_coefs)

mbm
# Unit: milliseconds
#           expr      min        lq      mean    median        uq      max neval cld
#             lm 96.12717 124.43298 150.72674 135.12729 188.32154 236.4910   100   c
#  pseudoinverse 26.61816  28.81151  53.32246  30.69587  80.61303 145.0489   100  b
#  linear system 16.70331  18.58778  35.14599  19.48467  22.69537 138.6660   100 a

We used the function argument check to check for equality (up to a maximal error of 1e-12) of the results returned by the three methods. If the results weren’t equal, microbenchmark would return an error message.

Another great feature is the integration with ggplot2 for plotting microbenchmark results.

library(ggplot2)
autoplot(mbm)

Microbenchmark results plot

Conclusion

The given demonstration of the different benchmarking functions is surely not exhaustive. Nevertheless I made some conclusions for my personal benchmarking needs:

  • The Sys.time approach as well as the tictoc package can be used for timing (potentially nested) steps of a complicated algorithm (that’s often my use case). However, tictoc is more convenient, and (most importantly) foolproof.
  • We saw that microbenchmark returns other types of measurements than benchmark, and I think that in most situations the microbenchmark measurements are of a higher practical significance :stuck_out_tongue:.
  • To my knowledge microbenchmark is the only benchmarking package that has visualizations built in :+1:.

For these reasons I will go with microbenchmark and tictoc. :bowtie:


  1. Though the repository does not seem to be very active. So the github version is probably no different from the stable release on CRAN.  2

 

소스: 5 ways to measure running time of R code | R-bloggers

Connecting R to an Oracle Database | R-bloggers

R is a very popular language for doing analytics, and particularly statistics, on your data. There are a number of R functions for reading in data, but most of them take a delimited text file (such as .CSV) for input. That’s great if your existing data is in a spreadsheet, but if you have large amounts of data, it’s probably stored in a relational database. If you work for a large company, chances are that it is an Oracle database.

The most efficient way to access an Oracle database from R is using the RODBC package, available from CRAN. If the RODBC package is not installed in your R environment, use the install.packages(“RODBC”) command to install it. ODBC stands for Open DataBase Connectivity, an open standard application programming interface (API) for databases. ODBC was created by the SQL Access Group and first released in September, 1992. Although Microsoft Windows was the first to provide an ODBC product, versions now exist for Linux and Macintosh platforms as well. ODBC is built-in to current versions of Windows. If you are using a different operating system, you’ll need to install on OBDC driver manager.

Before you can access a database from R, you’ll need to create a Data Source Name, or DSN. This is an alias to the database, which provides the connection details. In Windows, you create the DSN using the ODBC Source Administrator. This tool can be found in the Control Panel. In Windows 10, it’s under System and Security -> Administrative Tools -> ODBC Data Sources. Or you can just type “ODBC” in the search box. On my system, it looks like this:

As you can see, I already have a connection to an Oracle database. To set one up, click Add, and you’ll get this box:

Select the appropriate driver (in my case, Oracle in OraDB12Home1) and click the Finish button. A Driver Configuration box opens:

For “Data Source Name,” you can put in almost anything you want. This is the name you will use in R when you connect to the database.

The “Description” field is optional, and again, you can put in whatever you want.

TNS Service Name is the name that you (or your company data base administrator) assigned when configuring the Oracle database. And “User ID” is your ID that you use with the database.

After you fill in these fields, click the “Test Connection” button. Another box pops up, with the TNS Service Name and User ID already populated, and an empty field for your password. Enter your password and click “OK.” You should see a “Connection Successful” message. If not, check the Service Name, User ID, and Password.

Now you are ready to connect R to the database.

Here’s the R code that you need:

# Load RODBC package
library(RODBC)

# Create a connection to the database called "channel"
channel <- odbcConnect("DATABASE", uid="USERNAME", pwd="PASSWORD")

# Query the database and put the results into the data frame
# "dataframe"

 dataframe <- sqlQuery(channel, "
 SELECT *
 FROM
 SCHEMA.DATATABLE")

# When finished, it's a good idea to close the connection
odbcClose(channel)

A couple of comments about this code are in order:

First, I don’t like the idea of having a password appear, unencrypted, in the R program. One possible solution is to prompt the user for the password before creating the connection:

pswd <- readline("Input Password: ")
channel <- odbcConnect("DATABASE", uid="USERNAME",  pwd=pswd)

This will enable the connection to be made without compromising the security of the password.

Second, the sqlQuery will pass to Oracle whatever is inside the quotation marks. This is the workhorse function of the RODBC package. The term ‘query’ includes any valid SQL statement including table creation, updates, etc, as well as ‘SELECT’s.

Finally, I should mention that R works with data that is loaded into the computer’s memory. If you try to load a really huge database into memory all at once, it will a) take a very long time, and b) possibly fail due to exceeding your computer’s memory capacity. Of course, relational database systems like Oracle are the natural habitat of very large data sets, so that may be your motivation for connecting R to Oracle in the first place. Carefully constructed SQL Queries will let Oracle do the work of managing the data, and return just the data that R needs for performing analytics.

Writing SQL Queries is beyond the scope of this blog post. If you need help with that, there are plenty of free tutorials on the web, or you might find this book helpful: Oracle 12c for Dummies

 

소스: Connecting R to an Oracle Database | R-bloggers

R vs Python: Different similarities and similar differences | R-bloggers

A debate about which language is better suited for Datascience, R or Python, can set off diehard fans of these languages into a tizzy. This post tries to look at some of the different similarities and similar differences between these languages. To a large extent the ease or difficulty in learning R or Python is subjective. I have heard that R has a steeper learning curve than Python and also vice versa. This probably depends on the degree of familiarity with the languuge To a large extent both R an Python do the same thing in just slightly different ways and syntaxes. The ease or the difficulty in the R/Python construct’s largely is in the ‘eyes of the beholder’ nay, programmer’ we could say.  I include my own experience with the languages below.

1. R data types

R has 2 data types

  1. Character
  2. Numeric

Python has several data types

  1. Int
  2. float
  3. Long
  4. Complex and so on

2. R Vector vs Python List

A common data type in R is the vector. Python has a similar data type, the list

# R vectors
a<-c(4,5,1,3,4,5)
print(a[3])
## [1] 1
print(a[3:4]) # R does not always need the explicit print. 
## [1] 1 3
#R type of variable
print(class(a))
## [1] "numeric"
# Length of a
print(length(a))
## [1] 6
# Python lists
a=[4,5,1,3,4,5] # 
print(a[2]) # Some python IDEs require the explicit print
print(a[2:5])
print(type(a))
# Length of a
print(len(a))
## 1
## [1, 3, 4]
## <class 'list'="">
## 6

2a. Other data types – Python

Python also has certain other data types like the tuple, dictionary etc as shown below. R does not have as many of the data types, nevertheless we can do everything that Python does in R

# Python tuple
b = (4,5,7,8)
print(b)


#Python dictionary
c={'name':'Ganesh','age':54,'Work':'Professional'}
print(c)
#Print type of variable c
## (4, 5, 7, 8)
## {'name': 'Ganesh', 'age': 54, 'Work': 'Professional'}

2.Type of Variable

To know the type of the variable in R we use ‘class’, In Python the corresponding command is ‘type’

#R - Type of variable
a<-c(4,5,1,3,4,5)
print(class(a))
## [1] "numeric"
#Python - Print type of tuple a
a=[4,5,1,3,4,5]
print(type(a))
b=(4,3,"the",2)
print(type(b))
## <class 'list'="">
## <class 'tuple'="">

3. Length

To know length in R, use length()

#R - Length of vector
# Length of a
a<-c(4,5,1,3,4,5)
print(length(a))
## [1] 6

To know the length of a list,tuple or dict we can use len()

# Python - Length of list , tuple etc
# Length of a
a=[4,5,1,3,4,5]
print(len(a))
# Length of b
b = (4,5,7,8)
print(len(b))
## 6
## 4

4. Accessing help

To access help in R we use the ‘?’ or the ‘help’ function

#R - Help - To be done in R console or RStudio
#?sapply
#help(sapply)

Help in python on any topic involves

#Python help - This can be done on a (I)Python console
#help(len)
#?len

5. Subsetting

The key difference between R and Python with regards to subsetting is that in R the index starts at 1. In Python it starts at 0, much like C,C++ or Java To subset a vector in R we use

#R - Subset
a<-c(4,5,1,3,4,8,12,18,1)
print(a[3])
## [1] 1
# To print a range or a slice. Print from the 3rd to the 5th element
print(a[3:6])
## [1] 1 3 4 8

Python also uses indices. The difference in Python is that the index starts from 0/

#Python - Subset
a=[4,5,1,3,4,8,12,18,1]
# Print the 4th element (starts from 0)
print(a[3])

# Print a slice from 4 to 6th element
print(a[3:6])
## 3
## [3, 4, 8]

6. Operations on vectors in R and operation on lists in Python

In R we can do many operations on vectors for e.g. element by element addition, subtraction, exponentation,product etc. as show

#R - Operations on vectors
a<- c(5,2,3,1,7)
b<- c(1,5,4,6,8)

#Element wise Addition
print(a+b)
## [1]  6  7  7  7 15
#Element wise subtraction
print(a-b)
## [1]  4 -3 -1 -5 -1
#Element wise product
print(a*b)
## [1]  5 10 12  6 56
# Exponentiating the elements of a vector
print(a^2)
## [1] 25  4  9  1 49

In Python to do this on lists we need to use the ‘map’ and the ‘lambda’ function as follows

# Python - Operations on list
a =[5,2,3,1,7]
b =[1,5,4,6,8]

#Element wise addition with map & lambda
print(list(map(lambda x,y: x+y,a,b)))
#Element wise subtraction
print(list(map(lambda x,y: x-y,a,b)))
#Element wise product
print(list(map(lambda x,y: x*y,a,b)))
# Exponentiating the elements of a list
print(list(map(lambda x: x**2,a)))
## [6, 7, 7, 7, 15]
## [4, -3, -1, -5, -1]
## [5, 10, 12, 6, 56]
## [25, 4, 9, 1, 49]

However if we create ndarrays from lists then we can do the element wise addition,subtraction,product, etc. like R. Numpy is really a powerful module with many, many functions for matrix manipulations

import numpy as np
a =[5,2,3,1,7]
b =[1,5,4,6,8]
a=np.array(a)
b=np.array(b)
#Element wise addition
print(a+b)
#Element wise subtraction
print(a-b)
#Element wise product
print(a*b)
# Exponentiating the elements of a list
print(a**2)
## [ 6  7  7  7 15]
## [ 4 -3 -1 -5 -1]
## [ 5 10 12  6 56]
## [25  4  9  1 49]

7. Getting the index of element

To determine the index of an element which satisifies a specific logical condition in R use ‘which’. In the code below the index of element which is equal to 1 is 4

# R - Which
a<- c(5,2,3,1,7)
print(which(a == 1))
## [1] 4

In Python array we can use np.where to get the same effect. The index will be 3 as the index starts from 0

# Python - np.where
import numpy as np
a =[5,2,3,1,7]
a=np.array(a)
print(np.where(a==1))
## (array([3], dtype=int64),)

8. Data frames

R, by default comes with a set of in-built datasets. There are some datasets which come with the SkiKit- Learn package

# R 
# To check built datasets use
#data() - In R console or in R Studio
#iris - Don't print to console

We can use the in-built data sets that come with Scikit package

#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# This creates a Sklearn bunch
data = datasets.load_iris()
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)

9. Working with dataframes

With R you can work with dataframes directly. For more complex dataframe operations in R there are convenient packages like dplyr, reshape2 etc. For Python we need to use the Pandas package. Pandas is quite comprehensive in the list of things we can do with data frames The most common operations on a dataframe are

  • Check the size of the dataframe
  • Take a look at the top 5 or bottom 5 rows of dataframe
  • Check the content of the dataframe

a.Size

In R use dim()

#R - Size
dim(iris)
## [1] 150   5

For Python use .shape

#Python - size
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
data = datasets.load_iris()
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
iris.shape

b. Top & bottom 5 rows of dataframe

To know the top and bottom rows of a data frame we use head() & tail as shown below for R and Python

#R 
head(iris,5)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
tail(iris,5)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica
#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
data = datasets.load_iris()
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
print(iris.head(5))
print(iris.tail(5))
##    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
## 0                5.1               3.5                1.4               0.2
## 1                4.9               3.0                1.4               0.2
## 2                4.7               3.2                1.3               0.2
## 3                4.6               3.1                1.5               0.2
## 4                5.0               3.6                1.4               0.2
##      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
## 145                6.7               3.0                5.2               2.3
## 146                6.3               2.5                5.0               1.9
## 147                6.5               3.0                5.2               2.0
## 148                6.2               3.4                5.4               2.3
## 149                5.9               3.0                5.1               1.8

c. Check the content of the dataframe

#R
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
data = datasets.load_iris()
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
print(iris.info())
## <class 'pandas.core.frame.dataframe'="">
## RangeIndex: 150 entries, 0 to 149
## Data columns (total 4 columns):
## sepal length (cm)    150 non-null float64
## sepal width (cm)     150 non-null float64
## petal length (cm)    150 non-null float64
## petal width (cm)     150 non-null float64
## dtypes: float64(4)
## memory usage: 4.8 KB
## None

d. Check column names

#R
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"
#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
data = datasets.load_iris()
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
#Get column names
print(iris.columns)
## Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
##        'petal width (cm)'],
##       dtype='object')

e. Rename columns

In R we can assign a vector to column names

#R
colnames(iris) <- c("lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal","Species")
colnames(iris)
## [1] "lengthOfSepal" "widthOfSepal"  "lengthOfPetal" "widthOfPetal" 
## [5] "Species"

In Python we can assign a list to s.columns

#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
data = datasets.load_iris()
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
iris.columns = ["lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal"]
print(iris.columns)
## Index(['lengthOfSepal', 'widthOfSepal', 'lengthOfPetal', 'widthOfPetal'], dtype='object')

f.Details of dataframe

#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
data = datasets.load_iris()
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
print(iris.info())
## <class 'pandas.core.frame.dataframe'="">
## RangeIndex: 150 entries, 0 to 149
## Data columns (total 4 columns):
## sepal length (cm)    150 non-null float64
## sepal width (cm)     150 non-null float64
## petal length (cm)    150 non-null float64
## petal width (cm)     150 non-null float64
## dtypes: float64(4)
## memory usage: 4.8 KB
## None

g. Subsetting dataframes

# R
#To subset a dataframe 'df' in R we use df[row,column] or df[row vector,column vector]
#df[row,column]
iris[3,4]
## [1] 0.2
#df[row vector, column vector]
iris[2:5,1:3]
##   lengthOfSepal widthOfSepal lengthOfPetal
## 2           4.9          3.0           1.4
## 3           4.7          3.2           1.3
## 4           4.6          3.1           1.5
## 5           5.0          3.6           1.4
#If we omit the row vector, then it implies all rows or if we omit the column vector
# then implies all columns for that row
iris[2:5,]
##   lengthOfSepal widthOfSepal lengthOfPetal widthOfPetal Species
## 2           4.9          3.0           1.4          0.2  setosa
## 3           4.7          3.2           1.3          0.2  setosa
## 4           4.6          3.1           1.5          0.2  setosa
## 5           5.0          3.6           1.4          0.2  setosa
# In R we can all specific columns by column names
iris$Sepal.Length[2:5]
## NULL
#Python
# To select an entire row we use .iloc. The index can be used with the ':'. If 
# .iloc[start row: end row]. If start row is omitted then it implies the beginning of
# data frame, if end row is omitted then it implies all rows till end
#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
data = datasets.load_iris()
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
print(iris.iloc[3])
print(iris[:5])
# In python we can select columns by column name as follows
print(iris['sepal length (cm)'][2:6])
#If you want to select more than 2 columns then you must use the double '[[]]' since the 
# index is a list itself
print(iris[['sepal length (cm)','sepal width (cm)']][4:7])
## sepal length (cm)    4.6
## sepal width (cm)     3.1
## petal length (cm)    1.5
## petal width (cm)     0.2
## Name: 3, dtype: float64
##    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
## 0                5.1               3.5                1.4               0.2
## 1                4.9               3.0                1.4               0.2
## 2                4.7               3.2                1.3               0.2
## 3                4.6               3.1                1.5               0.2
## 4                5.0               3.6                1.4               0.2
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## Name: sepal length (cm), dtype: float64
##    sepal length (cm)  sepal width (cm)
## 4                5.0               3.6
## 5                5.4               3.9
## 6                4.6               3.4

h. Computing Mean, Standard deviation

#R 
#Mean
mean(iris$lengthOfSepal)
## [1] 5.843333
#Standard deviation
sd(iris$widthOfSepal)
## [1] 0.4358663
#Python
#Mean
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
data = datasets.load_iris()
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
# Convert to Pandas dataframe
print(iris['sepal length (cm)'].mean())
#Standard deviation
print(iris['sepal width (cm)'].std())
## 5.843333333333335
## 0.4335943113621737

i. Boxplot

Boxplot can be produced in R using baseplot

#R
boxplot(iris$lengthOfSepal)

 

Matplotlib is a popular package in Python for plots

#Python
import sklearn as sklearn
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
data = datasets.load_iris()
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
img=plt.boxplot(iris['sepal length (cm)'])
plt.show(img)

j.Scatter plot

#R
plot(iris$widthOfSepal,iris$lengthOfSepal)

#Python
import matplotlib.pyplot as plt
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
data = datasets.load_iris()
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
img=plt.scatter(iris['sepal width (cm)'],iris['sepal length (cm)'])
#plt.show(img)

k. Read from csv file

#R
tendulkar= read.csv("tendulkar.csv",stringsAsFactors = FALSE,na.strings=c(NA,"-"))
#Dimensions of dataframe
dim(tendulkar)
## [1] 347  13
names(tendulkar)
##  [1] "X"          "Runs"       "Mins"       "BF"         "X4s"       
##  [6] "X6s"        "SR"         "Pos"        "Dismissal"  "Inns"      
## [11] "Opposition" "Ground"     "Start.Date"

Use pandas.read_csv() for Python

#Python
import pandas as pd
#Read csv
tendulkar= pd.read_csv("tendulkar.csv",na_values=["-"])
print(tendulkar.shape)
print(tendulkar.columns)
## (347, 13)
## Index(['Unnamed: 0', 'Runs', 'Mins', 'BF', '4s', '6s', 'SR', 'Pos',
##        'Dismissal', 'Inns', 'Opposition', 'Ground', 'Start Date'],
##       dtype='object')

l. Clean the dataframe in R and Python.

The following steps are done for R and Python
1.Remove rows with ‘DNB’
2.Remove rows with ‘TDNB’
3.Remove rows with absent
4.Remove the “*” indicating not out
5.Remove incomplete rows with NA for R or NaN in Python
6.Do a scatter plot

#R
# Remove rows with 'DNB'
a <- tendulkar$Runs != "DNB"
tendulkar <- tendulkar[a,]
dim(tendulkar)
## [1] 330  13
# Remove rows with 'TDNB'
b <- tendulkar$Runs != "TDNB"
tendulkar <- tendulkar[b,]

# Remove rows with absent
c <- tendulkar$Runs != "absent"
tendulkar <- tendulkar[c,]
dim(tendulkar)
## [1] 329  13
# Remove the "* indicating not out
tendulkar$Runs <- as.numeric(gsub("\\*","",tendulkar$Runs))
dim(tendulkar)
## [1] 329  13
# Select only complete rows - complete.cases()
c <- complete.cases(tendulkar)
#Subset the rows which are complete
tendulkar <- tendulkar[c,]
dim(tendulkar)
## [1] 327  13
# Do some base plotting - Scatter plot
plot(tendulkar$BF,tendulkar$Runs)

#Python 
import pandas as pd
import matplotlib.pyplot as plt
#Read csv
tendulkar= pd.read_csv("tendulkar.csv",na_values=["-"])
print(tendulkar.shape)
# Remove rows with 'DNB'
a=tendulkar.Runs !="DNB"
tendulkar=tendulkar[a]
print(tendulkar.shape)
# Remove rows with 'TDNB'
b=tendulkar.Runs !="TDNB"
tendulkar=tendulkar[b]
print(tendulkar.shape)
# Remove rows with absent
c= tendulkar.Runs != "absent"
tendulkar=tendulkar
print(tendulkar.shape)
# Remove the "* indicating not out
tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","")
#Select only complete rows - dropna()
tendulkar=tendulkar.dropna()
print(tendulkar.shape)
tendulkar.Runs = tendulkar.Runs.astype(int)
tendulkar.BF = tendulkar.BF.astype(int)
#Scatter plot
plt.scatter(tendulkar.BF,tendulkar.Runs)
## (347, 13)
## (330, 13)
## (329, 13)
## (329, 13)
## (327, 13)

m.Chaining operations on dataframes

To chain a set of operations we need to use an R package like dplyr. Pandas does this The following operations are done on tendulkar data frame by dplyr for R and Pandas for Python below

  1. Group by ground
  2. Compute average runs in each ground
  3. Arrange in descending order
#R
library(dplyr)
tendulkar1 <- tendulkar %>% group_by(Ground) %>% summarise(meanRuns= mean(Runs)) %>%
         arrange(desc(meanRuns))
head(tendulkar1,10)
## # A tibble: 10 × 2
##           Ground  meanRuns
##                 
## 1         Multan 194.00000
## 2          Leeds 193.00000
## 3  Colombo (RPS) 143.00000
## 4        Lucknow 142.00000
## 5          Dhaka 132.75000
## 6     Manchester  93.50000
## 7         Sydney  87.22222
## 8   Bloemfontein  85.00000
## 9     Georgetown  81.00000
## 10 Colombo (SSC)  77.55556
#Python
import pandas as pd
#Read csv
tendulkar= pd.read_csv("tendulkar.csv",na_values=["-"])
print(tendulkar.shape)
# Remove rows with 'DNB'
a=tendulkar.Runs !="DNB"
tendulkar=tendulkar[a]
# Remove rows with 'TDNB'
b=tendulkar.Runs !="TDNB"
tendulkar=tendulkar[b]
# Remove rows with absent
c= tendulkar.Runs != "absent"
tendulkar=tendulkar
# Remove the "* indicating not out
tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","")

#Select only complete rows - dropna()
tendulkar=tendulkar.dropna()
tendulkar.Runs = tendulkar.Runs.astype(int)
tendulkar.BF = tendulkar.BF.astype(int)
tendulkar1= tendulkar.groupby('Ground').mean()['Runs'].sort_values(ascending=False)
print(tendulkar1.head(10))
## (347, 13)
## Ground
## Multan           194.000000
## Leeds            193.000000
## Colombo (RPS)    143.000000
## Lucknow          142.000000
## Dhaka            132.750000
## Manchester        93.500000
## Sydney            87.222222
## Bloemfontein      85.000000
## Georgetown        81.000000
## Colombo (SSC)     77.555556
## Name: Runs, dtype: float64

9. Functions

product <- function(a,b){
  c<- a*b
  c
}
product(5,7)
## [1] 35
def product(a,b):
  c = a*b
  return c
  
print(product(5,7))
## 35

Conclusion

Personally, I took to R, much like a ‘duck takes to water’. I found the R syntax very simple and mostly intuitive. R packages like dplyr, ggplot2, reshape2, make the language quite irrestible. R is weakly typed and has only numeric and character types as opposed to the full fledged data types in Python.

Python, has too many bells and whistles, which can be a little bewildering to the novice. It is possible that they may be useful as one becomes more experienced with the language. Also I found that installing Python packages sometimes gives errors with Python versions 2.7 or 3.6. This will leave you scrambling to google to find how to fix these problems. These can be quite frustrating. R on the other hand makes installing R packages a breeze.

Anyway, this is my current opinion, and like all opinions, may change in the course of time. Let’s see!

I may write a follow up post with more advanced features of R and Python. So do keep checking! Long live R! Viva la Python!

 

소스: R vs Python: Different similarities and similar differences | R-bloggers

Sankey charts for swinging voters

At a glance:

Sankey charts based on individual level survey data are a good way of showing change from election to election. I demonstrate this, via some complications with survey-reweighting and missing data, with the New Zealand Election Study for the 2014 and 2011 elections.


Continuing my examination of the individual level voting behaviour from the New Zealand Election Study, I wanted to look at the way individuals swap between parties, and between “did not vote” and a party, from one election to another. How much and how this happens is obviously an important question for both political scientists and for politicians.

Vote transition visualisations

I chose a Sankey chart as a way of showing the transition matrix from self-reported party vote in the 2011 election to the 2014 election. Here’s a static version:

And here is the more screen-friendly interactive version, with mouseover tooltips to give actual estimates:

The point with these graphics is to highlight the transitions. For example, what were the implications of turnout being higher in 2014 than 2011 (77.9% of enrolled voters in 2014 compared to 74.2% in 2011)? Judging from this survey data, the National Party gained 6.6% of the enrolled population in 2014 by converting them from a 2011 “did not vote” and lost only 3.6% in the other direction. This net gain of three percentage points was enough to win the election for the National-led coalition. In contrast, the Labour party had a net gain from “did not vote” in 2011 of only 0.2 percentage points. Remember though that these are survey-based estimates, and subject to statistical error.

I find setting up and polishing Sankey charts – controlling colours for example – a bit of a pain, so the code at the bottom of this post on how this was done might be of interest.

Weighting, missing data, population and age complications

Those visualisations have a few hidden fishhooks, which careful readers would find if they compare the percentages in the tooltips of the interactive version with percentages reported by the New Zealand Electoral Commission.

  • The 2014 percentages are proportions of the enrolled population. As the 2014 turnout of enrolled voters was 77.9%, the numbers here are noticeably less than the usually cited percentages which were used to translate into seat counts (for example, National Party reported party vote of 47.0% of votes becomes 36.6% of enrolled voters)
  • The 2011 percentages are even harder to explain, because I’ve chosen not only to scale the party vote and “did not vote” to the 2011 enrolled population as reported by the Commission, but also to add in around 5% of the 2014 population that were too young to vote in 2011.

Two things I would have liked to have taken into account but wasn’t able to were:

  • The “leakage” from the 2011 election of people who were deceased or had left the country by 2014
  • Explicit recognition of people who voted in 2014 but not in 2011 because they were out of the country. There is a variable in the survey that picks up the year the respondent came to live in New Zealand if not born here, but for only 10 respondents was this 2012 or later (in contrast to age – there were 58 respondents aged 20 or less in 2014).

I re-weighted the survey so the 2014 and 2011 reported party votes matched the known totals (with the addition of people aged 15 to 17 in 2011). One doesn’t normally re-weight a survey based on answers provided by the respondents, but in this case I think it makes perfect sense to calibrate to the public totals. The biggest impact is that for both years, but particularly 2011, relying on the respondents’ self-report and the published weighting of the NZES, totals for “did not vote” are materially understated, compared to results when calibrated to the known totals.

When party vote in 2011 had been forgotten or was an NA, and this wasn’t explained by being too young in 2011, I used multiple imputation based on a subset of relevant variables to give five instances of probable party vote to each such respondent.

Taken together, all this gives the visualisations a perspective based in 2014. It is better to think of it as “where did the 2014 voters come from” than “where did the 2011 voters go”. This is fairly natural when we consider it is the 2014 New Zealand Election Study, but is worth keeping in mind in interpretation.

Age (and hence the impact new young voters coming in, and of older voters passing on) is important in voting behaviour, as even the most casual observation of politics shows. In New Zealand, the age distribution of party voters in 2014 is seen in the chart below:

Non-voters, Green voters and to a certain extent Labour voters are young; New Zealand First voters are older. If this interests you though, I suggest you look at the multivariate analysis in this blog post or, probably more fun, this fancy interactive web app which lets you play with the predicted probabilities of voting based on a combination of demographic and socio-economic status variables.

Code

Here’s the R code that did that imputation, weighting, and the graphics:

library(tidyverse)
library(forcats)
library(riverplot)
library(sankeyD3)
library(foreign)
library(testthat)
library(mice)
library(grid)
library(survey)

# Caution networkD3 has its own sankeyD3 with less features.  Make sure you know which one you're using!
# Don't load up networkD3 in the same session

#============import data======================
# See previous blog posts for where this comes from:
unzip("D:/Downloads/NZES2014GeneralReleaseApril16.sav.zip")

nzes_orig <- read.spss("NZES2014GeneralReleaseApril16.sav", 
                       to.data.frame = TRUE, trim.factor.names = TRUE)

# Five versions of each row, so when we do imputation later on we
# in effect doing multiple imputation:
nzes <- nzes_orig[rep(1:nrow(nzes_orig), each = 5), ] %>%
   # lump up party vote in 2014:
   mutate(partyvote2014 = ifelse(is.na(dpartyvote), "Did not vote", as.character(dpartyvote)),
          partyvote2014 = gsub("M.ori", "Maori", partyvote2014),
          partyvote2014 = gsub("net.Man", "net-Man", partyvote2014),
          partyvote2014 = fct_infreq(partyvote2014)) %>%
   mutate(partyvote2014 = fct_lump(partyvote2014, 5)) 

# party vote in 2011, and a smaller set of columns to do the imputation:
nzes2 <- nzes %>%
   mutate(partyvote2011 = as.factor(ifelse(dlastpvote == "Don't know", NA, as.character(dlastpvote)))) %>%
   # select a smaller number of variables so imputation is feasible:
   select(dwtfin, partyvote2014, partyvote2011, dethnicity_m, dage, dhhincome, dtradeunion, dprofassoc,
          dhousing, dclass, dedcons, dwkpt)

# impute the missing values.  Although we are imputing only a single set of values,
# because we are doing it with each row repeated five times this has the same impact,
# for the simple analysis we do later on, as multiple imputation:
nzes3 <- complete(mice(nzes2, m = 1))

# Lump up the 2011 vote, tidy labels, and work out who was too young to vote:
nzes4 <- nzes3 %>%
   mutate(partyvote2011 = fct_lump(partyvote2011, 5),
          partyvote2011 = ifelse(grepl("I did not vote", partyvote2011), "Did not vote", as.character(partyvote2011)),
          partyvote2011 = ifelse(dage < 21, "Too young to vote", partyvote2011),
          partyvote2014 = as.character(partyvote2014))

#===============re-weighting to match actual votes in 2011 and 2014===================

# This makes a big difference, for both years, but more for 2011. "Did not vote" is only 16% in 2011
# if we only calibrate to 2014 voting outcomes, but in reality was 25.8%.  We calibrate for both
# 2011 and 2014 so the better proportions for both elections.

#------------2014 actual outcomes----------------
# http://www.elections.org.nz/news-media/new-zealand-2014-general-election-official-results
actual_vote_2014 <- data_frame(
   partyvote2014 = c("National", "Labour", "Green", "NZ First", "Other", "Did not vote"),
   Freq = c(1131501, 604534, 257356, 208300, 
            31850 + 16689 + 5286 + 95598 + 34095 + 10961 + 5113 + 1730 + 1096 + 872 + 639,
            NA)
)

# calculate the did not vote, from the 77.9 percent turnout
actual_vote_2014[6, 2] <- (100 / 77.9 - 1) * sum(actual_vote_2014[1:5, 2])

# check I did the turnout sums right:
expect_equal(0.779 * sum(actual_vote_2014[ ,2]), sum(actual_vote_2014[1:5, 2]))

#---------------2011 actual outcomes---------------------
# http://www.elections.org.nz/events/past-events-0/2011-general-election/2011-general-election-official-results
actual_vote_2011 <- data_frame(
   partyvote2011 = c("National", "Labour", "Green", "NZ First", "Other", "Did not vote", "Too young to vote"),
   Freq = c(1058636, 614937, 247372, 147511, 
            31982 + 24168 + 23889 + 13443 + 59237 + 11738 + 1714 + 1595 + 1209,
            NA, NA)
)
# calculate the did not vote, from the 74.21 percent turnout at 
# http://www.elections.org.nz/events/past-events/2011-general-election
actual_vote_2011[6, 2] <- (100 / 74.21 - 1) * sum(actual_vote_2011[1:5, 2])

# check I did the turnout sums right:
expect_equal(0.7421 * sum(actual_vote_2011[1:6 ,2]), sum(actual_vote_2011[1:5, 2]))

# from the below, we conclude 4.8% of the 2014 population (as estimated by NZES)
# were too young to vote in 2011:
nzes_orig %>%
   mutate(tooyoung = dage < 21) %>%
   group_by(tooyoung) %>%
   summarise(pop = sum(dwtfin),
             n = n()) %>%
   ungroup() %>%
   mutate(prop = pop / sum(pop))

# this is pretty approximate but will do for our purposes.
actual_vote_2011[7, 2] <- 0.048 * sum(actual_vote_2011[1:6, 2])

# Force the 2011 numbers to match the 2014 population
actual_vote_2011$Freq <- actual_vote_2011$Freq / sum(actual_vote_2011$Freq) * sum(actual_vote_2014$Freq)

#------------create new weights--------------------
# set up survey design with the original weights:
nzes_svy <- svydesign(~1, data = nzes4, weights = ~dwtfin)

# calibrate weights to the known total party votes in 2011 and 2014:
nzes_cal <- calibrate(nzes_svy, 
                      list(~partyvote2014, ~partyvote2011),
                      list(actual_vote_2014, actual_vote_2011))

# extract those weights for use in straight R (not just "survey")
nzes5 <- nzes4 %>%
   mutate(weight = weights(nzes_cal))

# See impact of calibrated weights for the 2014 vote:
prop.table(svytable(~partyvote2014, nzes_svy)) # original weights
prop.table(svytable(~partyvote2014, nzes_cal)) # recalibrated weights

# See impact of calibrated weights for the 2011 vote:
prop.table(svytable(~partyvote2011, nzes_svy)) # original weights
prop.table(svytable(~partyvote2011, nzes_cal)) # recalibrated weights


#=======================previous years vote riverplot version============

the_data <- nzes5 %>%
   mutate(
       # add two spaces to ensure the partyvote2011 does not have any
       # names that exactly match the party vote in 2014
       partyvote2011 = paste(partyvote2011, "  "),
       partyvote2011 = gsub("M.ori", "Maori", partyvote2011)) %>%
   group_by(partyvote2011, partyvote2014) %>%
   summarise(vote_prop = sum(weight)) %>%
   ungroup() 

# change names to the names wanted by makeRiver
names(the_data) <- c("col1", "col2", "Value")

# node ID need to be characters I think
c1 <- unique(the_data$col1)
c2 <- unique(the_data$col2)
nodes_ref <- data_frame(fullname = c(c1, c2)) %>%
   mutate(position = rep(c(1, 2), times = c(length(c1), length(c2)))) %>%
   mutate(ID = LETTERS[1:n()])

edges <-  the_data %>%
   left_join(nodes_ref[ , c("fullname", "ID")], by = c("col1" = "fullname")) %>%
   rename(N1 = ID) %>%
   left_join(nodes_ref[ , c("fullname", "ID")], by = c("col2" = "fullname")) %>%
   rename(N2 = ID) %>%
   as.data.frame(stringsAsFactors = FALSE)

rp <- makeRiver(nodes = as.vector(nodes_ref$ID), edges = edges,
                node_labels = nodes_ref$fullname,
                # manual vertical positioning by parties.  Look at
                # nodes_ref to see the order in which positions are set.  
                # This is a pain, so I just let it stay with the defaults:
                #  node_ypos = c(4, 1, 1.8, 3, 6, 7, 5, 4, 1, 1.8, 3, 6, 7),
                node_xpos = nodes_ref$position,
                # set party colours; all based on those in nzelect::parties_v:
                node_styles = list(C = list(col = "#d82a20"), # red labour
                                   D = list(col = "#00529F"), # blue national
                                   E= list(col = "black"),   # black NZFirst
                                   B = list(col = "#098137"), # green
                                   J = list(col = "#d82a20"), # labour
                                   I = list(col = "#098137"), # green
                                   K = list(col = "#00529F"), # national
                                   L = list(col = "black")))  # NZ First

# Turn the text horizontal, and pale grey
ds <- default.style()
ds$srt <- 0
ds$textcol <- "grey95"

mygp <- gpar(col = "grey75")
# using PNG rather than SVG as vertical lines appear in the SVG version

par(bg = "grey40")

# note the plot_area argument - for some reason, defaults to using only half the
# vertical space available, so set this to higher than 0.5!:
plot(rp, default_style = ds, plot_area = 0.9)

title(main = "Self-reported party vote in 2011 compared to 2014", 
      col.main = "white", font.main = 1)

grid.text(x = 0.15, y = 0.1, label = "2011 party vote", gp = mygp)
grid.text(x = 0.85, y = 0.1, label = "2014 party vote", gp = mygp)
grid.text(x = 0.95, y = 0.03, 
          gp = gpar(fontsize = 7, col = "grey75"), just = "right",
          label = "Source: New Zealand Election Study, analysed at https://ellisp.github.io")

#=======================sankeyD3 version====================

nodes_ref2 <- nodes_ref %>%
   mutate(ID = as.numeric(as.factor(ID)) - 1) %>%
   as.data.frame()

edges2 <- the_data %>%
   ungroup() %>%
   left_join(nodes_ref2[ , c("fullname", "ID")], by = c("col1" = "fullname")) %>%
   rename(N1 = ID) %>%
   left_join(nodes_ref2[ , c("fullname", "ID")], by = c("col2" = "fullname")) %>%
   rename(N2 = ID) %>%
   as.data.frame(stringsAsFactors = FALSE) %>%
   mutate(Value = Value / sum(Value) * 100)

pal <- 'd3.scaleOrdinal()
         .range(["#DCDCDC", "#098137", "#d82a20", "#00529F",  "#000000",  "#DCDCDC", 
                 "#DCDCDC", "#098137", "#d82a20", "#00529F", "#000000", "#DCDCDC"]);'

sankeyNetwork(Links = edges2, Nodes = nodes_ref2, 
              Source = "N1", Target = "N2", Value = "Value",
              NodeID = "fullname",
              NodeGroup = "fullname",
              LinkGroup = "col2",
              fontSize = 12, nodeWidth = 30,
              colourScale = JS(pal),
              numberFormat = JS('d3.format(".1f")'),
              fontFamily = "Calibri", units = "%", 
              nodeShadow = TRUE,
              showNodeValues = FALSE,
              width = 700, height = 500) 

#=======other by the by analysis==================
# Age density plot by party vote

# Remember to weight by the survey weights - in this case it controls for
# the under or over sampling by age in the original design.
nzes5 %>%
   ggplot(aes(x = dage, fill = partyvote2014, weight = weight / sum(nzes5$weight))) +
   geom_density(alpha = 0.3) +
   facet_wrap(~partyvote2014, scales = "free_y") +
   scale_fill_manual(values = parties_v) +
   theme(legend.position = "none") +
   labs(x = "Age at time of 2014 election",
        caption = "Source: New Zealand Election Study") +
   ggtitle("Age distribution by Party Vote in the 2014 New Zealand General Election")

 

소스: Sankey charts for swinging voters

Modelling individual party vote from the 2014 New Zealand election study

At a glance:

I work through a fairly complete modelling case study utilising methods for complex surveys, multiple imputation, multilevel models, non-linear relationships and the bootstrap. People who voted for New Zealand First in the 2014 election were more likely to be older, born in New Zealand, identify as working class and male.


Someone asked on Twitter about the characteristics of New Zealand First voters. While some crude conclusions can be drawn from examining votes by location cast and then comparing that with census data, we really need individual level data to answer the question properly. I set myself this task as a motivation for exploring the New Zealand Election Study.

This turned into a fairly lengthy blog post and includes more than 500 lines of code. Skip down to the very end if you just want to see how things turn out for the four main parties and the main conclusions from that. But first here’s the best version of the result with relation to New Zealand First:

Later down there’s a chart comparing this to the voters for the National, Labour and Green parties.

Functionality

Here’s the R packages I needed to do this analysis, plus a few convenience functions I wrote specifically for it.

library(tidyverse)
library(scales)
library(foreign)   # for importing SPSS data
library(survey)    # for survey weighting and analysis
library(forcats)   # for manipulating factor levels
library(mgcv)      # for generalized additive models
library(glmnet)    # for elastic net regularisation
library(mice)      # for imputation
library(testthat)
library(broom)     # for reshaping model outputs
library(stringr)   # for str_wrap
library(boot)
library(doParallel)
library(gridExtra)
library(nzelect)   # for party colours
library(lme4)      # for mixed effects / multilevel modelling

#-------------convenience functions--------------
camel_to_english <- function(camelCase){
   return(gsub("([a-z])([A-Z])", "\\1 \\L\\2", camelCase, perl = TRUE))
}

# Convert five category membership question (for trade unions, business
# associations, etc) into Yes or No.
membership <- function(x){
   tmp <- fct_recode(x,
                     Yes = "I belong, but no one else in the house does",
                     Yes = "I do, plus another in the house",
                     Yes = "Another person belongs, but not me",
                     No  = "No, no one belongs",
                     No  = "Don't know") 
   tmp <- ifelse(tmp == "Yes", 1, 0)
   # Uncomment the next line if we want to turn NA into 0.
   # tmp <- ifelse(is.na(tmp), 0, tmp)
   return(tmp)
}

# Draw confidence interval segment chart for two types of models:
# those with variance-covariance matrices, and hence can use confint()
# on them; and pooled models created by with.mice() after multiple
# imputation.
sum_plot <- function(model, 
                     title = NULL, 
                     party = gsub("PartyVote", "", as.character(model$formula)[2]),
                     colour = "#000000",
                     type = c("vcov", "pool")){
   type <- match.arg(type)
   if(type == "vcov"){
      conf_ints <- cbind(tidy(confint(model)), coef(model))
   } else {
      conf_ints <- tidy(summary(pool(model))) %>%
         select(.rownames, lo.95, hi.95, est)
   }
   
   tmp <- conf_ints %>%
      filter(.rownames != "(Intercept)")
   names(tmp)    <- c("var", "lower", "upper", "point")

   p <- tmp %>%
      mutate(var = camel_to_english(var),
             var = fct_reorder(var, point)) %>%
      ggplot(aes(y = var))  +
      geom_vline(xintercept= 0) +
      geom_segment(aes(x = lower, xend = upper, yend = var), 
                   size = 3, colour = colour, alpha = 0.5) +
      geom_point(aes(x = point)) +
      ggtitle(title) +
      labs(caption = "New Zealand Election Survey, analysed at https://ellispgithub.io",
           x = paste("Impact on log odds of voting", party),
           y = str_wrap("Compared to 'no religion', ''age30-55', 'high household income', 'school qualification', 'working full time", 50))
   return(p)
}

Sourcing the New Zealand Election Study data

The New Zealand Election Study generously makes its data available for free. The code snippet below assumes the zip file with the SPSS version of the data has been downloaded into the location specified on d: drive; obviously change this to wherever you have it saved.

#------------------data download-----------------

# Data downloaded from http://www.nzes.org/exec/show/data and because
# they want you to fill in a form to know who is using the data, I
# won't re-publish it myself
unzip("D:/Downloads/NZES2014GeneralReleaseApril16.sav.zip")

nzes_orig <- read.spss("NZES2014GeneralReleaseApril16.sav", 
                  to.data.frame = TRUE, trim.factor.names = TRUE)
varlab <- cbind(attributes(nzes_orig)$variable.labels)

DT::datatable(varlab)

SPSS data includes both short column names and full verbose questions, so the data frame of metadata called varlab in the R snippet above looks like this:

Data prep

The sample size is 2,835 and it includes 234 respondents who claim they party-voted for New Zealand First (note that nine of these were identified by the researchers as actually not voting at all; I made the call to count people on the basis of their self-report). I’m interested in a demographic characterisation of the New Zealand First voters compared to the rest of the New Zealand population (rather than, for example, compared to the voters of a single other party) so I’m going to be using a logistic regression with a response variable of 1 if the respondent voted for NZ First and 0 otherwise.

It’s often not appreciated that, on a rule of thumb, you need more data per explanatory variable for logistic regression than a continuous response: twice as much if the response is split 50/50, 10 times as much if it’s split 90/10.

Based on Frank Harrell’s sample size advice in Regression Modeling Strategies I’ve got somewhere between 11 and 23 degrees of freedom to “spend” on explanatory variables; that is 1/10 or 1/20 of the number of cases of the least frequent value of the categorical response variable. I certainly can’t afford to use all 437 columns in the data. Many of those columns are attitudinal variables that are out of scope for today (hope to come back later), but there’s still far too much socio-economic and demographic data to use it all in a model without more data. So I follow Harrell’s general approach of working out how many degrees of freedom I have to spend; which explanatory variables are committed to be in (no sneaky stepwise reductions based on p values); and how to allocate the degrees of freedom to those variables.

My options were restricted by what is available, and with what is available I had to make choices. For example, I collapsed down the nine household income categories into just two: “lower” and “higher”. In some cases, not being an expert in the exactly best data for theoretically interesting questions, I made some fairly arbitrary choices eg to use the answer to “what is your housing status?” to identify owner-occupiers, rather than “do you or any family member own a residence?”

Here’s the code that created the explanatory variables I wanted, a data set that is just a subset of the explanatory variables, the survey weights, and four possible response variables (party vote for the four highest voting parties).

#============rationalised version of feature creation===========
# This is a single 100 line command to aggregate various answers into
# simpler cateogrisations, because we don't have enough data to 
# analyse the original granular detail.
nzes <- nzes_orig %>%
   
   # party vote:
   mutate(dpartyvote2 = ifelse(is.na(dpartyvote), "Did not vote", as.character(dpartyvote)),
          dpartyvote2 = gsub("M.ori", "Maori", dpartyvote2),
          dpartyvote2 = gsub("net.Man", "net-Man", dpartyvote2),
          dpartyvote2 = fct_infreq(dpartyvote2)) %>%
   
   mutate(PartyVoteNZFirst = (dpartyvote2 == "NZ First"),
          PartyVoteLabour =  (dpartyvote2 == "Labour"),
          PartyVoteNational = (dpartyvote2 == "National"),
          PartyVoteGreen    = (dpartyvote2 == "Green")) %>%
   # voted at all (needed for weighting):
   mutate(Voted = 1 * (ddidvote == 1)) %>%
   
   # Two degrees of freedom for ethnicity:
   mutate(NotEuropean = 1 - dethnicity_e,
          Maori = dethnicity_m) %>%
   
   # Two degrees of freedom for income (lower, higher, don't know):
   mutate(HHIncome = fct_recode(dhhincome,
                                Lower = "No income",
                                Lower = "$NZ23,000 or less",
                                Lower = "$NZ23,001-$NZ31,000",
                                Lower = "$NZ31,001-$NZ39,800",
                                Lower = "$NZ39,801-$NZ55,000",
                                Higher = "$NZ55,001-$NZ76,100",
                                Higher = "$NZ76,101-$NZ110,800",
                                Higher = "$NZ110,801-$NZ147,699",
                                Higher = "$NZ147,700 or over",
                                `Don't know / NA` = "Don't know"),
          HHIncome = ifelse(is.na(HHIncome), "Don't know / NA", as.character(HHIncome)),
          HHIncome = fct_infreq(HHIncome)) %>%
   
   ## Two - four degrees of freedom for associations?
   mutate(HHMemberTradeUnion = membership(dtradeunion),
          HHMemberProfAssoc = membership(dprofassoc)) %>%
   
   ## One degree of freedom for born in NZ
   mutate(NZBorn = ifelse(dnzborn == "New Zealand", 1, 0)
          # uncomment the next line if you want to turn NA into zero:
          # , NZBorn = ifelse(is.na(NZBorn), 0, NZBorn)
   ) %>%
   
   ## One for sex
   mutate(Male = 1 * (dsex == "Male"),
          Male = ifelse(dsex == "Transsexual or transgender", NA, Male)) %>%
   
   ## Two-four for age
   mutate(Age = fct_collapse(as.character(dage),
                             `18-29` = as.character(18:29),
                             `30-55` = as.character(30:55),
                             `56+` = as.character(56:100)),
          # Uncomment the next line if you want to explicitly code the NAs
          # Age = ifelse(is.na(dage), "unknown", as.character(Age)),
          Age = fct_relevel(Age, "30-55")
   ) %>%
   
   ## One for housing.  Note there is also an alternative question "do you or any family member own a residence"
   mutate(OwnHouseOrFlat = 1 * grepl("Own house or flat", dhousing)) %>%
   
   # Two for religion
   mutate(Religion = fct_lump(dreligion, 2)) %>%
   
   # One for marital status
   mutate(Marital = 1 * (dmarital == "Married, in a civil union, or living with a partner")) %>%
   
   # One for self-identified class
   mutate(IdentifyWorkingClass = 1 * (dclass == "Working class")) %>%
   
   ## Two for education (University, None, Other)
   mutate(HighestQual = fct_collapse(dedcons, University = c("Undergraduate Degree", "Postgraduate degree", "PhD")),
          HighestQual = fct_lump(HighestQual, 2),
          HighestQual = ifelse(dedcons == "Not known", NA, as.character(HighestQual)),
          HighestQual = fct_relevel(HighestQual, "Other")
   ) %>%
   
   ## Two for working status
   mutate(WorkStatus = ifelse(!is.na(dwkpt), "Part time", NA),
          WorkStatus = ifelse(!is.na(dwkft), "Full time", WorkStatus),
          WorkStatus = ifelse(is.na(WorkStatus), "Other or unknown", WorkStatus),
          WorkStatus = fct_infreq(WorkStatus),
          Student = 1 * (!is.na(dwksch))) %>%
   
   ## One for occupation
   mutate(SuperviseAnyone = 1 * (dsupervise == "Yes")) %>%
   # Note there is detailed occupation information (for respondent and partner)
   # but I don't think we hav eneough data to use this in the model.
   
   ## None for industry.
   # We have industry of employhment for respondent and partner
   # but I don't think we have enough data to use this.  Plus many people
   # don't know what industry they are in anyway.
   #
   
   ## One degree of freedom for area lived in?
   # Five nice categories here, not enough data so we'll split into one
   mutate(City = 1 * (dregsize == "A major city (over 100,000 population)")) 

#============filter and stockcheck missing data===========
# nzes_subset is a subset of all the columns in the original nzes.
# we want a dataframe with only the variables we intend to use;
# will use it down the track for imputation.
nzes_subset <- nzes %>%
   select(PartyVoteNZFirst,
          PartyVoteNational,
          PartyVoteLabour,
          PartyVoteGreen,
          NotEuropean, Maori,
          HHIncome, 
          OwnHouseOrFlat,
          HHMemberTradeUnion, HHMemberProfAssoc,
          NZBorn,
          Religion,
          Marital,
          HighestQual,
          IdentifyWorkingClass,
          WorkStatus,
          Student,
          SuperviseAnyone,
          City,
          Male,
          Age,
          dage,
          dwtfin,
          Voted)

# 29% rows missing at least one value:
sum(complete.cases(nzes_subset)) / nrow(nzes_subset)

As the last comment in the code above notes, about 29% of rows are missing at least one value. There are three ways of dealing with this:

  1. knock out all rows missing a value
  2. explicitly include a “missing” dummy variable for each factor (which means about a dozen or more extra degrees of freedom)
  3. multiple imputation (creating multiple datasets with different imputed values, reflecting uncertainty in their expected values as well as population randomness – and pooling models estimated from each dataset).

Imputation is definitely the best and I will come back to it, but in the early stages I am knocking out those 29% of the data, for workflow simplicity.

glm versus svyglm versus glm with weights

The NZES data deliberately oversampled some groups (conspicuously, Māori) and used weights to control for this and for the incidental imbalance by sex, age, education, and non-voting (voters were also oversampled).

My first model was deliberately naive and I chose to ignore the weights, something I wouldn’t do but I wanted to see what impact it had. This gave me these results:

When I correctly used the survey weights and estimated the model via Thomas Lumley’s survey package I got this, slightly differing set of results:

To check this, I also tried the base function glm with a weights = argument. The publishers normalised the weights so their mean value is nearly exactly 1 (which wouldn’t be the case with a survey used to estimate population totals for official statistics purposes), so in principle this should be basically ok:

The naive method returns different results from the two that correctly adjust for the weights. In particular, it over-estimates the evidence linking Māori ethnicity and unknown household income to a New Zealand First vote. The two methods that adjust for weights give basically identical results to eachother.

Here’s the code that fitted those first three trial models:

# For convenience, here is a model formulation we'll be using several times:
form <- as.formula(PartyVoteNZFirst ~ 
                      NotEuropean + Maori + 
                      HHIncome + 
                      OwnHouseOrFlat +
                      HHMemberTradeUnion + HHMemberProfAssoc +
                      NZBorn +
                      Religion +
                      Marital + 
                      HighestQual +
                      IdentifyWorkingClass +
                      WorkStatus +
                      Student +
                      SuperviseAnyone +
                      City +
                      Male +
                      Age)

#---------glm v svyglm with knocking out the NAs--------------------
model_naive <- glm(form, data = nzes, family = "binomial")
sum_plot(model_naive, "Generalized linear model, no weights or imputation")

nzes_svy1 <- svydesign(~1, data = nzes, weights = ~dwtfin)
model_svy1 <- svyglm(form, design = nzes_svy1, family = "quasibinomial")
sum_plot(model_svy1, "Survey-specific `svyglm``, survey weights, no imputation")

model_glmw <- glm(form, data = nzes, family = "quasibinomial", weights = dwtfin)
sum_plot(model_glmw, "'Usual' `glm`, survey weights, no imputation")

Non-linear age effect

The data provide an age in years for each respondent (except missing values), which unlike the other variables leaves open the possibility of modelling a non-lineary curved relationship between age and the logarithm of the odds of voting New Zealand First. In my standard models I had split age into three categories (18-29, 30-55, and 56+, with 30-55 the reference point) which would allow some crude non-linearity but it is of interest to see if there is something else going on. To test this, I fit a generalized additive model (GAM) with the mgcv package, which allows the user to specify weights. The results for all the variables of interest were very similar and there are complexities in producing confidence intervals for fixed effects out of a GAM, so I’m not presenting them here. Instead, here’s the relationship between age and voting New Zealand First:

There’s a definite non-linearity here, but it’s not strong or defined enough for me to worry about, and I decided it’s easier to stick to my three simple categories of young, middle and old from here on. It looks like the strong impact really kicks in at about aged 65 rather 55, but I can’t go back and change my category coding just to accommodate that or I’ll be guilty of p-hacking.

Here’s the code for fitting the GAM:

#---------------GAM with weights------------------------------
model_gam <- gam(PartyVoteNZFirst ~ 
                    NotEuropean + Maori + 
                    HHIncome + 
                    OwnHouseOrFlat +
                    HHMemberTradeUnion + HHMemberProfAssoc +
                    NZBorn +
                    Religion +
                    Marital + 
                    HighestQual +
                    IdentifyWorkingClass +
                    WorkStatus +
                    Student +
                    SuperviseAnyone +
                    City +
                    Male +
                    s(dage),
              data = nzes, family = "quasibinomial", weights = dwtfin)
summary(model_gam)

plot(model_gam, shade = TRUE, main = "Non-linear impact of age on voting for NZ First")

Elastic net regularization

To get a feel for the robustness of the results, I also tried using both the “lasso” and “ridge regression” to shrink coefficient estimates towards zero. Results not shown here but basically all coefficient estimates were reduced to zero in either method. This intrigues me… but a thorough investigation will have to wait for a later post (maybe).

Mixed-effects (multilevel) modelling with a regional electorate effect

I’ve just finished Andrew Gelman’s Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way They Do. One of my key takeaways was that while richer states vote Democrats over Republicans (and poor states vice versa), within each state richer individuals vote Republican over Democrats (and poorer individuals vice versa); and the slopes vary in each state. That is, there’s an interaction effect between region and individual income as explanatory variables, and also quite a noticeable region effect that is independent of individual characteristics.

More generally, Gelman uses multi-level models with a regional (and district, perhaps?) random effect as well as individual randomness, when he can get the data. My first ever analysis of voting behaviour was working with David Charnock when he was building on similar research he’d done with Australian federal politics.

We don’t have anywhere near enough data to look into the interaction effects, but we can look to see if there is a regional voting pattern that can’t be explained by individual characteristics.

The NZES data has too many missing values for Regional Council, but has near-complete data on which electorate each respondent is in. So I fit a multi-level model with an electorate-level random effect. There was a noticeable amount of variation by electorate – standard deviation of 0.33 – in the tendency to vote New Zealand First in a model with no individual explanatory variables. However, this went all the way down to zero when the individual variables were included.

Interestingly, once we control for individual characteristics, we see no evidence of a residual region-based electorate effect in tendency to vote New Zealand First.

As an aside, this doesn’t apply to the tendency to vote National, Labour or Green (ie there is a residual electorate effect for those three parties), but investigating why would take me too far afield.

Code for the multilevel models:

#------------------regional multilevel model-----------------------
table(nzes$kregcon, useNA = "always") # too many missing regions to use
table(nzes$delect, useNA = "always")  # only four missing electorates

# Note that glmer interprets "weights" as meaning number of trials
# for a binomial glm, so I'm ignoring them

nzes <- nzes %>%
   mutate(Electorate = ifelse(is.na(delect), "unknown", delect))

model_ml_null <- glmer(PartyVoteNZFirst ~ (1 | Electorate), 
                       data = nzes, family = "binomial")
summary(model_ml_null) # standard deviation of 0.3297, quite noticeable


model_ml <- glmer(PartyVoteNZFirst ~ NotEuropean + Maori + HHIncome + OwnHouseOrFlat + 
                     HHMemberTradeUnion + HHMemberProfAssoc + NZBorn + Religion + 
                     Marital + HighestQual + IdentifyWorkingClass + WorkStatus + 
                     Student + SuperviseAnyone + City + Male + Age + (1 | Electorate), 
                  data = nzes, family = "binomial")

summary(model_ml)
# but "electorate effect" completely disappears to nearly zero when individual variables were included
# - at least for NZ First.  For other parties there is a) problem converging and b)
# there does seem to be a residual electorate effect

Bootstrap with imputation and recalibration of survey weights each resample

All the various models described above I regarded as basically exploratory. My “real” model and estimation process was always going to involve a bootstrap, and imputation of the missing values in a way that captures the randomness that would have been expected to be seen if we had a full set of data.

Recalibrating survey weights

The bootstrap involves a resample with replacement of a full sized sample from the sample we’ve got. One result of doing such a resample is that the weights will no longer correctly represent the proportions of people of various age, sex, ethnicity and voting status (did or didn’t vote) as the original weighted sample did. Fixing this requires creating a new set of weights for each bootstrap resample.

To calculate those weights, I need some more information on how the original weights were set. I haven’t been able to find (admittedly, after minimal effort) a definitive description of the weighting scheme of the NZES, but it definitely includes ethnicity, age, sex, and voting status; and I thought it included education. In fact, I particularly hope it included education, because a much-awaited review out today on the polling performance leading up to the 2016 US Presidential election pinged the failure of many state polls to weight by education as a major contributor to error (more educated people respond more to surveys; and in that particular year, education was more strongly related to vote than had been the case in previous elections).

We can get a glimpse of the sampling scheme (both deliberate and incidental under and oversampling) by a regression of the published survey weight on our explanatory variables. Looking at this, I’m sure that the weights are taking education into account:

This graphic nicely shows that Maori, voters, older people, and highly educated people were over-sampled and hence have low weights; whereas young people, men and students were under-sampled (probably through lower response rates rather than research design) and hence get slightly higher weights than usual.

mod <- lm(dwtfin ~ . -dage, data = nzes_subset)

tidy(mod) %>%
   filter(term != "(Intercept)") %>%
   mutate(term = fct_reorder(term, estimate)) %>%
   ggplot(aes(x = estimate, y = term)) + 
   geom_point() +
   labs(x = "Impact on weight") +
   ggtitle("Relative weights of different demographics, NZ Election Study 2014",
           "Voters, M\u0101ori, women, older, and university qualified people are over-sampled and have low weights to compensate")

Most commonly, in a bootstrap of a complex survey one would calculate a set of “replicate weights”, which have a one-off calibration to the correct population totals using Lumley’s survey package. However, I wanted to do the calibration for each resample separately, for convenience, because I was going to perform imputation separately for each resample, which adds quite a complication. Imputation depends on the observed data, hence should be performed for each resample to ensure the full randomness of the approach is taken into account.

To be ready for this recalibration, I needed the sum of weights for various marginal totals from the very first sample – these totals will reveal to me the implicit known, correct population totals the original researchers used in their weighting scheme. I won’t be doing it exactly the way they did, becuase I’m using my own simplified versions of age, qualifications, etc.; but it should be close enough for my purposes. Here’s the code that calculates those marginal totals for me, for later use in the bootstrap cycle:

#--------------define population marginal totals that we will force weights to add up to-------------
age_pop <- nzes_subset %>%
   group_by(Age) %>%
   summarise(Freq = sum(dwtfin)) %>%
   filter(!is.na(Age))

male_pop <- nzes_subset %>%
   group_by(Male) %>%
   summarise(Freq = sum(dwtfin)) %>%
   filter(!is.na(Male))

qual_pop <- nzes_subset %>%
   group_by(HighestQual) %>%
   summarise(Freq = sum(dwtfin)) %>%
   filter(!is.na(HighestQual))

maori_pop <- nzes_subset %>%
   group_by(Maori) %>%
   summarise(Freq = sum(dwtfin)) %>%
   filter(!is.na(Maori))

voted_pop <- nzes_subset %>%
   group_by(Voted) %>%
   summarise(Freq = sum(dwtfin)) 

Bootstrap results

Now, I’m ready to run my loop of bootstrap resamples, to get a decent “best possible” estimate of the relationship between demographic explanatory variables and tendency to party-vote New Zealand First, compared to the rest of the New Zealand adult population. Each resample will:

  • have a single set of imputed values calculated to replace its NAs, using the mice package that is usually used for multiple imputation even though I only want one imputation set per resample. mice adequately captures the existing observed structure in the data (much better than just imputing average values), as well as the observed randomness.
  • have its weights recalibrated to the correct marginal totals
  • have a model fit to it with that full set of data and new set of weights

The end result has already been shown at the top of this post, but as this is all so long here it is reproduced:

The code that does this exercise took about 90 minutes to run on 7 cores on my laptop:

#---------------define a bootstrap function-------------
# Function that does imputation, calibration of weights, and fits
# a survey GLM to the result:
imp_cal_fit <- function(x, i){
   # for dev:
   # x <- nzes_subset; i <- sample(1:nrow(x), nrow(x), replace = TRUE)
   
   # Resample the data
   nzes_res <- x[i, ]
   
   # Create a single complete, imputed version of the data
   nzes_imp <- complete(mice(nzes_res, 1, printFlag = FALSE))
   attributes(nzes_imp$Religion)$contrasts    <- NULL
   attributes(nzes_imp$WorkStatus)$contrasts  <- NULL
   attributes(nzes_imp$HHIncome)$contrasts    <- NULL
   attributes(nzes_imp$HighestQual)$contrasts <- NULL
   attributes(nzes_imp$Age)$contrasts         <- NULL
   
   # Set up as a survey object
   nzes_svy <- svydesign(~1, data = nzes_imp, weights = ~dwtfin)
   
   # force the marginal totals to match those from the original weighting
   nzes_svy_cal <- calibrate(nzes_svy, 
                         list(~Maori, ~Age, ~HighestQual, ~Male, ~Voted), 
                         list(maori_pop, age_pop, qual_pop, male_pop, voted_pop))
   
   # Fit the model
   model_svy_cal <- svyglm(form, design = nzes_svy_cal, family = "quasibinomial")
   
   
   # Return the results
   return(coef(model_svy_cal))
}

# Set up a cluster for parallel computing
cluster <- makeCluster(7) # only any good if you have at least 7 processors
registerDoParallel(cluster)

clusterEvalQ(cluster, {
   library(survey)
   library(mice)
})

# to mimic the original deliberate over-sampling of Maori, I make the bootstrap
# resampling stratify itself on Maori too.
# Note that this takes about 90 minutes, even with the parallel computing.  Each resample
# has to do the iterative imputation from scratch, then calibrate survey weights.
system.time({
nzes_booted <- boot(nzes_subset, imp_cal_fit, 
                    R = 999, strata = nzes_subset$Maori, 
                    parallel = "snow", cl = cluster)
})

# It's not common to do it this way.  More usually you make a set of many replicate
# weights and calibrate them as a once-off job, then use those weights for any future
# estimation.  But this doesn't conveniently fit in a workflow where we are doing a 
# new imputation for each bootstrap resample (which is usually admitted to be good 
# practice, but very rarely done, precisely because it's inconvenient for workflow).


# extract the 95% confidence intervals
k <- ncol(nzes_booted$t)
results <- as.data.frame(t(sapply(1:k, function(i){boot.ci(nzes_booted, type = "perc", index = i)$percent[4:5]})))
names(results) <- c("lower", "upper")
results$term <- names(coef(model_naive))

# draw graphic:
results %>%
   filter(term != "(Intercept)") %>%
   mutate(term = camel_to_english(term),
          mid = (lower + upper ) / 2,
          term = fct_reorder(term, mid)) %>%
   ggplot(aes(y = term)) +
   geom_vline(xintercept = 0) +
   geom_segment(aes(x = lower, xend = upper, yend = term), 
                colour = parties_v["NZ First"], alpha = 0.5, size = 3) +
   ggtitle("Party vote for New Zealand First in the 2014 election",
      "Bootstrap, imputation, recalibrated survey weights, svyglm") +
   labs(caption = "New Zealand Election Survey, analysed at https://ellispgithub.io",
        x = "Impact on log odds of voting New Zealand First",
        y = str_wrap("Compared to 'no religion', 'age30-55', 'high household income', 'school qualification', 'working full time'", 50))

Multiple imputation without bootstrap

The bootstrapped, multiply-imputed model gives a warm glow of thorough justificationness (if there is such a word) but is expensive in processor time. I was left with the impression that the standard model was fitting ok enough that even analytically calculated confidence intervals are probably good enough – after all, the bootstrap results looked pretty similar to the original non-bootstrap versions with no imputation at all.

To round out the analysis with fitting similar models to all parties, I skipped the bootstrap part and just used the original sample, but with five different sets of imputed values using the core functionality provided by mice. I didn’t bother to re-calibrate the weights for each set of imputed values because I’m fairly confident changes from the original weights would have been minimal. This means I have five different complete (no NA values) datasets, each of my four models is fit to each dataset, and the five results for each party are pooled using the Barnard-Rubin method.

Here’s the end results (click image to enlarge):

… and the code to do it is pretty simple:

#============multiple imputation all four main parties===============
# see http://r-survey.r-forge.r-project.org/survey/svymi.html for an alternative way to do this
# Also see https://stats.stackexchange.com/questions/149053/questions-on-multiple-imputation-with-mice-for-a-multigroup-sem-analysis-inclu

nzes_mi <- mice(nzes_subset)

attributes(nzes_mi$data$Age)$contrasts <- NULL
attributes(nzes_mi$data$Religion)$contrasts <- NULL
attributes(nzes_mi$data$HighestQual)$contrasts <- NULL
attributes(nzes_mi$data$WorkStatus)$contrasts <- NULL
attributes(nzes_mi$data$HHIncome)$contrasts <- NULL

responses <- paste0("PartyVote", c("National", "Labour", "NZFirst", "Green"))
colours <- parties_v[c("National", "Labour", "NZ First", "Green")]
p <- list()

form_gen <- "XXX ~ NotEuropean + Maori + HHIncome + OwnHouseOrFlat +
HHMemberTradeUnion + HHMemberProfAssoc + NZBorn + Religion +
Marital + HighestQual + IdentifyWorkingClass + WorkStatus +
Student + SuperviseAnyone + City + Male + Age"

for(i in 1:length(responses)){
   form2 <- gsub("XXX", responses[i], form_gen)
   model_mi <- with(data = nzes_mi,
                    glm(as.formula(form2), 
                        family = "quasibinomial", weights = dwtfin))
   p[[i]] <- sum_plot(model_mi, 
                      title = paste(gsub("PartyVote", "Party vote for ", responses[i]),
                                    "compared to rest of population"),
                      colour = colours[i],
                      type = "pool",
                      party = gsub("PartyVote", "", responses[i]))
}

grid.arrange(p[[1]], p[[2]], p[[3]], p[[4]])

Discussion

While I do a chunk of analysis on voting behaviour (eg see my election forecasts), I’m the wrong person for all sorts of reason to give actual commentary on New Zealand politics. So I’m going to make only the most obvious comments:

  • To answer the original question at the beginning of this post, New Zealand First voters were more likely than the general population to be older, born in New Zealand, identifying as working class and male.
  • No surprise that people such as home owners, higher income people, European heritage, those who don’t identify as working class, don’t have a trade union member in the household voted for the centre-right National party
  • Perhaps more surprising that being born in New Zealand is estimated to make people less likely to vote for National after controlling for education, ethnicity, income etc (in fact this one is surprising enough for me I’d like to see it replicated with fresh data before making too much of it)
  • In addition to the variables that would be expected to lead to Labour vote (union membership, lower income, less qualification), being Maori and female are both estimated to point in that direction
  • The characteristics of Green voters are an interesting but not that surprising list – one or more of university qualifications, beign a student, part time, union member in the household, living in a city, not identifying as Christian, not of European ethnicity.

So there we go. Relationship of individual socio-economi status on voting behaviour in the 2014 New Zealand General Election. Rather sadly, I don’t see much of this amazing dataset in the academic literature. Maybe I’m looking in the wrong spot.

Corrections and other comments welcomed. Beware – all the above code isn’t checked by anyone except me.

Big thanks to the Electoral Commission for funding the NZES, the NZES researchers themselves, and (as always) everyone behind the R core project and the 16 R packages listed near the beginning of this post.

> thankr::shoulders() %>%
+    mutate(maintainer = gsub("<.*>", "", maintainer)) %>%
+    group_by(maintainer) %>%
+    summarise(no_packages = sum(no_packages)) %>%
+    arrange(desc(no_packages)) %>%
+    as.data.frame()
             maintainer no_packages
1       Hadley Wickham           18
2          R Core Team           12
3         Brian Ripley            4
4        Kirill Müller            3
5         Rich Calaway            3
6           Yixuan Qiu            3
7    Dirk Eddelbuettel            2
8       Jennifer Bryan            2
9      Martin Maechler            2
10     "Thomas Lumley"            1
11       Achim Zeileis            1
12    Adelchi Azzalini            1
13     Baptiste Auguie            1
14          Ben Bolker            1
15   Charlotte Wickham            1
16      David Robinson            1
17     Deepayan Sarkar            1
18  Duncan Temple Lang            1
19        Gábor Csárdi            1
20        James Hester            1
21         Jelmer Ypma            1
22         Jeroen Ooms            1
23          Jim Hester            1
24          JJ Allaire            1
25           Joe Cheng            1
26 Katharine M. Mullen            1
27        Luke Tierney            1
28    Marek Gagolewski            1
29         Peter Ellis            1
30              R-core            1
31          Simon Wood            1
32     Stef van Buuren            1
33 Stefan Milton Bache            1
34    Terry M Therneau            1
35       Trevor Hastie            1
36       Vitalie Spinu            1
37     William Revelle            1
38       Winston Chang            1
39           Yihui Xie            1

 

소스: Modelling individual party vote from the 2014 New Zealand election study

Network analysis of Game of Thrones family ties | R-bloggers

In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones.

Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative.

What is a network?

A network in this context is a graph of interconnected nodes/vertices. Nodes can e.g. be people in a social network, genes in a co-expression network, etc. Nodes are connected via ties/edges.

What can network analysis tell us?

Network analysis can e.g. be used to explore relationships in social or professional networks. In such cases, we would typically ask questions like:

  • How many connections does each person have?
  • Who is the most connected (i.e. influential or “important”) person?
  • Are there clusters of tightly connected people?
  • Are there a few key players that connect clusters of people?
  • etc.

These answers can give us a lot of information about the patterns of how people interact.

The Game of Thrones character network

The basis for this network is Kaggle’s Game of Throne dataset (character-deaths.csv). Because most family relationships were missing in that dataset, I added the missing information in part by hand (based on A Wiki of Ice and Fire) and by scraping information from the Game of Thrones wiki. You can find the full code for how I generated the network on my Github page.

library(tidyverse)
library(igraph)
library(statnet)
load("union_edges.RData")
load("union_characters.RData")

I am using igraph to plot the initial network. To do so, I first create the graph from the edge- and node-table. An edge-table contains source and target nodes in the first two columns and optionally additional columns with edge attributes. Here, I have the type of interaction (mother, father or spouse), the color and line-type I want to assign to each edge.

Because the books and the TV series differ slightly, I have introduced edges that are only supported or hinted at by the TV series and are not part of the original narrative in the books. These edges are marked by being dotted instead of solid. An additional color for edges with unspecified parental origin are introduced as well. Originally, these served for interactions that were extracted from character names (i.e. characters that ended with “… son/daughter of …”) and could either mean mother or father. Now, they show unclear parentage or cases where there are a biological and a de facto father, as in the case of Jon Snow.

head(union_edges)
##               source            target   type   color   lty
## 1         Lysa Arryn      Robert Arryn mother #7570B3 solid
## 2       Jasper Arryn        Alys Arryn father #1B9E77 solid
## 3       Jasper Arryn         Jon Arryn father #1B9E77 solid
## 4          Jon Arryn      Robert Arryn father #1B9E77 solid
## 110 Cersei Lannister  Tommen Baratheon mother #7570B3 solid
## 210 Cersei Lannister Joffrey Baratheon mother #7570B3 solid

The nodetable contains one row for each character that is either a source or a target in the edgetable. We can give any number and type of node attributes. Here, I chose the followin columns from the original Kaggle dataset: gender/male (male = 1, female = 0), house (as the house each character was born into) and popularity. House2 was meant to assign a color to only the major houses. Shape represents the gender.

head(union_characters)
##            name male culture          house popularity      house2   color  shape
## 1    Alys Arryn    0        House Arryn 0.08026756             circle
## 2 Elys Waynwood    0     House Waynwood 0.07023411             circle
## 3  Jasper Arryn    1        House Arryn 0.04347826             square
## 4   Jeyne Royce    0        House Royce 0.00000000             circle
## 5     Jon Arryn    1 Valemen    House Arryn 0.83612040             square
## 6    Lysa Arryn    0        House Tully 0.00000000 House Tully #F781BF circle

By default, we have a directed graph.

union_graph <- graph_from_data_frame(union_edges, directed = TRUE, vertices = union_characters)

For plotting the legend, I am summarizing the edge and node colors.

color_vertices <- union_characters %>%
  group_by(house, color) %>%
  summarise(n = n()) %>%
  filter(!is.na(color))

colors_edges <- union_edges %>%
  group_by(type, color) %>%
  summarise(n = n()) %>%
  filter(!is.na(color))

Now, we can plot the graph object (here with Fruchterman-Reingold layout):

layout <- layout_with_fr(union_graph)

Click on the image to get to the high resolution pdf:

plot(union_graph,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph)$name),
     vertex.shape = V(union_graph)$shape,
     vertex.color = V(union_graph)$color, 
     vertex.size = (V(union_graph)$popularity + 0.5) * 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.arrow.size = 0.5,
     edge.color = E(union_graph)$color,
     edge.lty = E(union_graph)$lty)
legend("topleft", legend = c(NA, "Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19,
       col = c(NA, NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1,
       title = "") 
legend("topleft", legend = "", cex = 4, bty = "n", ncol = 1,
       title = "Game of Thrones Family Ties")

family_ties_1

Node color shows the major houses, node size the character’s popularity and node shape their gender (square for male, circle for female). Edge color shows interaction type.

As we can see, even with only a subset of characters from the Game of Thrones world, the network is already quite big. You can click on the image to open the pdf and zoom into specific parts of the plot and read the node labels/character names.

What we can see right away is that there are only limited connections between houses and that the Greyjoys are the only house that has no ties to any of the others.

Network analysis

How do we find out who the most important characters are in this network?

We consider a character “important” if he has connections to many other characters. There are a few network properties, that tell us more about this. For this, I am considering the network as undirected to account for parent/child relationships as being mutual.

union_graph_undir <- as.undirected(union_graph, mode = "collapse")

Centrality

Centrality describes the number of edges that are in- or outgoing to/from nodes. High centrality networks have few nodes with many connections, low centrality networks have many nodes with similar numbers of edges.

“Centralization is a method for creating a graph level centralization measure from the centrality scores of the vertices.” centralize() help

For the whole network, we can calculate centrality by degree (centr_degree()), closeness (centr_clo()) or eigenvector centrality (centr_eigen()) of vertices.

centr_degree(union_graph_undir, mode = "total")$centralization
## [1] 0.04282795
centr_clo(union_graph_undir, mode = "total")$centralization
## [1] 0.01414082
centr_eigen(union_graph_undir, directed = FALSE)$centralization
## [1] 0.8787532

Node degree

Node degree or degree centrality describes how central a node is in the network (i.e. how many in- and outgoing edges it has or to how many other nodes it is directly connected via one edge).

“The degree of a vertex is its most basic structural property, the number of its adjacent edges.” From the help pages of degree()

We can calculate the number of out- or ingoing edges of each node, or – as I am doing here – the sum of both.

union_graph_undir_degree <- igraph::degree(union_graph_undir, mode = "total")

#standardized by number of nodes
union_graph_undir_degree_std <- union_graph_undir_degree / (vcount(union_graph_undir) - 1)
node_degree <- data.frame(degree = union_graph_undir_degree,
                          degree_std = union_graph_undir_degree_std) %>%
  tibble::rownames_to_column()

union_characters <- left_join(union_characters, node_degree, by = c("name" = "rowname"))

node_degree %>%
  arrange(-degree) %>%
  .[1:10, ]
##            rowname degree degree_std
## 1  Quellon Greyjoy     12 0.05797101
## 2      Walder Frey     10 0.04830918
## 3   Oberyn Martell     10 0.04830918
## 4     Eddard Stark      9 0.04347826
## 5    Catelyn Stark      8 0.03864734
## 6       Emmon Frey      7 0.03381643
## 7  Genna Lannister      7 0.03381643
## 8     Merrett Frey      7 0.03381643
## 9    Balon Greyjoy      7 0.03381643
## 10 Jason Lannister      7 0.03381643

In this case, the node degree reflects how many offspring and spouses a character had. With 3 wifes and several children, Quellon Greyjoy, the grandfather of Theon and Asha/Yara comes out on top (of course, had I included all offspring and wifes of Walder Frey’s, he would easily be on top but the network would have gotten infinitely more confusing).

Closeness

The closeness of a node describes its distance to all other nodes. A node with highest closeness is more central and can spread information to many other nodes.

closeness <- igraph::closeness(union_graph_undir, mode = "total")

#standardized by number of nodes
closeness_std <- closeness / (vcount(union_graph_undir) - 1)
node_closeness <- data.frame(closeness = closeness,
                          closeness_std = closeness_std) %>%
  tibble::rownames_to_column()

union_characters <- left_join(union_characters, node_closeness, by = c("name" = "rowname"))

node_closeness %>%
  arrange(-closeness) %>%
  .[1:10, ]
##             rowname    closeness closeness_std
## 1       Sansa Stark 0.0002013288  9.726028e-07
## 2  Tyrion Lannister 0.0002012882  9.724070e-07
## 3   Tywin Lannister 0.0002011668  9.718201e-07
## 4  Joanna Lannister 0.0002005616  9.688965e-07
## 5      Eddard Stark 0.0002002804  9.675381e-07
## 6     Catelyn Stark 0.0001986492  9.596579e-07
## 7  Cersei Lannister 0.0001984915  9.588960e-07
## 8   Jaime Lannister 0.0001975894  9.545382e-07
## 9    Jeyne Marbrand 0.0001966568  9.500330e-07
## 10  Tytos Lannister 0.0001966568  9.500330e-07

The characters with highest closeness all surround central characters that connect various storylines and houses in Game of Thrones.

Betweenness centrality

Betweenness describes the number of shortest paths between nodes. Nodes with high betweenness centrality are on the path between many other nodes, i.e. they are people who are key connections or bridges between different groups of nodes. In a social network, these nodes would be very important because they are likely to pass on information to a wide reach of people.

The igraph function betweenness() calculates vertex betweenness, edge_betweenness() calculates edge betweenness:

“The vertex and edge betweenness are (roughly) defined by the number of geodesics (shortest paths) going through a vertex or an edge.” igraph help for estimate_betweenness()

betweenness <- igraph::betweenness(union_graph_undir, directed = FALSE)

# standardize by number of node pairs
betweenness_std <- betweenness / ((vcount(union_graph_undir) - 1) * (vcount(union_graph_undir) - 2) / 2)

node_betweenness <- data.frame(betweenness = betweenness,
                               betweenness_std = betweenness_std) %>%
  tibble::rownames_to_column() 

union_characters <- left_join(union_characters, node_betweenness, by = c("name" = "rowname"))

node_betweenness %>%
  arrange(-betweenness) %>%
  .[1:10, ]
##              rowname betweenness betweenness_std
## 1       Eddard Stark    6926.864       0.3248846
## 2        Sansa Stark    6165.667       0.2891828
## 3   Tyrion Lannister    5617.482       0.2634718
## 4    Tywin Lannister    5070.395       0.2378123
## 5   Joanna Lannister    4737.524       0.2221999
## 6  Rhaegar Targaryen    4301.583       0.2017533
## 7    Margaery Tyrell    4016.417       0.1883784
## 8           Jon Snow    3558.884       0.1669192
## 9        Mace Tyrell    3392.500       0.1591154
## 10   Jason Lannister    3068.500       0.1439191
edge_betweenness <- igraph::edge_betweenness(union_graph_undir, directed = FALSE)

data.frame(edge = attr(E(union_graph_undir), "vnames"),
           betweenness = edge_betweenness) %>%
  tibble::rownames_to_column() %>%
  arrange(-betweenness) %>%
  .[1:10, ]
##    rowname                              edge betweenness
## 1      160      Sansa Stark|Tyrion Lannister    5604.149
## 2      207          Sansa Stark|Eddard Stark    4709.852
## 3      212        Rhaegar Targaryen|Jon Snow    3560.083
## 4      296       Margaery Tyrell|Mace Tyrell    3465.000
## 5      213             Eddard Stark|Jon Snow    3163.048
## 6      131  Jason Lannister|Joanna Lannister    3089.500
## 7      159 Joanna Lannister|Tyrion Lannister    2983.591
## 8      171  Tyrion Lannister|Tywin Lannister    2647.224
## 9      192    Elia Martell|Rhaegar Targaryen    2580.000
## 10     300         Luthor Tyrell|Mace Tyrell    2565.000

This, we can now plot by feeding the node betweenness as vertex.size and edge betweenness as edge.width to our plot function:

plot(union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.color = V(union_graph_undir)$color, 
     vertex.size = betweenness * 0.001, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.width = edge_betweenness * 0.01,
     edge.arrow.size = 0.5,
     edge.color = E(union_graph_undir)$color,
     edge.lty = E(union_graph_undir)$lty)
legend("topleft", legend = c("Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19,
       col = c(NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1)

family_ties_1

Ned Stark is the character with highest betweenness. This makes sense, as he and his children (specifically Sansa and her arranged marriage to Tyrion) connect to other houses and are the central points from which the story unfolds. However, we have to keep in mind here, that my choice of who is important enough to include in the network (e.g. the Stark ancestors) and who not (e.g. the whole complicated mess that is the Targaryen and Frey family tree) makes this result somewhat biased.

Diameter

In contrast to the shortest path between two nodes, we can also calculate the longest path, or diameter:

diameter(union_graph_undir, directed = FALSE)
## [1] 21

In our network, the longest path connects 21 nodes.

“get_diameter returns a path with the actual diameter. If there are many shortest paths of the length of the diameter, then it returns the first one found.” diameter() help

This, we can also plot:

union_graph_undir_diameter <- union_graph_undir
node_diameter <- get.diameter(union_graph_undir_diameter,  directed = FALSE)

V(union_graph_undir_diameter)$color <- scales::alpha(V(union_graph_undir_diameter)$color, alpha = 0.5)
V(union_graph_undir_diameter)$size <- 2

V(union_graph_undir_diameter)[node_diameter]$color <- "red"
V(union_graph_undir_diameter)[node_diameter]$size <- 5

E(union_graph_undir_diameter)$color <- "grey"
E(union_graph_undir_diameter)$width <- 1

E(union_graph_undir_diameter, path = node_diameter)$color <- "red"
E(union_graph_undir_diameter, path = node_diameter)$width <- 5

plot(union_graph_undir_diameter,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir_diameter)$name),
     vertex.shape = V(union_graph_undir_diameter)$shape,
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.arrow.size = 0.5,
     edge.lty = E(union_graph_undir_diameter)$lty)
legend("topleft", legend = c("Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19,
       col = c(NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1)

family_ties_1

Transitivity

“Transitivity measures the probability that the adjacent vertices of a vertex are connected. This is sometimes also called the clustering coefficient.” transitivity() help

We can calculate the transitivity or ratio of triangles to connected triples for the whole network:

transitivity(union_graph_undir, type = "global")
## [1] 0.2850679

Or for each node:

transitivity <- data.frame(name = V(union_graph_undir)$name,
      transitivity = transitivity(union_graph_undir, type = "local")) %>%
  mutate(name = as.character(name))

union_characters <- left_join(union_characters, transitivity, by = "name")

transitivity %>%
  arrange(-transitivity) %>%
  .[1:10, ]
##                 name transitivity
## 1       Robert Arryn            1
## 2   Ormund Baratheon            1
## 3     Selyse Florent            1
## 4  Shireen Baratheon            1
## 5   Amarei Crakehall            1
## 6       Marissa Frey            1
## 7        Olyvar Frey            1
## 8        Perra Royce            1
## 9        Perwyn Frey            1
## 10         Tion Frey            1

Because ours is a family network, characters with a transitivity of one form triangles with their parents or offspring.

PageRank centrality

PageRank (originally used by Google to rank the importance of search results) is similar to eigenvector centrality. Eigenvector centrality scores nodes in a network according to the number of connections to high-degree nodes they have. It is therefore a measure of node importance. PageRank similarly considers nodes as more important if they have many incoming edges (or links).

page_rank <- page.rank(union_graph_undir, directed = FALSE)

page_rank_centrality <- data.frame(name = names(page_rank$vector),
      page_rank = page_rank$vector) %>%
  mutate(name = as.character(name))

union_characters <- left_join(union_characters, page_rank_centrality, by = "name")

page_rank_centrality %>%
  arrange(-page_rank) %>%
  .[1:10, ]
##                 name   page_rank
## 1     Oberyn Martell 0.018402407
## 2    Quellon Greyjoy 0.016128129
## 3        Walder Frey 0.012956029
## 4       Eddard Stark 0.011725019
## 5       Cregan Stark 0.010983561
## 6      Catelyn Stark 0.010555473
## 7       Lyarra Stark 0.009876629
## 8  Aegon V Targaryen 0.009688458
## 9      Balon Greyjoy 0.009647049
## 10         Jon Arryn 0.009623742

Oberyn Martell, Quellon Greyjoy and Walder Frey all have the highest number of spouses, children and grandchildren are are therefore scored highest for PageRank.

Matrix representation of a network

Connections between nodes can also be represented as an adjacency matrix. We can convert our graph object to an adjacency matrix with igraph’s as_adjacency_matrix() function. Whenever there is an edge between two nodes, this field in the matrix will get assigned a 1, otherwise it is 0.

adjacency <- as.matrix(as_adjacency_matrix(union_graph_undir))

Eigenvector centrality

We can now calculate the eigenvalues and eigenvectors of the adjacency matrix.

#degree diagonal matrix
degree_diag <- diag(1 / igraph::degree(union_graph_undir))

# PageRank matrix
pagerank <- adjacency %*% degree_diag

eigenvalues <- eigen(pagerank)

The eigenvector with the highest eigenvalue scores those vertices highly, that have many eges or that are connected to vertices with many edges.

eigenvector <- data.frame(name = rownames(pagerank),
           eigenvector = as.numeric(eigenvalues$vectors[, which.max(eigenvalues$values)]))

union_characters <- left_join(union_characters, eigenvector, by = "name")

eigenvector %>%
  arrange(eigenvector) %>%
  .[1:10, ]
##                       name eigenvector
## 1          Quellon Greyjoy  -0.6625628
## 2            Balon Greyjoy  -0.3864950
## 3   Lady of House Sunderly  -0.3312814
## 4           Alannys Harlaw  -0.2760678
## 5  Lady of House Stonetree  -0.2208543
## 6      Asha (Yara) Greyjoy  -0.1656407
## 7            Robin Greyjoy  -0.1104271
## 8            Euron Greyjoy  -0.1104271
## 9          Urrigon Greyjoy  -0.1104271
## 10       Victarion Greyjoy  -0.1104271

Because of their highly connected family ties (i.e. there are only a handful of connections but they are almost all triangles), the Greyjoys have been scored with the highest eigenvalues.

We can find the eigenvector centrality scores with:

eigen_centrality <- igraph::eigen_centrality(union_graph_undir, directed = FALSE)

eigen_centrality <- data.frame(name = names(eigen_centrality$vector),
           eigen_centrality = eigen_centrality$vector) %>%
  mutate(name = as.character(name))

union_characters <- left_join(union_characters, eigen_centrality, eigenvector, by = "name")

eigen_centrality %>%
  arrange(-eigen_centrality) %>%
  .[1:10, ]
##                name eigen_centrality
## 1   Tywin Lannister        1.0000000
## 2  Cersei Lannister        0.9168980
## 3  Joanna Lannister        0.8358122
## 4    Jeyne Marbrand        0.8190076
## 5   Tytos Lannister        0.8190076
## 6   Genna Lannister        0.7788376
## 7   Jaime Lannister        0.7642870
## 8  Robert Baratheon        0.7087042
## 9        Emmon Frey        0.6538709
## 10      Walder Frey        0.6516021

When we consider eigenvector centrality, Tywin and the core Lannister family score highest.

Who are the most important characters?

We can now compare all the node-level information to decide which characters are the most important in Game of Thrones. Such node level characteristics could also be used as input for machine learning algorithms.

Let’s look at all characters from the major houses:

union_characters %>%
  filter(!is.na(house2)) %>%
  dplyr::select(-contains("_std")) %>%
  gather(x, y, degree:eigen_centrality) %>%
  ggplot(aes(x = name, y = y, color = house2)) +
    geom_point(size = 3) +
    facet_grid(x ~ house2, scales = "free") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

family_ties_1

Taken together, we could say that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones.

Groups of nodes

We can also analyze dyads (pairs of two nodes), triads (groups of three nodes) and bigger cliques in our network. For dyads, we can use the function dyad_census() from igraph or dyad.census() from sna. Both are identical and calculate a Holland and Leinhardt dyad census with

  • mut: The number of pairs with mutual connections (in our case, spouses).
  • asym: The number of pairs with non-mutual connections (in the original network: mother-child and father-child relationships; but in the undirected network, there are none).
  • null: The number of pairs with no connection between them.
#igraph::dyad_census(union_graph_undir)
sna::dyad.census(adjacency)
##      Mut Asym  Null
## [1,] 326    0 21202

The same can be calculated for triads (see ?triad_census for details on what each output means).

#igraph::triad_census(union_graph_undir)
sna::triad.census(adjacency)
##          003 012   102 021D 021U 021C 111D 111U 030T 030C 201 120D 120U 120C 210 300
## [1,] 1412100   0 65261    0    0    0    0    0    0    0 790    0    0    0   0 105
triad.classify(adjacency, mode = "graph")
## [1] 2

We can also calculate the number of paths and cycles of any length we specify, here e.g. of length <= 5. For edges, we obtain the sum of counts for all paths or cycles up to the given maximum length. For vertices/nodes, we obtain the number of paths or cycles to which each node belongs.

node_kpath <- kpath.census(adjacency, maxlen = 5, mode = "graph", tabulate.by.vertex = TRUE, dyadic.tabulation = "sum")
edge_kpath <- kpath.census(adjacency, maxlen = 5, mode = "graph", tabulate.by.vertex = FALSE)
edge_kpath
## $path.count
##     1     2     3     4     5 
##   326  1105  2973  7183 17026

This, we could plot with (but here, it does not give much additional information):

gplot(node_kpath$paths.bydyad,
      label.cex = 0.5, 
      vertex.cex = 0.75,
      displaylabels = TRUE,
      edge.col = "grey")
node_kcycle <- kcycle.census(adjacency, maxlen = 8, mode = "graph", tabulate.by.vertex = TRUE, cycle.comembership = "sum")
edge_kcycle <- kcycle.census(adjacency, maxlen = 8, mode = "graph", tabulate.by.vertex = FALSE)
edge_kcycle
## $cycle.count
##   2   3   4   5   6   7   8 
##   0 105 136  27  57  58  86
node_kcycle_reduced <- node_kcycle$cycle.comemb
node_kcycle_reduced <- node_kcycle_reduced[which(rowSums(node_kcycle_reduced) > 0), which(colSums(node_kcycle_reduced) > 0)]

gplot(node_kcycle_reduced,
      label.cex = 0.5, 
      vertex.cex = 0.75,
      displaylabels = TRUE,
      edge.col = "grey")

family_ties_1

“A (maximal) clique is a maximal set of mutually adjacency vertices.” clique.census() help

node_clique <- clique.census(adjacency, mode = "graph", tabulate.by.vertex = TRUE, clique.comembership = "sum")
edge_clique <- clique.census(adjacency, mode = "graph", tabulate.by.vertex = FALSE, clique.comembership = "sum")
edge_clique$clique.count
##   1   2   3 
##   0  74 105
node_clique_reduced <- node_clique$clique.comemb
node_clique_reduced <- node_clique_reduced[which(rowSums(node_clique_reduced) > 0), which(colSums(node_clique_reduced) > 0)]

gplot(node_clique_reduced,
      label.cex = 0.5, 
      vertex.cex = 0.75,
      displaylabels = TRUE,
      edge.col = "grey")

family_ties_1

The largest group of nodes ín this network is three, i.e. all parent/child relationships. Therefore, it does not really make sense to plot them all, but we could plot and color them with:

vcol <- rep("grey80", vcount(union_graph_undir))

# highlight first of largest cliques
vcol[unlist(largest_cliques(union_graph_undir)[[1]])] <- "red"

plot(union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.color = vcol, 
     vertex.size = 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.width = 2,
     edge.arrow.size = 0.5,
     edge.color = E(union_graph_undir)$color,
     edge.lty = E(union_graph_undir)$lty)

Clustering

We can also look for groups within our network by clustering node groups according to their edge betweenness:

ceb <- cluster_edge_betweenness(union_graph_undir)
modularity(ceb)
## [1] 0.8359884
plot(ceb,
     union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.size = (V(union_graph_undir)$popularity + 0.5) * 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8)

family_ties_1

Or based on propagating labels:

clp <- cluster_label_prop(union_graph_undir)

plot(clp,
     union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.size = (V(union_graph_undir)$popularity + 0.5) * 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8)

family_ties_1

Network properties

We can also feed our adjacency matrix to other functions, like GenInd() from the NetIndices packages. This function calculates a number of network properties, like number of compartments (N), total system throughput (T..), total system throughflow (TST), number of internal links (Lint), total number of links (Ltot), like density (LD), connectance (C), average link weight (Tijbar), average compartment throughflow (TSTbar) and compartmentalization or degree of connectedness of subsystems in the network (Cbar).

library(NetIndices)
graph.properties <- GenInd(adjacency)
graph.properties
## $N
## [1] 208
## 
## $T..
## [1] 652
## 
## $TST
## [1] 652
## 
## $Lint
## [1] 652
## 
## $Ltot
## [1] 652
## 
## $LD
## [1] 3.134615
## 
## $C
## [1] 0.01514307
## 
## $Tijbar
## [1] 1
## 
## $TSTbar
## [1] 3.134615
## 
## $Cbar
## [1] 0.01086163

Alternatively, the network package provides additional functions to obtain network properties. Here, we can again feed in the adjacency matrix of our network and convert it to a network object.

library(network)
adj_network <- network(adjacency, directed = TRUE)
adj_network
##  Network attributes:
##   vertices = 208 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 652 
##     missing edges= 0 
##     non-missing edges= 652 
## 
##  Vertex attribute names: 
##     vertex.names 
## 
## No edge attributes

From this network object, we can e.g. get the number of dyads and edges within a network and the network size.

network.dyadcount(adj_network)
## [1] 43056
network.edgecount(adj_network)
## [1] 652
network.size(adj_network)
## [1] 208

“equiv.clust uses a definition of approximate equivalence (equiv.fun) to form a hierarchical clustering of network positions. Where dat consists of multiple relations, all specified relations are considered jointly in forming the equivalence clustering.” equiv.clust() help

ec <- equiv.clust(adj_network, mode = "graph", cluster.method = "average", plabels = network.vertex.names(adj_network))
ec
## Position Clustering:
## 
##  Equivalence function: sedist 
##  Equivalence metric: hamming 
##  Cluster method: average 
##  Graph order: 208
ec$cluster$labels <- ec$plabels
plot(ec)

family_ties_1

From the sna package, we can e.g. use functions that tell us the graph density and the dyadic reciprocity of the vertices or edges

gden(adjacency)
## [1] 0.01514307
grecip(adjacency)
## Mut 
##   1
grecip(adjacency, measure = "edgewise")
## Mut 
##   1

sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] NetIndices_1.4.4     MASS_7.3-47          statnet_2016.9       sna_2.4              ergm.count_3.2.2     tergm_3.4.0          networkDynamic_0.9.0 ergm_3.7.1           network_1.13.0       statnet.common_3.3.0 igraph_1.0.1         dplyr_0.5.0          purrr_0.2.2          readr_1.1.0          tidyr_0.6.2          tibble_1.3.0         ggplot2_2.2.1        tidyverse_1.1.1     
## 
## loaded via a namespace (and not attached):
##  [1] lpSolve_5.6.13    reshape2_1.4.2    haven_1.0.0       lattice_0.20-35   colorspace_1.3-2  htmltools_0.3.6   yaml_2.1.14       foreign_0.8-68    DBI_0.6-1         modelr_0.1.0      readxl_1.0.0      trust_0.1-7       plyr_1.8.4        robustbase_0.92-7 stringr_1.2.0     munsell_0.4.3     gtable_0.2.0      cellranger_1.1.0  rvest_0.3.2       coda_0.19-1       psych_1.7.5       evaluate_0.10     labeling_0.3      knitr_1.15.1      forcats_0.2.0     parallel_3.4.0    DEoptimR_1.0-8    broom_0.4.2       Rcpp_0.12.10      scales_0.4.1      backports_1.0.5   jsonlite_1.4      mnormt_1.5-5      hms_0.3           digest_0.6.12     stringi_1.1.5     grid_3.4.0        rprojroot_1.2     tools_3.4.0       magrittr_1.5      lazyeval_0.2.0    Matrix_1.2-10     xml2_1.1.1        lubridate_1.6.0   assertthat_0.2.0  rmarkdown_1.5     httr_1.2.1        R6_2.2.0          nlme_3.1-131      compiler_3.4.0

 

소스: Network analysis of Game of Thrones family ties | R-bloggers

Pretty scatter plots with ggplot2 | R-bloggers

here to make pretty scatter plots of correlated variables with ggplot2!

We’ll learn how to create plots that look like this:

init-example-1.png

 Data

In a data.frame d, we’ll simulate two correlated variables a and b of length n:

set.seed(170513)
n <- 200
d <- data.frame(a = rnorm(n))
d$b <- .4 * (d$a + rnorm(n))

head(d)
#>            a           b
#> 1 -0.9279965 -0.03795339
#> 2  0.9133158  0.21116682
#> 3  1.4516084  0.69060249
#> 4  0.5264596  0.22471694
#> 5 -1.9412516 -1.70890512
#> 6  1.4198574  0.30805526

 Basic scatter plot

Using ggplot2, the basic scatter plot (with theme_minimal) is created via:

library(ggplot2)

ggplot(d, aes(a, b)) +
  geom_point() +
  theme_minimal()

unnamed-chunk-3-1.JPEG

 Shape and size

There are many ways to tweak the shape and size of the points. Here’s the combination I settled on for this post:

ggplot(d, aes(a, b)) +
  geom_point(shape = 16, size = 5) +
  theme_minimal()

unnamed-chunk-4-1.JPEG

 Color

We want to color the points in a way that helps to visualise the correlation between them.

One option is to color by one of the variables. For example, color by a (and hide legend):

ggplot(d, aes(a, b, color = a)) +
  geom_point(shape = 16, size = 5, show.legend = FALSE) +
  theme_minimal()

unnamed-chunk-5-1.JPEG

Although it’s subtle in this plot, the problem is that the color is changing as the points go from left to right. Instead, we want the color to change in a direction that characterises the correlation – diagonally in this case.

To do this, we can color points by the first principal component. Add it to the data frame as a variable pc and use it to color like so:

d$pc <- predict(prcomp(~a+b, d))[,1]

ggplot(d, aes(a, b, color = pc)) +
  geom_point(shape = 16, size = 5, show.legend = FALSE) +
  theme_minimal()

unnamed-chunk-6-1.JPEG

Now we can add color, let’s pick something nice with the help of the scale_color_gradient functions and some nice hex codes (check out color-hex for inspriation). For example:

ggplot(d, aes(a, b, color = pc)) +
  geom_point(shape = 16, size = 5, show.legend = FALSE) +
  theme_minimal() +
  scale_color_gradient(low = "#0091ff", high = "#f0650e")

unnamed-chunk-7-1.JPEG

 Transparency

Now it’s time to get rid of those offensive mushes by adjusting the transparency with alpha.

We could adjust it to be the same for every point:

ggplot(d, aes(a, b, color = pc)) +
  geom_point(shape = 16, size = 5, show.legend = FALSE, alpha = .4) +
  theme_minimal() +
  scale_color_gradient(low = "#0091ff", high = "#f0650e")

unnamed-chunk-8-1.JPEG

This is fine most of the time. However, what if you have many points? Let’s try with 5,000 points:

# Simulate data
set.seed(170513)
n <- 5000
d <- data.frame(a = rnorm(n))
d$b <- .4 * (d$a + rnorm(n))

# Compute first principal component
d$pc <- predict(prcomp(~a+b, d))[,1]

# Plot
ggplot(d, aes(a, b, color = pc)) +
  geom_point(shape = 16, size = 5, show.legend = FALSE, alpha = .4) +
  theme_minimal() +
  scale_color_gradient(low = "#0091ff", high = "#f0650e")

unnamed-chunk-9-1.JPEG

We’ve got another big mush. What if we take alpha down really low to .05?

ggplot(d, aes(a, b, color = pc)) +
  geom_point(shape = 16, size = 5, show.legend = FALSE, alpha = .05) +
  theme_minimal() +
  scale_color_gradient(low = "#0091ff", high = "#f0650e")

unnamed-chunk-10-1.JPEG

Better, except it’s now hard to see extreme points that are alone in space.

To solve this, we’ll map alpha to the inverse point density. That is, turn down alpha wherever there are lots of points! The trick is to use bivariate density, which can be added as follows:

# Add bivariate density for each point
d$density <- fields::interp.surface(
  MASS::kde2d(d$a, d$b), d[,c("a", "b")])

Now plot with alpha mapped to 1/density:

ggplot(d, aes(a, b, color = pc, alpha = 1/density)) +
  geom_point(shape = 16, size = 5, show.legend = FALSE) +
  theme_minimal() +
  scale_color_gradient(low = "#0091ff", high = "#f0650e")

unnamed-chunk-12-1.JPEG

You can see that distant points are now too vibrant. Our final fix is to use scale_alpha to tweak the alpha range. By default, this range is 0 to 1, making the most distant points have an alpha close to 1. Let’s restrict it to something better:

ggplot(d, aes(a, b, color = pc, alpha = 1/density)) +
  geom_point(shape = 16, size = 5, show.legend = FALSE) +
  theme_minimal() +
  scale_color_gradient(low = "#0091ff", high = "#f0650e") +
  scale_alpha(range = c(.05, .25))

unnamed-chunk-13-1.JPEG

Much better! No more mushy patches or lost points.

 Bringing it together

Here’s a complete example with new data and colors:

# Simulate data
set.seed(170513)
n <- 2000
d <- data.frame(a = rnorm(n))
d$b <- -(d$a + rnorm(n, sd = 2))

# Add first principal component
d$pc <- predict(prcomp(~a+b, d))[,1]

# Add density for each point
d$density <- fields::interp.surface(
  MASS::kde2d(d$a, d$b), d[,c("a", "b")])

# Plot
ggplot(d, aes(a, b, color = pc, alpha = 1/density)) +
  geom_point(shape = 16, size = 5, show.legend = FALSE) +
  theme_minimal() +
  scale_color_gradient(low = "#32aeff", high = "#f2aeff") +
  scale_alpha(range = c(.25, .6))

unnamed-chunk-14-1.png

 Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

 

소스: Pretty scatter plots with ggplot2 | R-bloggers

R에서 월스트리트 저널 데이터 시각화를 만드는 방법

We know that the NY Times data visualizations are pretty awesome, but the Wall Street Journal’s data visualizations are nothing to laugh at. In fact, one of my favorite books on data visualization is by Dona Wong, the former graphics editor of Wall Street Journal and a student of Ed Tufte at Yale (and yes, she was at the NY Times too).

The Wall Street Journal Data Visualization

One of my favorite data visualizations from the WSJ is the graphic on Chick-fil-A and the public opinion on same-sex marriage. The WSJ published this graphic in an article after the Chick-fil-A president commented against same-sex marriage. The author of this article suggested that Chick-fil-A could afford to express such sentiments because their stores, hence customers, were in regions with majority of the public’s opinion was against same-sex marriage.

WSJ Data Visualization on Chick-fil-A

What I liked about the data visualization:

  • The graphic itself was very neutral in its color palette and in its conclusions. It left it up to the readers to draw any conclusions.
  • The store counts in each state were easy to see and compare. There was no cute icon for each store.
  • It removed all the distractions from the data visualization. No extraneous colors, no 3-d effects, no unnecessary data.
  • The colors of various regions were matched by the colors of the legends for the pie charts.
  • Yes, pie charts:
    • They were suitable to explain the three opinion categories.
    • They all started with “Favor” at 12 o’clock.
    • They were not exploding or were in 3-D, nor did they have 50 tiny slices.
    • The legends were printed only once.
    • A reader could quickly distinguish the public opinion slices using the simple colors.

WSJ Data Visualization in R

Like my previous post on NYT’s data visualization, I wanted to recreate this WSJ data visualization using R. Here’s how I tried to replicate that graphic:

Load favorite libraries

library(rvest) 
library(stringr)
library(dplyr)
library(ggplot2)
library(scales)
library(ggmap)
library(readr)
library(tidyr)
library(zipcode)
library(maps)
library(jsonlite)
library(grid)
library(gridExtra)
data(zipcode) ## load zipcode data in your workspace
  • rvest to scrape data from the web
  • stringr for string manipulation
  • dplyr for data manipulation
  • ggplot2 for plotting, duh!
  • scales to generate beautiful axis labels
  • ggmap to get and plot the United States map
  • readr to read data from CSVs
  • tidyr to transform data from long to wide form and vice versa
  • zipcode to clean up zipcodes and get the coordinates for each zipcode
  • jsonlite to get data from a json object
  • grid to line up charts
  • gridExtra because stackoverflow said so 🙂

Get Chick-fil-A locations

I found out that Chick-fil-A listed all its store locations on this page by each state. Every state has a separate page with store locations from that state.

So, first, I got all the states with a store in an XML document:

states <- read_html("https://www.Chick-fil-A.com/Locations/Browse")

Then, using the CSS selector values, I parsed the XML to extract all the hyperlink text with the URL to each state:

locations <- states %>% html_nodes("article ul li a") %>% 
  html_attr("href")

You can use Selector Gadget to find CSS selectors of objects on a given web-page. Please do read rvest’s documentation for more information on using rvest for web-scraping.

You will get a character vector of all the relative urls of each state. Like this:

## [1] "/Locations/Browse/AL" "/Locations/Browse/AZ" "/Locations/Browse/AR"

So, we need to join this relative URL to the top domain:

locations <- paste0("https://www.Chick-fil-A.com", locations)

Now, we need to scrape every state page and extract the zipcode of all the stores on that page. I wrote a function to help us with that:

extract_location <- function(url){
  read_html(url) %>% 
    html_nodes(".location p") %>% 
    html_text() %>% 
    str_extract("\\d{5}(?=\n)") 
}

The function will download each URL, find the div with the location class, convert the match to text, and lastly extract the five digit number (zipcode) using regular expression.

We need to pass the URL of each state to this function. We can achieve that by using sapply function, but be patient as this will take a minute or two:

locationzips <- sapply(locations, extract_location, USE.NAMES = FALSE)

Clean up the store zip

To make the above data usable for this data visualization, we need to put the zipcode list in a data frame.

locationzips_df <- data.frame(zips = unlist(locationzips), stringsAsFactors = FALSE)

If this post goes viral, I wouldn’t want millions of people trying to download this data from Chick-fil-A’s site. I saved you these steps and stored the zips in a csv on my dropbox (notice the raw=1 parameter at the end to get the file directly and not the preview from dropbox):

locationzips_df <- read_csv("https://www.dropbox.com/s/x8xwdx61go07e4e/chickfilalocationzips.csv?raw=1", col_types = "c")

In case we got some bad data, clean up the zipcodes using clean.zipcodesfunction from the zipcode package.

locationzips_df$zips <- clean.zipcodes(locationzips_df$zips)

Merge coordinate data

Next, we need to get the coordinate data on each zipcode. Again, the dataset from the zipcode package provides us with that data.

locationzips_df <- merge(locationzips_df, select(zipcode, zip, latitude, longitude, state), by.x = 'zips', by.y = 'zip', all.x = TRUE)

Calculate the total number of stores by state

This is really easy with some dplyr magic:

rest_cnt_by_state <- count(locationzips_df, state)

This data frame will look something like this:

## # A tibble: 6 × 2
##   state     n
##    
## 1    AL    77
## 2    AR    32
## 3    AZ    35
## 4    CA    90
## 5    CO    47
## 6    CT     7

Gather the public opinion data

The PRRI portal shows various survey data on American values. The latest year with data is 2015. I dug into the HTML to find the path to save the JSON data:

region_opinion_favor <- fromJSON("http://ava.publicreligion.org/ajx_map.regionsdata?category=lgbt_ssm&amp;sc=2&amp;year=2015&amp;topic=lgbt")$regions
region_opinion_oppose <- fromJSON("http://ava.publicreligion.org/ajx_map.regionsdata?category=lgbt_ssm&amp;sc=3&amp;year=2015&amp;topic=lgbt")$regions

Next, I added a field to note the opinion:

region_opinion_favor$opi <- 'favor'
region_opinion_oppose$opi <- 'oppose'

Then, I manipulated this data to make it usable for the pie charts:

region_opinion <- bind_rows(region_opinion_favor, region_opinion_oppose) %>% 
                   filter(region != 'national') %>%
                   mutate(region = recode(region, "1" = "Northeast", "2" = "Midwest", "3" = "South", "4" = "West")) %>%
                   spread(key = opi, value = percent) %>%
                   mutate(other = 100 - favor - oppose) %>%
                   gather(key = opi, value = percent, -region, -sort)  %>%
                   select(-sort) %>%
                   mutate(opi = factor(opi, levels = c('oppose',  'other', 'favor'), ordered = TRUE))

There’s a lot of stuff going on in this code. Let me explain:

  1. We bind the two data frames we created
  2. We remove the data row for the national average
  3. We recode the numerical regions to text regions
  4. We spread the data from long format to wide format. Read tidyr documentation for further explanation.
  5. Now that the opinions are two columns, we create another column for the remaining/unknown opinions.
  6. We bring everything back to long form using gather.
  7. We remove the sort column.
  8. We create an ordered factor for the opinion. This is so that the oppose opinion shows up at the top on the charts.

After all that, we get a data frame that looks like this:

##      region    opi percent
## 1 Northeast  favor      63
## 2   Midwest  favor      54
## 3     South  favor      46
## 4      West  favor      59
## 5 Northeast oppose      29
## 6   Midwest oppose      38

The WSJ data visualization did one more useful thing: it ordered the pie charts with the regions with most opposition to same-sex marriage at the top. The way to handle this kind of stuff in ggplot is to order the underlying factors.

regions_oppose_sorted <- arrange(filter(region_opinion, opi == 'oppose'), desc(percent))$region
region_opinion <- mutate(region_opinion, region = factor(region, levels = regions_oppose_sorted, ordered = TRUE))

Now that our dataframe is ready, we can create the pie charts using ggplot.

Create the pie charts

To create the pie charts in ggplot, we actually create stacked bar graphs first and then change the polar coordinates. We could also use the base R plotting functions, but I wanted to test the limits of ggplot.

opin_pie_charts <- ggplot(region_opinion, aes(x = 1, y = percent, fill = opi)) + geom_bar(width = 1,stat = "identity", size = 0.3, color = "white", alpha = 0.85, show.legend = FALSE)

There are few things worth explaining here:

  • We set the aes x value to 1 as we really don’t have an x-axis.
  • We set the aes fill value to the variable opi that has values of oppose, favor and other.
  • We set the bar width to 1. I chose this value after some iterations.
  • We set the stat to identity because we don’t ggplot to calculate the bar proportions.
  • We set the color to white – this is the color of the separator between each stack of the bars.
  • We set the size to 0.3 that gives us reasonable spacing between the stacks. I did so to match the WSJ visualization.

This is the chart we get:

And, your reaction totally may be like this:

first time you see the default ggplot stacked bar colors

first time you see the default ggplot stacked bar colors

Well, relax, Kevin. This will get better. I promise.

Next, we add facets to create a separate chart for each region as well as adding the polar coordinates to create the pie chart. I also added theme_void to remove all the gridlines, axis lines, and labels.

opin_pie_charts <- opin_pie_charts + facet_grid(region ~ ., switch = "y") + coord_polar("y")  + theme_void()

Resulting in this:

Feeling better, Kevin?


Let’s change the colors, which I found using image color picker, of the slices to match with the WSJ’s data visualization:

opin_pie_charts <- opin_pie_charts + scale_fill_manual(values = c("favor" = "black", "oppose" = "#908E89", "other" = "#F7F3E8"))

Giving us this:

Next, add data tables. I just picked some good values where it would make sense to show the labels:

#add labels
opin_pie_charts <- opin_pie_charts + geom_text(color = "white", size = rel(3.5), aes(x = 1, y = ifelse(opi == 'favor', 15, 85), label = ifelse(opi == 'other', NA, paste0(str_to_title(opi), "\n", percent(percent/100)))))

Resulting in this:

Next, adding the plot title as given in the WSJ data visualization:

opin_pie_charts <- opin_pie_charts + ggtitle(label = "Survey of views of gay\nmarriage, by region") + theme(plot.title = element_text(size = 10, face = "bold", hjust = 1))

And, the last step:

  • change the background color of the region labels,
  • emphasize the region labels
  • change the panel color as well as the plot background color
opin_pie_charts <- opin_pie_charts + theme(strip.background = element_rect(fill = "#E6D9B7"), strip.text = element_text(face = "bold"), plot.background = element_rect(fill = "#F6F4E8", color = NA), panel.background = element_rect(fill = "#F6F4E8", color = NA))

Getting us really close to the WSJ pie charts:

I should say that a pie chart in ggplot is difficult to customize because you lose one axis. Next time, I would try the base R pie chart.

Creating the map

Phew! That was some work to make the pie charts look similar to the WSJ data visualization. Now, more fun: the maps.

First, let’s get some data ready.

R comes with various data points on each of the states. So, we will get the centers of the states, abbreviations, regions, and state names.

state_centers_abb <- data.frame(state = tolower(state.name), 
                                stateabr = state.abb, 
                                center_long = state.center$x, center_lat = state.center$y,
                                region = state.region)

Which gives us this:

##        state stateabr center_long center_lat region
## 1    alabama       AL    -86.7509    32.5901  South
## 2     alaska       AK   -127.2500    49.2500   West
## 3    arizona       AZ   -111.6250    34.2192   West
## 4   arkansas       AR    -92.2992    34.7336  South
## 5 california       CA   -119.7730    36.5341   West
## 6   colorado       CO   -105.5130    38.6777   West

The regions in this data set have a value of North Central; let’s change that to Midwest.

state_centers_abb <- mutate(state_centers_abb, region = recode(region, "North Central" = "Midwest"))

Next, let’s get the polygon data on each state and merge it with the state centers data frame:

us_map <- map_data("state")
us_map <- rename(us_map, state = region)
us_map <- inner_join(us_map, state_centers_abb, by = c("state" = "state")) 
us_map <- us_map[order(us_map$order), ]

While plotting the polygon data using ggplot, you have to make sure that the order column of the polygon data frame is ordered, otherwise, you will get some wacky looking shapes.

With some prep work done, let the fun begin. Let’s create the base map:

base_plot <- ggplot(data = left_join(us_map, rest_cnt_by_state, by = c("stateabr" = "state")), aes(x = long, y = lat, group = group, order = order, fill = region)) + geom_polygon(color = "white")

You will note that I’ve joined the polygon data frame with the counts of restaurants in each state. Also, similar to the stacked bar graphs, I’ve separated each state with the white color, giving us this:

I already see Kevin going crazy!

I again used the image color picker to select the colors from the WSJ data visualization and assigned them manually to each region. I also removed the legend:

base_plot <- base_plot + scale_fill_manual(values = c("West" = "#E6D9B7", "South" = "#F2DBD5", "Northeast" = "#D8E7D7", "Midwest" = "#D6DBE3"), guide = FALSE)

Generating this:

Next, remove all the distractions and change the background color:

base_plot <- base_plot + theme(axis.title = element_blank(), 
                                 axis.ticks = element_blank(), 
                                 axis.line = element_blank(), 
                                 axis.text = element_blank(), 
                                 panel.grid = element_blank(), 
                                 panel.border = element_blank(), 
                                 plot.background = element_rect(fill = "#F6F4E8"), 
                                 panel.background = element_rect(fill = "#F6F4E8"))

Giving us this:

Next up is adding the circles for restaurant counts.

base_plot <- base_plot + geom_point(aes(size = n, x = center_long, y = center_lat), shape =  1, stroke = 0.5, color = "grey60", inherit.aes = FALSE) + scale_size_area(max_size = 18, breaks = c(0, 10, 100, 300), guide = FALSE) + scale_shape(solid = FALSE)

OK. There is a lot of stuff going on here. First, we use geom_point to create the circles at the center of each state. The size of the circle is dictated by the number of restaurants in each state. The stroke parameters controls the thickness of the circle. We also are using inherit.aes = FALSE to create new aesthetics for this geom.

The scale_size_area is very important because as the documentation says:

scale_size scales area, scale_radius scales radius. The size aesthetic is most commonly used for points and text, and humans perceive the area of points (not their radius), so this provides for optimal perception. scale_size_area ensures that a value of 0 is mapped to a size of 0.

Don’t let this happen to you! Use the area and not the radius.

I also increased the size of the circles and gave breaks manually to the circle sizes.

Generating this:

Since we removed the legend for the circle size and the WSJ graphic had one, let’s try to add that back in. This was challenging and hardly accurate. I played with some sizes and eyeballed to match the radii of different circles. I don’t recommend this at all. This is better handled in post-processing using Inkscape or Illustrator.

Please add the circles as a legend in post-processing. Adding circle grobs is not accurate and doesn’t produce the desired results.
base_plot <- base_plot + annotation_custom(grob = circleGrob(r = unit(1.4, "lines"), gp = gpar(fill = "gray87", alpha = 0.5)), xmin = -79, xmax = -77, ymin = 48 , ymax = 49) + annotation_custom(grob = circleGrob(r = unit(0.7, "lines"), gp = gpar(fill = "gray85", alpha = 0.87)), xmin = -79, xmax = -77, ymin = 47.4, ymax = 48) + annotation_custom(grob = circleGrob(r = unit(0.3, "lines"), gp = gpar(fill = "gray80", alpha = 0.95)), xmin = -79, xmax = -77, ymin = 46.5, ymax = 48)

Giving us:

Let’s add the title:

base_plot <- base_plot + ggtitle(label = "Number of Chick-fil-A resturants by state and region") + theme(plot.title = element_text(size = 10, face = "bold", hjust = 0.14, vjust = -5))

Merging the states and pie charts

This is the easiest step of all. We use the grid.arrange function to merge the two plots to create our own WSJ data visualization in R.

png("wsj-Chick-fil-A-data-visualization-r-plot.png", width = 800, height = 500, units = "px")
grid.arrange(base_plot, opin_pie_charts, widths= c(4, 1), nrow = 1)
dev.off()

Here it is:
WSJ infographics in R

What do you think?

Some things I couldn’t make work:

  • Change the color of the facet of the pie charts. I toyed with strip.text settings, but I couldn’t change all the colors. Perhaps, it is easy to do so in the base R pie charts.
  • The circle-in-circle legend. I got the circles, but not the numbers.
  • The ‘other’ category label in the pie chart.
  • Make the circles on the maps look less jagged-y.

Does the data visualization still work?

Of course, the public opinion over same-sex marriage has changed across the states and Chick-fil-A has opened up more stores across the country.

It still does look like that bigger circles are in the regions where people oppose the same-sex marriage.

I wanted to question that. According to the Pew research center data, only 33% of Republicans favor same-sex marriage. So, if we were to plot the 2016 U.S. Presidential elections by county and plot each zipcode of Chick-fil-A stores, we can see whether there are more stores in counties voting for Donald Trump.

Let’s get started then.

Get the county level data. Tony McGovern already did all the hard work:

cnty_results <- read_csv("https://raw.githubusercontent.com/tonmcg/County_Level_Election_Results_12-16/master/2016_US_County_Level_Presidential_Results.csv")
cnty_results <- mutate(cnty_results, per_point_diff = per_gop - per_dem,
                       county_name =  tolower(str_replace(county_name, " County", "")),
                       county_name =  tolower(str_replace(county_name, " parish", "")))

Get the county map polygon data:

county_map <- map_data("county")
county_map <- rename(county_map, state = region, county = subregion)
county_map$stateabr <- state.abb[match(county_map$state, tolower(state.name))]

Join the polygon with county results:

cnty_map_results <- left_join(county_map, cnty_results, by = c("stateabr" = "state_abbr", "county" = "county_name")) %>% arrange(order, group)

Plot the results and counties:

cnty_plot <- ggplot(data = cnty_map_results, aes(x = long, y = lat, group = group, order = order)) + geom_polygon(aes(fill = per_point_diff), color = "gray90", size = 0.1)

This what we get:

Now, we fill the map with red and blue colors for the republican and democratic voting counties:

cnty_plot <- cnty_plot + scale_fill_gradient2(midpoint = 0, low = "#2062A1", mid = "white", high = "red", na.value = "white", breaks = seq(from = 0, to = 1, length.out = 7), guide = FALSE)

Generating this:

Let’s remove all the distractions and add the store locations by zipcode:

cnty_plot <- cnty_plot + theme_void() + geom_point(data = locationzips_df, aes(x = longitude, y = latitude), size = 0.3, alpha = 0.2, inherit.aes = FALSE)

Giving us this:

wsj-chick-fil-a-locations-presidential-elections-county-data-visualization

Very different picture, wouldn’t you say? It actually looks like that the store locations are present in more democratic leaning counties or at least the counties that are equally divided between the republican and democratic votes.

Of course, it is possible that I messed something up. But, I can conclude two things based on this chart:

  • Aggregation can mislead us to see the non-existent patterns
  • A person’s political identification or views has nothing to do with the food he or she likes. And, the Chick-fil-A leaders know that.

What do you think?

Improve Data Visualization In As Quick As 5 Minutes With These 20+ Special Tips

Expert Advice To Create Data Visualization Like Pros

Complete Script

library(rvest) 
library(stringr)
library(dplyr)
library(ggplot2)
library(scales)
library(ggmap)
library(readr)
library(tidyr)
library(zipcode)
library(maps)
library(jsonlite)
library(grid)
library(gridExtra)
data(zipcode) ## load zipcode data in your workspace
 
states <- read_html("https://www.Chick-fil-A.com/Locations/Browse")
 
locations <- states %>% html_nodes("article ul li a") %>% 
  html_attr("href") 
 
locations <- paste0("https://www.Chick-fil-A.com", locations)
extract_location <- function(url){
  read_html(url) %>%
  html_nodes(".location p") %>%
  html_text() %>%
  str_extract("\\d{5}(?=\n)")
}
 
locationzips <- sapply(locations, extract_location, USE.NAMES = FALSE)
locationzips_df <- data.frame(zips = unlist(locationzips), stringsAsFactors = FALSE)
 
#locationzips_df <- read_csv("https://www.dropbox.com/s/x8xwdx61go07e4e/chickfilalocationzips.csv?raw=1", col_types = "c")
 
locationzips_df$zips <- clean.zipcodes(locationzips_df$zips)
locationzips_df <- merge(locationzips_df, select(zipcode, zip, latitude, longitude, state), by.x = 'zips', by.y = 'zip', all.x = TRUE)
 
est_cnt_by_state <- count(locationzips_df, state)
head(rest_cnt_by_state)
 
region_opinion_favor <- fromJSON("http://ava.publicreligion.org/ajx_map.regionsdata?category=lgbt_ssm&sc=2&year=2015&topic=lgbt")$regions
region_opinion_oppose <- fromJSON("http://ava.publicreligion.org/ajx_map.regionsdata?category=lgbt_ssm&sc=3&year=2015&topic=lgbt")$regions
region_opinion_favor$opi <- 'favor'
region_opinion_oppose$opi <- 'oppose'
 
region_opinion <- bind_rows(region_opinion_favor, region_opinion_oppose) %>% 
                   filter(region != 'national') %>%
                   mutate(region = recode(region, "1" = "Northeast", "2" = "Midwest", "3" = "South", "4" = "West")) %>%
                   spread(key = opi, value = percent) %>%
                   mutate(other = 100 - favor - oppose) %>%
                   gather(key = opi, value = percent, -region, -sort)  %>%
                   select(-sort) %>%
                   mutate(opi = factor(opi, levels = c('oppose',  'other', 'favor'), ordered = TRUE))
head(region_opinion)
 
regions_oppose_sorted <- arrange(filter(region_opinion, opi == 'oppose'), desc(percent))$region
region_opinion <- mutate(region_opinion, region = factor(region, levels = regions_oppose_sorted, ordered = TRUE)) 
 
opin_pie_charts <- ggplot(region_opinion, aes(x = 1, y = percent, fill = opi)) + geom_bar(width = 1,stat = "identity", size = 0.3, color = "white", alpha = 0.85, show.legend = FALSE) 
plot(opin_pie_charts)
opin_pie_charts <- opin_pie_charts + facet_grid(region ~ ., switch = "y") + coord_polar("y")  + theme_void()
plot(opin_pie_charts)
opin_pie_charts <- opin_pie_charts + scale_fill_manual(values = c("favor" = "black", "oppose" = "#908E89", "other" = "#F7F3E8")) 
plot(opin_pie_charts)
 
#add labels
opin_pie_charts <- opin_pie_charts + geom_text(color = "white", size = rel(3.5), aes(x = 1, y = ifelse(opi == 'favor', 15, 85), label = ifelse(opi == 'other', NA, paste0(str_to_title(opi), "\n", percent(percent/100)))))
plot(opin_pie_charts)
opin_pie_charts <- opin_pie_charts + ggtitle(label = "Survey of views of gay\nmarriage, by region") + theme(plot.title = element_text(size = 10, face = "bold", hjust = 1)) 
plot(opin_pie_charts)
opin_pie_charts <- opin_pie_charts + theme(strip.background = element_rect(fill = "#E6D9B7"), strip.text = element_text(face = "bold"), plot.background = element_rect(fill = "#F6F4E8", color = NA), panel.background = element_rect(fill = "#F6F4E8", color = NA))
plot(opin_pie_charts)
 
state_centers_abb <- data.frame(state = tolower(state.name), 
                                stateabr = state.abb, 
                                center_long = state.center$x, center_lat = state.center$y,
                                region = state.region)
head(state_centers_abb)
 
state_centers_abb <- mutate(state_centers_abb, region = recode(region, "North Central" = "Midwest"))
 
us_map <- map_data("state")
us_map <- rename(us_map, state = region)
us_map <- inner_join(us_map, state_centers_abb, by = c("state" = "state")) 
us_map <- us_map[order(us_map$order), ]
 
base_plot <- ggplot(data = left_join(us_map, rest_cnt_by_state, by = c("stateabr" = "state")), aes(x = long, y = lat, group = group, order = order, fill = region)) + geom_polygon(color = "white") 
plot(base_plot)
 
base_plot <- base_plot + scale_fill_manual(values = c("West" = "#E6D9B7", "South" = "#F2DBD5", "Northeast" = "#D8E7D7", "Midwest" = "#D6DBE3"), guide = FALSE)
plot(base_plot)
 
base_plot <- base_plot + theme(axis.title = element_blank(), 
                                 axis.ticks = element_blank(), 
                                 axis.line = element_blank(), 
                                 axis.text = element_blank(), 
                                 panel.grid = element_blank(), 
                                 panel.border = element_blank(), 
                                 plot.background = element_rect(fill = "#F6F4E8"), 
                                 panel.background = element_rect(fill = "#F6F4E8"))
plot(base_plot)
 
base_plot <- base_plot + geom_point(aes(size = n, x = center_long, y = center_lat), shape =  1, stroke = 0.5, color = "grey60", inherit.aes = FALSE) + scale_size_area(max_size = 18, breaks = c(0, 10, 100, 300), guide = FALSE) + scale_shape(solid = FALSE) 
plot(base_plot)
 
base_plot <- base_plot + annotation_custom(grob = circleGrob(r = unit(1.4, "lines"), gp = gpar(fill = "gray87", alpha = 0.5)), xmin = -79, xmax = -77, ymin = 48 , ymax = 49) + annotation_custom(grob = circleGrob(r = unit(0.7, "lines"), gp = gpar(fill = "gray85", alpha = 0.87)), xmin = -79, xmax = -77, ymin = 47.4, ymax = 48) + annotation_custom(grob = circleGrob(r = unit(0.3, "lines"), gp = gpar(fill = "gray80", alpha = 0.95)), xmin = -79, xmax = -77, ymin = 46.5, ymax = 48)
plot(base_plot)
 
base_plot <- base_plot + ggtitle(label = "Number of Chick-fil-A resturants by state and region") + theme(plot.title = element_text(size = 10, face = "bold", hjust = 0.14, vjust = -5))
png("wsj-Chick-fil-A-data-visualization-r-plot.png", width = 800, height = 500, units = "px")
grid.arrange(base_plot, opin_pie_charts, widths= c(4, 1), nrow = 1)
dev.off()
 
cnty_results <- read_csv("https://raw.githubusercontent.com/tonmcg/County_Level_Election_Results_12-16/master/2016_US_County_Level_Presidential_Results.csv")
cnty_results <- mutate(cnty_results, per_point_diff = per_gop - per_dem,
                       county_name =  tolower(str_replace(county_name, " County", "")),
                       county_name =  tolower(str_replace(county_name, " parish", "")))
 
county_map <- map_data("county")
county_map <- rename(county_map, state = region, county = subregion)
county_map$stateabr <- state.abb[match(county_map$state, tolower(state.name))]
cnty_map_results <- left_join(county_map, cnty_results, by = c("stateabr" = "state_abbr", "county" = "county_name")) %>% arrange(order, group) 
cnty_plot <- ggplot(data = cnty_map_results, aes(x = long, y = lat, group = group, order = order)) + geom_polygon(aes(fill = per_point_diff), color = "gray90", size = 0.1) 
plot(cnty_plot)
 
cnty_plot <- cnty_plot + scale_fill_gradient2(midpoint = 0, low = "#2062A1", mid = "white", high = "red", na.value = "white", breaks = seq(from = 0, to = 1, length.out = 7), guide = FALSE)
plot(cnty_plot)
 
cnty_plot <- cnty_plot + theme_void() + geom_point(data = locationzips_df, aes(x = longitude, y = latitude), size = 0.3, alpha = 0.2, inherit.aes = FALSE)

 

소스: How to Create a Wall Street Journal Data Visualization in R – nandeshwar.info

데이터 과학 시작하기 – 권장 리소스

주변에서 데이터 과학을 시작하기 위해 어떤 자원을 추천 할 수 있습니까 라는 질문을 많이 받곤 합니다? 다음은 내 직접 경험해 보고 분석되어진 권장 사항입니다.

관련 도서

Data Science for Business

Data Science for Business
Data Science for Business

Data Science for Busine은 훌륭한 개요 책입니다.

그다지 기술적이지 않고 개념을 매우 쉽게 이용할 수 있습니다. 특히 손실 기능이 작동하는 방법을 설명하는 섹션을 발견했습니다. 특히 내가 본 다른 설명과 비교하여 계몽에 중점을 둡니다.

가장 중요한 것은 데이터 과학의 프로세스 및 비즈니스 의미에 관한 것입니다. 이 책은 관리자, 기술자 및 최근 졸업생 모두에게 좋습니다.

I read the book on my phone on kindle, usually when I was travelling. It remained something readable and intelligible at early hours and late nights – a compelling endorsement in an area where the books can be dense and make one’s brain bleed out of one’s ears.

나는 여행 중일 때, 보통 전화로 내 책을 읽었습니다. 이 책은 이른 시간과 늦은 밤에 읽을 수 있고 이해할 수있는 것으로 남아있었습니다. 책이 빽빽 해져서 뇌가 귀로 찢어 질 수있는 영역에서 강력한지지를 보였습니다.

 

R for Data Science

R for Data Science
R for Data Science

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data is written by Hadley Wickham and Garett Grolemund. You can buy it and you can also access it online.

If you’re interested in learning to actually start doing data science as a practitioner, this book is a very accessible introduction to programming.

Starting gently, this book doesn’t teach you much about the use of R from a general programming perspective. It takes a very task oriented approach and teaches you R as you go along.

This book doesn’t cover the breadth and depth of data science in R, but it gives you a strong foundation in the coding skills you need and gives you a sense of the of the process you’ll go through.

I really like this book but it’s important to note you may have some gaps in your knowledge if this is your main introduction to R programming.

Introductions to R

There are many introductions to R that are useful. The difficulty lies in that there aren’t many focusing on modern R.

I struggle with the ongoing debate about whether I teach people a strong understanding in base R (or vanilla R) or do I teach them modern R (e.g. data.table and the tidyverse) which I perceive to be easier?

  • Base R is quirky and quite difficult to learn. Modern R is much easier.
  • Base R knowledge ensures you understand the core objects and programming constructs. Modern R focusses on tabular data, which is what most people need.
  • Base R is stable. Modern R is bleeding edge.

There are very good base R introductory books but we’re still relatively lacking in ones that incorporate a lot of modern R.

In-person

User groups

User groups are a great way to meet people in the field and see some talks in the area. For the most part, I recommend you go to meetup.com and look for local data science, python, or R meetups near you.

I love user groups and I think they’re a great way to build a local network of people you can chat to about your start in data science.

Training

There’s a reasonable amount of people out there offering introductions to R and data science. I myself offer training, from bespoke to community workshops to training for BI people. You can often search Eventbrite or ask on twitter to find an event happening near you. Your local user group is another great place to ask.

Conferences

You can also check out a lot of conferences. KDNuggets keep a good list of conferences that you can attend.

As data science is becoming so popular, you can often see a lot of data science appearing in non-data science conferences, especially data platform and analysis conferences so you might be able to look to conferences you already know about for getting started with data science.

Online

Blogs

I love reading blogs as a way of learning in an area – not only do you get technical knowledge but people are kind enough to share theory and current trends too. I read a variety that might not suit everyone but if you’re just starting out I recommend Becoming a Data Scientist and R-bloggers.

Online practice

Getting hands-on is an important aspect of learning for me.

A great way to start doing that in, a very chuck yourself in at the deep end kind of way, is to start learning on Kaggle and especially their Titanic competition.

In more gentle ways of learning, you can start using Microsoft free notebook and machine learning platform to start coding things.

Podcasts

Alas, I’m not a podcast person so I’ve only one recommendation for you: Not So Standard Deviations.

DataCamp

DataCamp
DataCamp

By far my most recommended site for learning R and data science is DataCamp. DataCamp blends videos and online exercises making it a great way to learn practically but still get the theory.

There are a number of free introductory courses, but DataCamp works on a monthly subscription access model of $29. For the monthly fee, you can consume any and all of their courses.

DataCamp is great value for money, especially if you want to do an intensive month of learning and then end your subscription.

Coursera

Coursera
Coursera

Coursera is the original online course provider.

They’ve got some fantastic courses for getting started with data science.

With online courses like this, you can find yourself dropping off it and not finishing a course. I recommend you start with just a course or two before you go for something like the $20k Masters in Data Science that you could work towards on there!

The post Getting started with data science – recommended resources appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

 

소스: Getting started with data science – recommended resources | R-bloggers