LeaRning Path on R – Step by Step Guide to Learn Data Science on R

LeaRning Path on R – Step by Step Guide to Learn Data Science on R

One of the common problems people face in learning R is lack of a structured path. They don’t know, from where to start, how to proceed, which track to choose? Though, there is an overload of good free resources available on the Internet, this could be overwhelming as well as confusing at the same time.

To create this R learning path, Analytics Vidhya and DataCamp sat together and selected a comprehensive set of resources to help you learn R from scratch. This learning path is a great introduction for anyone new to data science or R, and if you are a more experienced R user you will be updated on some of the latest advancements.

This will help you learn R quickly and efficiently. Time to have fun while lea-R-ning!

 

Step 0: Warming up

Before starting your journey, the first question to answer is: Why use R? or How would R be useful?

R is a fast growing open source contestant to commercial software packages like SAS, STATA and SPSS. The demand for R skills in the job marketing is rising rapidly, and recently companies such as Microsoft pledged their commitment to R as a lingua franca of Data Science.

Watch this 90 seconds video from Revolution Analytics to get an idea of how useful R could be. Incidentally Revolution Analytics just got acquired by Microsoft.

 

Step 1: Setting up your machine

The easiest way to set-up R is by downloading a copy of it on your local computer from the Comprehensive R Archive Network (CRAN). You can choose between binaries for Linux, Mac and Windows.

Although you could consider working with the basic R console, we recommend you to install one of R’s integrated development environment (IDE). The most well known IDE is RStudio, which makes R coding much easier and faster as it allows you to type multiple lines of code, handle plots, install and maintain packages and navigate your programming environment much more productively. An alternative to RStudio is Architect, an eclipse-based workbench.

(Need a GUI? Check R-commander or Deducer)

Assignment 

  1. Install R, and RStudio
  2. Install Packages Rcmdr, rattle, and Deducer. Install all suggested packages or dependencies including GUI.
  3. Load these packages using library command and open these GUIs one by one.

 

Step 2: Learn the basics of R  language

You should start by understanding the basics of the language, libraries and data structure.Learn R Programming, Data Handling & More

If you prefer an online interactive learning environment to learn R’s syntax this free online R tutorial by DataCamp is a great way to get you going. Also check the successor to this course: intermediate R programming. An alternative learning tool is this online version of swirl where you can learn R in an environment similar to RStudio.

Next to these interactive learning environments, you can also choose to enroll in one of the Moocs available on Coursera or Edx.

In addition to these online resources, you can also consider the following excellent written resources:

Specifically learn: read.table, data frames, table, summary, describe, loading and installing packages, data visualization using plot command

Assignment

  1. Take the free online R tutorial by DataCamp and become familiar with basic R syntax
  2. Create a github account at http://github.com
  3. Learn to troubleshoot package installation above by googling for help.
  4. Install package swirl and learn R programming (see above)

 

Step 3: Understanding the R community

4The major reason R is growing rapidly and is such a huge success, is because of its strong community. At the center of this is R’s package ecosystem. These packages can be downloaded from the Comprehensive R Archive Network, or from bioconductor, github and bitbucket. At Rdocumentation you can easily search packages from CRAN, github and bioconductor that will fit your needs for the task at hand.

Next to the package ecosystem R, you can also easily find help and feedback on your R endeavours. First of all there is R’s built-in help system which you can access via the command ? and the name of e.g. a function. There is also Analytics Vidhya Discussions,  Stack Overflow where R is one of the fastests growing languages. To end, there are numerous blogs run by R enthusiast, a great collection of these is aggregated at R-bloggers.

Assignment

 

Step 4: Importing and manipulating your data

Importing and manipulating your data are important steps in the data science workflow. R allows for the import of different data formats using specific packages that can make your job easier:

  • readr for importing flat files
  • The readxl package for getting excel files into R
  • The haven package lets you import SAS, STATA and SPSS data files into R.
  • Databases: connect via packages like RMySQL and RpostgreSQL, and access and manipulate via DBI
  • rvest for webscraping

Once your data is available in your working environment you are ready to start manipulating it using these packages:

Assignment

 

Step 5: Effective Data Visualization

There is no greater satisfaction than creating your own data visualizations. However, visualizing data is as much of an art as it is a skill. A great read on this is Edward Tufte principles for visualizing quantitative data, or the pitfalls on dashboard design by Stephen Few. Also check out the blog FlowingData by Nathan Yau for inspiration on creating visualization using (mainly) R.

5.1: Plots everywhere

R offers multiple ways for creating graphs. The standard way is by making use of base graphics in R. However, there are way better tools (or packages) to create your graphs in a more simple way that will look on top of that way more beautiful:

  • 3Start with learning the grammar of graphics, a practical way to do data visualizations in R.
  • Probably the most important package to master if you want to become serious about data visualization in R is the ggplot2 package. ggplot2 is so popular that there are tons of resources available on the web for learning purposes such as this online ggplot2 tutorial, a handy cheatsheet or this book by the creator of the package Hadley Wickham.
  • A package such as ggvis allows you create interactive web graphics using the grammar of graphics (see tutorial)
  • Know this ted talk by Hans Rosling? Learn how to re-create this yourself with googleVis (an interface with google charts).
  • In case you run into issues plotting your data this post might help as well.

See more visualization option in this CRAN task view.

Alternatively look at the data visualization guide to R.

5.2: Maps everywhere

Interested in visualizing data on spatial analysis? Take the tutorial on Introduction to visualising spatial data in R and get started easily with these two packages:

  • Visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps with ggmap.
  • Ari Lamstein’s choroplethr
  • The tmap package.

2

 

5.3: HTML widgets

A very promising new tool for visualizations in R is the usage of  HTML widgets. HTML widgets allow you to create interactive web visualizations in an easy way (see the tutorial by RStudio) and mastering this type of visualizations is very likely to become a must have R skill. Impress your friends and colleagues with these visualizations:

Assignment

 

Step 6: Data Mining and Machine Learning

For those that are new to statistics we recommend these resources:

If you want to sharpen your machine learning skills, consider starting with these tutorials:

Make sure to see the various machine learning options available in R in the relevant CRAN task view.

Assignment

 

Step 7: Reporting Results

Communicating your results and sharing your insights with fellow data science enthusiast is equally important as the analysis itself. Luckily R has some very nifty tools to do this that can save you a lot of time.

1The first is R Markdown , a great tool for reporting your data analysis in a reproducible manner based on knitr and pandoc. With R markdown, R generates a final document that replaces the R code with its results. This document can be in an html, word, pfd, ioslides, etc. format. You can learn more on it via this tutorialand use this cheat sheet a a reference.

Next to R Markdown there is also ReporteRs. ReporteRs is an R package for creating Microsoft (Word docx and Powerpoint pptx) and html documents and runs on Windows, Linux, Unix and Mac OS systems. Just like R Markdown it’s an ideal tool to automate reporting generation from R. See here how to get started.

Last but not least there is Shiny, one of the most exciting tools in R around at the moment. Shiny makes it incredibly easy to build interactive web applications with R. It allows you to turn your analysis into interactive web applications without needing to know HTML, CSS or Javascript. If you want to get started with Shiny (and believe us you should!), checkout the RStudio learning portal.

Assignment

  • Create your first interactive report using RMarkdown and/or ReporteRs
  • Try to build your very first Shiny app

 

Bonus Step: Practice

You will only become a great R programmer through practice. Therefore, make sure to tackle new data science challenges regularly. The best recommendation we can make to you here is to start competing with fellow data scientists on Kaggle: https://www.kaggle.com/c/titanic-gettingStarted.

Test your R Skills on live challenges – Practice Problems

 

Step 8: Time Series Analysis

R has a dedicated task view for Time Series. If you ever want to do something with time series analysis in R, this is definitely the place the start. You will soon see that the scope & depth of tools is tremendous.

You will not run out of online resources for learning time series analysis with R easily. Good starting points are A little book of R for time series or check out Forecasting: principles and practice. In terms of packages, you need to make sure that you are familiar with the zoo package and the xts. Zoo provides you a common used format for saving time series objects, while xts gives you the tools to manipulate your time series data sets.

Alternate resource: Comprehensive tutorial on Time Series

Assignment

  • Take one of the recommended time series tutorials listed above so you are ready to start your own analysis.
  • Use a package such as  quantmod or quandl to download financial data and start your own time series analysis.
  • Use a package such as  dygraphs to create stunning visualizations of your time series data and analysis.

 

Bonus Step – Text Mining is Important Too!

To learn text mining, you can refer to text mining module from analytics edge course. Though, the course is archived, you can still access the tutorials.

Practice

 

Step 9: Becoming an R Master

Now that you have learnt most of data analytics using R , it is time to give some advanced topics a shot. There is a good chance that you already know many of these, but have a look at these tutorials too.

You want to apply your analytical skills and test your potential? Then participate in our Hackathons to compete with many Data Scientists from all over the world.

소스: LeaRning Path on R – Step by Step Guide to Learn Data Science on R

Marketing Multi-Channel Attribution model based on Sales Funnel with R | R-bloggers

,

This is the last post in the series of articles about using Multi-Channel Attribution in marketing. In previous two articles (part 1 and part 2), we’ve reviewed a simple and powerful approach based on Markov chains that allows you to effectively attribute marketing channels.

In this article, we will review another fascinating approach that marries heuristic and probabilistic methods. Again, the core idea is straightforward and effective.

Sales Funnel
Usually, companies have some kind of idea on how their clients move along the user journey from first visiting a website to closing a purchase. This sequence of steps is called a Sales (purchasing or conversion) Funnel. Classically, the Sales Funnel includes at least four steps:
  • Awareness – the customer becomes aware of the existence of a product or service (“I didn’t know there was an app for that”),
  • Interest – actively expressing an interest in a product group (“I like how your app does X”),
  • Desire – aspiring to a particular brand or product (“Think I might buy a yearly membership”),
  • Action – taking the next step towards purchasing the chosen product (“Where do I enter payment details?”).

For an e-commerce site, we can come up with one or more conditions (events/actions) that serve as an evidence of passing each step of a Sales Funnel.

For some extra information about Sales Funnel, you can take a look at my (rather ugly) approach of Sales Funnel visualization with R.

Companies, naturally, lose some share of visitors on each following step of a Sales Funnel as it gets narrower. That’s why it looks like a string of bottlenecks. We can calculate a probability of transition from the previous step to the next one based on recorded history of transitions. On the other hand, customer journeys are sequences of sessions (visits) and these sessions are attributed to different marketing channels.

Therefore, we can link marketing channels with a probability of a customer passing through each step of a Sales Funnel. And here goes the core idea of the concept. The probability of moving through each “bottleneck” represents the value of the marketing channel which leads a customer through it. The higher probability of passing a “neck”, the lower the value of a channel that provided the transition. And vice versa, the lower probability, the higher value of a marketing channel in question.

Let’s study the concept with the following example. First off, we’ll define the Sales Funnel and a set of conditions which will register as customer passing through each step of the Funnel.

  • 0 step (necessary condition) – customer visits a site for the first time
  • 1st step (awareness) – visits two site’s pages
  • 2nd step (interest) – reviews a product page
  • 3rd step (desire) –  adds a product to the shopping cart
  • 4th step (action) – completes purchase

Second, we need to extract the data that includes sessions where corresponding events occurred. We’ll simulate this data with the following code:

click to expand R code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
library(tidyverse)
library(purrrlyr)
library(reshape2)
##### simulating the "real" data #####
set.seed(454)
df_raw <- data.frame(customer_id = paste0('id', sample(c(1:5000), replace = TRUE)),
date = as.POSIXct(rbeta(10000, 0.7, 10) * 10000000, origin = '2017-01-01', tz = "UTC"),
channel = paste0('channel_', sample(c(0:7), 10000, replace = TRUE, prob = c(0.2, 0.12, 0.03, 0.07, 0.15, 0.25, 0.1, 0.08))),
site_visit = 1) %>%
mutate(two_pages_visit = sample(c(0,1),
10000,
replace = TRUE,
prob = c(0.8, 0.2)),
product_page_visit = ifelse(two_pages_visit == 1,
sample(c(0, 1),
length(two_pages_visit[which(two_pages_visit == 1)]),
replace = TRUE, prob = c(0.75, 0.25)),
0),
add_to_cart = ifelse(product_page_visit == 1,
sample(c(0, 1),
length(product_page_visit[which(product_page_visit == 1)]),
replace = TRUE, prob = c(0.1, 0.9)),
0),
purchase = ifelse(add_to_cart == 1,
sample(c(0, 1),
length(add_to_cart[which(add_to_cart == 1)]),
replace = TRUE, prob = c(0.02, 0.98)),
0)) %>%
dmap_at(c('customer_id', 'channel'), as.character) %>%
arrange(date) %>%
mutate(session_id = row_number()) %>%
arrange(customer_id, session_id)
df_raw <- melt(df_raw, id.vars = c('customer_id', 'date', 'channel', 'session_id'), value.name = 'trigger', variable.name = 'event') %>%
filter(trigger == 1) %>%
select(-trigger) %>%
arrange(customer_id, date)

And the data sample looks like:

Next up, the data needs to be preprocessed. For example, it would be useful to replace NA/direct channel with the previous one or separate first-time purchasers from current customers, or even create different Sales Funnels based on new and current customers, segments, locations and so on. I will omit this step but you can find some ideas on preprocessing in my previous blogpost.

The important thing about this approach is that we only have to attribute the initial marketing channel, one that led the customer through their first step. For instance, a customer initially reviews a product page (step 2, interest) and is brought by channel_1. That means any future product page visits from other channels won’t be attributed until the customer makes a purchase and starts a new Sales Funnel journey.

Therefore, we will filter records for each customer and save the first unique event of each step of the Sales Funnel using the following code:

click to expand R code

1
2
3
4
5
### removing not first events ###
df_customers <- df_raw %>%
 group_by(customer_id, event) %>%
 filter(date == min(date)) %>%
 ungroup()

I point your attention that in this way we assume that all customers were first-time buyers, therefore every next purchase as an event will be removed with the above code.

Now, we can use the obtained data frame to compute Sales Funnel’s transition probabilities, importance of Sale Funnel steps, and their weighted importance. According to the method, the higher probability, the lower value of the channel. Therefore, we will calculate the importance of an each step as 1 minus transition probability. After that, we need to weight importances because their sum will be higher than 1. We will do these calculations with the following code:

click to expand R code

1
2
3
4
5
6
7
8
9
10
11
12
13
### Sales Funnel probabilities ###
sf_probs <- df_customers %>%
 
 group_by(event) %>%
 summarise(customers_on_step = n()) %>%
 ungroup() %>%
 
 mutate(sf_probs = round(customers_on_step / customers_on_step[event == 'site_visit'], 3),
 sf_probs_step = round(customers_on_step / lag(customers_on_step), 3),
 sf_probs_step = ifelse(is.na(sf_probs_step) == TRUE, 1, sf_probs_step),
 sf_importance = 1 - sf_probs_step,
 sf_importance_weighted = sf_importance / sum(sf_importance)
 )

A hint: it can be a good idea to compute Sales Funnel probabilities looking at a limited prior period, for example, 1-3 months. The reason is that customers’ flow or “necks” capacities could vary due to changes on a company’s site or due to changes in marketing campaigns and so on. Therefore, you can analyze the dynamics of the Sales Funnel’s transition probabilities in order to find the appropriate time period.

I can’t publish a blogpost without visualization. This time I suggest another approach for the Sales Funnel visualization that represents all customer journeys through the Sales Funnel with the following code:

click to expand R code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
### Sales Funnel visualization ###
df_customers_plot <- df_customers %>%
 
 group_by(event) %>%
 arrange(channel) %>%
 mutate(pl = row_number()) %>%
 ungroup() %>%
 
 mutate(pl_new = case_when(
 event == 'two_pages_visit' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'two_pages_visit'])) / 2),
 event == 'product_page_visit' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'product_page_visit'])) / 2),
 event == 'add_to_cart' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'add_to_cart'])) / 2),
 event == 'purchase' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'purchase'])) / 2),
 TRUE ~ 0
 ),
 pl = pl + pl_new)
df_customers_plot$event <- factor(df_customers_plot$event, levels = c('purchase',
 'add_to_cart',
 'product_page_visit',
 'two_pages_visit',
 'site_visit'
 ))
# color palette
cols <- c('#4e79a7', '#f28e2b', '#e15759', '#76b7b2', '#59a14f',
 '#edc948', '#b07aa1', '#ff9da7', '#9c755f', '#bab0ac')
ggplot(df_customers_plot, aes(x = event, y = pl)) +
 theme_minimal() +
 scale_colour_manual(values = cols) +
 coord_flip() +
 geom_line(aes(group = customer_id, color = as.factor(channel)), size = 0.05) +
 geom_text(data = sf_probs, aes(x = event, y = 1, label = paste0(sf_probs*100, '%')), size = 4, fontface = 'bold') +
 guides(color = guide_legend(override.aes = list(size = 2))) +
 theme(legend.position = 'bottom',
 legend.direction = "horizontal",
 panel.grid.major.x = element_blank(),
 panel.grid.minor = element_blank(),
 plot.title = element_text(size = 20, face = "bold", vjust = 2, color = 'black', lineheight = 0.8),
 axis.title.y = element_text(size = 16, face = "bold"),
 axis.title.x = element_blank(),
 axis.text.x = element_blank(),
 axis.text.y = element_text(size = 8, angle = 90, hjust = 0.5, vjust = 0.5, face = "plain")) +
 ggtitle("Sales Funnel visualization - all customers journeys")

Ok, seems we now have everything to make final calculations. In the following code, we will remove all users that didn’t make a purchase. Then, we’ll link weighted importances of the Sales Funnel steps with sessions by event and, at last, summarize them.

click to expand R code

1
2
3
4
5
6
7
8
9
10
11
12
13
### computing attribution ###
df_attrib <- df_customers %>%
 # removing customers without purchase
 group_by(customer_id) %>%
 filter(any(as.character(event) == 'purchase')) %>%
 ungroup() %>%
 
 # joining step's importances
 left_join(., sf_probs %>% select(event, sf_importance_weighted), by = 'event') %>%
 
 group_by(channel) %>%
 summarise(tot_attribution = sum(sf_importance_weighted)) %>%
 ungroup()

As the result, we’ve obtained the number of conversions that have been distributed by marketing channels:

In the same way you can distribute the revenue by channels.

At the end of the article, I want to share OWOX company’s blog where you can read more about the approach: Funnel Based Attribution Model.

In addition, you can find that OWOX provides an automated system for Marketing Multi-Channel Attribution based on BigQuery. Therefore, if you are not familiar with R or don’t have a suitable data warehouse, I can recommend you to test their service.

The post Marketing Multi-Channel Attribution model based on Sales Funnel with R appeared first on AnalyzeCore – data is beautiful, data is a story.

소스: Marketing Multi-Channel Attribution model based on Sales Funnel with R | R-bloggers

Software Carpentry: 데이터 과학

데이터 과학

기계와의 경쟁을 준비하며…

“The future is here, it’s just not evenly distributed yet.”
– William Gibson

R, RStudio, tidyverse, 스파크, AWS 와 함께하는 데이터 과학

  1. R 언어
  2. 측도와 R 자료구조
  3. 깔끔한 데이터와 모형 – broom
  4. R 개발 환경 인프라 데이터 과학 툴체인 – 파이썬
  5. tidyverse 데이터 과학 기본체계
  6. R 팩키지
  7. 다양한 데이터
  8. 데이터 제품
    1. 데이터 저널리즘 – Andrew Flowers
    2. Shiny 웹앱
    3. 보고서 작성 자동화(30분)
    4. R로 전자우편 자동 전송
    5. 공공 데이터 제품
    6. 야구 MLB
  9. 정렬(Sort)

참고 자료

소스: Software Carpentry: 데이터 과학

데이터과학의 주기율표

, ,

이 주기율표는 데이터 과학 공간의 주요 플레이어를 탐색하는 가이드 역할을 할 수 있습니다. 이 테이블의 자료는 O’Reilly의 2016 Data Science Salary Survey, Gartner의 2017 Magic Science Quadrant 및 KD Nuggets 2016 Software Poll 결과와 같은 데이터 과학 사용자로부터 얻은 설문 조사를보고 선택했습니다. 다른 출처 중. 표의 카테고리가 모두 상호 배타적 인 것은 아닙니다.

이 주기율표는 데이터 과학 공간의 주요 플레이어를 탐색하는 가이드 역할을 할 수 있습니다. 이 테이블의 자료는 O’Reilly의 2016 Data Science Salary Survey, Gartner의 2017 Magic Science Quadrant 및 KD Nuggets 2016 Software Poll 결과와 같은 데이터 과학 사용자로부터 얻은 설문 조사를보고 선택했습니다. 다른 출처 중. 표의 카테고리가 모두 상호 배타적 인 것은 아닙니다.

 

데이터 과학의 주기율표 탐색

테이블의 왼쪽 섹션에는 교육과 관련이있는 회사 목록이 나와 있습니다. 여기에는 코스, 부트 캠프 및 컨퍼런스가 있습니다. 반면에 오른쪽에는 최신 뉴스, 가장 인기있는 블로그 및 데이터 과학 커뮤니티의 관련 자료로 최신 정보를 얻을 수있는 리소스가 있습니다. 중간에는 데이터 과학을 시작하는 데 사용할 수있는 도구가 있습니다. 프로그래밍 언어, 프로젝트 및 문제, 데이터 시각화 도구 등을 찾을 수 있습니다.

이 표는 데이터 과학 자료, 도구 및 회사를 다음 13 가지 범주로 분류합니다.

교육 과정 : 데이터 과학을 배우려는 사람들에게는 데이터 과학 과정을 제공하는 많은 사이트 또는 회사가 있습니다.DataCamp, Coursera 및 Edx의 MOOC 등 학습 스타일에 어울리는 다양한 옵션을 찾을 수 있습니다!

부트 캠프: this section includes resources for those who are looking for more mentored options to learn data science. You’ll see that boot camps like The Data Incubator or Galvanize have been included.

이 섹션에는 데이터 과학을 배우기위한 더 많은 멘토 옵션을 찾고있는 사람들을위한 자료가 포함되어 있습니다. Data Incubator 또는 Galvanize와 같은 부트 캠프가 포함되어 있습니다.

컨퍼런스: learning is not an activity that you do when you go on courses or boot camps. Conferences are something that learners often forget, but they also contribute to learning data science: it’s important that you attend them as a data science aspirant, as you’ll get in touch with the latest advancements and the best industry experts. Some of the ones that are listed in the table are UseR!, Tableau Conference and PyData.

Data: practice makes perfect, and this is also the case for data science. You’ll need to look and find data sets in order to start practicing what you learned in the courses on real-life data or to make your data science portfolio. Data is the basic building block of data science and finding that data can be probably one of the hardest things. Some of the options that you could consider when you’re looking for cool data sets are data.world, Quandl and Statista.

Projects & Challenges, Competitions: after practicing, you might also consider taking on bigger projects: data science portfolios, competitions, challenges, …. You’ll find all of these in this category of the Periodic Table of Data Science! One of the most popular options is probably Kaggle, but also DrivenData or DataKind are worth checking out!

Programming Languages & Distributions:  data scientists generally use not only one, but many programming languages; Some programming languages like Python have recently gained a lot of traction in the community and also Python distributions, like Anaconda, seem to find their way to data science aspirants.

Search & Data Management: this enormous category contains all tools that you can use to search and manage your data in some way. You’ll see, on the one hand, a search library like Lucene, but also a relational database management system like Oracle.

Machine Learning & Stats: this category not only offers you libraries to get started with machine learning and stats with programming languages such as Python, but also entire platforms, such as Alteryx or DataRobot.

Data Visualization & Reporting: after you have analyzed and modeled your data, you might be looking to visualize the results and report on what you have been investigating. You can make use of open-source options like Shiny or Matplotlib to do this, or all back on commercial options such as Qlikview or Tableau.

Collaboration: collaboration is a trending topic in the data science community. As you grow, you’ll also find the need to work in teams (even if it’s just with one other person!) and in those cases, you’ll want to make use of notebooks like Jupyter. But even as you’re just working on your own, working with an IDE can come in handy if you’re just starting out. In such cases, consider Rodeo or Spyder.

Community & Q&A: asking questions and falling back on the community is one of the things that you’ll probably do a lot when you’re learning data science. If you’re ever unsure of where you can find the answer to your data science question, you can be sure to find it in sites such as StackOverflow, Quora, Reddit, etc.

News, Newsletters & Blogs: you’ll find that the community is evolving and growing rapidly: following the news and the latest trends is a necessity. General newsletters like Data Science Weekly or Data Elixir, or language-specific newsletters like Python Weekly or R Weekly can give you your weekly dose of data science right in your mailbox. But also blogging sites like RBloggers or KD Nuggets are worth following!

Podcasts: last, but definitely not least, are the podcasts. These are great in many ways, as you’ll get introduced to expert interviews, like in Becoming A Data Scientist or to specific data science topics, like in Data Stories or Talking Machines!

Are you thinking of another resource that should be added to this periodic table?  Leave a comment below and tell us about it!

 

소스: The Periodic Table of Data Science | R-bloggers

R을 배우는 5 가지 가장 효과적인 방법

,
간단한 시계열을 계획하고 있거나 다음 선거에 대한 예측 모델을 작성하든 관계없이 R 프로그래밍 언어의 유연성을 통해 작업을 수행하는 데 필요한 모든 기능을 사용할 수 있습니다. 이 블로그에서는이 필수 데이터 과학 언어를 학습하기위한 5 가지 효과적인 전략과 각각의 주요 리소스에 대해 살펴볼 것입니다. 이 전략은 세계에서 가장 강력한 통계 언어를 마스터하는 길에서 서로 보완하기 위해 사용해야합니다!

Whether you’re plotting a simple time series or building a predictive model for the next election, the R programming language’s flexibility will ensure you have all the capabilities you need to get the job done. In this blog we will take a look at five effective tactics for learning this essential data science language, as well as some of the top resources associated with each. These tactics should be used to complement one another on your path to mastering the world’s most powerful statistical language!

1. Watch Instructive Videos

We often flock to YouTube when we want to learn how to play a song on the piano, change a tire, or chop an onion, but why should it be any different when learning how to perform calculations using the most popular statistical programming language? LearnR, Google Developers, and MarinStatsLectures are all fantastic YouTube channels with playlists specifically dedicated to the R language.

2. Read Blogs

There’s a good chance you came across this article through the R-bloggers website, which curates content from some of the best blogs about R that can be found on the web today. Since there are 750+ blogs that are curated on R-bloggers alone, you shouldn’t have a problem finding an article on the exact topic or use case you’re interested in!

A few notable R blogs:

3. Take an Online Course

As we’ve mentioned in previous blogs, there are a great number of online classes you can take to learn specific technical skills. In many instances, these courses are free, or very affordable, with some offering discounts to college students. Why spend thousands of dollars on a university course, when you can get as good, if not better (IMHO), of an understanding online.

Some sites that offer great R courses include:

4. Read Books

Many times, books are given a bad rap since most programming concepts can be found online, for free. Sure, if you are going to use the book just as a reference, you’d probably be better off saving that money and taking to Google search. However, if you’re a beginner, or someone who wants to learn the fundamentals, working through an entire book at the foundational level will provide a high degree of understanding.

There is a fantastic list of the best books for R at Data Science Central.

5. Experiment!

You can read articles and watch videos all day long, but if you never try it for yourself, you’ll never learn! Datazar is a great place for you to jump right in and experiment with what you’ve learned. You can immediately start by opening the R console or creating a notebook in our cloud-based environment. If you get stuck, you can consult with other users and even work with scripts that have been opened up by others!

I hope you found this helpful and as always if you would like to share any additional resources, feel free to drop them in the comments below!

Resources Included in this Article


The 5 Most Effective Ways to Learn R was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

소스: The 5 Most Effective Ways to Learn R | R-bloggers

R 학습을위한 자습서

,

인터넷에는 R의 다양한 측면을 배우는 데 도움이되는 수많은 자료가 있으며, 초보자에게는 이것이 압도적 일 수 있습니다. 또한 역동적인 언어이며 빠르게 변화하므로 최신 도구 및 기술을 따라 잡는 것이 중요합니다.

이것이 R-bloggers와 DataCamp가 함께 연구하여 R에 대한 학습 경로를 제공하는 이유입니다. 각 섹션에서는 관련 리소스와 도구를 통해 학습자가 계속 학습하고 계속 학습 할 수 있도록 도와줍니다. 설명서, 온라인 과정, 서적 등 다양한 자료가 혼합되어 있습니다. R과 마찬가지로이 학습 경로는 동적 인 리소스입니다. 최상의 학습 경험을 제공하기 위해 지속적으로 진화하고 자원을 개선하기를 원합니다. 따라서 제안 사항이 있으면 개선을 위해 tal.galili@gmail.com으로 이메일을 보내주십시오.

There are tons of resources to help you learn the different aspects of R, and as a beginner this can be overwhelming. It’s also a dynamic language and rapidly changing, so it’s important to keep up with the latest tools and technologies.

That’s why R-bloggers and DataCamp have worked together to bring you a learning path for R. Each section points you to relevant resources and tools to get you started and keep you engaged to continue learning. It’s a mix of materials ranging from documentation, online courses, books, and more.

Just like R, this learning path is a dynamic resource. We want to continually evolve and improve the resources to provide the best possible learning experience. So if you have suggestions for improvement please email tal.galili@gmail.com with your feedback.

Learning Path

Getting started:  The basics of R

Setting up your machine

R packages

Importing your data into R

Data Manipulation

Data Visualization

Data Science & Machine Learning with R

Reporting Results in R

Learning advanced R topics in (paid) online courses

Next steps

Getting started:  The basics of R

image02

The best way to learn R is by doing. In case you are just getting started with R, this free introduction to R tutorial by DataCamp is a great resource as well the successor Intermediate R programming (subscription required). Both courses teach you R programming and data science interactively, at your own pace, in the comfort of your browser. You get immediate feedback during exercises with helpful hints along the way so you don’t get stuck.

Another free online interactive learning tutorial for R is available by O’reilly’s code school website called try R. An offline interactive learning resource is swirl, an R package that makes if fun and easy to become an R programmer. You can take a swirl course by (i) installing the package in R, and (ii) selecting a course from the course library. If you want to start right away without needing to install anything you can also choose for the online version of Swirl.

There are also some very good MOOC’s available on edX and Coursera that teach you the basics of R programming. On edX you can find Introduction to R Programming by Microsoft, an 8 hour course that focuses on the fundamentals and basic syntax of R. At Coursera there is the very popular R Programming course by Johns Hopkins. Both are highly recommended!

If you instead prefer to learn R via a written tutorial or book there is plenty of choice. There is the introduction to R manual by CRAN, as well as some very accessible books like Jared Lander’s R for Everyone or R in Action by Robert Kabacoff.

Setting up your machine

You can download a copy of R from the Comprehensive R Archive Network (CRAN). There are binaries available for Linux, Mac and Windows.

Once R is installed you can choose to either work with the basic R console, or with an integrated development environment (IDE). RStudio is by far the most popular IDE for R and supports debugging, workspace management, plotting and much more (make sure to check out the RStudio shortcuts).

image05

Next to RStudio you also have Architect, and Eclipse-based IDE for R. If you prefer to work with a graphical user interface you can have a look at R-commander  (aka as Rcmdr), or Deducer.

R packages

image04

R packages are the fuel that drive the growth and popularity of R. R packages are bundles of code, data, documentation, and tests that are easy to share with others. Before you can use a package, you will first have to install it. Some packages, like the base package, are automatically installed when you install R. Other packages, like for example the ggplot2 package, won’t come with the bundled R installation but need to be installed.

Many (but not all) R packages are organized and available from CRAN, a network of servers around the world that store identical, up-to-date, versions of code and documentation for R. You can easily install these package from inside R, using the install.packages function. CRAN also maintains a set of Task Views that identify all the packages associated with a particular task such as for example TimeSeries.

Next to CRAN you also have bioconductor which has packages for the analysis of high-throughput genomic data, as well as for example the github and bitbucket repositories of R package developers. You can easily install packages from these repositories using the devtools package.

Finding a package can be hard, but luckily you can easily search packages from CRAN, github and bioconductor using Rdocumentation, inside-R, or you can have a look at this quick list of useful R packages.

To end, once you start working with R, you’ll quickly find out that R package dependencies can cause a lot of headaches. Once you get confronted with that issue, make sure to check out packrat (see video tutorial) or checkpoint. When you’d need to update R, if you are using Windows, you can use the updateR() function from the installr package.

Importing your data into R

The data you want to import into R can come in all sorts for formats: flat files, statistical software files, databases and web data.

image03

Getting different types of data into R often requires a different approach to use. To learn more in general on how to get different data types into R you can check out this online Importing Data into R tutorial (subscription required), this post on data importing, or this webinar by RStudio.

  • Flat files are typically simple text files that contain table data. The standard distribution of R provides functionality to import these flat files into R as a data frame with functions such as read.table() and read.csv() from the utils package. Specific R packages to import flat files data are readr, a fast and very easy to use package that is less verbose as utils and multiple times faster (more information), and data.table’s fread() function for importing and munging data into R (using the fread function).
  • Software packages such as SAS, STATA and SPSS use and produce their own file types. The haven package by Hadley Wickham can deal with importing SAS, STATA and SPSS data files into R and is very easy to use. Alternatively there is the foreign package, which is able to import not only SAS, STATA and SPSS files but also more exotic formats like Systat and Weka for example. It’s also able to export data again to various formats. (Tip: if you’re switching from SAS,SPSS or STATA to R, check out Bob Muenchen’s tutorial (subscription required))
  • The packages used to connect to and import from a relational database depend on the type of database you want to connect to. Suppose you want to connect to a MySQL database, you will need the RMySQL package. Others are for example the RpostgreSQL and ROracle package.The R functions you can then use to access and manipulate the database, is specified in another R package called DBI.
  • If you want to harvest web data using R you need to connect R to resources online using API’s or through scraping with packages like rvest. To get started with all of this, there is this great resource freely available on the blog of Rolf Fredheim.

Data Manipulation

Turning your raw data into well structured data is important for robust analysis, and to make data suitable for processing. R has many built-in functions for data processing, but they are not always that easy to use. Luckily, there are some great packages that can help you:

  • The tidyr package allows you to “tidy” your data. Tidy data is data where each column is a variable and each row an observation. As such, it turns your data into data that is easy to work with. Check this excellent resource on how you can tidy your data using tidyr.
  • If you want to do string manipulation, you should learn about the stringr package. The vignette is very understandable, and full of useful examples to get you started.
  • dplyr is a great package when working with data frame like objects (in memory and out of memory). It combines speed with a very intuitive syntax. To learn more on dplyr you can take this data manipulation course (subscription required) and check out this handy cheat sheet.
  • When performing heavy data wrangling tasks, the data.table package should be your “go-to”package. It’s blazingly fast, and once you get the hang of it’s syntax you will find yourself using data.table all the time. Check this data analysis course (subscription required) to discover the ins and outs of data.table, and use this cheat sheet as a reference.
  • Chances are you find yourself working with times and dates at some point. This can be a painful process, but luckily lubridate makes it a bit easier to work with. Check it’s vignette to better understand how you can use lubridate in your day-to-day analysis.
  • Base R has limited functionality to handle time series data. Fortunately, there are package like zoo, xts and quantmod. Take this tutorial by Eric Zivot to better understand how to use these packages, and how to work with time series data in R.

If you want to have a general overview of data manipulation with R, you can read more in the book Data Manipulation with R or see the Data Wrangling with R video by RStudio. In case you run into troubles with handling your data frames, check 15 easy solutions to your data frame problems.

Data Visualization

One of the things that make R such a great tool is its data visualizations capabilities. For performing visualizations in R, ggplot2 is probably the most well known package and a must learn for beginners! You can find all relevant information to get you started with ggplot2 onhttp://ggplot2.org/ and make sure to check out the cheatsheet and the upcomming book. Next to ggplot2, you also have packages such as ggvis for interactive web graphics (seetutorial (subscription required)), googleVis to interface with google charts (learn to re-create this TED talk), Plotly for R, and many more. See the task view for some hidden gems, and if you have some issues with plotting your data this post might help you out.

In R there is a whole task view dedicated to handling spatial data that allow you to create beautiful maps such as this famous one:

image01

To get started look at for example a package such as ggmap, which allows you to visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps. Alternatively you can start playing around with maptools, choroplethr, and the tmap package. If you need a great tutorial take this Introduction to visualising spatial data in R.

You’ll often see that visualizations in R make use of all these magnificent color schemes that fit like a glove on the graph/map/… If you want to achieve this for your visualizations as well, then deepen yourself into the RColorBrewer package and ColorBrewer.

One of the latest visualizations tools in R is HTML widgets. HTML widgets work just like R plots but they create interactive web visualizations such as  dynamic maps (leaflet), time-series data charting (dygraphs), and interactive tables (DataTables). There are some very nice examples of HTML widgets in the wild, and solid documentation on how to create your own one (not in a reading mode: just watch this video).

If you want to get some inspiration on what visualization to create next, you can have a look at blogs dedicated to visualizations such as FlowingData.

Data Science & Machine Learning with R

There are many beginner resources on how to do data science with R. A list of available online courses:

Alternatively, if you prefer a good read:

Once your start doing some machine learning with R, you will quickly find yourself using packages such as caret, rpart and randomForest. Luckily, there are some great learning resources for these packages and Machine Learning in general. If you are just getting started,this guide will get you going in no time. Alternatively, you can have a look at the books Mastering Machine Learning with R and Machine Learning with R. If you are looking for some step-by-step tutorials that guide you through a real life example there is the Kaggle Machine Learning course or you can have a look at Wiekvoet’s blog.

Reporting Results in R

R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It is a great tool for reporting your data analysis in a reproducible manner, thereby making the analysis more useful and understandable. R markdown is based on knitr and pandoc. With R markdown, R generates a final document that replaces the R code with its results. This document can be in an html, word, pfd, ioslides, etc. format. You can even create interactive R markdown documents using Shiny. This 4 hour tutorial on Reporting with R Markdown (subscription required) get’s you going with R markdown, and in addition you can use this nice cheat sheet for future reference.

Next to R markdown, you should also make sure to check out  Shiny. Shiny makes it incredibly easy to build interactive web applications with R. It allows you to turn your analysis into interactive web applications without needing to know HTML, CSS or Javascript. RStudio maintains a great learning portal to get you started with Shiny, including this set of video tutorials (click on the essentials of Shiny Learning Roadmap). More advanced topics are available, as well as a great set of examples.

image00

Learning advanced R topics in (paid) online courses

DataCamp

Other than its free courses, DataCamp also offers access to all of its advance R courses for $25/month, these include:

Udemy

Another company is Udemy. While they do not offer video + interactive sessions like DataCamp, they do offer extensive video lessons, covering some other topics in using R and learning statistics. For readers of R-bloggers, Udemy is offering access to its courses for $15-$30 per course, use the code RBLOGGERS30 for an extra 30% discount. Here are some of their courses:

Statistics.com

Statistics.com is an online learning website with 100+ courses in statistics, analytics, data mining, text mining, forecasting, social network analysis, spatial analysis, etc.

They have kindly agreed to offer R-Bloggers readers a reduced rate of $399 for any of their 23 courses in R, Python, SQL or SAS.  These are high-impact courses, each 4-weeks long (normally costing up to $589).  They feature hands-on exercises and projects and the opportunity to receive  answers online from leading experts like Paul Murrell (member of the R core development team), Chris Brunsdon (co-developer of the GISTools package), Ben Baumer (former statistician for the NY Mets baseball team), and others. These instructors will answer all your questions (via a private discussion forum) over a 4-week period.

You may use the code “R-Blogger16″ when registering.  You can register for any R, Python, Hadoop, SQL or SAS course starting on any date. Here is a list of the R related courses:

Using R as a statistical package

Building R programming skills – for those familiar with R, or experienced with other programming languages or statistical computing environments

Applying R to specific domains or applications

You may pick any of the R courses from their catalog page:
www.statistics.com/course-catalog/

Next steps

Once you become more fluent in writing R syntax (and consequently addicted to R), you will want to unlock more of its power (read: do some really nifty stuff). In that case make sure to check out RCPP, an R package that makes it easier for integrating C++ code with R, or RevoScaleR (start the free tutorial).

After spending some time writing R code (and you became an R-addict), you’ll reach a point that you want to start writing your own R package. Hilary Parker from Etsy has written a short tutorial on how to create your first package, and if you’re really serious about it you need to read R packages, an upcoming book by Hadley Wickham that is already available for free on the web.

If you want to start learning on the inner workings of R and improve your understanding of it, the best way to get you started is by reading Advanced R.

Finally, come visit us again at R-bloggers.com to read of the latest news and tutorials from bloggers of the R community.

소스: Tutorials for learning R | R-bloggers

MOOC 탐방 : 코세라(Coursera)

,

온라인 강의는 한국에선 꽤 익숙하지만 불과 몇 년 전까지 해외에선 다소 생소한 개념이었다. 해외 국가들은 사교육 시장이 많이 발달되지 않았을 뿐더러 인프라 환경도 잘 구축되지 않았기 때문이다. 하지만 2010년 온라인 공개 수업(Massive Open OnlineCourse, MOOC(무크))이 등장한 이후부터 전세계적으로 온라인 교육 시장이 새롭게 각광받고 있다. 그 중 코세라(coursera)는 단연 MOOC를 부흥시킨 선두주자다. 현재 다양한 실험과 대학과의 협업으로 MOOC 산업을 발전시키고 있다.

코세라 로고

 

“최고의 교수가 만든 최고의 강의를 누구나 들을 수 있도록”

코세라는 2012년 스탠포드대학에서 강의를 제공하던 앤드류 응(Andrew NG) 교수와 다프네 콜러(Daphne Koller) 교수가 만든 서비스다. 두 사람은 코세라 이전에 이미 컴퓨터과학 및 데이터과학 분야에서 이름난 권위자였다. 오랫동안 교수로 살아오던 두 사람은 어떻게 코세라라는 기업을 설립하게 됐을까. 2012년 테드(TED)에 공개된 다프네 콜러 교수의 강연에서는 코세라 설립 뒷얘기를 조금 엿볼 수 있다.

다프네 콜러 교수의 부모는 둘 다 학자였으며, 그녀는 3대째 박사를 배출해낸 집안에서 성장했다. 원하는 공부를 마음껏 할 수 있었던 다프네 콜러 교수와 달리, 세상에는 그 반대의 삶을 사는 사람도 많았다. 교수 생활을 하면서 그녀는 저개발 국가의 교육 환경 그리고 미국의 높은 대학 등록금 문제에 대해 깊이 고민하게 됐다.

TED 강연 중인 다프네 콜러 코세라 공동설립자. <출처: 다프네 콜러 TED 강의 영상 갈무리>

“어렸을 적엔 아버지의 대학교 연구실에서 놀곤 했죠. 그래서 최고 대학에 다니는 게 무척 당연하게 생각됐어요. 그곳에선 기회의 문이 열렸죠. 불행한 일이지만 세계 대부분의 사람들이 그렇게 운이 좋진 못합니다. 지구상 어떤 곳에서는, 예를 들어 남아프리카에서는 교육을 받는 것은 쉽지 않습니다. 남아프리카의 교육 시스템은 소수의 백인들이 통치하던 아파르트헤이트1)(Apartheid) 시절에 만들어졌습니다. 결과적으로 오늘날에는 고등교육을 받는 것을 원하지만, 받을 자격이 있는 많은 사람들을 위한 자리가 없습니다. 이러한 부족함은 결국 2012년 1월 요하네스버그대학에 위기를 초래했습니다. 정규 입학 절차 이후에도 몇몇 신입생 자리가 남았고, 신입생 등록 바로 전날 밤 수천 명의 사람들이 등록하길 바라면서 정문 밖으로 1마일이나 줄을 섰습니다. 정문이 열리자 사람들이 한꺼번에 우르르 몰렸고, 결국 20명이 다쳤고 한 여성이 죽고 말았습니다. 그녀는 자신의 아들이 좀 더 나은 삶을 살 수 있는 기회와 자신의 목숨을 맞바꾼 한 어머니였습니다.”– 다프네 콜러 교수 TED 강연 중

이뿐 아니다. 다프네 콜러 교수는 미국의 대학 등록금은 1985년에 비교해 무려 559%가 오르며 많은 사람이 금전적인 이유로 좋은 수업을 받지 못하고 있다는 점도 지적했다. 그리고 이러한 문제를 개선하는 방법에 온라인 강의를 선택했다.

 

코세라 설립자들. 다프네 콜러 교수(왼쪽)과 앤드류 응 교수. <출처: 코세라 홈페이지>

실제로 당시 다프네 콜러 교수의 동료였던 앤드류 응 교수는 매년 개설하는 기계학습 강의를 누구나 들을 수 있게 개방하는 실험을 진행했다. 원래 기계학습 강의의 정원은 400명이었지만, 온라인 강의로 전환하며10만명의 수강생들에게 동시에 강의를 제공할 수 있었다. 이 과정을 통해 두 교수는 좋은 강의를 듣고 싶어하는 사용자들의 수요를 확인했고, 곧바로 코세라를 만들었다. 2012년 코세라 웹사이트가 공개된 이후 단 3개월 만에 190개 국가에서 64만명의 가입자를 유치했다. 이들이 첫 해에 수강 시청한 강의 수는 150만개에, 영상은 1400만번 재생됐다. 2017년 초 현재 코세라 가입자는 2400만명으로 늘어났다.

코세라 서비스 예시 <출처: 코세라 홈페이지>

코세라는 처음에는 컴퓨터과학 분야 강의가 많았지만 지금은 비즈니스, 언어, 경영, 인문학 등 보다 다양한 강의를 제공하고 있다. 현재 코세라와 제휴한 대학은 149곳이며, 이들이 제작한 강의는 2천개가 넘는다. 듀크대학, 존스홉킨스대학, 미시간주립대학, 와튼스쿨 등이 대표적으로 코세라에 제공하고 있으며, 유명 사립대들의 강의를 대부분 볼 수 있다. 코세라에 빠진 하버드대학이나 MIT 등은 경쟁업체 에덱스(edX)에 강의를 제공하고 있다. 다음은 2016년 가장 인기 있었던 강의들이다.

2016년 코세라에서 가장 많이 등록된 강의2)

1. Learning How to Learn: Powerful mental tools to help you master tough subjectsUniversity of California, SanDiego
2. Machine LearningStanford University
3. Programming for Everybody (Getting Started with Python)University of Michigan
4. R ProgrammingJohns Hopkins University
5. Speak English Professionally: In Person, Online & On the PhoneGeorgia Institute of Technology
6. Grammar and PunctuationUniversity of California, Irvine
7. Seeing Through PhotographsThe Museum of Modern Art
8. The Data Scientist’s ToolboxJohns Hopkins University
9. Buddhism and Modern PsychologyPrinceton University
10. Mastering Data Analysis in ExcelDuke University

코세라 강의의 분량은 짧게는 4-6주, 길게는 4-6개월 과정으로 구성된다. 과거에는 강의실에서 진행되는 강의를 녹화하는 경우가 많았지만, 최근엔 온라인용 강의를 별도로 제작해 올리는 경우도 많다. 짧은 과정의 강의는 대부분 자신이 원하는 시기에 강의를 시작하면서 자유롭게 들을 수 있다. 긴 과정의 강의는 특정 시기에 시작하고 끝내야 하는 강의가 대부분이며, 영상을 시청해야 할 기간과 과제 마감일이 따로 주어지기도 한다.

온라인 강의는 많은 수강생이 들을 수 있도록 만들어지기도 하지만 교수의 수업 질을 높이기 위해서도 사용된다. 예를 들어 코세라는 영상이 끝날 때마다 내용을 요약하거나 핵심 개념에 대해서 단답형 질문이나 객관식 문제를 퀴즈로 낸다. 만약 수만 명이 듣는 수업에서 똑같은 오답을 2천 명 넘게 제출한다면 어떨까? 교수는 학생들이 무엇을 헷갈려 하는지 보다 정확히 알고 수업을 더 보강할 수 있을 것이다.

평생교육에서 각광받는 코세라

MOOC 등장 당시엔 온라인 교육이 기존 대학을 대체할지 여부를 놓고 활발한 토론이 이뤄졌다. 위기감을 느낀 기존 대학은“MOOC는 얼굴을 맞대고 듣는 강의가 아닌데다 강제성이 없어 수료율이 낮다”라는 점을 내세워 MOOC를 비판했다. 지금은 이 부분에 대한 논란은 많이 사라진 편이다. 시간이 지날수록 코세라와 같은 MOOC는 대학의 대체제가 아니라 보조도구로 활용되는 현상이 뚜렷해졌기 때문이다. 또한 MOOC는 대학 학부생보다 졸업생들의 평생교육 도구로 더 많이 이용되는 상황이다.

2015년 코세라, 워싱턴대, 펜실베니아대가 공동으로 코세라 사용자 5만명을 조사해 집필한 ‘온라인 강의에서의 학습자 성과’ 보고서3)에 따르면 MOOC는 학생보다는 직업이 있는 사용자들에게 더 관심을 받고 있었다. 예를 들어 참가자 중 52%는 ‘자기 계발 및 직업 경력에 도움을 받기 위해 온라인 강의를 듣는다’라고 응답했다. 그 중 62%는 ‘온라인 강의를 들어 실제로 업무를 더 잘 처리할 수 있었다’라고 대답했고, 43%는 ‘새로운 직업을 구하는 과정에서 더 좋은 자질을 얻었다’라고 말했다.

코세라 사용자 유형 <출처: 코세라 보고서 – Learner Outcomes in Open Online Courses 2015 >

해당 설문조사 응답자 중 58%는 풀타임(fulltime) 직장인이었고, 12%는 파트타임(parttime) 형태로 직장에 고용된 사람이었다. 교육 수준을 보면 학사학위 소지지가 32%, 석사는 37%, 박사학위 소지자는 9%였다. 나이 때를 비교하면 30대 사용자가 25%로 가장 많았고, 20대가 24%로 비슷한 수준을 보였다. 60대 이상이 16%에 이른다는 점도 눈에 띈다.

학위, 묶음교육 등 유료 서비스로 운영 비용 충당

현재 코세라는 경쟁기업으로 불리는 에덱스(edX)나 유다시티(Udacity)에 비해 월등히 높은 사용자 수와 강의 수를 보유하고 있다. 이러한 인기 배경에는 많은 대학의 적극적인 참여와 ‘무료’라는 점이 한몫했다. 여기서 드는 의문은 ‘그렇다면 코세라는 어떻게 돈을 벌고 있을까?’일 테다. 코세라는 비영리단체는 아니다. 수익을 위해 기업을 만든 것은 아니지만, 기본적인 운영비용을 스스로 충당하고 있다.

먼저 코세라가 가장 처음 시도한 수익 모델은 ‘수료증’ 판매다. 코세라는 접근성을 중요시 여기기 때문에 기본적으로 모든 강의를 무료로 볼 수 있게 공개한다. 강의만 무료로 듣고 싶다면 ‘청강(audit)’버전을 등록하면 된다. 코세라 강의에선 프로젝트나 과제 등을 제공할 수 있는데, 교수나 조교의 피드백을 받기 위해선 유료로 강의를 등록해야 한다. 강의 영상을 다 듣고, 과제도 성실히 수행하면 수료증이 나오는데, 수료증을 받기 위해서도 비용을 내야 한다. 유료 강의는 대부분 29-99달러 수준이다.

코세라 무료 강의와 유료 강의의 차이. Audit이 무료강의다. <출처: 코세라 홈페이지>

‘스페셜리제이션’이라는 묶음 강의도 있다. 예를 들어 데이터과학자가 되기 위해선 한 과목만 배워선 안 된다. 데이터과학자들이 쓰는 도구, 프로그래밍 언어, 분석 방법 등 다양한 과목을 배워야 명확한 개념을 익힐 수 있다. 코세라에선 특정 분야의 전문가가 되기 위해 알아야 할 강좌를 큐레이션해 엮어주고, 이를 전부 수료한 수강생에게 별도의 수료증을 제공하고 있다. 스페셜리제이션 가격은 250-500 달러 선이며, 수강 기간은 평균 4-6개월이 걸린다.

온라인 석사 학위 프로그램도 있다. 이는 일리노이대학과 코세라가 실험적으로 진행하는 프로젝트다. MBA 과정과 데이터과학 석사과정만 공개됐다. 이 프로그램은 학사학위 소지자만 지원할 수 있고, 일리노이대학은 온라인 강의임에도 불구하고 지원자의 추천서, 이력서, 영어 성적 등을 고려해 수강생을 선별한다. 온라인 석사학위 프로그램을 전부 이수하기 위해서는 1~3년의 시간이 필요하며, 등록금은 1만5천-2만5천 달러가 필요하다.

코세라 석사 학위 프로그램. 일리노이대학이 함께 진행하고 있다. <출처: 코세라 학위 홈페이지>

기업용 교육 플랫폼으로 확장

코세라가 이러한 구조로 얼마만큼의 매출을 만들어냈는지는 정확히 알 수 없다. 구체적은 수익을 공개하지 않았기 때문이다. 코세라가 2012년부터 2015년까지 꾸준히 투자를 유치했으므로, 일단 투자금을 기반으로 운영하고 있을 가능성이 높다. 지금까지 코세라가 받은 투자금은 1억 4610만 달러, 우리 돈 약 1650억 원이다.

MOOC 통합포털 서비스를 제공하는 클래스센트럴은 코세라 파트너 행사에서 공개된 자료를 기반으로 코세라 수익을 예측한 글4)을 올린 적 있다. 클래스센트럴은 “코세라에는 월 유료 사용자 수가10만명이며 매달 2만명의 새로운 유료 사용자가 유입되고 있다”라며 “이러한 데이터를 기반으로 2016년에 5천만-6천만달러 매출을 올렸을 것”이라고 평가했다. 또한 “코세라의 온라인 MBA 프로그램에 등록한 수강생은 270명”이라며 “한 학생당 2만 2천 달러의 등록금을 냈다고 가정했을 때 500만 달러의 매출을 만들었을 것”이라고 내다봤다.

‘코세라 포 비즈니스’ <출처: 코세라 홈페이지(Coursera for Business)>

최근 들어 코세라가 새롭게 시도하는 수익모델은 기업용 교육 플랫폼 ‘코세라 포 비즈니스(Coursera for Business)’다. 코세라 포 비즈니스는 각 기업에 필요한 강의와 영상을 별도로 뽑아 추천하는 서비스다. 기업 담당자는 디지털 마케팅, 데이터과학, 리더십 등의 분야로 강의를 나눠 담당 직원들을 초대할 수 있으며, 직원들이 실제로 강의를 등록했는지, 영상을 얼마나 시청했는지 분석하고 관리할 수 있다. 기존 기업 계정과 연동해 쓸 수도 있다. 페이팔, 로레알, 에어프랑스 등이 코세라 포 비즈니스를 이용하고 있다.

코세라의 새로운 실험들

2016년 이후 코세라는 다양한 변화를 맞이하고 있다. 먼저 코세라 설립자들은 경영진에서 물러났으며, 새로운 인물들이 코세라를 이끌고 있다. 현재 코세라 CEO는 예일대 회장을 지낸 릭 레빈(Rick Levin)이 맡고 있으며 인텔, 이베이, 넷플릭스, 구글 등에서 오랫동안 근무했던 전문가들이 코세라 경영진으로 합류했다. 다프네 콜러 교수는 바이오테크 스타트업으로, 앤드류 응 교수는 중국 기업 바이두에 합류해 대규모 데이터를 분석·관리하고 있다.

강의 종류도 다양해지고 있다. 2016년부터 코세라는 ‘능숙하게 e메일과 메모를 작성하는 법’, ‘인포그래픽 그리기’, ‘이력서 쓰는 법’, ‘논문 작성법’ 등 실용적이면서 프로젝트 중심의 수업을 공개하고 있다. 독특한 예술 강의도 눈에 띈다. 그동안 프로젝트 중심 MOOC 수업은 대부분 IT 관련 내용이었지만 최근엔‘만화책 만들기’, ‘TV 파일럿 프로그램을 위한 작가 수업’, ‘전자음악 만들기’ 수업도 공개됐다. IT 수업도 ‘안드로이드 앱 개발하기’같은 일반적인 수업도 있지만 ‘1주일 만에 웹사이트 만들기’같은 단기 개발 강의도 눈에 띈다. 코세라는 이러한 수업으로 수료율을 높이는 동시에 유료 수강생을 모으는 효과를 기대하는 것으로 보인다.

코세라 프로젝트 기반 강의. <출처: 코세라 강의 페이지>

취업 연계 프로그램5)도 새롭게 시도하고 있다. 취업 연계 프로그램에서는 특정 직업을 가지기 위해 알아야 할 사전 정보와 좋은 강의를 큐레이션해 추천해 준다. 수강생 후기와 직업 정보도 함께 제공한다. 예를 들어 ‘데이터과학자가 되는 법’ 강의 밑에는 미국의 유명 기업인 타깃, 올스테이트, 에어비앤비 구직 홈페이지가 바로 연결됐다. 수강생에게 배운 지식으로 어떤 일을 할 수 있는지 구체적인 정보를 주는 것이다.

결제 방식에도 변화를 두었다. 예전에 ‘스페셜리제이션’ 과정을 수강하려면 모든 금액을 선불로 내야 했지만 2016년부터 월별 요금을 적용해 수강생이 매달 학습 진행하면서 수강료를 지불할 수 있게 했다.

국가별 코세라 사용자 증가율. <출처: 코세라 블로그>

글로벌 진출과 브랜드 확장에도 신경 쓰고 있다. 코세라는 전통적으로 미국 수강생들에게 인기 있는 서비스였지만 최근 몇 년 사이에 중국과 인도, 브라질, 러시아 등 다양한 나라에서 사용자가 늘어나고 있는 추세다. 2016년에는 처음으로 TV 광고를 내보냈으며, 미국 국무부와 파트너십을 맺고 난민 교육도 지원하고 있다.

코세라 TV 광고 <출처: 유튜브>

참고자료

소스: 코세라 : 네이버캐스트

 

RStudio 치트 시트

,

빅데이터 분석전문가 과정 모집 – 한국데이터진흥원 빅데이터 아카데미

, ,

교육 목표

빅데이터 분석 기획, 분석 방법 및 분석 도구 활용 등에 대한 지식 함양을 기반으로 현업에서 빅데이터 분석을 통해 새로운 가치를 창출해낼 수 있는 빅데이터 분석 전문가 양성
본 교육은 통계학 중심의 빅데이터 분석 과정(18기,20기)과 대용량 데이터 처리기술 기반의 빅데이터 분석 과정
(19기,21기)으로 구분하여 실시하오니, 교육 신청 전 반드시 커리큘럼을 확인하시길 바랍니다.

교육 대상

<아래 조건을 모두 충족하는 자>
빅데이터 프로젝트 수행 또는 예정인 산업계 재직자
시장‧고객‧제품 등 데이터 분석 3년이상 경력자
(18기, 20기) 기초통계 유경험자
(19기, 21기) 프로그래밍 유경험자

유의사항

빅데이터아카데미는 재직자 대상의 빅데이터 양성 프로그램으로 대학(원)생 및 미취업자는 참여불가
빅데이터아카데미 프로그램의 수료 이력이 있는 자는 중복 수강 불가
정당하지 못한 이유로 수강을 취소한 자 또는 수료 기준 미달로 인해 미수료된 자는 해당년도를 포함하여 2년간 수강 불가

수강료 : 300만원

대기업 재직자 : 수강료 80% 지원 (자부담금 60만원 납부)
중소기업 재직자, 프리랜서 등 : 수강료 100% 지원
소스: :: DBguide.net :: 데이터 전문가 지식포털