17 Free Data Science Projects To Boost Your Knowledge & Skills

17 Ultimate Data Science Projects To Boost Your Knowledge and Skills (& can be accessed freely)

SHARE
, / 23

Introduction

Data science projects offer you a promising way to kick-start your analytics career. Not only you get to learn data science by applying, you also get projects to showcase on your CV. Nowadays, recruiters evaluate a candidate’s potential by his/her work, not as much by certificates and resumes. It wouldn’t matter, if you just tell them how much you know, if you have nothing to show them! That’s where most people struggle and miss out!

You might have worked on several problems, but if you can’t make it presentable & explanatory, how on earth would someone know what you are capable of? That’s where these projects would help you. Think of the time spend on these projects like your training sessions. I guarantee, the more time you spend, the better you’ll become!

The data sets in the list below are handpicked. I’ve made sure to provide you a taste of variety of problems from different domains with different sizes. I believe, everyone must learn to smartly work on large data sets, hence large data sets are added. Also, I’ve made sure all the data sets are open and free to access.

17-data-science-projects for career in analytics

 

Useful Information

To help you decide your start line, I’ve divided the data set into 3 levels namely:

  1. Beginner Level: This level comprises of data sets which are fairly easy to work with, and doesn’t require complex data science techniques. You can solve them using basic regression / classification algorithms. Also, these data sets have enough open tutorials to get you going. In this list, I’ve provided tutorials also to help you get started.
  2. Intermediate Level: This level comprises of data sets which are challenging. It consists of mid & large data sets which require some serious pattern recognition skills. Also, feature engineering will make a difference here. There is no limit of use of ML techniques, everything under the sun can be put to use.
  3. Advanced Level: This level is best suited for people who understand advanced topics like neural networks, deep learning, recommender systems etc. High dimensional data are also featured here. Also, this is the time to get creative – see the creativity best data scientists bring in their work and codes.

 

Table of Contents

  1. Beginner Level
    • Iris Data
    • Titanic Data
    • Loan Prediction Data
    • Bigmart Sales Data
    • Boston Housing Data
  2. Intermediate Level
    • Human Activity Recognition Data
    • Black Friday Data
    • Siam Competition Data
    • Trip History Data
    • Million Song Data
    • Census Income Data
    • Movie Lens Data
  3. Advanced Level
    • Identify your Digits
    • Yelp Data
    • ImageNet Data
    • KDD Cup 1998
    • Chicago Crime Data

 

Beginner Level

1. Iris Data Set

iris_dataset_scatterplot-svgThis is probably the most versatile, easy and resourceful data set in pattern recognition literature. Nothing could be simpler than iris data set to learn classification techniques. If you are totally new to data science, this is your start line. The data has only 150 rows & 4 columns.

Problem: Predict the flower class based on available attributes.

Start: Get Data | Tutorial: Get Here

 

2. Titanic Data Set

titanic_sn1912This is another most quoted data set in global data science community. With several tutorials and help guides, this project should give you enough kick to pursue data science deeper. With healthy mix of variables comprising categories, numbers, text, this data set has enough scope to support crazy ideas! This is a classification problem. The data has 891 rows & 12 columns.

Problem: Predict the survival of passengers in Titanic.

Start: Get Data | Tutorial: Get Here

 

3. Loan Prediction Data Set

ssAmong all industries, insurance domain has the largest use of analytics & data science methods. This data set would provide you enough taste of working on data sets from insurance companies, what challenges are faced, what strategies are used, which variables influence the outcome etc. This is a classification problem. The data has 615 rows and 13 columns.

Problem: Predict if a loan will get approved or not.

Start: Get Data | Tutorial: Get Here

 

4. Bigmart Sales Data Set

shopping-cart-1269174_960_720Retail is another industry which extensively uses analytics to optimize business processes. Tasks like product placement, inventory management, customized offers, product bundling etc are being smartly handled using data science techniques. As the name suggests, this data comprises of transaction record of a sales store. This is a regression problem. The data has 8523 rows of 12 variables.

Problem: Predict the sales.

Start: Get Data | Tutorial: Get Here

 

5. Boston Housing Data Set

14938-illustration-of-a-yellow-house-pvThis is another popular data set used in pattern recognition literature. The data set comes from real estate industry in Boston (US). This is a regression problem. The data has 506 rows and 14 columns. Thus, it’s a fairly small data set where you can attempt any technique without worrying about your laptop’s memory issue.

Problem: Predict the median value of owner occupied homes

Start: Get Data | Tutorial: Get Here

 

Intermediate Level

1. Human Activity Recognition

asThis data set is collected from recordings of 30 human subjects captured via smartphones enabled with embedded inertial sensors. Many machine learning courses use this data for students practice. It’s your turn now. This is a multi-classification problem. The data set has 10299 rows and 561 columns.

Problem: Predict the activity category of a human

Start: Get Data

 

2. Black Friday Data Set

black-fridayThis data set comprises of sales transactions captured at a retail store. It’s a classic data set to explore your feature engineering skills and day to day understanding from your shopping experience. It’s a regression problem. The data set has 550069 rows and 12 columns.

Problem: Predict purchase amount.

Start: Get Data

 

3. Text Mining Data Set

De l'éloquence judiciaire À AthenesThis data set is originally from siam competition 2007. The data set comprises of aviation safety reports describing problem(s) which occurred in certain flights. It is a multi-classification, high dimensional problem. It has 21519 rows and 30438 columns.

Problem: Classify the documents according to their labels

Start: Get Data | Get Information

 

4. Trip History Data Set

trip-history-dataThis data set comes from a bike sharing service in US. This data set requires you to exercise your pro data munging skills. The data set is provided quarter wise from 2010 (Q4) onwards. Each file has 7 columns. It is a classification problem.

Problem: Predict the class of user

Start: Get Data

 

5. Million Song Data Set

million-songDidn’t you know analytics can be used in entertainment industry also? Do it yourself now. This data set puts forward a regression task. It consists of 515345 observations and 90 variables. However, this is just a tiny subset of original database of million song data. You should use data linked below.

Problem: Predict release year of the song

Start: Get Data

 

6. Census Income Data Set

us-censusIt’s an imbalanced classification and a classic machine learning problem. You know, machine learning is being extensively used to solve imbalanced problems such as cancer detection, fraud detection etc. It’s time to get your hand dirty. The data set has 48842 rows and 14 columns. For guidance, you can check my imbalanced data project.

Problem: Predict the income class of US population

Start: Get Data

 

7. Movie Lens Data Set

movie-lens-dataThis data set allows you to build a recommendation engine. Have you created one before?  It’s one of the most popular & quoted data set in data science industry. It is available in various dimensions. Here I’ve used a fairly small size. It has 1 million ratings from 6000 users on 4000 movies.

Problem: Recommend new movies to users

Start: Get Data

 

Advanced Level

1. Identify your Digits Data Set

identify-the-digitsThis data set allows you to study, analyze and recognize elements in the images. That’s exactly how your camera detects your face, using image recognition! It’s your turn to build and test that technique. It’s an digit recognition problem. This data set has 7000 images of 28 X 28 size, sizing 31MB.

Problem: Identify digits from an image

Start: Get Data

 

2. Yelp Data Set

yelp-data-setThis data set is a part of round 8 of The Yelp Dataset Challenge. It comprises of nearly 200,000 images, provided in 3 json files of ~2GB. These images provide information about local businesses in 10 cities across 4 countries. You are required to find insights from data using cultural trends, seasonal trends, infer categories, text mining, social graph mining etc.

Problem: Find insights from images

Start: Get Data

 

3. Image Net Data Set

laImageNet offers variety of problems which encompasses object detection, localization, classification and screen parsing. All the images are freely available. You can search for any type of image and build your project around it. As of now, this image engine has 14,197,122 images of multiple shapes sizing up to 140GB.

Problem: Problem to solve is subjected to the image type you download

Start: Get Data

 

4. KDD 1999 Data Set

kdd-datasetHow could I miss KDD Cup? Originally, KDD brought the taste of data mining competition to the world. Don’t you want to see what data set they used to offer? I assure you, it’ll be an enriching experience. This data poses a classification problem. It has 4M rows and 48 columns in a ~1.2GB file.

Problem: Classify a network intrusion detector as good or bad.

Start: Get Data

 

5. Chicago Crime Data Set

chicago-crimeThe ability of handle large data sets is expected of every data scientist these days. Companies no longer prefer to work on samples, they use full data. This data set would provide you much needed hands on experience of handling large data sets on your local machines. The problem is easy, but data management is the key!  This data set has 6M observations. It’s a multi-classification problem.

Problem: Predict the type of crime.

Start: Get Data | To download data, click on Export -> CSV

 

End Notes

Out of the 17 data sets listed above, you should start by finding the right match of your skills. Say, if you are a beginner in machine learning, avoid taking up advanced level data sets. Don’t bite more than you can chew and don’t feel overwhelmed with how much you still have to do. Instead, focus on making step wise progress.

Once you complete 2 – 3 projects, showcase them on your resume and your github profile (most important!). Lots of recruiters these days hire candidates by stalking github profiles. Your motive shouldn’t be to do all the projects, but to pick out selected ones based on data set, domain, data set size whichever excites you the most. If you want me to solve any of above problem and create a complete project like this, let me know.

Did you find this article useful ? Have you already build any project on these data sets? Do share your experience, learnings and suggestions in comments below.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

 

소스: 17 Free Data Science Projects To Boost Your Knowledge & Skills

Marketing Multi-Channel Attribution model based on Sales Funnel with R | R-bloggers

,

This is the last post in the series of articles about using Multi-Channel Attribution in marketing. In previous two articles (part 1 and part 2), we’ve reviewed a simple and powerful approach based on Markov chains that allows you to effectively attribute marketing channels.

In this article, we will review another fascinating approach that marries heuristic and probabilistic methods. Again, the core idea is straightforward and effective.

Sales Funnel
Usually, companies have some kind of idea on how their clients move along the user journey from first visiting a website to closing a purchase. This sequence of steps is called a Sales (purchasing or conversion) Funnel. Classically, the Sales Funnel includes at least four steps:
  • Awareness – the customer becomes aware of the existence of a product or service (“I didn’t know there was an app for that”),
  • Interest – actively expressing an interest in a product group (“I like how your app does X”),
  • Desire – aspiring to a particular brand or product (“Think I might buy a yearly membership”),
  • Action – taking the next step towards purchasing the chosen product (“Where do I enter payment details?”).

For an e-commerce site, we can come up with one or more conditions (events/actions) that serve as an evidence of passing each step of a Sales Funnel.

For some extra information about Sales Funnel, you can take a look at my (rather ugly) approach of Sales Funnel visualization with R.

Companies, naturally, lose some share of visitors on each following step of a Sales Funnel as it gets narrower. That’s why it looks like a string of bottlenecks. We can calculate a probability of transition from the previous step to the next one based on recorded history of transitions. On the other hand, customer journeys are sequences of sessions (visits) and these sessions are attributed to different marketing channels.

Therefore, we can link marketing channels with a probability of a customer passing through each step of a Sales Funnel. And here goes the core idea of the concept. The probability of moving through each “bottleneck” represents the value of the marketing channel which leads a customer through it. The higher probability of passing a “neck”, the lower the value of a channel that provided the transition. And vice versa, the lower probability, the higher value of a marketing channel in question.

Let’s study the concept with the following example. First off, we’ll define the Sales Funnel and a set of conditions which will register as customer passing through each step of the Funnel.

  • 0 step (necessary condition) – customer visits a site for the first time
  • 1st step (awareness) – visits two site’s pages
  • 2nd step (interest) – reviews a product page
  • 3rd step (desire) –  adds a product to the shopping cart
  • 4th step (action) – completes purchase

Second, we need to extract the data that includes sessions where corresponding events occurred. We’ll simulate this data with the following code:

click to expand R code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
library(tidyverse)
library(purrrlyr)
library(reshape2)
##### simulating the "real" data #####
set.seed(454)
df_raw <- data.frame(customer_id = paste0('id', sample(c(1:5000), replace = TRUE)),
date = as.POSIXct(rbeta(10000, 0.7, 10) * 10000000, origin = '2017-01-01', tz = "UTC"),
channel = paste0('channel_', sample(c(0:7), 10000, replace = TRUE, prob = c(0.2, 0.12, 0.03, 0.07, 0.15, 0.25, 0.1, 0.08))),
site_visit = 1) %>%
mutate(two_pages_visit = sample(c(0,1),
10000,
replace = TRUE,
prob = c(0.8, 0.2)),
product_page_visit = ifelse(two_pages_visit == 1,
sample(c(0, 1),
length(two_pages_visit[which(two_pages_visit == 1)]),
replace = TRUE, prob = c(0.75, 0.25)),
0),
add_to_cart = ifelse(product_page_visit == 1,
sample(c(0, 1),
length(product_page_visit[which(product_page_visit == 1)]),
replace = TRUE, prob = c(0.1, 0.9)),
0),
purchase = ifelse(add_to_cart == 1,
sample(c(0, 1),
length(add_to_cart[which(add_to_cart == 1)]),
replace = TRUE, prob = c(0.02, 0.98)),
0)) %>%
dmap_at(c('customer_id', 'channel'), as.character) %>%
arrange(date) %>%
mutate(session_id = row_number()) %>%
arrange(customer_id, session_id)
df_raw <- melt(df_raw, id.vars = c('customer_id', 'date', 'channel', 'session_id'), value.name = 'trigger', variable.name = 'event') %>%
filter(trigger == 1) %>%
select(-trigger) %>%
arrange(customer_id, date)

And the data sample looks like:

Next up, the data needs to be preprocessed. For example, it would be useful to replace NA/direct channel with the previous one or separate first-time purchasers from current customers, or even create different Sales Funnels based on new and current customers, segments, locations and so on. I will omit this step but you can find some ideas on preprocessing in my previous blogpost.

The important thing about this approach is that we only have to attribute the initial marketing channel, one that led the customer through their first step. For instance, a customer initially reviews a product page (step 2, interest) and is brought by channel_1. That means any future product page visits from other channels won’t be attributed until the customer makes a purchase and starts a new Sales Funnel journey.

Therefore, we will filter records for each customer and save the first unique event of each step of the Sales Funnel using the following code:

click to expand R code

1
2
3
4
5
### removing not first events ###
df_customers <- df_raw %>%
 group_by(customer_id, event) %>%
 filter(date == min(date)) %>%
 ungroup()

I point your attention that in this way we assume that all customers were first-time buyers, therefore every next purchase as an event will be removed with the above code.

Now, we can use the obtained data frame to compute Sales Funnel’s transition probabilities, importance of Sale Funnel steps, and their weighted importance. According to the method, the higher probability, the lower value of the channel. Therefore, we will calculate the importance of an each step as 1 minus transition probability. After that, we need to weight importances because their sum will be higher than 1. We will do these calculations with the following code:

click to expand R code

1
2
3
4
5
6
7
8
9
10
11
12
13
### Sales Funnel probabilities ###
sf_probs <- df_customers %>%
 
 group_by(event) %>%
 summarise(customers_on_step = n()) %>%
 ungroup() %>%
 
 mutate(sf_probs = round(customers_on_step / customers_on_step[event == 'site_visit'], 3),
 sf_probs_step = round(customers_on_step / lag(customers_on_step), 3),
 sf_probs_step = ifelse(is.na(sf_probs_step) == TRUE, 1, sf_probs_step),
 sf_importance = 1 - sf_probs_step,
 sf_importance_weighted = sf_importance / sum(sf_importance)
 )

A hint: it can be a good idea to compute Sales Funnel probabilities looking at a limited prior period, for example, 1-3 months. The reason is that customers’ flow or “necks” capacities could vary due to changes on a company’s site or due to changes in marketing campaigns and so on. Therefore, you can analyze the dynamics of the Sales Funnel’s transition probabilities in order to find the appropriate time period.

I can’t publish a blogpost without visualization. This time I suggest another approach for the Sales Funnel visualization that represents all customer journeys through the Sales Funnel with the following code:

click to expand R code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
### Sales Funnel visualization ###
df_customers_plot <- df_customers %>%
 
 group_by(event) %>%
 arrange(channel) %>%
 mutate(pl = row_number()) %>%
 ungroup() %>%
 
 mutate(pl_new = case_when(
 event == 'two_pages_visit' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'two_pages_visit'])) / 2),
 event == 'product_page_visit' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'product_page_visit'])) / 2),
 event == 'add_to_cart' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'add_to_cart'])) / 2),
 event == 'purchase' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'purchase'])) / 2),
 TRUE ~ 0
 ),
 pl = pl + pl_new)
df_customers_plot$event <- factor(df_customers_plot$event, levels = c('purchase',
 'add_to_cart',
 'product_page_visit',
 'two_pages_visit',
 'site_visit'
 ))
# color palette
cols <- c('#4e79a7', '#f28e2b', '#e15759', '#76b7b2', '#59a14f',
 '#edc948', '#b07aa1', '#ff9da7', '#9c755f', '#bab0ac')
ggplot(df_customers_plot, aes(x = event, y = pl)) +
 theme_minimal() +
 scale_colour_manual(values = cols) +
 coord_flip() +
 geom_line(aes(group = customer_id, color = as.factor(channel)), size = 0.05) +
 geom_text(data = sf_probs, aes(x = event, y = 1, label = paste0(sf_probs*100, '%')), size = 4, fontface = 'bold') +
 guides(color = guide_legend(override.aes = list(size = 2))) +
 theme(legend.position = 'bottom',
 legend.direction = "horizontal",
 panel.grid.major.x = element_blank(),
 panel.grid.minor = element_blank(),
 plot.title = element_text(size = 20, face = "bold", vjust = 2, color = 'black', lineheight = 0.8),
 axis.title.y = element_text(size = 16, face = "bold"),
 axis.title.x = element_blank(),
 axis.text.x = element_blank(),
 axis.text.y = element_text(size = 8, angle = 90, hjust = 0.5, vjust = 0.5, face = "plain")) +
 ggtitle("Sales Funnel visualization - all customers journeys")

Ok, seems we now have everything to make final calculations. In the following code, we will remove all users that didn’t make a purchase. Then, we’ll link weighted importances of the Sales Funnel steps with sessions by event and, at last, summarize them.

click to expand R code

1
2
3
4
5
6
7
8
9
10
11
12
13
### computing attribution ###
df_attrib <- df_customers %>%
 # removing customers without purchase
 group_by(customer_id) %>%
 filter(any(as.character(event) == 'purchase')) %>%
 ungroup() %>%
 
 # joining step's importances
 left_join(., sf_probs %>% select(event, sf_importance_weighted), by = 'event') %>%
 
 group_by(channel) %>%
 summarise(tot_attribution = sum(sf_importance_weighted)) %>%
 ungroup()

As the result, we’ve obtained the number of conversions that have been distributed by marketing channels:

In the same way you can distribute the revenue by channels.

At the end of the article, I want to share OWOX company’s blog where you can read more about the approach: Funnel Based Attribution Model.

In addition, you can find that OWOX provides an automated system for Marketing Multi-Channel Attribution based on BigQuery. Therefore, if you are not familiar with R or don’t have a suitable data warehouse, I can recommend you to test their service.

The post Marketing Multi-Channel Attribution model based on Sales Funnel with R appeared first on AnalyzeCore – data is beautiful, data is a story.

소스: Marketing Multi-Channel Attribution model based on Sales Funnel with R | R-bloggers

Visual Studio 2015 용 R 도구 1.0 발표

,

I’m delighted to announce the general availability of R Tools 1.0 for Visual Studio 2015 (RTVS). This release will be shortly followed by R Tools 1.0 for Visual Studio 2017 in early May.

RTVS is a free and open source plug-in that turns Visual Studio into a powerful and productive R development environment. Check out this video for a quick tour of its core features:

Core IDE Features

RTVS builds on Visual Studio, which means you get numerous features for free: from using multiple languages to word-class Editing and Debugging to over 7,000 extensions for every need:

  • A polyglot IDE – VS supports R, Python, C++, C#, Node.js, SQL, etc. projects simultaneously.
  • Editor – complete editing experience for R scripts and functions, including detachable/tabbed windows, syntax highlighting, and much more.
  • IntelliSense – (aka auto-completion) available in both the editor and the Interactive R window.
  • R Interactive Window – work with the R console directly from within Visual Studio.
  • History window – view, search, select previous commands and send to the Interactive window.
  • Variable Explorer – drill into your R data structures and examine their values.
  • Plotting – see all of your R plots in a Visual Studio tool window.
  • Debugging – breakpoints, stepping, watch windows, call stacks and more.
  • R Markdown – R Markdown/knitr support with export to Word and HTML.
  • Git – source code control via Git and GitHub.
  • Extensions – over 7,000 Extensions covering a wide spectrum from Data to Languages to Productivity.
  • Help – use ? and ?? to view R documentation within Visual Studio.

RTVS1

It’s Enterprise-Grade

RTVS includes various features that address the needs of individual as well as Data Science teams, for example:

SQL Server 2016

RTVS integrates with SQL Server 2016 R Services and SQL Server Tools for Visual Studio 2015. These separate downloads enhance RTVS with support for syntax coloring and Intellisense, interactive queries, and deployment of stored procedures directly from Visual Studio.

RTVS2

Microsoft R Client

Use the stock CRAN R interpreter, or the enhanced Microsoft R Client and its ScaleR functions that support multi-core and cluster computing for practicing data science at scale.

Visual Studio Team Services

Integrated support for git, continuous integration, agile tools, release management, testing, reporting, bug and work-item tracking through Visual Studio Team Services. Use our hosted service or host it yourself privately.

Remoting

Whether it’s data governance, security, or running large jobs on a powerful server, RTVS workspaces enable setting up your own R server or connecting to one in the cloud.

The road ahead

We’re very excited to officially bring another language to the Visual Studio family!  Along with Python Tools for Visual Studio, you have the two main languages for tackling most any analytics and ML related challenge.  In the near future (~May), we’ll release RTVS for Visual Studio 2017 as well. We’ll also resurrect the “Data Science workload” in VS2017 which gives you R, Python, F# and all their respective package distros in one convenient install. Beyond that, we’re looking forward to hearing from you on what features we should focus on next! R package development? Mixed R+C debugging? Model deployment? VS Code/R for cross-platform development? Let us know at the RTVS Github repository!

Thank you!

Bits: http://microsoft.github.io/RTVS-docs/installation
Code: https://github.com/Microsoft/RTVS
Docs: http://microsoft.github.io/RTVS-docs

소스: Announcing R Tools 1.0 for Visual Studio 2015 | R-bloggers

R Notebooks을 사용해 보자!

,

오늘 R Markdown에 강력한 노트북 제작 엔진을 추가 한 R Notebooks을 발표하게되어 기쁩니다. 데이터 분석을위한 노트북 인터페이스는 코드와 출력의 밀접한 연관성과 내러티브를 계산으로 산재시키는 기능을 포함하여 매력적인 이점을 제공합니다. 노트북은 교육을위한 훌륭한 도구이며 분석을 공유하는 편리한 방법입니다.

screen-shot-2016-09-21-at-3-42-44-pm

RStudio Preview Release에서 R 노트북을 시험 사용해 볼 수 있습니다.

상호 작용하는 R Markdown

 

저작 형식으로 R Markdown은 JupyterBeaker와 같은 기존 노트북과 많은 유사점을 가지고 있습니다. 그러나 일반적으로 노트의 코드는 한 번에 한 셀씩 대화식으로 실행되지만 R Markdown 문서의 코드는 대개 일괄 적으로 실행됩니다.

R 노트북은 대화식 실행 모델을 R Markdown 문서로 가져와 일반 텍스트 도구와 R Markdown에서 신뢰할 수있는 생산 품질의 출력을 남기지 않고도 노트북 인터페이스에서 빠르고 반복적으로 작업 할 수 있습니다 .

R Markdown Notebooks Traditional Notebooks
일반 텍스트 표현
R 스크립트에 사용되는 동일한 편집기 / 도구
버전관리와 잘 작동 여부
최종 결과물에 집중
코드로 인라인 출력
세션 전체에 캐시된 출력
단일 파일에서 코드 및 출력 공유
강조된 실행 모델 상호작용 및 배치 상호작용

아래 동영상은 더 많은 배경 지식과 실제 사용되는 노트북의 데모를 제공합니다.

 

 

신속하게 반복

일반적인 R Markdown 문서에서는 변경 사항을 확인하기 위해 문서를 다시 짜야합니다. 사소한 계산이 포함되어 있으면 시간이 걸릴 수 있습니다. 그러나 R 노트북을 사용하면 코드를 실행하고 문서에서 결과를 즉시 볼 수 있습니다. 콘솔 출력, 플롯, 데이터 프레임 및 대화식 HTML 위젯을 포함하여 R이 생성하는 모든 종류의 컨텐츠를 포함 할 수 있습니다.

screen-shot-2016-09-20-at-4-16-47-pm

아래와 같이 실행되는 코드의 진행 상태를 볼 수 있습니다.

screen-shot-2016-09-21-at-10-52-02-am

개별 인라인 표현식의 결과도 미리 볼 수 있습니다.

notebook-inline-output

심지어 LaTeX 방정식도 입력 할 때 실시간으로 렌더링됩니다.

notebook-mathjax

이 집중적 인 상호 작용 모드에서는 콘솔, 뷰어 또는 출력 창을 열어 둘 필요가 없습니다. 필요한 모든 사항은 편집기에서 손쉽게 처리 할 수 있으므로 주의 집중을 줄이고 분석에 집중할 수 있습니다. 완료되면 자신이 작성한 레코드나 다른 사람들과 공유 할 수있는 컨텍스트가 풍부하고 완성 된 형식의 형식이 있고 재현 가능한 레코드를 갖게됩니다.

 

포함된 배터리

R 노트북은 R 코드 이상을 실행할 수 있습니다. Python, Bash 또는 C ++ (Rcpp)와 같은 다른 언어로 작성된 청크를 실행할 수 있습니다.

screen-shot-2016-09-20-at-4-25-48-pm

SQL을 직접 실행할 수도 있습니다.

notebook-sql

이로써 R Notebook은 재현 가능한 모든 데이터 분석 워크 플로우를 조율 할 수있는 탁월한 도구가됩니다. feather 또는 일반 CSV 파일과 같은 패키지를 사용하여 데이터를 쉽게 수집하고 언어 간 데이터를 공유 할 수 있습니다.

재현 가능한 Notebooks

While you can run chunks (and even individual lines of R code!) in any order you like, a fully reproducible document must be able to be re-executed start-to-finish in a clean environment. There’s a built-in command to do this, too, so it’s easy to test your notebooks for reproducibility.

큰 단위의 코드 (심지어 R 코드의 개별 라인까지)를 원하는 순서대로 실행할 수 있지만, 완전히 재현 가능한 문서는 깨끗한 환경에서 시작부터 끝까지 재실행 할 수 있어야합니다. 이 작업을 수행 할 수 있는 명령이 내장되어있어 노트북의 재현성을 쉽게 테스트 할 수 있습니다.

screen-shot-2016-09-21-at-3-52-34-pm

풍부한 출력 형식

Since they’re built on R Markdown, R Notebooks work seamlessly with other R Markdown output types. You can use any existing R Markdown document as a notebook, or render (knit) a notebook to any R Markdown output type.

R Markdown을 기반으로하므로 R Notebooks는 다른 R Markdown 출력 유형과 원활하게 작동합니다. 기존 R Markdown 문서를 전자 필기장으로 사용하거나 전자 필기장을 모든 R Markdown 출력 유형으로 반영 (knit) 할 수 있습니다.

notebook-yaml

아이디어를 신속하게 반복하고 나중에 출판을 위해 완전히 다른 형식으로 렌더링 할 때 동일한 문서를 노트로 사용할 수 있습니다. 코드, 데이터 또는 출력의 중복이 필요하지 않습니다.

공유 및 게시

 

R 노트북은 공동 작업자와 손쉽게 공유 할 수 있습니다. 일반 텍스트 파일이기 때문에 Git과 같은 버전 제어 시스템에서 잘 작동합니다. 노트를 오픈 소스 rmarkdown 패키지를 사용하여 R 콘솔에서 렌더링 할 수 있으므로 공동 작업자는 편집하기 위해 RStudio가 필요하지 않습니다.

렌더링 된 노트는 RStudio에서 바로 미리 볼 수 있습니다.

notebook-preview

노트 미리보기가 렌더링 된 R Markdown 문서와 비슷하게 보이지만 노트 미리보기는 R 코드 청크를 실행하지 않습니다. 가장 최근의 청크 출력과 함께 문서의 렌더링된 사본을 보여줍니다. 이 미리보기를 생성하는 것은 매우 빠르기 때문에 또다시 R 코드는 실행되지 않습니다. R Markdown 문서를 저장할 때마다 생성됩니다.

생성 된 HTML 파일의 특수 확장자는 .nb.html입니다. 그것은 독립적이며 의존성이 없이 로컬에서 보거나 정적 웹 호스팅 서비스에 게시 할 수 있습니다.

screen-shot-2016-09-14-at-12-12-35-pm

또한 R Markdown 소스 파일의 번들 사본이 포함되어있어 RStudio에서 완벽하게 열어 모든 출력물을 그대로 유지하면서 노트북 작업을 재개 할 수 있습니다.

사용해보기

R 노트북을 사용해 보려면 최신 RStudio Preview Release를 다운로드해야합니다.

R Markdown 웹 사이트의R Notebooks 페이지에서 노트북 기능에 대한 설명서를 찾을 수 있으며, R Notebooks Webinar에 비디오 자습서를 게시했습니다.

우리는 R Notebook이 귀하의 툴킷에 강력한 새 기능이 될 것이라고 믿습니다. 그것에게 회전을주십시오 당신이 생각하는 것을 저희에게 알려주십시오!

 

 

소스: R Notebooks | RStudio Blog