The SeaClass R Package · R Views

The SeaClass R Package

The Operations Technology and Advanced Analytics Group (OTAAG) at Seagate Technology has decided to share an internal project that helps accelerate development of classification models. The interactive SeaClass tool is contained in an R-based package built using shiny and other CRAN packages commonly used for binary classification. The package is free to use and develop further, but any analysis mistakes are the sole responsibility of the user. Check out the demo video below:

Origin Story

Manufacturer data sets used for pass/fail analysis are typically highly imbalanced with many more passing cases than failing cases. In some situations, the failure examples may be scarce or nonexistent. In these extreme cases complex modeling techniques are likely inadvisable. Data scientists commonly consider data scarcity, class imbalance, and data dimensionality when discriminating between competing candidate approaches, such as anomaly detection, simple models, and complex models. Standard approaches are easily identified within each of these analysis categories, and can be exploited as reasonable initial classification analysis baselines.

The SeaClass application was created to generate template R code for the commonly encountered classification problems described above. The application allows data analysts to explore multiple models quickly with essentially no programming required. SeaClass provides an option to download the corresponding R code for further model training/testing. This workflow enables our data analysts to jump-start their modeling, saving time and initial hassles.

The Advanced Analytics group decided to open-source the package for several reasons. Firstly, we encourage other contributors to suggest improvements to the SeaClass application. Additionally, we are hopeful our application will inspire other code-generating projects within the R community. Lastly, our group benefits greatly from open-source tools, and it’s nice to give back every once in a while.

Package Overview

The SeaClass R package provides tools for analyzing classification problems. In particular, specialized tools are available for addressing the problem of imbalanced data sets. The SeaClass application provides an easy-to-use interface that requires only minimal R programming knowledge to get started, and can be launched using the RStudio Addins menu. The application allows the user to explore numerous methods by simply clicking on the available options and interacting with the generated results. The user can choose to download the codes for any procedures they wish to explore further. SeaClass was designed to jump-start the analysis process for both novice and advanced R users. See screenshots below for one demonstration.

Installation Instructions

The SeaClass application depends on multiple R packages. To install SeaClass and its dependencies, run:


Usage Instructions

Step 1. Begin by loading and preparing your data in R. Some general advice:

  • Your data set must be saved as an R data frame object.
  • The data set must contain a binary response variable (0/1, PASS/FAIL, A/B, etc.)
  • All other variables must be predictor variables.
  • Predictor variables can be numeric, categorical, or factors.
  • Including too many predictors may slow down the application and weaken performance.
  • Categorical predictors are often ignored when the number of levels exceeds 10, since they tend to have improper influences.
  • Missing values are not allowed and will throw a flag. Please remove or impute NAs prior to starting the app.
  • Keep the number of observations (rows) to a medium or small size.
  • Data sets with many rows (>10,000) or many columns (>30) may slow down the app’s interactive responses.

Step 2. After data preparation, start the application by either loading SeaClass from the RStudio Addins drop-down menu, or by loading the SeaClass function from the command line. For example:


### Make some fake data:
X <- matrix(rnorm(10000,0,1),ncol=10,nrow=1000)
X[1:100,1:2] <- X[1:100,1:2] + 3
Y <- c(rep(1,100), rep(0,900))
Fake_Data <- data.frame(Y = Y , X)

### Load the SeaClass rare failure data:

### Start the interactive GUI:

If the application fails to load, you may need to specify your favorite browser path first. For example:

options(browser = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")

Step 3. The user has various options for configuring their analysis within the GUI. Once the analysis runs, the user can view the results, interact with the results (module-dependent), save the underlying R script, or start over. Additional help is provided within the application. See above screenshots for one depiction of these steps.

Step 4. Besides the SeaClass function, several other functions are contained within the library. For example:

### List available functions:
### Note this is a sample data set:
# data(rareFailData)
### Note code_output is a support function for SeaClass, not for general use.

### View help:

### Run example from help file:
### General Use: ###
x <- c(rnorm(100,0,1),rnorm(100,2,1))
group <- c(rep(0,100),rep(2,100))
accuracy_threshold(x=x, group=group, pos_class=2)
accuracy_threshold(x=x, group=group, pos_class=0)
### Bagged Example ###
replicate_function = function(index){accuracy_threshold(x=x[index], group=group[index], pos_class=2)[[2]]}
sample_cuts <- replicate(100, {
  sample_index =,replace=TRUE)
bagged_scores <- sapply(x, function(x) mean(x > sample_cuts))
unbagged_cut    <- accuracy_threshold(x=x, group=group, pos_class=2)[[2]]
unbagged_scores <- ifelse(x > unbagged_cut, 1, 0)
# Compare AUC:
PRROC::roc.curve(scores.class0 = bagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]]
PRROC::roc.curve(scores.class0 = unbagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]]
bagged_prediction <- ifelse(bagged_scores > 0.50, 2, 0)
unbagged_prediction <- ifelse(x > unbagged_cut, 2, 0)
# Compare Confusion Matrix:
table(bagged_prediction, group)
table(unbagged_prediction, group)

소스: The SeaClass R Package · R Views

LeaRning Path on R – Step by Step Guide to Learn Data Science on R

LeaRning Path on R – Step by Step Guide to Learn Data Science on R

One of the common problems people face in learning R is lack of a structured path. They don’t know, from where to start, how to proceed, which track to choose? Though, there is an overload of good free resources available on the Internet, this could be overwhelming as well as confusing at the same time.

To create this R learning path, Analytics Vidhya and DataCamp sat together and selected a comprehensive set of resources to help you learn R from scratch. This learning path is a great introduction for anyone new to data science or R, and if you are a more experienced R user you will be updated on some of the latest advancements.

This will help you learn R quickly and efficiently. Time to have fun while lea-R-ning!


Step 0: Warming up

Before starting your journey, the first question to answer is: Why use R? or How would R be useful?

R is a fast growing open source contestant to commercial software packages like SAS, STATA and SPSS. The demand for R skills in the job marketing is rising rapidly, and recently companies such as Microsoft pledged their commitment to R as a lingua franca of Data Science.

Watch this 90 seconds video from Revolution Analytics to get an idea of how useful R could be. Incidentally Revolution Analytics just got acquired by Microsoft.


Step 1: Setting up your machine

The easiest way to set-up R is by downloading a copy of it on your local computer from the Comprehensive R Archive Network (CRAN). You can choose between binaries for Linux, Mac and Windows.

Although you could consider working with the basic R console, we recommend you to install one of R’s integrated development environment (IDE). The most well known IDE is RStudio, which makes R coding much easier and faster as it allows you to type multiple lines of code, handle plots, install and maintain packages and navigate your programming environment much more productively. An alternative to RStudio is Architect, an eclipse-based workbench.

(Need a GUI? Check R-commander or Deducer)


  1. Install R, and RStudio
  2. Install Packages Rcmdr, rattle, and Deducer. Install all suggested packages or dependencies including GUI.
  3. Load these packages using library command and open these GUIs one by one.


Step 2: Learn the basics of R  language

You should start by understanding the basics of the language, libraries and data structure.Learn R Programming, Data Handling & More

If you prefer an online interactive learning environment to learn R’s syntax this free online R tutorial by DataCamp is a great way to get you going. Also check the successor to this course: intermediate R programming. An alternative learning tool is this online version of swirl where you can learn R in an environment similar to RStudio.

Next to these interactive learning environments, you can also choose to enroll in one of the Moocs available on Coursera or Edx.

In addition to these online resources, you can also consider the following excellent written resources:

Specifically learn: read.table, data frames, table, summary, describe, loading and installing packages, data visualization using plot command


  1. Take the free online R tutorial by DataCamp and become familiar with basic R syntax
  2. Create a github account at
  3. Learn to troubleshoot package installation above by googling for help.
  4. Install package swirl and learn R programming (see above)


Step 3: Understanding the R community

4The major reason R is growing rapidly and is such a huge success, is because of its strong community. At the center of this is R’s package ecosystem. These packages can be downloaded from the Comprehensive R Archive Network, or from bioconductor, github and bitbucket. At Rdocumentation you can easily search packages from CRAN, github and bioconductor that will fit your needs for the task at hand.

Next to the package ecosystem R, you can also easily find help and feedback on your R endeavours. First of all there is R’s built-in help system which you can access via the command ? and the name of e.g. a function. There is also Analytics Vidhya Discussions,  Stack Overflow where R is one of the fastests growing languages. To end, there are numerous blogs run by R enthusiast, a great collection of these is aggregated at R-bloggers.



Step 4: Importing and manipulating your data

Importing and manipulating your data are important steps in the data science workflow. R allows for the import of different data formats using specific packages that can make your job easier:

  • readr for importing flat files
  • The readxl package for getting excel files into R
  • The haven package lets you import SAS, STATA and SPSS data files into R.
  • Databases: connect via packages like RMySQL and RpostgreSQL, and access and manipulate via DBI
  • rvest for webscraping

Once your data is available in your working environment you are ready to start manipulating it using these packages:



Step 5: Effective Data Visualization

There is no greater satisfaction than creating your own data visualizations. However, visualizing data is as much of an art as it is a skill. A great read on this is Edward Tufte principles for visualizing quantitative data, or the pitfalls on dashboard design by Stephen Few. Also check out the blog FlowingData by Nathan Yau for inspiration on creating visualization using (mainly) R.

5.1: Plots everywhere

R offers multiple ways for creating graphs. The standard way is by making use of base graphics in R. However, there are way better tools (or packages) to create your graphs in a more simple way that will look on top of that way more beautiful:

  • 3Start with learning the grammar of graphics, a practical way to do data visualizations in R.
  • Probably the most important package to master if you want to become serious about data visualization in R is the ggplot2 package. ggplot2 is so popular that there are tons of resources available on the web for learning purposes such as this online ggplot2 tutorial, a handy cheatsheet or this book by the creator of the package Hadley Wickham.
  • A package such as ggvis allows you create interactive web graphics using the grammar of graphics (see tutorial)
  • Know this ted talk by Hans Rosling? Learn how to re-create this yourself with googleVis (an interface with google charts).
  • In case you run into issues plotting your data this post might help as well.

See more visualization option in this CRAN task view.

Alternatively look at the data visualization guide to R.

5.2: Maps everywhere

Interested in visualizing data on spatial analysis? Take the tutorial on Introduction to visualising spatial data in R and get started easily with these two packages:

  • Visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps with ggmap.
  • Ari Lamstein’s choroplethr
  • The tmap package.



5.3: HTML widgets

A very promising new tool for visualizations in R is the usage of  HTML widgets. HTML widgets allow you to create interactive web visualizations in an easy way (see the tutorial by RStudio) and mastering this type of visualizations is very likely to become a must have R skill. Impress your friends and colleagues with these visualizations:



Step 6: Data Mining and Machine Learning

For those that are new to statistics we recommend these resources:

If you want to sharpen your machine learning skills, consider starting with these tutorials:

Make sure to see the various machine learning options available in R in the relevant CRAN task view.



Step 7: Reporting Results

Communicating your results and sharing your insights with fellow data science enthusiast is equally important as the analysis itself. Luckily R has some very nifty tools to do this that can save you a lot of time.

1The first is R Markdown , a great tool for reporting your data analysis in a reproducible manner based on knitr and pandoc. With R markdown, R generates a final document that replaces the R code with its results. This document can be in an html, word, pfd, ioslides, etc. format. You can learn more on it via this tutorialand use this cheat sheet a a reference.

Next to R Markdown there is also ReporteRs. ReporteRs is an R package for creating Microsoft (Word docx and Powerpoint pptx) and html documents and runs on Windows, Linux, Unix and Mac OS systems. Just like R Markdown it’s an ideal tool to automate reporting generation from R. See here how to get started.

Last but not least there is Shiny, one of the most exciting tools in R around at the moment. Shiny makes it incredibly easy to build interactive web applications with R. It allows you to turn your analysis into interactive web applications without needing to know HTML, CSS or Javascript. If you want to get started with Shiny (and believe us you should!), checkout the RStudio learning portal.


  • Create your first interactive report using RMarkdown and/or ReporteRs
  • Try to build your very first Shiny app


Bonus Step: Practice

You will only become a great R programmer through practice. Therefore, make sure to tackle new data science challenges regularly. The best recommendation we can make to you here is to start competing with fellow data scientists on Kaggle:

Test your R Skills on live challenges – Practice Problems


Step 8: Time Series Analysis

R has a dedicated task view for Time Series. If you ever want to do something with time series analysis in R, this is definitely the place the start. You will soon see that the scope & depth of tools is tremendous.

You will not run out of online resources for learning time series analysis with R easily. Good starting points are A little book of R for time series or check out Forecasting: principles and practice. In terms of packages, you need to make sure that you are familiar with the zoo package and the xts. Zoo provides you a common used format for saving time series objects, while xts gives you the tools to manipulate your time series data sets.

Alternate resource: Comprehensive tutorial on Time Series


  • Take one of the recommended time series tutorials listed above so you are ready to start your own analysis.
  • Use a package such as  quantmod or quandl to download financial data and start your own time series analysis.
  • Use a package such as  dygraphs to create stunning visualizations of your time series data and analysis.


Bonus Step – Text Mining is Important Too!

To learn text mining, you can refer to text mining module from analytics edge course. Though, the course is archived, you can still access the tutorials.



Step 9: Becoming an R Master

Now that you have learnt most of data analytics using R , it is time to give some advanced topics a shot. There is a good chance that you already know many of these, but have a look at these tutorials too.

You want to apply your analytical skills and test your potential? Then participate in our Hackathons to compete with many Data Scientists from all over the world.

소스: LeaRning Path on R – Step by Step Guide to Learn Data Science on R

Data Scientist Skill Set – Data Science Central

1         Background

Data science is first and foremost a talent-based discipline and capability. Platforms, tools and IT infrastructure play an important but secondary role. Nevertheless, software and technology companies around the globe spend significant amounts of money talking business managers into buying or licensing their products which often times results in unsatisfying outcomes that do not come close to realizing the full potential of data science.

Talent is key – but unfortunately very rare and hard to identify. If you are trying to hire a data scientist these days you are facing the serious risk of recruiting someone with the wrong or an insufficient skill set. On top of things, talent is even more crucial for small or medium-sized companies whose data science teams are likely to stay relatively small. Wasting one or two head counts on wrong profiles might render an entire team inefficient.

The demand for data scientists has risen dramatically in recent years [1, 2, 3, 4, 5]:

  • New technologies significantly improved our ability to manage and process data; including new data types of data as well as large quantities of data.
  • shift in mind set in business environments took place [6] regarding the utilization of data: from data as a reporting and business analytics necessity towards a valuable resource to enable smart decision making.
  • Last but not least exciting new intellectual developments
  • Last but not least exciting new intellectual developments have taken place in relevant related academic disciplines like machine learning [7, 8] or natural language processing.

Due to high demand, the term ‘data scientist’ developed into a recruiting buzz word which is broadly being abused these days. Experienced lead data scientists share a painful experience when trying to fill a vacant position: Out of a hundred applicants, typically only a handful matches the requirements to qualify for an interview. Some candidates feel already qualified to call themselves ‘data scientist’ after finishing a six-week online course on a statistical computing language. Unqualified individuals often times end up being hired by managers who themselves lack data science experience – leading to disappointments, frustration and an erosion of the term ‘data science’.

2         Who is a Data Scientist?

The data scientist skill set described in the following is based on the idea that it fundamentally rests on three pillars, each representing a skill set mostly orthogonal to the remaining two.

Following this idea, a solid data scientist needs to have the following three well-established skill sets:

  1. Technical skills,
  2. Analytical skills and
  3. Business skills.

Although technical skills are often times the focus of data science role descriptions, they represent only the basis of a data scientist’s skill set. Analytical skills are much harder to acquire (and to test) but represent the crucial core of a data scientist’s ability to solve business problems utilizing scientific approaches. Business skills enable a data scientist to thrive in corporate environments.

2.1        Technical skills | Basis

Technical skills are the basis of a data scientist’s skill set. They include coding skills in languages such as R or Python, the ability to handle various computational architectures, including different types of data bases and operating systems but also other skills such as parallel computing or high performance computing.

The ability to handle data is a necessity for data scientists. It includes data management, data consolidation, data cleansing and data modelling amongst others. As there is often times a high demand for these skills in corporate environments, it comes with the risk of focusing data scientists on data management tasks – thus distracting them from their actual work.

Almost more important than a candidate’s current technical skill set is their mind set. A key factor is intellectual agility providing candidates with the ability to adapt to new computational environments in a short amount of time. This includes learning new coding languages, dealing with new types of data bases or data structures or keeping up with current technological developments like moving from relational databases to object-analytical approaches.
A data scientist with a static technical skill set will not thrive for long as the discipline requires constant adaption and learning. Strong candidates show a healthy appetite for developing their technical skills. When a candidate focusses on a tool discussion during an interview it can be an indication of a narrow technical comfort zone with firm constraints.

Unfortunately, data science job profiles are often times narrowly focused on technical skills; caused by a) the misperception that a successful data scientist’s secret lies exclusively in the ability to handle a specific set of tools and b) a lack of knowledge on the hiring manager’s end as to what the right skill set looks like in the first place. Focusing on technical skills when evaluating candidates renders a significant risk.

2.2        Analytical skills | Core

Scientific problem solving is an essential part of data science. Analytical skills represent the ability to succceed at this complex and highly non-linear discipline. Establishing throrough analytical skills requires a high amount of commitment and dedication (which is a limiting factor contributing to the global shortage of data scientists).

Analytical skills include expertise in academic disciplines like computer science, machine learning, advanced statistics, probability theory, causal inference, artificial intelligence, feature extraction and others (including strong mathematical skills). The list can be extended almost infinetely [9, 10, 11] and has been subject to many debates.
Covering all potentially usefull analytical disciplines is a life-time achievement for any data scientist and not a requirement for a successful candidate. Rather, a data scientist needs to have a healthy mix of analytical skills to succeed. For instance, an expert on Markov chains and an expert on Bayesian networks might both be able to develop a solution for the very same business problem although utilizing their respective strengths and thus fundamentally different methods.

Analytical skills are typically beeing developed through pursuing excellence in a highly quantitative academic field such as computer science, theoretical physics, computational math or bioinformatics. These skills are trained in academic institutions through exposure to hard, unsolved research problems that require a high level of intellectual curiosity and dedication to tackle and eventually solve. This is typically done over the course of a PhD.

Mastering a quantitative research question that nobody else has solved before is a non-linear process inadvertedly accompanied by failing over and over again. However, this process of scientitic problem solving shapes the analytical mind and builds the expertise to later succeed in data science. It typically consists of iterative cycles of

  1. implementing and adapting an analytical approach
  2. applying it and observing it fail, then
  3. investigating the problems and
  4. building an understanding why it failed and where the limitations of the approach lie
  5. to come up with a better more refined approach.

These iterations are acompanied with key learnings and represent small steps towards the project goal thus effectively zig-zagging towards the final solution.

A key requirement for analytical excellence is the right mind set: A data scientist needs to have an intrinsic, high level of curiosity and a strong appetite for intellectual challenges. Data scientists need to be able to pick up new methods and mathematical techniques in a short amount of time to then apply them to a problem at hand – often times within the limited time frame of an ongoing project.

A good way to test analytical skills during an interview process is to provide potential candidates with a business problem and real data to then ask them to spend a few of hours working on it remotely. Discussing the code they wrote, the approach they chose, the solution they built and the insights they generated is a great way to evaluate their potential and at the same time provide the candidates with a first feeling for their potential new tasks.

2.3        Business Skills | Enablement

Business skills enable data scientists to thrive in a corporate environment.

It is important for data scientists to communicate effectively with business users utilizing business lingua and at the same time avoiding a shift towards a conversation that is too technical. Healthy data science projects start and end with the discussion of a business problem supported by a valid business case.

Data scientists need to have a good understanding of business processes as it will be required to make sure the solution they build can be integrated and ultimately consumed by the respective business users. Careful and smart change management almost always plays a role in data science projects as well. A solid portion of entrepreneurship and out-of-the-box thinking helps data scientists to consider business problems from new angles utilizing analytical methods that their business partners do not know about. Last but not least, many big and successful data science projects that ultimately lead to significant impact were achieved through ‘connecting the dots’ by data scientists who built up internal knowledge by working on different projects across departments and functions.

Candidates who come with strong technical and analytical skills are often times highly intelligent individuals looking for intellectual challenges. Even if they have no experience in an industry or in navigating a corporate environment, they can pick up required business skills in a short amount of time – given that they have a healthy appetite for solving business cases. Building strong analytical or technical skills takes orders of magnitude longer.

When trying to determine whether a candidate has an intrinsic interest in business questions or whether he or she would rather prefer to work in an academic setting, it can help to ask yourself the following questions:

  • How well can the candidate explain data science methods like deep learning to business users?
  • When discussing a business problem can the candidate communicate effectively in business terms while thinking about potential mathematical or technical approaches?
  • Will the business users collaborate with the data scientist in the future respect him or her as a partner at eye-level?
  • Would you feel comfortable sending the candidate on their own to present to your manager?
  • Do you think the candidate will succeed in your business environment?

3         Recruiting

Data science requires a mix of different skills. In the end, this mix needs to be adapted to the requirements and the situation at hand, and the business problems that represent the biggest potential value for your company. Big data for instance, is a strong buzz word but in many companies data is under-utilized to a degree that a data science team can focus on low hanging fruit for one or two years in the form of small and structured data sets and at the same time already have a strong business impact.

A key characteristic of candidates that has not been mentioned so far and which can be hard to evaluate is attitude. Hiring data scientists for business consultant positions will require a different mindset and attitude than hiring for integration into an analytics unit or even to supplement a business team.

4         References

[1] NY Times, Data Science: The Numbers of Our Lives by Claire Cain Miller
[2] TechCrunch: How To Stem The Global Shortage Of Data Scientists
[3] Bloomberg: Help Wanted: Black Belts in Data
[4] McKinsey on US opportunities for growth
[5] McKinsey on big data and data science
[6] Big Data at Work: Dispelling the Myths, Uncovering the Opportunities; Thomas H. Davenport; Harvard Business Review Press (2014)
[7] Andrew Ng on Deep Learning
[8] Andrew Ng on Deep Learning Applications
[9] Data scientist Venn diagram by Drew Conway
[10] Swami Chandrasekaran’s data scientist skill map:
[11] Forbes: The best machine learning engineers have these 9 traits in common.


소스: Data Scientist Skill Set – Data Science Central

17 Free Data Science Projects To Boost Your Knowledge & Skills

17 Ultimate Data Science Projects To Boost Your Knowledge and Skills (& can be accessed freely)

, / 23


Data science projects offer you a promising way to kick-start your analytics career. Not only you get to learn data science by applying, you also get projects to showcase on your CV. Nowadays, recruiters evaluate a candidate’s potential by his/her work, not as much by certificates and resumes. It wouldn’t matter, if you just tell them how much you know, if you have nothing to show them! That’s where most people struggle and miss out!

You might have worked on several problems, but if you can’t make it presentable & explanatory, how on earth would someone know what you are capable of? That’s where these projects would help you. Think of the time spend on these projects like your training sessions. I guarantee, the more time you spend, the better you’ll become!

The data sets in the list below are handpicked. I’ve made sure to provide you a taste of variety of problems from different domains with different sizes. I believe, everyone must learn to smartly work on large data sets, hence large data sets are added. Also, I’ve made sure all the data sets are open and free to access.

17-data-science-projects for career in analytics


Useful Information

To help you decide your start line, I’ve divided the data set into 3 levels namely:

  1. Beginner Level: This level comprises of data sets which are fairly easy to work with, and doesn’t require complex data science techniques. You can solve them using basic regression / classification algorithms. Also, these data sets have enough open tutorials to get you going. In this list, I’ve provided tutorials also to help you get started.
  2. Intermediate Level: This level comprises of data sets which are challenging. It consists of mid & large data sets which require some serious pattern recognition skills. Also, feature engineering will make a difference here. There is no limit of use of ML techniques, everything under the sun can be put to use.
  3. Advanced Level: This level is best suited for people who understand advanced topics like neural networks, deep learning, recommender systems etc. High dimensional data are also featured here. Also, this is the time to get creative – see the creativity best data scientists bring in their work and codes.


Table of Contents

  1. Beginner Level
    • Iris Data
    • Titanic Data
    • Loan Prediction Data
    • Bigmart Sales Data
    • Boston Housing Data
  2. Intermediate Level
    • Human Activity Recognition Data
    • Black Friday Data
    • Siam Competition Data
    • Trip History Data
    • Million Song Data
    • Census Income Data
    • Movie Lens Data
  3. Advanced Level
    • Identify your Digits
    • Yelp Data
    • ImageNet Data
    • KDD Cup 1998
    • Chicago Crime Data


Beginner Level

1. Iris Data Set

iris_dataset_scatterplot-svgThis is probably the most versatile, easy and resourceful data set in pattern recognition literature. Nothing could be simpler than iris data set to learn classification techniques. If you are totally new to data science, this is your start line. The data has only 150 rows & 4 columns.

Problem: Predict the flower class based on available attributes.

Start: Get Data | Tutorial: Get Here


2. Titanic Data Set

titanic_sn1912This is another most quoted data set in global data science community. With several tutorials and help guides, this project should give you enough kick to pursue data science deeper. With healthy mix of variables comprising categories, numbers, text, this data set has enough scope to support crazy ideas! This is a classification problem. The data has 891 rows & 12 columns.

Problem: Predict the survival of passengers in Titanic.

Start: Get Data | Tutorial: Get Here


3. Loan Prediction Data Set

ssAmong all industries, insurance domain has the largest use of analytics & data science methods. This data set would provide you enough taste of working on data sets from insurance companies, what challenges are faced, what strategies are used, which variables influence the outcome etc. This is a classification problem. The data has 615 rows and 13 columns.

Problem: Predict if a loan will get approved or not.

Start: Get Data | Tutorial: Get Here


4. Bigmart Sales Data Set

shopping-cart-1269174_960_720Retail is another industry which extensively uses analytics to optimize business processes. Tasks like product placement, inventory management, customized offers, product bundling etc are being smartly handled using data science techniques. As the name suggests, this data comprises of transaction record of a sales store. This is a regression problem. The data has 8523 rows of 12 variables.

Problem: Predict the sales.

Start: Get Data | Tutorial: Get Here


5. Boston Housing Data Set

14938-illustration-of-a-yellow-house-pvThis is another popular data set used in pattern recognition literature. The data set comes from real estate industry in Boston (US). This is a regression problem. The data has 506 rows and 14 columns. Thus, it’s a fairly small data set where you can attempt any technique without worrying about your laptop’s memory issue.

Problem: Predict the median value of owner occupied homes

Start: Get Data | Tutorial: Get Here


Intermediate Level

1. Human Activity Recognition

asThis data set is collected from recordings of 30 human subjects captured via smartphones enabled with embedded inertial sensors. Many machine learning courses use this data for students practice. It’s your turn now. This is a multi-classification problem. The data set has 10299 rows and 561 columns.

Problem: Predict the activity category of a human

Start: Get Data


2. Black Friday Data Set

black-fridayThis data set comprises of sales transactions captured at a retail store. It’s a classic data set to explore your feature engineering skills and day to day understanding from your shopping experience. It’s a regression problem. The data set has 550069 rows and 12 columns.

Problem: Predict purchase amount.

Start: Get Data


3. Text Mining Data Set

De l'éloquence judiciaire À AthenesThis data set is originally from siam competition 2007. The data set comprises of aviation safety reports describing problem(s) which occurred in certain flights. It is a multi-classification, high dimensional problem. It has 21519 rows and 30438 columns.

Problem: Classify the documents according to their labels

Start: Get Data | Get Information


4. Trip History Data Set

trip-history-dataThis data set comes from a bike sharing service in US. This data set requires you to exercise your pro data munging skills. The data set is provided quarter wise from 2010 (Q4) onwards. Each file has 7 columns. It is a classification problem.

Problem: Predict the class of user

Start: Get Data


5. Million Song Data Set

million-songDidn’t you know analytics can be used in entertainment industry also? Do it yourself now. This data set puts forward a regression task. It consists of 515345 observations and 90 variables. However, this is just a tiny subset of original database of million song data. You should use data linked below.

Problem: Predict release year of the song

Start: Get Data


6. Census Income Data Set

us-censusIt’s an imbalanced classification and a classic machine learning problem. You know, machine learning is being extensively used to solve imbalanced problems such as cancer detection, fraud detection etc. It’s time to get your hand dirty. The data set has 48842 rows and 14 columns. For guidance, you can check my imbalanced data project.

Problem: Predict the income class of US population

Start: Get Data


7. Movie Lens Data Set

movie-lens-dataThis data set allows you to build a recommendation engine. Have you created one before?  It’s one of the most popular & quoted data set in data science industry. It is available in various dimensions. Here I’ve used a fairly small size. It has 1 million ratings from 6000 users on 4000 movies.

Problem: Recommend new movies to users

Start: Get Data


Advanced Level

1. Identify your Digits Data Set

identify-the-digitsThis data set allows you to study, analyze and recognize elements in the images. That’s exactly how your camera detects your face, using image recognition! It’s your turn to build and test that technique. It’s an digit recognition problem. This data set has 7000 images of 28 X 28 size, sizing 31MB.

Problem: Identify digits from an image

Start: Get Data


2. Yelp Data Set

yelp-data-setThis data set is a part of round 8 of The Yelp Dataset Challenge. It comprises of nearly 200,000 images, provided in 3 json files of ~2GB. These images provide information about local businesses in 10 cities across 4 countries. You are required to find insights from data using cultural trends, seasonal trends, infer categories, text mining, social graph mining etc.

Problem: Find insights from images

Start: Get Data


3. Image Net Data Set

laImageNet offers variety of problems which encompasses object detection, localization, classification and screen parsing. All the images are freely available. You can search for any type of image and build your project around it. As of now, this image engine has 14,197,122 images of multiple shapes sizing up to 140GB.

Problem: Problem to solve is subjected to the image type you download

Start: Get Data


4. KDD 1999 Data Set

kdd-datasetHow could I miss KDD Cup? Originally, KDD brought the taste of data mining competition to the world. Don’t you want to see what data set they used to offer? I assure you, it’ll be an enriching experience. This data poses a classification problem. It has 4M rows and 48 columns in a ~1.2GB file.

Problem: Classify a network intrusion detector as good or bad.

Start: Get Data


5. Chicago Crime Data Set

chicago-crimeThe ability of handle large data sets is expected of every data scientist these days. Companies no longer prefer to work on samples, they use full data. This data set would provide you much needed hands on experience of handling large data sets on your local machines. The problem is easy, but data management is the key!  This data set has 6M observations. It’s a multi-classification problem.

Problem: Predict the type of crime.

Start: Get Data | To download data, click on Export -> CSV


End Notes

Out of the 17 data sets listed above, you should start by finding the right match of your skills. Say, if you are a beginner in machine learning, avoid taking up advanced level data sets. Don’t bite more than you can chew and don’t feel overwhelmed with how much you still have to do. Instead, focus on making step wise progress.

Once you complete 2 – 3 projects, showcase them on your resume and your github profile (most important!). Lots of recruiters these days hire candidates by stalking github profiles. Your motive shouldn’t be to do all the projects, but to pick out selected ones based on data set, domain, data set size whichever excites you the most. If you want me to solve any of above problem and create a complete project like this, let me know.

Did you find this article useful ? Have you already build any project on these data sets? Do share your experience, learnings and suggestions in comments below.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.


소스: 17 Free Data Science Projects To Boost Your Knowledge & Skills