17 Free Data Science Projects To Boost Your Knowledge & Skills

17 Ultimate Data Science Projects To Boost Your Knowledge and Skills (& can be accessed freely)

, / 23


Data science projects offer you a promising way to kick-start your analytics career. Not only you get to learn data science by applying, you also get projects to showcase on your CV. Nowadays, recruiters evaluate a candidate’s potential by his/her work, not as much by certificates and resumes. It wouldn’t matter, if you just tell them how much you know, if you have nothing to show them! That’s where most people struggle and miss out!

You might have worked on several problems, but if you can’t make it presentable & explanatory, how on earth would someone know what you are capable of? That’s where these projects would help you. Think of the time spend on these projects like your training sessions. I guarantee, the more time you spend, the better you’ll become!

The data sets in the list below are handpicked. I’ve made sure to provide you a taste of variety of problems from different domains with different sizes. I believe, everyone must learn to smartly work on large data sets, hence large data sets are added. Also, I’ve made sure all the data sets are open and free to access.

17-data-science-projects for career in analytics


Useful Information

To help you decide your start line, I’ve divided the data set into 3 levels namely:

  1. Beginner Level: This level comprises of data sets which are fairly easy to work with, and doesn’t require complex data science techniques. You can solve them using basic regression / classification algorithms. Also, these data sets have enough open tutorials to get you going. In this list, I’ve provided tutorials also to help you get started.
  2. Intermediate Level: This level comprises of data sets which are challenging. It consists of mid & large data sets which require some serious pattern recognition skills. Also, feature engineering will make a difference here. There is no limit of use of ML techniques, everything under the sun can be put to use.
  3. Advanced Level: This level is best suited for people who understand advanced topics like neural networks, deep learning, recommender systems etc. High dimensional data are also featured here. Also, this is the time to get creative – see the creativity best data scientists bring in their work and codes.


Table of Contents

  1. Beginner Level
    • Iris Data
    • Titanic Data
    • Loan Prediction Data
    • Bigmart Sales Data
    • Boston Housing Data
  2. Intermediate Level
    • Human Activity Recognition Data
    • Black Friday Data
    • Siam Competition Data
    • Trip History Data
    • Million Song Data
    • Census Income Data
    • Movie Lens Data
  3. Advanced Level
    • Identify your Digits
    • Yelp Data
    • ImageNet Data
    • KDD Cup 1998
    • Chicago Crime Data


Beginner Level

1. Iris Data Set

iris_dataset_scatterplot-svgThis is probably the most versatile, easy and resourceful data set in pattern recognition literature. Nothing could be simpler than iris data set to learn classification techniques. If you are totally new to data science, this is your start line. The data has only 150 rows & 4 columns.

Problem: Predict the flower class based on available attributes.

Start: Get Data | Tutorial: Get Here


2. Titanic Data Set

titanic_sn1912This is another most quoted data set in global data science community. With several tutorials and help guides, this project should give you enough kick to pursue data science deeper. With healthy mix of variables comprising categories, numbers, text, this data set has enough scope to support crazy ideas! This is a classification problem. The data has 891 rows & 12 columns.

Problem: Predict the survival of passengers in Titanic.

Start: Get Data | Tutorial: Get Here


3. Loan Prediction Data Set

ssAmong all industries, insurance domain has the largest use of analytics & data science methods. This data set would provide you enough taste of working on data sets from insurance companies, what challenges are faced, what strategies are used, which variables influence the outcome etc. This is a classification problem. The data has 615 rows and 13 columns.

Problem: Predict if a loan will get approved or not.

Start: Get Data | Tutorial: Get Here


4. Bigmart Sales Data Set

shopping-cart-1269174_960_720Retail is another industry which extensively uses analytics to optimize business processes. Tasks like product placement, inventory management, customized offers, product bundling etc are being smartly handled using data science techniques. As the name suggests, this data comprises of transaction record of a sales store. This is a regression problem. The data has 8523 rows of 12 variables.

Problem: Predict the sales.

Start: Get Data | Tutorial: Get Here


5. Boston Housing Data Set

14938-illustration-of-a-yellow-house-pvThis is another popular data set used in pattern recognition literature. The data set comes from real estate industry in Boston (US). This is a regression problem. The data has 506 rows and 14 columns. Thus, it’s a fairly small data set where you can attempt any technique without worrying about your laptop’s memory issue.

Problem: Predict the median value of owner occupied homes

Start: Get Data | Tutorial: Get Here


Intermediate Level

1. Human Activity Recognition

asThis data set is collected from recordings of 30 human subjects captured via smartphones enabled with embedded inertial sensors. Many machine learning courses use this data for students practice. It’s your turn now. This is a multi-classification problem. The data set has 10299 rows and 561 columns.

Problem: Predict the activity category of a human

Start: Get Data


2. Black Friday Data Set

black-fridayThis data set comprises of sales transactions captured at a retail store. It’s a classic data set to explore your feature engineering skills and day to day understanding from your shopping experience. It’s a regression problem. The data set has 550069 rows and 12 columns.

Problem: Predict purchase amount.

Start: Get Data


3. Text Mining Data Set

De l'éloquence judiciaire À AthenesThis data set is originally from siam competition 2007. The data set comprises of aviation safety reports describing problem(s) which occurred in certain flights. It is a multi-classification, high dimensional problem. It has 21519 rows and 30438 columns.

Problem: Classify the documents according to their labels

Start: Get Data | Get Information


4. Trip History Data Set

trip-history-dataThis data set comes from a bike sharing service in US. This data set requires you to exercise your pro data munging skills. The data set is provided quarter wise from 2010 (Q4) onwards. Each file has 7 columns. It is a classification problem.

Problem: Predict the class of user

Start: Get Data


5. Million Song Data Set

million-songDidn’t you know analytics can be used in entertainment industry also? Do it yourself now. This data set puts forward a regression task. It consists of 515345 observations and 90 variables. However, this is just a tiny subset of original database of million song data. You should use data linked below.

Problem: Predict release year of the song

Start: Get Data


6. Census Income Data Set

us-censusIt’s an imbalanced classification and a classic machine learning problem. You know, machine learning is being extensively used to solve imbalanced problems such as cancer detection, fraud detection etc. It’s time to get your hand dirty. The data set has 48842 rows and 14 columns. For guidance, you can check my imbalanced data project.

Problem: Predict the income class of US population

Start: Get Data


7. Movie Lens Data Set

movie-lens-dataThis data set allows you to build a recommendation engine. Have you created one before?  It’s one of the most popular & quoted data set in data science industry. It is available in various dimensions. Here I’ve used a fairly small size. It has 1 million ratings from 6000 users on 4000 movies.

Problem: Recommend new movies to users

Start: Get Data


Advanced Level

1. Identify your Digits Data Set

identify-the-digitsThis data set allows you to study, analyze and recognize elements in the images. That’s exactly how your camera detects your face, using image recognition! It’s your turn to build and test that technique. It’s an digit recognition problem. This data set has 7000 images of 28 X 28 size, sizing 31MB.

Problem: Identify digits from an image

Start: Get Data


2. Yelp Data Set

yelp-data-setThis data set is a part of round 8 of The Yelp Dataset Challenge. It comprises of nearly 200,000 images, provided in 3 json files of ~2GB. These images provide information about local businesses in 10 cities across 4 countries. You are required to find insights from data using cultural trends, seasonal trends, infer categories, text mining, social graph mining etc.

Problem: Find insights from images

Start: Get Data


3. Image Net Data Set

laImageNet offers variety of problems which encompasses object detection, localization, classification and screen parsing. All the images are freely available. You can search for any type of image and build your project around it. As of now, this image engine has 14,197,122 images of multiple shapes sizing up to 140GB.

Problem: Problem to solve is subjected to the image type you download

Start: Get Data


4. KDD 1999 Data Set

kdd-datasetHow could I miss KDD Cup? Originally, KDD brought the taste of data mining competition to the world. Don’t you want to see what data set they used to offer? I assure you, it’ll be an enriching experience. This data poses a classification problem. It has 4M rows and 48 columns in a ~1.2GB file.

Problem: Classify a network intrusion detector as good or bad.

Start: Get Data


5. Chicago Crime Data Set

chicago-crimeThe ability of handle large data sets is expected of every data scientist these days. Companies no longer prefer to work on samples, they use full data. This data set would provide you much needed hands on experience of handling large data sets on your local machines. The problem is easy, but data management is the key!  This data set has 6M observations. It’s a multi-classification problem.

Problem: Predict the type of crime.

Start: Get Data | To download data, click on Export -> CSV


End Notes

Out of the 17 data sets listed above, you should start by finding the right match of your skills. Say, if you are a beginner in machine learning, avoid taking up advanced level data sets. Don’t bite more than you can chew and don’t feel overwhelmed with how much you still have to do. Instead, focus on making step wise progress.

Once you complete 2 – 3 projects, showcase them on your resume and your github profile (most important!). Lots of recruiters these days hire candidates by stalking github profiles. Your motive shouldn’t be to do all the projects, but to pick out selected ones based on data set, domain, data set size whichever excites you the most. If you want me to solve any of above problem and create a complete project like this, let me know.

Did you find this article useful ? Have you already build any project on these data sets? Do share your experience, learnings and suggestions in comments below.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.


소스: 17 Free Data Science Projects To Boost Your Knowledge & Skills

R이 왜 최고의 데이터 과학 언어일까요?


최근 R은 데이터 분석을 하는 많은 사람들에게 ​​매우 인기가 높습니다. 통계자료를 보면 지난 10 년간 가장 빠르게 성장하고있는 프로그래밍 언어 중 하나입니다. 사실, 데이터 과학을 시작한다면 여전히 권장하는 언어이자 매우 인기 있고 동급 최강의 데이터 언어입니다. 왜 R이 최근에 가장 좋은 데이터 과학 언어일까요?

R consistently ranks among the best languages
One thing I want you to understand is that right now, R is one of the most highly regarded, highly ranked, and fastest growing languages in existence.

In many ways, R is the data language. In data science, it’s the language to beat (with only 1 or 2 serious contenders).

To understand why this is true, let’s look at the results of several important surveys and programming language rankings to see where R shakes out.

IEEE: R ranks #5

The world’s “largest association of technical professionals,” the IEEE, has created a ranking of programming languages for several years.

This IEEE ranking system uses a set of 12 metrics, including things like Google search volume, Google trends, Twitter hits, Github repositories, Hacker News posts, and more.

Using this methodology, they rank several dozen programming languages and place them into several categories.

In their review of the “Top Programming Languages” of 2016, R climbed to #5.

The IEEE methodology is quite comprehensive, so this is a strong indicator of R’s strength compared to other languages, and the relative value of learning R.

TIOBE: R ranks high with consistent upward trend

Another ranking system, the TIOBE index, creates a similar score and rank for various programming languages.

If we look at R’s performance on the TIOBE index, we can see a solid upward trend for almost a decade.

Keep in mind that the TIOBE index is structured to be “an indicator of the popularity of programming languages. The index is updated once a month. The ratings are based on the number of skilled engineers world-wide, courses and third party vendors. Popular search engines such as Google, Bing, Yahoo!, Wikipedia, Amazon, YouTube and Baidu are used to calculate the ratings.”

For December 2016, R has an overall rank of 17 (among all programming languages). Its maximum rank was #12 in May of 2015.

This suggests that currently, learning R is still an excellent option if you want to learn data science. It may arguably be the best option. (To be clear, Python ranks higher on the TIOBE index, but it’s harder to separate out web and software dev uses of Python from the strictly data-related uses of Python, so it may not be an apples to apples comparison.)

Redmonk: R is #12

Another frequently sited language ranking system is the Redmonk Programming Language Rankings, which are derived from popularity on GitHub (lines of code) and popularity on Stack Overflow (number of tags).

As of November 2016, R ranks number 13 among all programming languages.

Moreover, R has shown a consistent upward trend for several years:

Out of all the back half of the Top 20 languages, R has shown the most consistent upwards movement over time. From its position of 17 back in 2012, it has made steady gains over time, but had seemed to stall at 13 having stuck there for three consecutive quarters. This time around, however, R took over #12 from Perl which in turn dropped to #13. There’s still an enormous amount of Perl in circulation, but the fact that the more specialized R has unseated the language once considered the glue of the web says as much about Perl as it does about R. Which is irrelevant to R advocates, of course. Whatever the cause, R’s relatively unique Top 20 path is one for fans of the language to cheer.

– RedMonk Programming Language Rankings: June 2016
(emphasis mine)

O’Reilly: R is arguably the most common data programming language

Finally, O’Reilly media has conducted a data science survey for the last several years, and they use the survey data to analyze data science trends. Among other things, they analyzed tool usage to identify which tools are most commonly used by data scientists.

In the 2016 survey report, R was the most common programming language (if we exclude SQL, which isn’t a programming language in the sense that I’m using it here). 57% of all respondents used R (compared to 54% using Python).

(As a side note, fully 70% of respondents used SQL. If you’re looking for another tool to learn after R, I’d suggest SQL.)

They also surveyed people to identify data visualization tools. They found that ggplot2 was the most common visualization tool. I’ll explain why I love ggplot2 in an upcoming blog post, but if we’re only tracking popularity, the O’Reilly survey suggests that ggplot2 is highly used (if not best in class).

R is excellent for learning data science
Beyond popularity, another reason that R is an excellent data science programming language is that it is excellent for learning data science.

R is a true “data language”

Part of the reason for this, is the nature of the language itself.

R was ultimately created with statistics and data in mind. The R-Project describes R as a “[programming] language and environment for statistical computing” (emphasis mine).

R is a language that has statistics and data built into its DNA, so to speak.

In this sense, R is nearly unique among programming languages. It is a language that has been built for statistics. It’s been designed for data.

This has advantages when you’re learning data science, because almost any statistical test or technique can be found somewhere within base R or one of its packages.

The best books and resources use R

Related to the fact that R is a “statistical computing” language is the fact that many of the best books and learning materials have adopted R as the language of choice.

This is important. If you’re a beginner, and you’re just getting started in data science, you’ll have a lot to learn. To truly master data science, you’ll need to learn several sub-areas like probability, statistics, data visualization, data manipulation, and machine learning. All of these skill areas have theoretical foundations (which you’ll need to learn) but also practical techniques that you’ll need to execute by writing code.

That means that:

You need a language that has strong capabilities in each of these areas (visualization, manipulation, machine learning (AKA statistical learning), etc)
You need a language for which there are high quality training materials in these skill areas.
While there are many data-related books and courses out there, but many of the best ones are centered on the R programming language.

Learn Probability with R

For example, two excellent books on probability use R for their “hands on” programming examples.

The first is Probability with Applications and R. This book is very approachable, readable, and well organized.

The second is Introduction to Probability which was developed from highly regarded statistics lectures at Harvard.

These are just two examples. If you dig deeper, you’ll find that among probability books that use a programming language, many (if not most) of them use R.

Learn frequentist statistics with R

The same can be said for statistics books.

Because R has statistics “built into its DNA,” many statistics textbooks use R as a learning tool.

For an introductory look at frequentist statistics, here’s one excellent book:

Statistics: an Introduction using R
Again, if you do a quick search on Amazon, and look at many intro stats books, you’ll find that if they use any programming language as a teaching tool, they are more likely to use R than almost any other language.

Learn Bayesian statistics with R

This becomes even more pronounced if you want a hands-on book for learning Bayesian statistics.

If you want to learn Bayesian stats and Bayesian analysis, nearly all of the books use R. There are some exceptions, like a few books that teach Bayesian analysis in C or Python, but overwhelmingly the best books that teach Bayesian statistics use R.

If you’re interested in Bayesian stats, check out these:

Introduction to Bayesian Statistics
Statistical Rethinking
Doing Bayesian Data Analysis
If you’re interested in Bayesian methods, these books are “best in class,” and they all use R.

Learn Data Visualization in R

When you’re learning data visualization, there’s a slightly larger range of programming languages to choose from, but I still maintain that most of the best learning materials use R.

If you’re learning data visualization, I highly recommend the work of Nathan Yau. His blog, flowingdata.com, frequently has data visualization tutorials for the R programming language. (I also recommend his book Data Points as a companion, though it teaches principles as opposed to programing language syntax.)

I also highly recommend several books by Hadley Wickham. First, if you’re interested in data visualization in R, you need to own the book ggplot2. It not only teaches you the syntax of this critical R data visualization library, but it will also reshape how you think about visualizing your data.

I also recommend R for Data Science. This book provides a great introduction to data visualization, but additionally teaches you a broad set of data tools in R. It’s excellent, and a “must own” R book.

Learn machine learning with R

Finally, if you want to get started with machine learning, many of best machine learning books use R.

Although I will acknowledge that there’s more diversity among ML books with regard to their programming language, I still maintain that many of the best ones use R.

Here are two excellent introductions to machine learning that teach ML using the R programming language.

An Introduction to Statistical Learning
Applied Predictive Modeling
These books are both rigorous while still being approachable. They will teach you a little bit of theory (but not overwhelm you with math) while also showing you practical techniques.

Without question, these are the two books that I recommend most often for a beginner who wants to learn machine learning, and they both use R.

If you want to learn data science, R is excellent

Ultimately, the point here is that R is an excellent language for learning data science, because many of the best books (and other training materials) use R as the programming language of choice.

So if you’re a beginner in data science, I think that R is the best language – in large part – because of the quantity and quality of data-science learning materials.

A quick note on Python
There are other options, but the only one I’ll address here is Python.

As far as data science programming languages go, Python is the only serious alternative to R right now. (Other alternatives lack a well-developed package ecosystem or are not free/open source.)

I won’t explain my full thoughts on Python here, but I will say that it’s an excellent language. I love Python.

Having said that, for data science beginners, I still think that R is a slightly better choice, largely for the reasons I outlined above.

Again, I think that many of the best textbooks and training materials for foundational data science concepts (probability, statistics, Bayesian statistics, machine learning) are R-based books. That’s not to say that there aren’t excellent data science books that use Python, but I still think that there is a higher average quality among the R-based texts.

The other issue with Python is that many students get caught up in software development. That is, instead of learning statistics, data visualization, data manipulation, probability, etc, they end up spending their time learning about data structures, loops, flow-control, object oriented programming, and web frameworks. These skill areas can complement the core data science toolkit, but they are not data science topics in the sense that I’m using the term here. In fact, I recommend that most beginners learn software development contepts after learning basic data science subjects like data manipulation, visualization, analysis, etc.

Even though most beginners should learn software development principles later, many beginners who start with Python get sidetracked into these software development and web development areas. I think this happens, because in many ways, Python is geared towards these subjects. Most books on Python are not really data science books per se, but instead books on programming, development, etc. So a beginning data science student opens up a Python book intending to learn data science, but they end up going down the software/web development rabbit hole, and don’t come out for a few months (or years).

As much as I love Python, I think this is a risk for beginners. I think it’s better to start with R as it has statistics and data science more “built into its DNA.” With R, it’s easier to learn the foundations, and harder to get sidetracked.

Recap: Learn R if you want to learn data science
What you should take away is that for learning data science, R is arguably the best option. In terms of popularity, R is very highly ranked, and on an upward trajectory. Moreover, many of the best data science books and training materials use R.

If you want to get started learning data science, I recommend the following:

Learn R
Specifically, learn ggplot2, dplyr, tidyr, lubridate, and other Hadleyverse tidyverse tools for data visualization and manipulation
Learn to use these tools together to analyze data
Once you have some background in these essential R packages, bulk up on probability, stats, and machine learning (I recommend the texts that I talk about in this blog post)
Discover how to master R
Do you want to rapidly master R?

Sign up for the email list at Sharp Sight.

Our posts are devoted to helping you rapidly master R, one of the best programming languages in the world, and possibly the best data science language you can learn.

In last week’s blog, I explained why you should Master R (even if it may eventually become obsolete). I wrote that article to address people who claim mastering R is a bit of a waste of time (because it will eventually become obsolete). But when I suggested that R may eventually become obsolete, this seemed […]

소스: Why R is the best data science language to learn today – SHARP SIGHT LABS