데이터 과학을 위한 R vs. Python

최고의 데이터 과학 도구의 경쟁에서 Python과 R은 각각 장.단점이 있습니다. 다른 하나를 선택하는 것은 사용 사례, 학습 비용 및 기타 필요한 공통 도구에 따라 달라집니다.


 

DataCamp 에서 조사한바에 따르면 학습자들이 자주 묻는 질문중에 하나가 일상적인 데이터 분석 작업을 위해 R과 Python 중 어느 것을 사용해야하는지 여부를 자주 묻는다고 합니다. 이 사이트에서는 주로 대화식 R 자습서를 제공하지만, 항상 이러한 선택이 학습자들이 직면 한 데이터 분석적 도전의 유형에 달려 있다고 대답한다고 합니다.

아시다시피 Python과 R은 모두 통계를위한 유명한 프로그래밍 언어입니다. R은 통계학자를 염두에두고 개발되었지만 강력한 데이터 시각화 기능을 제공하고 있고, Python은 초보자에게 이해하기 쉬운 구문으로 높이 평가됩니다.

이 글에서는 R과 Python의 차이점과 데이터 과학 및 통계 세계에서 이들이 어떻게 다른지 살펴 보겠습니다. 이에 대한 인포그래픽을 참고하시기 바랍니다.

”Data Science Wars: R vs Python” by DataCamp

 

R 소개

뉴질랜드 오클랜드 대학의 로버트 젠틀맨(Robert Gentleman)과 로스 이하카(Ross Ihaka)는 1995 년에 S 프로그래밍 언어의 구현으로 오픈 소스 언어 R을 만들었습니다. 데이터 분석, 통계 및 그래픽 모델을 수행 할 수있는 보다 사용자 친화적 인 방법을 제공하는 데 중점을 둔 언어를 개발하는 것이 목적 이었습니다. 처음에는 R이 학문과 연구에 주로 사용되었지만 요즘 기업에서 R 을 점덤 더 많이 사용하고 있습니다. 이것은 R이 기업에서 가장 빠르게 성장하는 통계 언어 중 하나가되게합니다.

R의 가장 큰 장점 중 하나는 메일링 리스트, 사용자 제공 문서 및 매우 활발한 Stack Overflow 그룹을 통해 지원을 제공하는 거대한 커뮤니티입니다. 사용자가 쉽게 기여할 수있는 R 패키지의 거대한 저장소인 CRAN 도 있습니다. 이 패키지는 처음부터 모든 것을 개발할 필요없이 최신 기술과 기능에 즉시 액세스 할 수있게 해주는 R 함수 및 데이터 모음입니다.

만일 숙련된 프로그래머라면 R에 익숙해지는데 그렇게 크게 어려움을 겪지 않을 것입니다. 그러나 초보자는 가파른 학습 곡선으로 진입장벽이 있을 수 있습니다. 하지만, 최근엔 수많은 MOOC(Coursera, Udaciy, Edx, KMOOC 등)을 통해 공짜로 쉽게 접할 수있는 수많은 훌륭한 학습 자원이 있습니다.

 

Python 소개

 

1991년 Guido Van Rossem에 의해 만들어진 Python 은 생산성과 코드 가독성을 강조했습니다. 데이터 분석이나 통계 기법 적용을 원하는 프로그래머는 통계 목적으로 Python을 사용하는 주요 사용자 중 일부입니다.

엔지니어링 환경에서 일할 때 Python을 더 선호 할 가능성이 커집니다. 참신한 것을 하기에 좋은 유연한 언어입니다. 가독성과 단순성에 중점을두고 학습 곡선이 상대적으로 낮습니다.

Python에도 R과 같은 패키지가 있습니다. PyPi 는 Python 패키지 색인이며 사용자가 기여할 수있는 라이브러리로 구성됩니다. R과 마찬가지로 Python은 훌륭한 커뮤니티를 가지고 있지만 일반 목적의 언어이기 때문에 조금 더 산재되어 있습니다. 그럼에도 불구하고, 데이터 과학을 위한 Python은 더 많은 지배적 인 위치를 차지하고 있습니다. 기대가 커지고 더 혁신적인 데이터 과학 응용 프로그램이 여기에서 시작됩니다.

 

숫자로 본 R 과 Python

인터넷에서 R 과 Python의 채택 및 인기를 비교하는 많은 기사를 찾을 수 있습니다. 이 수치는 컴퓨터 과학의 전반적인 생태계에서이 두 언어가 어떻게 진화하고 있는지를 잘 보여 주지만 나란히 비교한다는 것은 쉽지 않습니다. 주된 이유는 데이터 과학 환경에서만 R을 발견할 수 있기 때문입니다. Python은 범용 언어로서 웹 개발과 같은 많은 분야에서 널리 사용되고 있습니다.

 

R 을 언제, 어떻게 사용해야 하나

R 은 주로 데이터 분석 작업이 개별 서버에서 독립형 컴퓨팅 또는 분석을 요구할 때 사용됩니다. 탐색 작업에 적합하며 수많은 패키지와 손쉽게 사용할 수있는 테스트로 인해 신속하게 준비하고 실행하는데 필요한 도구를 제공하기 때문에 거의 모든 유형의 데이터 분석에 편리하게 사용할 수 있습니다. R 은 대용량 데이터 솔루션의 일부일 수도 있습니다.

R 을 시작할 때 좋은 첫 번째 단계는 훌륭한 IDE인 RStudio를 설치하는 것입니다. 이 작업이 끝나면 다음과 같은 인기있는 패키지를 살펴 보는 것이 좋습니다.

  • dplyrplyr, data.table 패키지는 데이터를 쉽게 조작 할 수 있습니다
  • stringr 패키지는 문자열 조작에 유용합니다
  • zoo 패키지는 정기적이고 불규칙한 시계열로 작업하기 유용합니다
  • ggvislattice, ggplot2 패키지는 데이터 시각화를 위해 널리 사용됩니다
  • caret 패키지는 머신러닝을 위해 유용합니다

 

Python 을 언제, 어떻게 사용해야 하나

 

데이터 분석 작업을 웹 애플리케이션과 통합해야하거나 통계 코드를 프로덕션 데이터베이스에 통합해야하는 경우 Python을 사용할 수 있습니다. 완전한 프로그래밍 언어이기 때문에 프로덕션 용도로 알고리즘을 구현하는 훌륭한 도구입니다.

과거에는 데이터 분석을위한 Python 패키지의 초기 단계가 문제였지만, 수년 동안이 부분이 크게 향상되었습니다.Python 을 데이터 분석에 사용할 수 있도록 NumPy / SciPy (과학 계산용) 및 pandas (데이터 조작용)를 설치하십시오. 또한 그래픽을 만들기 위해 matplotlib 를 살펴 보고 기계 학습을 위해 scikit-learn 을 익히시기 바랍니다.

R과 달리 Python에는 명확한 뚜렷한 IDE가 없습니다.SpyderIPython Notebook, Rodeo를 보고 어떤 것이 본인에게 가장 적합한 지 확인하는 것이 좋습니다.

 

 

R 과 Python: 데이터 과학에서의 숫자

데이터 분석에 사용되는 프로그래밍 언어에 초점을 둔 최근의 여론 조사를 보면 R이 분명히 승자 입니다. 특히 Python과 R의 데이터 분석 커뮤니티에 집중한다면 비슷한 패턴이 나타납니다.

위의 수치에도 불구하고 더 많은 사람들이 R에서 Python으로 전환하고 있다는 신호가 있습니다. 또한 적절하게 두 언어의 조합을 사용하는 데이터 분석가가 증가하고 있고, 데이터 과학 분석가들 에게도 이 방안을 권장합니다.

데이터 과학 분야에서 경력을 쌓으려고한다면 두 언어 모두에서 능숙해야 합니다. 직업 동향으로 보자면, 두 기술 모두 수요가 증가했고 임금은 평균 이상으로 높습니다.

 

R의 장점과 단점

장점1 : 그림이 수천 단어 이상을 표현한다.

시각화된 데이터는 원시 숫자 만 사용하는 것보다 더 효율적이고 효과적으로 이해 될 수 있습니다. R 과 시각화는 완벽하게 궁합이 맞습니다. 시각화 패키지로는 ggplot2, ggvis, googleVis 및 rCharts 등이 사용 됩니다.

장점2 : R 생태계

R 은 최첨단 패키지와 활발한 커뮤니티로 구성된 풍부한 생태계를 갖추고 있습니다. 패키지는 CRAN, BioConductor 및 Github에서 제공됩니다. Rdocumentatio에서 모든 R 패키지를 검색 할 수 있습니다.

장점3 : 데이터 과학의 만국 공통어 R

R 은 통계학자가 통계학자를 위해 개발한 언어 입니다. 그들은 R 코드와 패키지를 통해 아이디어와 개념을 전달할 수 있습니다. 입문하려면 반드시 컴퓨터 과학 배경이 필요하지 않습니다. 또한, 점점 학계 밖 산업계에서도 채택되고 있습니다.

단점1 : 느린 속도

R 은 통계학자의 삶을 컴퓨터의 연산 속도보다 더 쉽게하기 위해 개발되었습니다. 종종 잘못 작성된 코드로 인해 느려지기도하지만, pqR, renjin, FastR, Riposte 등과 같은 R패키지로는 성능 향상을 기대할 수  있습니다.

단점2 : 가파른 학습 곡선

R의 학습 곡선은 특히 통계 분석을 위해 GUI를 사용하는 경우 중요하지 않습니다. 패키지를 찾는 것도 익숙하지 않으면 시간이 오래 걸릴 수 있습니다.

 

Python의 장점과 단점

장점1 : IPython Notebook

IPython Notebook은 Python과 데이터로 작업하기가 더 쉽습니다. 노트북을 설치하지 않고도 동료와 손쉽게 공유 할 수 있습니다. 이렇게하면 코드, 출력 및 메모 파일 구성의 오버 헤드가 크게 줄어들고 실제 작업을하는 데 더 많은 시간을 할애 할 수 있습니다.

장점2 : 범용 언어로서의 Python

Python은 쉽고 직관적인 범용 언어입니다. 무엇보다 비교적 평평한 학습 곡선을 필요로하며, 프로그램을 작성할 수 있는 속도를 높입니다. 즉, 코드 작성 시간을 줄여주며 테스팅 프레임 워크를 활용하여 코드를 재사용 할 수 있고 신뢰할 수 있습니다.

장점3 : 다목적 언어로서의 Python

Python은 서로 다른 배경을 가진 사람들을 모으게합니다. 프로그래머가 알고 있고 통계 학자가 쉽게 습득 할 수있는 공통적으로 이해하기 쉬운 언어로서 워크 플로우의 모든 부분과 통합되는 단일 도구를 구축 할 수 있습니다.

단점1 : 시각화

시각화는 데이터 분석 소프트웨어를 선택할때 중요한 기준입니다. Python에는 Seaborn, Bokeh 및 Pygal과 같은 멋진 시각화 라이브러리가 있지만 선택할 수있는 옵션이 너무 많습니다. 또한 R에 비해 시각화가 일반적으로 더 복잡하며 유려하지 않은 단점이 있습니다.

단점2 : R의 도전자로서의 Python

Python은 R에 대한 도전자입니다. 수백개의 필수 R 패키지에 대한 대안을 제공하지 않습니다. 따라 잡고는 있지만 사람들이 R을 포기하게 만들지는 아직 불분명합니다.

 

그렇다면 승자는…

여러분들에게 달려 있습니다! 데이터 과학자로서 과제에 가장 잘 맞는 언어를 고르는 것이 당신의 임무입니다. 당신을 도울 수있는 몇가지 질문들은 다음과 같습니다.어떤 문제를 해결하고 싶습니까?

  1. 언어 학습을위한 순 비용은 얼마입니까?
  2. 귀하의 분야에서 일반적으로 사용되는 도구는 무엇입니까?
  3. 다른 사용 가능한 도구는 무엇이며 어떻게 공통적으로 사용되는 도구와 관련이 있습니까?

 

DataCamp 소개

DataCamp는 온라인 대화 형 교육 플랫폼으로 데이터 과학 및 R과 Python 프로그래밍 과정을 제공합니다. 각 과정은 특정 데이터 과학 주제를 중심으로 구축되며 비디오 지침과 브라우저 내 코딩 문제를 결합하여 수행함으로써 학습 할 수 있습니다. 원할 때마다 언제 어디서나 무료로 모든 과정을 시작할 수 있습니다.

 

연관기사

 

소스: R vs Python for Data Science: The Winner is …

Data Scientist Skill Set – Data Science Central

1         Background

Data science is first and foremost a talent-based discipline and capability. Platforms, tools and IT infrastructure play an important but secondary role. Nevertheless, software and technology companies around the globe spend significant amounts of money talking business managers into buying or licensing their products which often times results in unsatisfying outcomes that do not come close to realizing the full potential of data science.

Talent is key – but unfortunately very rare and hard to identify. If you are trying to hire a data scientist these days you are facing the serious risk of recruiting someone with the wrong or an insufficient skill set. On top of things, talent is even more crucial for small or medium-sized companies whose data science teams are likely to stay relatively small. Wasting one or two head counts on wrong profiles might render an entire team inefficient.

The demand for data scientists has risen dramatically in recent years [1, 2, 3, 4, 5]:

  • New technologies significantly improved our ability to manage and process data; including new data types of data as well as large quantities of data.
  • shift in mind set in business environments took place [6] regarding the utilization of data: from data as a reporting and business analytics necessity towards a valuable resource to enable smart decision making.
  • Last but not least exciting new intellectual developments
  • Last but not least exciting new intellectual developments have taken place in relevant related academic disciplines like machine learning [7, 8] or natural language processing.

Due to high demand, the term ‘data scientist’ developed into a recruiting buzz word which is broadly being abused these days. Experienced lead data scientists share a painful experience when trying to fill a vacant position: Out of a hundred applicants, typically only a handful matches the requirements to qualify for an interview. Some candidates feel already qualified to call themselves ‘data scientist’ after finishing a six-week online course on a statistical computing language. Unqualified individuals often times end up being hired by managers who themselves lack data science experience – leading to disappointments, frustration and an erosion of the term ‘data science’.

2         Who is a Data Scientist?

The data scientist skill set described in the following is based on the idea that it fundamentally rests on three pillars, each representing a skill set mostly orthogonal to the remaining two.

Following this idea, a solid data scientist needs to have the following three well-established skill sets:

  1. Technical skills,
  2. Analytical skills and
  3. Business skills.

Although technical skills are often times the focus of data science role descriptions, they represent only the basis of a data scientist’s skill set. Analytical skills are much harder to acquire (and to test) but represent the crucial core of a data scientist’s ability to solve business problems utilizing scientific approaches. Business skills enable a data scientist to thrive in corporate environments.

2.1        Technical skills | Basis

Technical skills are the basis of a data scientist’s skill set. They include coding skills in languages such as R or Python, the ability to handle various computational architectures, including different types of data bases and operating systems but also other skills such as parallel computing or high performance computing.

The ability to handle data is a necessity for data scientists. It includes data management, data consolidation, data cleansing and data modelling amongst others. As there is often times a high demand for these skills in corporate environments, it comes with the risk of focusing data scientists on data management tasks – thus distracting them from their actual work.

Almost more important than a candidate’s current technical skill set is their mind set. A key factor is intellectual agility providing candidates with the ability to adapt to new computational environments in a short amount of time. This includes learning new coding languages, dealing with new types of data bases or data structures or keeping up with current technological developments like moving from relational databases to object-analytical approaches.
A data scientist with a static technical skill set will not thrive for long as the discipline requires constant adaption and learning. Strong candidates show a healthy appetite for developing their technical skills. When a candidate focusses on a tool discussion during an interview it can be an indication of a narrow technical comfort zone with firm constraints.

Unfortunately, data science job profiles are often times narrowly focused on technical skills; caused by a) the misperception that a successful data scientist’s secret lies exclusively in the ability to handle a specific set of tools and b) a lack of knowledge on the hiring manager’s end as to what the right skill set looks like in the first place. Focusing on technical skills when evaluating candidates renders a significant risk.

2.2        Analytical skills | Core

Scientific problem solving is an essential part of data science. Analytical skills represent the ability to succceed at this complex and highly non-linear discipline. Establishing throrough analytical skills requires a high amount of commitment and dedication (which is a limiting factor contributing to the global shortage of data scientists).

Analytical skills include expertise in academic disciplines like computer science, machine learning, advanced statistics, probability theory, causal inference, artificial intelligence, feature extraction and others (including strong mathematical skills). The list can be extended almost infinetely [9, 10, 11] and has been subject to many debates.
Covering all potentially usefull analytical disciplines is a life-time achievement for any data scientist and not a requirement for a successful candidate. Rather, a data scientist needs to have a healthy mix of analytical skills to succeed. For instance, an expert on Markov chains and an expert on Bayesian networks might both be able to develop a solution for the very same business problem although utilizing their respective strengths and thus fundamentally different methods.

Analytical skills are typically beeing developed through pursuing excellence in a highly quantitative academic field such as computer science, theoretical physics, computational math or bioinformatics. These skills are trained in academic institutions through exposure to hard, unsolved research problems that require a high level of intellectual curiosity and dedication to tackle and eventually solve. This is typically done over the course of a PhD.

Mastering a quantitative research question that nobody else has solved before is a non-linear process inadvertedly accompanied by failing over and over again. However, this process of scientitic problem solving shapes the analytical mind and builds the expertise to later succeed in data science. It typically consists of iterative cycles of

  1. implementing and adapting an analytical approach
  2. applying it and observing it fail, then
  3. investigating the problems and
  4. building an understanding why it failed and where the limitations of the approach lie
  5. to come up with a better more refined approach.

These iterations are acompanied with key learnings and represent small steps towards the project goal thus effectively zig-zagging towards the final solution.

A key requirement for analytical excellence is the right mind set: A data scientist needs to have an intrinsic, high level of curiosity and a strong appetite for intellectual challenges. Data scientists need to be able to pick up new methods and mathematical techniques in a short amount of time to then apply them to a problem at hand – often times within the limited time frame of an ongoing project.

A good way to test analytical skills during an interview process is to provide potential candidates with a business problem and real data to then ask them to spend a few of hours working on it remotely. Discussing the code they wrote, the approach they chose, the solution they built and the insights they generated is a great way to evaluate their potential and at the same time provide the candidates with a first feeling for their potential new tasks.

2.3        Business Skills | Enablement

Business skills enable data scientists to thrive in a corporate environment.

It is important for data scientists to communicate effectively with business users utilizing business lingua and at the same time avoiding a shift towards a conversation that is too technical. Healthy data science projects start and end with the discussion of a business problem supported by a valid business case.

Data scientists need to have a good understanding of business processes as it will be required to make sure the solution they build can be integrated and ultimately consumed by the respective business users. Careful and smart change management almost always plays a role in data science projects as well. A solid portion of entrepreneurship and out-of-the-box thinking helps data scientists to consider business problems from new angles utilizing analytical methods that their business partners do not know about. Last but not least, many big and successful data science projects that ultimately lead to significant impact were achieved through ‘connecting the dots’ by data scientists who built up internal knowledge by working on different projects across departments and functions.

Candidates who come with strong technical and analytical skills are often times highly intelligent individuals looking for intellectual challenges. Even if they have no experience in an industry or in navigating a corporate environment, they can pick up required business skills in a short amount of time – given that they have a healthy appetite for solving business cases. Building strong analytical or technical skills takes orders of magnitude longer.

When trying to determine whether a candidate has an intrinsic interest in business questions or whether he or she would rather prefer to work in an academic setting, it can help to ask yourself the following questions:

  • How well can the candidate explain data science methods like deep learning to business users?
  • When discussing a business problem can the candidate communicate effectively in business terms while thinking about potential mathematical or technical approaches?
  • Will the business users collaborate with the data scientist in the future respect him or her as a partner at eye-level?
  • Would you feel comfortable sending the candidate on their own to present to your manager?
  • Do you think the candidate will succeed in your business environment?

3         Recruiting

Data science requires a mix of different skills. In the end, this mix needs to be adapted to the requirements and the situation at hand, and the business problems that represent the biggest potential value for your company. Big data for instance, is a strong buzz word but in many companies data is under-utilized to a degree that a data science team can focus on low hanging fruit for one or two years in the form of small and structured data sets and at the same time already have a strong business impact.

A key characteristic of candidates that has not been mentioned so far and which can be hard to evaluate is attitude. Hiring data scientists for business consultant positions will require a different mindset and attitude than hiring for integration into an analytics unit or even to supplement a business team.

4         References

[1] NY Times, Data Science: The Numbers of Our Lives by Claire Cain Miller http://nyti.ms/1TfCFmX
[2] TechCrunch: How To Stem The Global Shortage Of Data Scientists http://tcrn.ch/1TUIqsB
[3] Bloomberg: Help Wanted: Black Belts in Data http://bloom.bg/1Xt8bTO
[4] McKinsey on US opportunities for growth http://bit.ly/1WAonmD
[5] McKinsey on big data and data science http://bit.ly/1VXQJHD
[6] Big Data at Work: Dispelling the Myths, Uncovering the Opportunities; Thomas H. Davenport; Harvard Business Review Press (2014)
[7] Andrew Ng on Deep Learning http://bit.ly/1Tg3g74
[8] Andrew Ng on Deep Learning Applications http://bit.ly/1Wza02H
[9] Data scientist Venn diagram by Drew Conway http://bit.ly/1Xd6MAn
[10] Swami Chandrasekaran’s data scientist skill map: http://bit.ly/1ZUGUIF
[11] Forbes: The best machine learning engineers have these 9 traits in common. http://onforb.es/1VXR9Og

 

소스: Data Scientist Skill Set – Data Science Central

전문가들이 말하는 2017년 빅데이터·분석 전망 15선 – CIO Korea

,

빅데이터와 분석 관련 기술은 소셜, 모바일, 클라우드와 더불어 디지털 시대의 변혁을 이끄는 주역으로 알려져 있다. 2016년 시장의 주인공이 BI 강화를 주도한 빅데이터 기술이었다면 2017년은 데이터, 분석 분야의 혁신에 주목해야 할 한 해가 될 것이다. 전문가들이 바라본 올해 빅데이터와 분석 관련 전망을 15가지로 정리했다. 맵알테크놀로지스(MapR Technologies)의 설립자이자 수석 경영자로 재직 중인 존 슐로이더 2017

소스: 전문가들이 말하는 2017년 빅데이터·분석 전망 15선 – CIO Korea

R이 왜 최고의 데이터 과학 언어일까요?

,

최근 R은 데이터 분석을 하는 많은 사람들에게 ​​매우 인기가 높습니다. 통계자료를 보면 지난 10 년간 가장 빠르게 성장하고있는 프로그래밍 언어 중 하나입니다. 사실, 데이터 과학을 시작한다면 여전히 권장하는 언어이자 매우 인기 있고 동급 최강의 데이터 언어입니다. 왜 R이 최근에 가장 좋은 데이터 과학 언어일까요?

R consistently ranks among the best languages
One thing I want you to understand is that right now, R is one of the most highly regarded, highly ranked, and fastest growing languages in existence.

In many ways, R is the data language. In data science, it’s the language to beat (with only 1 or 2 serious contenders).

To understand why this is true, let’s look at the results of several important surveys and programming language rankings to see where R shakes out.

IEEE: R ranks #5

The world’s “largest association of technical professionals,” the IEEE, has created a ranking of programming languages for several years.

This IEEE ranking system uses a set of 12 metrics, including things like Google search volume, Google trends, Twitter hits, Github repositories, Hacker News posts, and more.

Using this methodology, they rank several dozen programming languages and place them into several categories.

In their review of the “Top Programming Languages” of 2016, R climbed to #5.

The IEEE methodology is quite comprehensive, so this is a strong indicator of R’s strength compared to other languages, and the relative value of learning R.

TIOBE: R ranks high with consistent upward trend

Another ranking system, the TIOBE index, creates a similar score and rank for various programming languages.

If we look at R’s performance on the TIOBE index, we can see a solid upward trend for almost a decade.

Keep in mind that the TIOBE index is structured to be “an indicator of the popularity of programming languages. The index is updated once a month. The ratings are based on the number of skilled engineers world-wide, courses and third party vendors. Popular search engines such as Google, Bing, Yahoo!, Wikipedia, Amazon, YouTube and Baidu are used to calculate the ratings.”

For December 2016, R has an overall rank of 17 (among all programming languages). Its maximum rank was #12 in May of 2015.

This suggests that currently, learning R is still an excellent option if you want to learn data science. It may arguably be the best option. (To be clear, Python ranks higher on the TIOBE index, but it’s harder to separate out web and software dev uses of Python from the strictly data-related uses of Python, so it may not be an apples to apples comparison.)

Redmonk: R is #12

Another frequently sited language ranking system is the Redmonk Programming Language Rankings, which are derived from popularity on GitHub (lines of code) and popularity on Stack Overflow (number of tags).

As of November 2016, R ranks number 13 among all programming languages.

Moreover, R has shown a consistent upward trend for several years:

Out of all the back half of the Top 20 languages, R has shown the most consistent upwards movement over time. From its position of 17 back in 2012, it has made steady gains over time, but had seemed to stall at 13 having stuck there for three consecutive quarters. This time around, however, R took over #12 from Perl which in turn dropped to #13. There’s still an enormous amount of Perl in circulation, but the fact that the more specialized R has unseated the language once considered the glue of the web says as much about Perl as it does about R. Which is irrelevant to R advocates, of course. Whatever the cause, R’s relatively unique Top 20 path is one for fans of the language to cheer.

– RedMonk Programming Language Rankings: June 2016
(emphasis mine)

O’Reilly: R is arguably the most common data programming language

Finally, O’Reilly media has conducted a data science survey for the last several years, and they use the survey data to analyze data science trends. Among other things, they analyzed tool usage to identify which tools are most commonly used by data scientists.

In the 2016 survey report, R was the most common programming language (if we exclude SQL, which isn’t a programming language in the sense that I’m using it here). 57% of all respondents used R (compared to 54% using Python).

(As a side note, fully 70% of respondents used SQL. If you’re looking for another tool to learn after R, I’d suggest SQL.)

They also surveyed people to identify data visualization tools. They found that ggplot2 was the most common visualization tool. I’ll explain why I love ggplot2 in an upcoming blog post, but if we’re only tracking popularity, the O’Reilly survey suggests that ggplot2 is highly used (if not best in class).

R is excellent for learning data science
Beyond popularity, another reason that R is an excellent data science programming language is that it is excellent for learning data science.

R is a true “data language”

Part of the reason for this, is the nature of the language itself.

R was ultimately created with statistics and data in mind. The R-Project describes R as a “[programming] language and environment for statistical computing” (emphasis mine).

R is a language that has statistics and data built into its DNA, so to speak.

In this sense, R is nearly unique among programming languages. It is a language that has been built for statistics. It’s been designed for data.

This has advantages when you’re learning data science, because almost any statistical test or technique can be found somewhere within base R or one of its packages.

The best books and resources use R

Related to the fact that R is a “statistical computing” language is the fact that many of the best books and learning materials have adopted R as the language of choice.

This is important. If you’re a beginner, and you’re just getting started in data science, you’ll have a lot to learn. To truly master data science, you’ll need to learn several sub-areas like probability, statistics, data visualization, data manipulation, and machine learning. All of these skill areas have theoretical foundations (which you’ll need to learn) but also practical techniques that you’ll need to execute by writing code.

That means that:

You need a language that has strong capabilities in each of these areas (visualization, manipulation, machine learning (AKA statistical learning), etc)
You need a language for which there are high quality training materials in these skill areas.
While there are many data-related books and courses out there, but many of the best ones are centered on the R programming language.

Learn Probability with R

For example, two excellent books on probability use R for their “hands on” programming examples.

The first is Probability with Applications and R. This book is very approachable, readable, and well organized.

The second is Introduction to Probability which was developed from highly regarded statistics lectures at Harvard.

These are just two examples. If you dig deeper, you’ll find that among probability books that use a programming language, many (if not most) of them use R.

Learn frequentist statistics with R

The same can be said for statistics books.

Because R has statistics “built into its DNA,” many statistics textbooks use R as a learning tool.

For an introductory look at frequentist statistics, here’s one excellent book:

Statistics: an Introduction using R
Again, if you do a quick search on Amazon, and look at many intro stats books, you’ll find that if they use any programming language as a teaching tool, they are more likely to use R than almost any other language.

Learn Bayesian statistics with R

This becomes even more pronounced if you want a hands-on book for learning Bayesian statistics.

If you want to learn Bayesian stats and Bayesian analysis, nearly all of the books use R. There are some exceptions, like a few books that teach Bayesian analysis in C or Python, but overwhelmingly the best books that teach Bayesian statistics use R.

If you’re interested in Bayesian stats, check out these:

Introduction to Bayesian Statistics
Statistical Rethinking
Doing Bayesian Data Analysis
If you’re interested in Bayesian methods, these books are “best in class,” and they all use R.

Learn Data Visualization in R

When you’re learning data visualization, there’s a slightly larger range of programming languages to choose from, but I still maintain that most of the best learning materials use R.

If you’re learning data visualization, I highly recommend the work of Nathan Yau. His blog, flowingdata.com, frequently has data visualization tutorials for the R programming language. (I also recommend his book Data Points as a companion, though it teaches principles as opposed to programing language syntax.)

I also highly recommend several books by Hadley Wickham. First, if you’re interested in data visualization in R, you need to own the book ggplot2. It not only teaches you the syntax of this critical R data visualization library, but it will also reshape how you think about visualizing your data.

I also recommend R for Data Science. This book provides a great introduction to data visualization, but additionally teaches you a broad set of data tools in R. It’s excellent, and a “must own” R book.

Learn machine learning with R

Finally, if you want to get started with machine learning, many of best machine learning books use R.

Although I will acknowledge that there’s more diversity among ML books with regard to their programming language, I still maintain that many of the best ones use R.

Here are two excellent introductions to machine learning that teach ML using the R programming language.

An Introduction to Statistical Learning
Applied Predictive Modeling
These books are both rigorous while still being approachable. They will teach you a little bit of theory (but not overwhelm you with math) while also showing you practical techniques.

Without question, these are the two books that I recommend most often for a beginner who wants to learn machine learning, and they both use R.

If you want to learn data science, R is excellent

Ultimately, the point here is that R is an excellent language for learning data science, because many of the best books (and other training materials) use R as the programming language of choice.

So if you’re a beginner in data science, I think that R is the best language – in large part – because of the quantity and quality of data-science learning materials.

A quick note on Python
There are other options, but the only one I’ll address here is Python.

As far as data science programming languages go, Python is the only serious alternative to R right now. (Other alternatives lack a well-developed package ecosystem or are not free/open source.)

I won’t explain my full thoughts on Python here, but I will say that it’s an excellent language. I love Python.

Having said that, for data science beginners, I still think that R is a slightly better choice, largely for the reasons I outlined above.

Again, I think that many of the best textbooks and training materials for foundational data science concepts (probability, statistics, Bayesian statistics, machine learning) are R-based books. That’s not to say that there aren’t excellent data science books that use Python, but I still think that there is a higher average quality among the R-based texts.

The other issue with Python is that many students get caught up in software development. That is, instead of learning statistics, data visualization, data manipulation, probability, etc, they end up spending their time learning about data structures, loops, flow-control, object oriented programming, and web frameworks. These skill areas can complement the core data science toolkit, but they are not data science topics in the sense that I’m using the term here. In fact, I recommend that most beginners learn software development contepts after learning basic data science subjects like data manipulation, visualization, analysis, etc.

Even though most beginners should learn software development principles later, many beginners who start with Python get sidetracked into these software development and web development areas. I think this happens, because in many ways, Python is geared towards these subjects. Most books on Python are not really data science books per se, but instead books on programming, development, etc. So a beginning data science student opens up a Python book intending to learn data science, but they end up going down the software/web development rabbit hole, and don’t come out for a few months (or years).

As much as I love Python, I think this is a risk for beginners. I think it’s better to start with R as it has statistics and data science more “built into its DNA.” With R, it’s easier to learn the foundations, and harder to get sidetracked.

Recap: Learn R if you want to learn data science
What you should take away is that for learning data science, R is arguably the best option. In terms of popularity, R is very highly ranked, and on an upward trajectory. Moreover, many of the best data science books and training materials use R.

If you want to get started learning data science, I recommend the following:

Learn R
Specifically, learn ggplot2, dplyr, tidyr, lubridate, and other Hadleyverse tidyverse tools for data visualization and manipulation
Learn to use these tools together to analyze data
Once you have some background in these essential R packages, bulk up on probability, stats, and machine learning (I recommend the texts that I talk about in this blog post)
Discover how to master R
Do you want to rapidly master R?

Sign up for the email list at Sharp Sight.

Our posts are devoted to helping you rapidly master R, one of the best programming languages in the world, and possibly the best data science language you can learn.

In last week’s blog, I explained why you should Master R (even if it may eventually become obsolete). I wrote that article to address people who claim mastering R is a bit of a waste of time (because it will eventually become obsolete). But when I suggested that R may eventually become obsolete, this seemed […]

소스: Why R is the best data science language to learn today – SHARP SIGHT LABS

클라우드 리포팅 도구 Google DataStudio 소개

,

구글 클라우드의 리포팅 도구인 DataStudio가 무료화 되었습니다.

빅쿼리, MySQL, YouTube등 다양한 데이타를 시각화할 수 있는 리포팅 도구 입니다. 아래 링크를 참조해 보시지요

 

소스: Easily Build Custom Reports and Dashboards – Google Data Studio – Google