[책소개] Efficient R Programming

데이터 과학 및 R 언어 서적의 혼잡한 시장 공간에서 Lovelace와 Gillespie의 Efficient R Programming (2016)이 두각을 나타내고 있습니다. 10개의 포괄적인 장을 통해 저자는 효율적인 R 프로그램을 개발하는 주된 원리를 다룹니다. R 핵심 개발 팀의 일원이 아니면,이 책은 초보자 R 프로그래머이든, 고급 데이터 과학자 및 엔지니어이든간에 유용 할 것입니다. 이 책은 R 프로그램의 효율성과 개발 프로세스의 효율성을 향상시키는 데 도움이되는 유용한 팁과 기술로 가득 차 있습니다. 지난 4년 이상 매일 매일 R을 사용하고 있었지만, 이 책의 모든 장에서는 이전에 배운 기술에 대한 이해를 돕는 동시에 R 코드를 개선하는 방법에 대한 새로운 통찰을 제공했습니다. Efficient R Programming의 각 장은 “Top five tips”목록을 포함하는 단일 주제에 대해 다루며 다양한 패키지와 기술을 다루며 핵심 통찰력을 통합하기위한 유용한 연습과 문제 세트를 포함합니다.

In Chapter 1. Introduction, the authors orient the audience to the key characteristics of R that affect its efficiency, compared to other programming languages. Importantly, the authors address R efficiency not just in the expected sense of algorithmic speed and complexity, but broaden its scope to include programmer productivity and how it relates to programming idioms, IDEs, coding conventions, and community support – all things that can improve the efficiency of writing and maintaining code. This is doubly important for a language like R, which is notoriously flexible in its ability to solve problems in multiple ways. The first chapter concludes by introducing the reader to two valuable packages: (1) microbenchmark, an accurate benchmarking tool with nanosecond precision; and (2) profvis, a handy tool for profiling larger chunks of code. These two packages are repeatedly used throughout the remainder of the book to illustrate key concepts and highlight efficient techniques.

In Chapter 2. Efficient Setup, the reader is introduced to techniques for setting up a development environment that facilitates efficient workflow. Here the authors cover choices in operating system, R version, R start-up, alternative R interpreters, and how to maintain up-to-date packages with tools like packrat and installr. I found their overview of the R startup process particularly useful, as the authors taught me how to modify my .Renviron and .Rprofile files to cache external API keys and customize my R environment, for example by adding alias shortcuts to commonly used functions. The chapter concludes by discussing how to setup and customize the RStudio environment (e.g., modifying code editing preference, editing keyboard shortcuts, and turning off restore .Rdata to help prevent bugs), which can greatly improve individual efficiency.

Chapter 3. Efficient Programming introduces the reader to efficient programming by discussing “big picture” programming techniques and how they relate to the R language. This chapter will most likely be beneficial to established programmers who are new to R, as well as to data scientists and analysts who have limited exposure to programming in a production environment. In this chapter the authors introduce the “golden rule of R programming” before delving into the usual suspects of inefficient R code. Usefully, the book illustrates multiple ways of performing the same task (e.g., data selection) with different code snippets, and highlights the performance differences through benchmarked results. Here we learn about the pitfalls of growing vectors, the benefits of vectorization, and the proper use of factors. The chapter wraps up with the requisite overview of the apply function family, before discussing the use of variable caching (package memoise) and byte compilation as important techniques in writing fast, responsive R code.

Chapter 4. Efficient Workflow will be of primary use to junior-level programmers, analysts, and project managers who haven’t had enough time or practice to develop their own efficient workflows. This chapter discusses the importance of project planning, audience, and scope before delving into common tools that facilitate project management. In my opinion, one of best aspects of R is the huge, maddeningly broad number of packages that are available on CRAN and GitHub. The authors provide useful advice and techniques for identifying the packages that will be of most use to your project. A brief discussion on the use of R Markdown and knitr concludes this chapter.

Chapter 5. Efficient Input/Output is devoted to efficient read/write operations. Anybody who has ever struggled with loading a big file into R for analysis will appreciate this discussion and the packages covered in this chapter. The rio package, which can handle a wide variety of common data file types, provides a useful starting point for exploratory work on a new project. Other packages that are discussed (including readr and data.table) provide more efficient I/O than those in base R. The chapter ends with a discussion of two new file formats and associated packages, (feather and RProtoBuf), that can be used for cross-language, fast, efficient serialized data I/O.

Chapter 6. Efficient Data Carpentry introduces what are, in my opinion, the most useful R tools for data munging – what Lovelace and Gillespie prefer to call by the more admirable term “data carpentry.” This chapter could more aptly be titled the “Tidyverse” or the “Hadleyverse”, for most of the tools discussed in this chapter were developed by prolific R package writer, Hadley Wickham. Sections of the chapter are devoted to each of the primary packages of the tidyverse: tibble, a more useful and user-friendly data.frame; tidyr, used for reshaping data between short and long forms; stringr, which provides a consistent API over obtuse regex functions; dplyr, used for efficient data processing including filtering, sorting, mutating, joining, and summarizing; and of course magrittr, for piping all these operations together with %>%. A brief section on package data.table rounds out the discussion on efficient data carpentry.

Chapter 7. Efficient Optimization begins with the requisite optimization quote by computer scientist Donald Knuth:

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

In this chapter, the authors introduce profvis, and they illustrate the utility of this package by showing how it can be used to identify bottlenecks in a Monte Carlo simulation of a Monopoly game. The authors next examine alternative methods in base R that can be used for greater efficiency. These include discussion of if() vs. ifelse(), sorting operations, AND (&) and OR (|) vs. && and ||, row/column operations, and sparse matrices. The authors then apply these tricks to the Monopoly code to show a 20-fold decrease in run time. The chapter concludes with a discussion and examples of parallelization, and the use of Rcpp as an R interface to underlying fast and efficient C++ code.

I found the chapter Efficient Hardware to be the least useful in the book (spoiler alert: add more RAM or migrate to cloud-based services), though the chapter on Efficient Collaboration will be particularly useful for novice data scientists lacking real-world experience developing data artifacts and production applications in a distributed, collaborative environment. In this chapter, the authors discuss the importance of coding style, code comments, version control, and code review. The final chapter Efficient Learning, will find appreciative readers among those just getting started with R (and if this describes you, I would suggest that you start with this chapter first!). Here the authors discuss using and navigating R’s excellent internal help utility, as well as the importance of vignettes and source code in learning/understanding. After briefly introducing swirl, the book concludes with a discussion of online resources, including Stack Overflow; the authors thankfully provide the newbie with important information on how to ask the right questions and the importance of providing a great R reproducible example.

In summary, Lovelace and Gillespie’s Efficient R Programming does an admirable job of illustrating the key techniques and packages for writing efficient R programs. The book will appeal to a wide audience from advanced R programmers to those just starting out. In my opinion, the book hits that pragmatic sweet spot between breadth and depth, and it usefully contains links to external resources for those wishing to delve deeper into a specific topic. After reading this book, I immediately went to work refactoring a Shiny dashboard application I am developing and several internal R packages I maintain for our data science team. In a matter of a few short hours, I witnessed a 5 to 10-fold performance increase in these applications just by implementing a couple of new techniques. I was particularly impressed with the greatly improved end-user performance and the ease with which I implemented intelligent caching with the memoise package for a consumer decision tree application I am developing. If you care deeply about writing beautiful, clean, efficient code and bringing your data science to the next level, I highly recommend adding Efficient R Programming to your arsenal.

The book is published by O’Reilly Media and is available online at the authors’ website, as well as through Safari.

 

소스: Review of Efficient R Programming | R-bloggers

R ggplot2에 대한 히치하이커의 가이드

,

우선 책은 여기 에서 책을 받으실 수 있습니다.

이것은 완벽 해 보이지만 R 패키지의 변경 사항은 항상 책에 포함 된 예제의 변경을 요구합니다. 이것이 전자 형식이이 작업의 목적에 이상적인 이유입니다. 죽은 나무 책 안에 그것을 갇히는 것은 궁극적으로 시간과 자원 낭비입니다.

나의 첫 번째 서적을 제외하고 이것은 또한 나의 첫 번째 공동 작업입니다. 나는 Jodie Burchell과의 50-50 협력에서 그것을 썼다. Jodie는 놀라운 데이터 과학자입니다. 나는 당신이 재현 가능한 연구 등에서 정말 좋은 자료를 찾을 수있는 그녀의 블로그 Standard Error를 읽기를 강력히 권합니다.

이 책은 기술 도서입니다. 이 책의 범위는 바로 그 지점으로 가고 글쓰기 스타일은 자세한 지침이있는 요리법과 유사합니다. R의 기초를 알고 있고 아름다운 그림을 만드는 법을 배우고 싶다고 가정합니다.

각 장에서는 다른 유형의 플롯을 만드는 방법을 설명하고 기본 플롯에서 고도로 맞춤화 된 그래프까지 단계별로 안내합니다. 장의 순서는 난이도에 따른다.

모든 장은 다른 장과는 독립적입니다. 전체 책을 읽거나 관심있는 부분으로 갈 수 있으며 첫 번째 장을 읽지 않고 지침을 이해하고 예제를 재현하는 것이 쉽다는 것을 확신합니다.

전체적으로이 책에는 받아 들일 수있는 미적 결과를 얻기 위해 237 페이지 (레터 용지 크기)의 다양한 조리법이 포함되어 있습니다. Leanpub에서 무료로 책을 다운로드 할 수 있습니다 (예, 정말로!).

This is a book that may look complete but changes in R package are always demanding changes in the examples contained within the book. This is why the electronic format is perfect for the purpose of this work. Trapping it inside a dead tree book is ultimately a waste of time and resources in my on view.

Aside from being my first book, this is also my first collaborative work. I wrote it in a 50-50 collaboration with Jodie Burchell. Jodie is an amazing data scientist. I highly recommend reading her blog Standard Error where you can find really good material on Reproducible Research and more.

This is a technical book. The scope of the book is to go straight to the point and the writing style is similar to a recipe with detailed instructions. It is assumed that you know the basics of R and that you want to learn how to create beautiful plots.

Each chapter will explain how to create a different type of plot, and will take you step-by-step from a basic plot to a highly customised graph. The chapters’ order is by degree of difficulty.

Every chapter is independent from the others. You can read the whole book or go to a section of interest and we are sure that it will be easy to understand the instructions and reproduce our examples without reading the first chapters.

In total this book contains 237 pages (letter paper size) of different recipes to obtain an acceptable aesthetic result. You can download the book for free (yes, really!) from Leanpub.

How the book started?

Almost a year ago I finished writing the eleventh tutorial in a series on using ggplot2 I created with Jodie Burchell.

I asked Jodie to co-authors some blog entries when I found her blog and I realised that my interest in Data Science was reflected on her blog. The book comes after those entries on our blogs.

A few weeks later those tutorials evolved into the shape of an ebook. The reason behind it was that what we started to write had an unexpected success. We even had RTs from important people in the R community such as Hadley Wickham. Finally the book was released by Leanpub.

We also included a pack that contains the Rmd files that we used to generate every chart that is displayed in the book.

Why Leanpub?

Leanpub is a platform where you can easily write your book by using MS Word among other writing software and it even has GitHub and Dropbox integration. We went for R Markdown with LaTeX output, and that means that Leanpub is both easy to use and flexible at the same time.

Even more, Leanpub enables the audience to download your books for free, if you allow it, or you can define a price range with a suggested price indication. The website gives the authors a royalty of 90% minus 50 cents per sale (compared to other platforms this is convenient for the authors). You can also sell your books with additional exercises, lessons in video, etc.

For example, last year I updated all the examples contained in the book just a few days after ggplot2 2.2 was released and my readers had a notification email just after I uploaded the new version. People who pay or does not pay for your books can download the newer versions of if for free.

If that’s not enough Leanpub allows you to create bundles and sell your books as a set or you can charge another price for your book plus additional material such as Rmarkdown notebooks, instructional videos and more.

What I learned from my first book?

At the moment I am teaching Data Visualization and from my students I learned that good visualizations come after they learn the visualization concepts. Coding cleary helps but coding goes after the fundamentals.

It would be better to teach visualization fundamentals first and not in parallel while coding, and this applies specially when a part of your audience has never wrote code before.

I got a lot of feedback from my students last term. That was really helpful to improve the book and dive some steps in smaller pieces to facilitate the understading of the Grammar of Graphics.

The interested reader may find some remarkable books that can be read before mine. I highly recommend:

Those are really good books that show the fundamentals of Data Visualisation and provide the key concepts and rules needed to communicate effectively with data.

소스: The Hitchhiker’s Guide to Ggplot2 in R | R-bloggers

[책소개] A free introduction to statistics and data science with R | ModernDive

,

소스코드: GitHub 소스코드

출처: ModernDive: A free introduction to statistics and data science with R | R-bloggers

R Packages | Hadley Wickham

, ,

2015년 4월에 발간된 이 책은 저자가 Hadley Wickham이라 화제가 된 책이다.

Hadley Wickham

이 분을 빼놓고 R을 생각할 수 없을 정도로 중요한 R 패키지들을 개발해서 수많은 데이터 분석가들에게 존경을 받는 진정한 데이터 사이언티스 입니다. 현재 RStudio의 수석 과학자 이자, Auckland, Standford, Rice 대학교의 통계학 겸임 교수입니다. 데이터 과학을 보다 쉽고 빠르며 재미있게 만들어주는 도구를 구축하는 목표로 주요 R 패키지 개발에 큰 공을 세우고 있습니다. 그가 제작한 주요 R 패키지는 다음과 같습니다.

DATA SCIENCE 분야
– ggplot2 : 데이터 시각화를 위한 패키지
– dplyr : 데이터 핸들링을 위한 패키지
– tidyr for tidying data.
– stringr for working with strings.
– lubridate for working with date/times.

DATA IMPORT
– readr for reading .csv and fwf files.
– readxl for reading .xls and .xlsx files.
– haven for SAS, SPSS, and Stata files.
– httr for talking to web APIs.
– rvest for scraping websites.
– xml2 for importing XML files.

SOFTWARE ENGINEERING
– devtools for general package development.
– roxygen2 for in-line documentation.
– testthat for unit testing

 

출처: Welcome · R packages

The Art of Data Science | Roger D. Peng and Elizabeth Matsui

, ,

Roger D. Peng은 존스 홉킨스 대학 (Johns Hopkins University) 블룸버그 공중 보건 학교의 생물 통계학 교수입니다. 우리에게는 Coursera에서 John Hopkins Data Specialization 분야의 공동 창립자이자 강연자로 잘 알려져 있기도 하지요. 2016년에는 공중 보건에 탁월한 공헌을 한 통계 학자를 기리는 미국 공중 보건 협회 (American Public Health Association)의 2016 Mortimer Spiegelman 상을 수상하기도 했습니다.

최근 bookdown.org에 발간한 라는 책이 있어서 공유합니다. 이외에도 leanpub.com과 같은 온라인 출판 사이트에 무료로 R 학습에 도움이 되는 책들을 발간하고 있어 데이터 분석을 하는 많은 사람들에게 도움을 주고 있습니다.

Elizabeth Matsui는 존스 홉킨스 대학 (Johns Hopkins University)의 소아과, 역학 및 환경 건강 과학 교수이자 소아 알레르기 전문의 / 면역 학자입니다. 그녀는 역학 연구 및 임상 시험을 지원하는 Peng 박사와 데이터 관리 및 분석 센터를 지휘하고 데이터 과학 컨설팅 회사인 Skybrude Consulting, LLC의 공동 창립자입니다.

우선 책을 읽어 보시기 전에 샘플을 읽어보시고 어떤 주제를 다루는지 확인해 보시지요.

출처: The Art of Data Science

R Week 2016 | PACKT Books

,

R was made for data, packed with features that make it the premier language of data professionals. A powerful choice for machine learning, scientific computing, and data analysis, R was built with statisticians in mind. This week, dive into the power of R with 50% off our top titles, and our amazing any 5-for-$50 bundle deal. From machine learning to statisitical computing – do more with R all this week.

R은 데이터 전문가를위한 최고의 언어가되는 기능이 가득한 데이터 용으로 제작되었습니다. 기계 학습, 과학 컴퓨팅 및 데이터 분석을위한 강력한 선택 인 R은 통계 전문가를 염두에두고 설계되었습니다. 이번 주에는 50 % 할인 된 가격으로 50 % 할인 된 가격으로 R의 힘을 얻으십시오. 기계 학습에서 통계 컴퓨팅에 이르기까지 – 이번 주에는 R로 더 많은 것을하십시오.

소스: R Week 2016 | PACKT Books

Fundamentals of R Programming and Statistical Analysis

, ,

A comprehensive guide to working on statistical data with the R language

출처: Fundamentals of R Programming and Statistical Analysis [Video] – Fundamentals of R Programming and Statistical Analysis [Video]