R에서 데이터 프로젝트를 실행하기위한 주요한 힌트

R에서 데이터 프로젝트를 실행하기위한 주요한 힌트
R과 작업 할 때 힌트와 모범 사례에 대해 더 많이 묻습니다. 얼마나 깊고 전문적인지에 따라 어려운 작업이 될 수 있습니다. 그래서 나는 지난 수년 동안 나를 도왔던 11 점 (총리, 식스) 점수를 균형있게 유지하려고 노력했습니다. 마지막으로, 거기에 좋은 조언이 많이 있습니다. (많은 의미가 있습니다.) 꼭 체크 아웃해야합니다. 아래의 빠른 참조 섹션에서 몇 가지 예를 참조하십시오.
 

01. 항상 R 프로젝트를 사용하세요

인간 문명은 관례에 따라 세워졌습니다. R 스크립트에도 스크립트가 있습니다. 예를 들어

Any resident R script is written assuming that it will be run from a fresh R process with working directory set to the project directory. It creates everything it needs, in its own workspace or folder, and it touches nothing it did not create.모든 상주 R 스크립트는 작업 디렉토리가 프로젝트 디렉토리로 설정된 새 R 프로세스에서 실행될 것이라는 가정하에 작성됩니다. 그것은 자신의 작업 공간이나 폴더에 필요한 모든 것을 생성하며, 생성하지 않은 것은 아무것도 건드리지 않습니다.

You want your project to work. Not only now, but also in 5 years, even if folder and file paths have changed. Also, you want it to work when your collaborator runs it from their computer. Projects create convention that makes it possible.

대부분 작성된 프로젝트가 현재뿐만 아니라 폴더 및 파일 경로가 변경된 경우에도 5 년 후에 또한 공동 작업자가 자신의 컴퓨터에서 실행될 때 작동되기를 원할겁니다. 프로젝트는이를 가능하게하는 컨벤션을 만듭니다.

기본적으로 다음과 같은 환경을 만듭니다.

  1. 모든 소스코드와 결과는 하나의 설정 위치에 저장됩니다 (더 이상setwd()를 사용하지 마세요!)
  2. 상대 파일 경로가 생성 되므로 재현성이 향상됩니다.
  3. 깔끔한 R 환경은 당신이 그것을 열 때마다 만들어집니다 (더 이상rm(list = ls()))를 사용하지 마세요!)

 

확신이 안든다면 Jenny Bryan이 제공 한 보다 철저한 설명을 읽으십시오.

이제 새로운 프로젝트를 시작하십시오! 그렇지 않으면 컴퓨터가 화재로 위험에 처할 수 있습니다 …^^

02. Describe the purpose of your code / project.

Before you even load the packages or import the data, state the purpose and content of your script in hashed code.

### Data import and cleaning from sources A, B and C. 
### This script is part of Segmentation project. 

Naturally, you can update it along the way, but having such description in place from day one will guarantee that whoever has access to your script will instantly know what he/she is looking at. Also, if you leave it for later, the chances are that you’ll forget to do it altogether and believe me, in X months time when you go back to your code it won’t be that clear what were you trying to do.

03. Load all necessary packages in the beginning of your script & consider calling functions in a package via ::.

Ok, I know how it is: you load some basic packages to start with, but then the analysis takes you to wondrous and wonderful places that require more specialized libraries. So you install and load them along the way… WRONG! Whenever you have to load a new package, go back to the beginning of your script and load them there. This way whenever you (or someone else) have to run the script again, they will have ALL necessary libraries to run it and won’t get stuck somewhere in the middle of the execution because the function they called is not recognized. It’s coding. Small things matter.

Also, if you don’t want to load the whole package just for the sake of using a single function, consider specifying package directly via ::. Use this text cleaning exercise as an example:

library(readr)
library(dplyr)

tweets <- read_csv("tweets.csv")

all_tweets <- tweets %>% 
  rename(author = handle) %>% 
  select(author, text) %>% 
  mutate(text = qdapRegex::rm_url(text)) %>% #removes URLs from text
  na.omit()

Here, I’m expecting to use readr and dplyr packages regularly and I’m using qdapRegex library only once – to remove URL’s from my tweet data. In this case I call rm_url() once and I specify the package that it comes from. More about it here. Easy peasy.

04. Name your code sections and use them for quick navigation.

Your code will grow. Sometimes it will turn into an R script equivalent of the Bible, at least in terms of volume. To keep it organized, keep your code nice and tidy by assigning relevant code chunks to different sections that can be later folded/unfolded; you can also easily navigate yourself through chunks by using ‘drop-up’ menu at the bottom of the script screen.

You create a new code section by writing #### or ---- at the end of any comment that is to become a new code section.

See the image below for a simplified example:

05. Make your life easier and mark your code when you need to.

Did you know that if you click anywhere left from line numbers in RStudio, it will create a red mark? I think this is one of the most underused features in RStudio, and a very useful one, too! Imagine having to define a function using another part of a very long R script? Instead of finding a relevant part of the script and then spending seconds or minutes on finding the function that you were about to define, you first mark that function, go to the relevant bit of code and then have no problems finding the function to be defined again, as the red mark will tell you where to look. BTW, it works only in already saved R scripts and it doesn’t work in RMarkdown files.

06. Write your code as if someone was going to use it without communicating with you. From. Day. One.

I know, I know. We all mean to do it, but we rarely do. I always promise myself that I’ll be thoroughly commenting my code from the very beginning of the project but even these days I find it difficult to do it consistently. Why? Because:

  • analysis itself is more important (I’m telling myself).
  • I know what I’m doing.
  • I (usually) have no direct collaborators that are using the code.

But how short-sighted those arguments are! The reality is that

  • even most precious and relevant piece of analysis is useless if you or others don’t understand it (more about it below)
  • you know what you’re doing at the moment, but it won’t feel the same way in a month or two when you have moved on to another project but someone asked you an innocent question about how you defined that key variable… Our memory is fallible. Make sure you don’t have to rely on it with every single piece of code you produce
  • even if you don’t have active collaborators at the time of doing analysis, someone will have to use your code sooner or later. You’re not going to be in the same job/role/position for the rest of your life. You’re creating legacy that ONE DAY someone will use, no matter how far away it seems right now.

What makes good, reproducible code?

  • generous and clear comments
  • logical and efficient code
  • code that is ideally timed and tested

07. Name your files like a Pro.

Jenny Bryan is probably the first very high profile R-user I know who’s been actively highlighting the importance of things that not many people talk about: file naming is one of them. I’m not going to reinvent the excellent wheel that she has already invented, so I’m only going to summarize her advice here. But for more detail, please, please, have a look at Jenny’s slides.

So, what do all good file names have in common? They are:

MACHINE READABLE

This means that file names are

  • regular expression and globbing friendly. Meaning? You can search them using key words (with regex and/or stringr package). In order to make that easier/possible, remember to avoid spaces, punctuation, accented characters, case sensitivity.

This way it will be easy to search for files later and/or narrow file lists based on names.

Jenny’s example:

  • easy to compute on using delimiters: this means that the file names should have a consistent name structure, where each section of the name has its role and is separated from other name sections with delimiters. This way it will be easy to extract information from file names, e.g. by splitting.

See Jenny’s example below for clarification:

HUMAN READABLE

Human readable means exactly what is says on the tin:

Easy to figure out what the heck something is, based on its name, as simple as that 🙂

I should add, it’s easy to figure it out also for someone who doesn’t know your work very well (this is a clear reminder of the previous point). I love the example below:

THEY PLAY WELL WITH DEFAULT ORDERING

The reality is that your computer will sort your files for you, no matter whether you like/need it or not. Therefore:

  • put something numeric in your file name first – if the order of sourcing files doesn’t matter, stating when the file was created is always useful. Otherwise, you can indicate the logical order of the files (I’ll come back to that in point 9).
  • use the YYYY-MM-DD format for dates (it’s ISO 8601 standard!) EVEN when you’re American 🙂
  • left pad other numbers with zeroes – otherwise you’ll end up with 10 before 1 and so on.

Again, one of Jenny’s examples:

I couldn’t believe how much these little hacks improved the flow of my work. Paying attention to file names was also appreciated by my coworkers, I’m sure.

08. If you have to copy/paste excerpt of code 3 or more times, write a function for it.

.. as Hadley Wickham said many many times in his and Charlotte’s course on DataCamp. It doesn’t only teach you to write more elegant and efficient code, but also it makes it more readable for yourself and others.

To learn how to write good functions, you can have a sneak peak into the first chapter of the DataCamp course for free (the rest is available under paid subscription) or read this DataCamp tutorial.

09. With big, complex data projects use project pipeline.

I’m not sure if project pipeline is an official name for what I want to talk about, but for a sake of the argument, let’s call it a project pipeline 😉 Namely, sometimes running a full project from one script – with even clearest and informatively named code sections – is simply difficult, if not unfeasible. Particularly when big(ish) data is involved: imagine trying to run a hefty model on data that requires importing and cleaning, where import into R alone takes about an hour (real life scenario!). You wouldn’t possibly want to do it every time you open the project, right?

So what you do instead is write one script for data import and save the data.frame with useful data as an .RData file (using, e.g. save(data1, file = "data.RData"). Then start a new script for, let’s say, data cleaning where you load previously imported data.frame from the .RData file (using load("data.RData")). Then you clean the data and save it as yet another .RData file. Then you write the third script where you load the clean data.frame from the second .RData file and you use it to run your model. Jenny Bryan’s advice on file naming comes in handy here, as you want to name and order your scripts or their outputs logically to avoid surprises in the future.

This way you create a clear structure of building blocks, as well as of inputs and outputs. Additionally, once the data has been imported and cleaned, you can jump straight away into next steps of analysis/modelling without wasting any time on the first two steps.

I started using this approach in big and complex projects after reading this SO answerand I never looked back.

10. Never save your workspace.

Again, there were others that already said it before me and said it much better than I would have. So there you go:

Loading a saved workspace turns your R script from a program, where everything happens logically according to the plan that is the code, to something akin to a cardboard box taken down from the attic, full of assorted pages and notebooks that may or may not be what they seem to be. You end up having to put an inordinate trust in your old self. I don’t know about your old selves, dear reader, but if they are anything like mine, don’t save your workspace.

For details, see Martin Johnsson’s excellent blog post that I quoted above.

11. Before publishing/sharing your code, run it in the fresh workspace.

This almost goes without saying, but if I got a £1 for every time I forgot to do it… Basically, stuff happens when you write your code: you work with multiple files, maybe you loaded a package in one of them but forgot to do it in the rest of them? If that’s the case, you’ll be able to run all scripts in this particular session but if you send someone the code without the imported function… they will fail to run it. The same will happen if you try to run the script on its own, on your own machine.

Anyway, this doesn’t need much explaining. Simply make sure that you re-run your code in a fresh session before you take it further. Amen.

That’s it for now.

So.. these are my ‘prime hints’ – if someone had told me about them 2 years ago, my #rstats life would have been much much easier. By no means is this list complete or exhaustive, but my intention was to highlight the points that I personally found most useful in my daily fun with R. Hope it will help someone!

QUICK REFERENCE

General good practice

  1. Five Tips to Improve Your R Code (DataCamp)
  2. Workflow for statistical analysis and report writing (SO)
  3. R Best Practices: R you writing the R way! (By Milind Paradkar)
  4. Good practices in R programming (Indiana University)

Good practice for package building

  1. Good Practices for Writing R Packages (by Roman Tsegelskyi)
  2. R package development “best practices” (by stevenpollack)
  3. goodpractice – Advice on R Package Building (by MangoTheCat)

 

원본: Prime Hints For Running A Data Project In R

About The Author

THE-R

한국HP 엔터프라이즈 컨설팅 PM 경력 현 (주)아이리치그린 대표이사 현 데이터진흥원 빅데이터 기획 멘토

댓글 남기기

태그

error: Content is protected !!
%d 블로거가 이것을 좋아합니다: