R을 사용한 기계 학습 소개

R을 사용한 기계 학습 소개

이번 기사는 2018 년 6 월 28 일 독일 하이델베르그 대학교 (University of Heidelberg)에서 열린 R with Machine Learning 소개 워크샵의 슬라이드입니다.

이 워크샵은 기계 학습의 기초를 다룹니다. 예제 데이터 세트를 사용하여 R에서는 caret 패키지와 h2o를 사용하여 표준 워크 플로우를 학습했습니다.

  • 데이터 읽기
  • 탐색적 데이터 분석
  • 결측치 찾기
  • Feature Engineering
  • 학습 및 테스트 분할
  • 랜덤 포레스트, 그라데이션 부스팅, 뉴럴넷 등을 이용한 모델 학습
  • 하이퍼 파라미터 튜닝


모든 분석은 RStudio를 사용하여 R로 수행됩니다. R 버전, 운영 체제 및 패키지 버전을 포함한 자세한 세션 정보는이 문서의 끝 부분에있는 sessionInfo() 출력을 참조하십시오.

모든 이미지는 ggplot2로 부터 작성 되었습니다

  • 필수 라이브러리들


Data preparation

The dataset

The dataset I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset. The data was downloaded from the UC Irvine Machine Learning Repository.

The first dataset looks at the predictor classes:

  • malignant or
  • benign breast mass.

The features characterise cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses:

  • Sample ID (code number)
  • Clump thickness
  • Uniformity of cell size
  • Uniformity of cell shape
  • Marginal adhesion
  • Single epithelial cell size
  • Number of bare nuclei
  • Bland chromatin
  • Number of normal nuclei
  • Mitosis
  • Classes, i.e. diagnosis



Missing data





Missing values can be imputed with the mice package.

More info and tutorial with code: https://shirinsplayground.netlify.com/2018/04/flu_prediction/

Data exploration

  • Response variable for classification


More info on dealing with unbalanced classes: https://shiring.github.io/machine_learning/2017/04/02/unbalanced

  • Response variable for regression


  • Features


  • Correlation graphs



Principal Component Analysis


Multidimensional Scaling


t-SNE dimensionality reduction


Machine Learning packages for R



Training, validation and test data












Decision trees



Random Forests

Random Forests predictions are based on the generation of multiple classification trees. They can be used for both, classification and regression tasks. Here, I show a classification task.


When you specify savePredictions = TRUE, you can access the cross-validation resuls with model_rf$pred.





Dealing with unbalanced data

Luckily, caret makes it very easy to incorporate over- and under-sampling techniques with cross-validation resampling. We can simply add the sampling option to our trainControl and choose down for under- (also called down-) sampling. The rest stays the same as with our original model.




Feature Importance




  • predicting test data





Extreme gradient boosting trees

Extreme gradient boosting (XGBoost) is a faster and improved implementation of gradient boosting for supervised learning.

“XGBoost uses a more regularized model formalization to control over-fitting, which gives it better performance.” Tianqi Chen, developer of xgboost

XGBoost is a tree ensemble model, which means the sum of predictions from a set of classification and regression trees (CART). In that, XGBoost is similar to Random Forests but it uses a different approach to model training. Can be used for classification and regression tasks. Here, I show a classification task.




  • Feature Importance


  • predicting test data





Available models in caret


Feature Selection

Performing feature selection on the whole dataset would lead to prediction bias, we therefore need to run the whole modeling process on the training data alone!

  • Correlation

Correlations between all features are calculated and visualised with the corrplot package. I am then removing all features with a correlation higher than 0.7, keeping the feature with the lower mean.







  • Recursive Feature Elimination (RFE)

Another way to choose features is with Recursive Feature Elimination. RFE uses a Random Forest algorithm to test combinations of features and rate each with an accuracy score. The combination with the highest score is usually preferential.





  • Genetic Algorithm (GA)

The Genetic Algorithm (GA) has been developed based on evolutionary principles of natural selection: It aims to optimize a population of individuals with a given set of genotypes by modeling selection over time. In each generation (i.e. iteration), each individual’s fitness is calculated based on their genotypes. Then, the fittest individuals are chosen to produce the next generation. This subsequent generation of individuals will have genotypes resulting from (re-) combinations of the parental alleles. These new genotypes will again determine each individual’s fitness. This selection process is iterated for a specified number of generations and (ideally) leads to fixation of the fittest alleles in the gene pool.

This concept of optimization can be applied to non-evolutionary models as well, like feature selection processes in machine learning.

Hyperparameter tuning with caret

  • Cartesian Grid
  • mtry: Number of variables randomly sampled as candidates at each split.

  • Random Search

Grid search with h2o

The R package h2o provides a convenient interface to H2O, which is an open-source machine learning and deep learning platform. H2O distributes a wide range of common machine learning algorithms for classification, regression and deep learning.

Training, validation and test data


Random Forest