This package contains a set of functions related to exploratory data analysis, data preparation, and model performance. It is used by people coming from business, research, and teaching (professors and students).
이 패키지는 탐색적 데이터 분석, 데이터 준비 및 모델 성능과 관련된 기능 세트를 포함하고 있습니다.
funModeling is intimately related to the Data Science Live Book -Open Source- (2017) in the sense that most of its functionality is used to explain different topics addressed by the book.
funModeling은 그 기능의 대부분이 책에서 다루는 여러 주제를 설명하는 데 사용된다는 의미에서 Data Science Live Book – Open Source- (2017)과 밀접한 관련이 있습니다.
📗 The paperback version is being prepared, get notified by the newsletter or twitter.
Opening the black-box
Some functions have in-line comments so the user can open the black-box and learn how it was developed, or to tune or improve any of them.
All the functions are well documented, explaining all the parameters with the help of many short examples. R documentation can be accessed by: help("name_of_the_function").
Important changes from latest version 1.6.7, (relevant only if you were using previous versions):
From the latest version, 1.6.7 (Jan 21-2018), the parameters str_input, str_target and str_score will be renamed to input, target and score respectively. The functionality remains the same.
If you were using these parameters names on production, they will be still working until next release. this means that for now, you can use for example str_input or input.
The other important change was in discretize_get_bins, which is detailed later in this document.
About this quick-start
This quick-start is focused only on the functions. All explanations around them, and the how and when to use them, can be accessed by following the “Read more here.” links below each section, which redirect you to the book.
Below there are most of the funModeling functions divided by category.
Exploratory data analysis
df_status: Dataset health status
Use case: analyze the zeros, missing values (NA), infinity, data type, and number of unique values for a given dataset.
path_out indicates the path directory; if it has a value, then the plot is exported in jpeg.
If input is empty, then it runs for all numeric (skipping the categorical ones).
input must be numeric and target must be categoric.
target can be multi-class (not only binary).
categ_analysis: Quantitative analysis for binary outcome
Profile a binary target based on a categorical input variable, the representativeness (perc_rows) and the accuracy (perc_target) for each value of the input variable; for example, the rate of flu infection per country.
discretize_get_bins + discretize_df: Convert numeric variables to categoric
We need two functions: discretize_get_bins, which returns the thresholds for each variable, and then discretize_df, which takes the result from the first function and converts the desired variables. The binning criterion is equal frequency.
Example converting only two variables from a dataset.
# Step 1: Getting the thresholds for the desired variables: "max_heart_rate" and "oldpeak"
d_bins=discretize_get_bins(data=heart_disease, input=c("max_heart_rate", "oldpeak"), n_bins=5)
##  "Variables processed: max_heart_rate, oldpeak"
# Step 2: Applying the threshold to get the final processed data frame
heart_disease_discretized=discretize_df(data=heart_disease, data_bins=d_bins, stringsAsFactors=T)
##  "Variables processed: max_heart_rate, oldpeak"
The following image illustrates the result. Please note that the
variable name remains the same.
This two-step procedure is thought to be used in production with new data.
Min and max values for each bin will be -Inf and Inf, respectively.
A fix in the latest funModeling release (1.6.7) may change output in certain scenarios. Please check the results if you using version 1.6.6. More info about this here.
Unlike discretize_get_bins, this function doesn’t insert -Inf and Inf as the min and max value respectively.
range01: Scales variable into the 0 to 1 range
Convert a numeric vector into a scale from 0 to 1 with 0 as the minimum and 1 as the maximum.
# checking results
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.13 0.17 0.26 1.00
Outliers data preparation
hampel_outlier and tukey_outlier: Gets outliers threshold
Both functions retrieve a two-value vector that indicates the thresholds for which the values are considered as outliers. The functions tukey_outlier and hampel_outlier are used internally in prep_outliers.
Takes a data frame and returns the same data frame plus the transformations specified in the input parameter. It also works with a single vector.
Example considering two variables as input:
# Get threshold according to Hampel's method
## bottom_threshold top_threshold
## 86 220
# Apply function to stop outliers at the threshold values
data_prep=prep_outliers(data = heart_disease, input = c('max_heart_rate','resting_blood_pressure'), method = "hampel", type='stop')
Checking the before and after for variable max_heart_rate:
After computing the scores or probabilities for the class we want to predict, we pass it to the gain_lift function, which returns a data frame with performance metrics.
# Create machine learning model and get its scores for positive case
fit_glm=glm(has_heart_disease ~ age + oldpeak, data=heart_disease, family = binomial)
heart_disease$score=predict(fit_glm, newdata=heart_disease, type='response')
# Calculate performance metrics
gain_lift(data=heart_disease, score='score', target='has_heart_disease')
http://the-r.kr/wp-content/uploads/2017/05/THE-R_100x40_Dark.svg00THE-Rhttp://the-r.kr/wp-content/uploads/2017/05/THE-R_100x40_Dark.svgTHE-R2018-01-25 14:24:302018-01-25 14:24:30funModeling 패키지를 이용한 탐색적 데이터 분석 및 데이터 준비