Blog Elements

You can display blog posts in various ways with the “Blog Post” element/shortcode. You can see one example here and even more at the blog main menu item of this demo.

The SeaClass R Package · R Views

The SeaClass R Package

The Operations Technology and Advanced Analytics Group (OTAAG) at Seagate Technology has decided to share an internal project that helps accelerate development of classification models. The interactive SeaClass tool is contained in an R-based package built using shiny and other CRAN packages commonly used for binary classification. The package is free to use and develop further, but any analysis mistakes are the sole responsibility of the user. Check out the demo video below:

Origin Story

Manufacturer data sets used for pass/fail analysis are typically highly imbalanced with many more passing cases than failing cases. In some situations, the failure examples may be scarce or nonexistent. In these extreme cases complex modeling techniques are likely inadvisable. Data scientists commonly consider data scarcity, class imbalance, and data dimensionality when discriminating between competing candidate approaches, such as anomaly detection, simple models, and complex models. Standard approaches are easily identified within each of these analysis categories, and can be exploited as reasonable initial classification analysis baselines.

The SeaClass application was created to generate template R code for the commonly encountered classification problems described above. The application allows data analysts to explore multiple models quickly with essentially no programming required. SeaClass provides an option to download the corresponding R code for further model training/testing. This workflow enables our data analysts to jump-start their modeling, saving time and initial hassles.

The Advanced Analytics group decided to open-source the package for several reasons. Firstly, we encourage other contributors to suggest improvements to the SeaClass application. Additionally, we are hopeful our application will inspire other code-generating projects within the R community. Lastly, our group benefits greatly from open-source tools, and it’s nice to give back every once in a while.

Package Overview

The SeaClass R package provides tools for analyzing classification problems. In particular, specialized tools are available for addressing the problem of imbalanced data sets. The SeaClass application provides an easy-to-use interface that requires only minimal R programming knowledge to get started, and can be launched using the RStudio Addins menu. The application allows the user to explore numerous methods by simply clicking on the available options and interacting with the generated results. The user can choose to download the codes for any procedures they wish to explore further. SeaClass was designed to jump-start the analysis process for both novice and advanced R users. See screenshots below for one demonstration.

Installation Instructions

The SeaClass application depends on multiple R packages. To install SeaClass and its dependencies, run:

install.packages('devtools')
devtools::install_github('ChrisDienes/SeaClass')

Usage Instructions

Step 1. Begin by loading and preparing your data in R. Some general advice:

  • Your data set must be saved as an R data frame object.
  • The data set must contain a binary response variable (0/1, PASS/FAIL, A/B, etc.)
  • All other variables must be predictor variables.
  • Predictor variables can be numeric, categorical, or factors.
  • Including too many predictors may slow down the application and weaken performance.
  • Categorical predictors are often ignored when the number of levels exceeds 10, since they tend to have improper influences.
  • Missing values are not allowed and will throw a flag. Please remove or impute NAs prior to starting the app.
  • Keep the number of observations (rows) to a medium or small size.
  • Data sets with many rows (>10,000) or many columns (>30) may slow down the app’s interactive responses.

Step 2. After data preparation, start the application by either loading SeaClass from the RStudio Addins drop-down menu, or by loading the SeaClass function from the command line. For example:

library(SeaClass)

### Make some fake data:
X <- matrix(rnorm(10000,0,1),ncol=10,nrow=1000)
X[1:100,1:2] <- X[1:100,1:2] + 3
Y <- c(rep(1,100), rep(0,900))
Fake_Data <- data.frame(Y = Y , X)

### Load the SeaClass rare failure data:
data("rareFailData")

### Start the interactive GUI:
SeaClass()

If the application fails to load, you may need to specify your favorite browser path first. For example:

options(browser = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")

Step 3. The user has various options for configuring their analysis within the GUI. Once the analysis runs, the user can view the results, interact with the results (module-dependent), save the underlying R script, or start over. Additional help is provided within the application. See above screenshots for one depiction of these steps.

Step 4. Besides the SeaClass function, several other functions are contained within the library. For example:

### List available functions:
ls("package:SeaClass")
### Note this is a sample data set:
# data(rareFailData)
### Note code_output is a support function for SeaClass, not for general use.

### View help:
?accuracy_threshold

### Run example from help file:
### General Use: ###
set.seed(123)
x <- c(rnorm(100,0,1),rnorm(100,2,1))
group <- c(rep(0,100),rep(2,100))
accuracy_threshold(x=x, group=group, pos_class=2)
accuracy_threshold(x=x, group=group, pos_class=0)
### Bagged Example ###
set.seed(123)
replicate_function = function(index){accuracy_threshold(x=x[index], group=group[index], pos_class=2)[[2]]}
sample_cuts <- replicate(100, {
  sample_index = sample.int(n=length(x),replace=TRUE)
  replicate_function(index=sample_index)
})
bagged_scores <- sapply(x, function(x) mean(x > sample_cuts))
unbagged_cut    <- accuracy_threshold(x=x, group=group, pos_class=2)[[2]]
unbagged_scores <- ifelse(x > unbagged_cut, 1, 0)
# Compare AUC:
PRROC::roc.curve(scores.class0 = bagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]]
PRROC::roc.curve(scores.class0 = unbagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]]
bagged_prediction <- ifelse(bagged_scores > 0.50, 2, 0)
unbagged_prediction <- ifelse(x > unbagged_cut, 2, 0)
# Compare Confusion Matrix:
table(bagged_prediction, group)
table(unbagged_prediction, group)

소스: The SeaClass R Package · R Views

LeaRning Path on R - Step by Step Guide to Learn Data Science on R

LeaRning Path on R – Step by Step Guide to Learn Data Science on R One of the common problems people face in learning R is lack of a structured path. They don’t know, from where to start, how to proceed, which track to choose? Though,…

Data Scientist Skill Set - Data Science Central

1         Background Data science is first and foremost a talent-based discipline and capability. Platforms, tools and IT infrastructure play an important but secondary role. Nevertheless, software and technology companies around…

17 Free Data Science Projects To Boost Your Knowledge & Skills

17 Ultimate Data Science Projects To Boost Your Knowledge and Skills (& can be accessed freely) MACHINE LEARNING PYTHON R SHARE ANALYTICS VIDHYA CONTENT TEAM , OCTOBER 26, 2016 / 23 Introduction Data science projects offer you…

Marketing Multi-Channel Attribution model based on Sales Funnel with R | R-bloggers

,
This is the last post in the series of articles about using Multi-Channel Attribution in marketing. In previous two articles (part 1 and part 2), we’ve reviewed a simple and powerful approach based on Markov chains that allows you to effectively…

New package polypoly (helper functions for orthogonal polynomials) - Higher Order Functions

Last week, I released a new package called polypoly to CRAN. It wraps up some common tasks for dealing with orthogonal polynomials into a single package. The README shows off the main functionality, as well as the neat “logo” I made for…

5 ways to measure running time of R code | R-bloggers

A reviewer asked me to report detailed running times for all (so many :scream:) performed computations in one of my papers, and so I spent a Saturday morning figuring out my favorite way to benchmark R code. This is a quick summary of the options…