The SeaClass R Package · R Views

The SeaClass R Package

The Operations Technology and Advanced Analytics Group (OTAAG) at Seagate Technology has decided to share an internal project that helps accelerate development of classification models. The interactive SeaClass tool is contained in an R-based package built using shiny and other CRAN packages commonly used for binary classification. The package is free to use and develop further, but any analysis mistakes are the sole responsibility of the user. Check out the demo video below:

Origin Story

Manufacturer data sets used for pass/fail analysis are typically highly imbalanced with many more passing cases than failing cases. In some situations, the failure examples may be scarce or nonexistent. In these extreme cases complex modeling techniques are likely inadvisable. Data scientists commonly consider data scarcity, class imbalance, and data dimensionality when discriminating between competing candidate approaches, such as anomaly detection, simple models, and complex models. Standard approaches are easily identified within each of these analysis categories, and can be exploited as reasonable initial classification analysis baselines.

The SeaClass application was created to generate template R code for the commonly encountered classification problems described above. The application allows data analysts to explore multiple models quickly with essentially no programming required. SeaClass provides an option to download the corresponding R code for further model training/testing. This workflow enables our data analysts to jump-start their modeling, saving time and initial hassles.

The Advanced Analytics group decided to open-source the package for several reasons. Firstly, we encourage other contributors to suggest improvements to the SeaClass application. Additionally, we are hopeful our application will inspire other code-generating projects within the R community. Lastly, our group benefits greatly from open-source tools, and it’s nice to give back every once in a while.

Package Overview

The SeaClass R package provides tools for analyzing classification problems. In particular, specialized tools are available for addressing the problem of imbalanced data sets. The SeaClass application provides an easy-to-use interface that requires only minimal R programming knowledge to get started, and can be launched using the RStudio Addins menu. The application allows the user to explore numerous methods by simply clicking on the available options and interacting with the generated results. The user can choose to download the codes for any procedures they wish to explore further. SeaClass was designed to jump-start the analysis process for both novice and advanced R users. See screenshots below for one demonstration.

Installation Instructions

The SeaClass application depends on multiple R packages. To install SeaClass and its dependencies, run:

install.packages('devtools')
devtools::install_github('ChrisDienes/SeaClass')

Usage Instructions

Step 1. Begin by loading and preparing your data in R. Some general advice:

  • Your data set must be saved as an R data frame object.
  • The data set must contain a binary response variable (0/1, PASS/FAIL, A/B, etc.)
  • All other variables must be predictor variables.
  • Predictor variables can be numeric, categorical, or factors.
  • Including too many predictors may slow down the application and weaken performance.
  • Categorical predictors are often ignored when the number of levels exceeds 10, since they tend to have improper influences.
  • Missing values are not allowed and will throw a flag. Please remove or impute NAs prior to starting the app.
  • Keep the number of observations (rows) to a medium or small size.
  • Data sets with many rows (>10,000) or many columns (>30) may slow down the app’s interactive responses.

Step 2. After data preparation, start the application by either loading SeaClass from the RStudio Addins drop-down menu, or by loading the SeaClass function from the command line. For example:

library(SeaClass)

### Make some fake data:
X <- matrix(rnorm(10000,0,1),ncol=10,nrow=1000)
X[1:100,1:2] <- X[1:100,1:2] + 3
Y <- c(rep(1,100), rep(0,900))
Fake_Data <- data.frame(Y = Y , X)

### Load the SeaClass rare failure data:
data("rareFailData")

### Start the interactive GUI:
SeaClass()

If the application fails to load, you may need to specify your favorite browser path first. For example:

options(browser = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")

Step 3. The user has various options for configuring their analysis within the GUI. Once the analysis runs, the user can view the results, interact with the results (module-dependent), save the underlying R script, or start over. Additional help is provided within the application. See above screenshots for one depiction of these steps.

Step 4. Besides the SeaClass function, several other functions are contained within the library. For example:

### List available functions:
ls("package:SeaClass")
### Note this is a sample data set:
# data(rareFailData)
### Note code_output is a support function for SeaClass, not for general use.

### View help:
?accuracy_threshold

### Run example from help file:
### General Use: ###
set.seed(123)
x <- c(rnorm(100,0,1),rnorm(100,2,1))
group <- c(rep(0,100),rep(2,100))
accuracy_threshold(x=x, group=group, pos_class=2)
accuracy_threshold(x=x, group=group, pos_class=0)
### Bagged Example ###
set.seed(123)
replicate_function = function(index){accuracy_threshold(x=x[index], group=group[index], pos_class=2)[[2]]}
sample_cuts <- replicate(100, {
  sample_index = sample.int(n=length(x),replace=TRUE)
  replicate_function(index=sample_index)
})
bagged_scores <- sapply(x, function(x) mean(x > sample_cuts))
unbagged_cut    <- accuracy_threshold(x=x, group=group, pos_class=2)[[2]]
unbagged_scores <- ifelse(x > unbagged_cut, 1, 0)
# Compare AUC:
PRROC::roc.curve(scores.class0 = bagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]]
PRROC::roc.curve(scores.class0 = unbagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]]
bagged_prediction <- ifelse(bagged_scores > 0.50, 2, 0)
unbagged_prediction <- ifelse(x > unbagged_cut, 2, 0)
# Compare Confusion Matrix:
table(bagged_prediction, group)
table(unbagged_prediction, group)

소스: The SeaClass R Package · R Views

[패키지] forecast 8.0

,

In what is now a roughly annual event, the forecast package has been updated on CRAN with a new version, this time 8.0. A few of the more important new features are described below. Check residuals…

소스: forecast 8.0 | Hyndsight

 

In what is now a roughly annual event, the forecast package has been updated on CRAN with a new version, this time 8.0.

A few of the more important new features are described below.

Check residuals

A common task when building forecasting models is to check that the residuals satisfy some assumptions (that they are uncorrelated, normally distributed, etc.). The new function checkresiduals makes this very easy: it produces a time plot, an ACF, a histogram with super-imposed normal curve, and does a Ljung-Box test on the residuals with appropriate number of lags and degrees of freedom.

fit <- auto.arima(WWWusage)
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  residuals
## Q* = 7.8338, df = 8, p-value = 0.4499
## 
## Model df: 2.   Total lags used: 10

This should work for all the modelling functions in the package, as well as some of the time series modelling functions in the stats package.

Different types of residuals

Usually, residuals are computed as the difference between observations and the corresponding one-step forecasts. But for some models, residuals are computed differently; for example, a multiplicative ETS model or a model with a Box-Cox transformation. So the residuals() function now has an additional argument to deal with this situation.

“Innovation residuals”” correspond to the white noise process that drives the evolution of the time series model. “Response residuals” are the difference between the observations and the fitted values (as with GLMs). For homoscedastic models, the innovation residuals and the one-step response residuals are identical. “Regression residuals” are also available for regression models with ARIMA errors, and are equal to the original data minus the effect of the regression variables. If there are no regression variables, the errors will be identical to the original series (possibly adjusted to have zero mean).

library(ggplot2)
fit <- ets(woolyrnq)
res <- cbind(Residuals = residuals(fit), 
             Response.residuals = residuals(fit, type='response'))
autoplot(res, facets=TRUE)

Some new graphs

The geom_histogram() function in the ggplot2 package is nice, but it does not have a good default bandwidth. So I added the gghistogram function which provides a quick histogram with good defaults. You can also overlay a normal density curve or a kernel density estimate.

gghistogram(lynx)

The ggseasonplot function is useful for studying seasonal patterns and how they change over time. It now has a polar argument to create graphs like this.

ggseasonplot(USAccDeaths, polar=TRUE)

I often want to add a time series line to an existing plot. Base graphics has line() which works well when a time series is passed as an argument. So I added autolayer which is similar (but more general). It is an S3 method like autoplot, and adds a layer to an existing ggplot object. autolayer will eventually form part of the next release of ggplot2, but for now it is available in the forecast package. There are methods provided for ts and forecast objects:

WWWusage %>% ets %>% forecast(h=20) -&gt; fc
autoplot(WWWusage, series="Data") + 
  autolayer(fc, series="Forecast") + 
  autolayer(fitted(fc), series="Fitted")

Cross-validation

The tsCV and CVar functions have been added. These were discussed in a previous post.

Bagged ETS

The baggedETS function has been added, which implements the procedure discussed in Bergmeir et al (2016) for bagging ETS forecasts.

head and tail of time series

I’ve long found it annoying that head and tail do not work on multiple time series. So I added some functions to the package so they now work.

Imports and Dependencies

The pipe operator from the magrittr package is now imported. So you don’t need to load the magrittr package to use it.

There are now no packages that are loaded with forecast – everything required is imported. This makes the start up much cleaner (no more annoying messages from all those packages being loaded). Instead, some random tips are occasionally printed when you load the forecast package (much like ggplot2 does).

There is quite a bit more — see the Changelog for a list.

시각적 자극에 대한 뇌 반응 모델링 – superheat 패키지

, , ,

이 사례 연구는 UC Berkeley의 Gallant 신경 과학 연구소 (Gallant lab 관련 논문 참조)에서 한 개인에 대해 수행한 기능적 자기 공명 영상 (fMRI) 실험에서 수집 한 데이터를 검사합니다. edu / mlpapers / paper_files / NIPS2008_0963.pdf “> 참조1 참조2 ).

fMRI는 신경 활동의 간접적인 척도로 간주 될 수있는 뇌의 산소가 공급 된 혈액 흐름을 측정합니다 (두 과정은 상호 연관성이 높음). fMRI 실험에서 얻은 측정치는 뇌의 큐브와 같은 화소 내에서 수십만개 뉴런의 총 응답과 일치합니다. 두뇌를 3D 화소로 세분화하는 것은 이미지를 2D 픽셀로 세분화하는 것과 유사합니다.

The data

이 데이터에는 1,750 가지의 이미지 (예 : 아기, 집 또는 그림의 영상)에 대한 반응으로 한 개인의 시각 피질에있는 약 1,300개 화소에 대한 fMRI 측정 값 입니다.

진행 상황을 시각화하기 위해 아래 그림은 fMRI 시스템에서 개별 이미지를 보는 모습입니다.

An individual viewing an image while in an fMRI machine

fMRI 기기에서 이미지를 보는 개인

각각의 이미지는 길이 1282 = 16,3841282 = 16,384의 벡터로 표현 될 수 있지만 가보 (Gabor) 웨이블릿 변환을 통해 길이 10,92110,921로 감소 될 수있는 128 × 128128 × 128 픽셀 그레이 스케일 이미지이다.

원시 데이터는 Computational Neuroscience (CRCNS) 데이터 공유 저장소의 협업 연구에 저장되며 여기에서 찾을 수 있습니다. 데이터에 액세스하려면 데이터로 수행 할 작업을 설명하는 CRCNS 계정을 요청해야하지만 이는 매우 간단합니다.

Loading Superheat

Installing the superheat package from github is easy, assuming you have the devtools package installed in R. Simply type the following command:

# install devtools if you don't have it already
install.packages("devtools")
# install the development version of superheat
devtools::install_github("rlbarter/superheat")

Assuming that you didn’t run into any unfortunate errors when installing the package, you can load the package into R in the normal way.

library(superheat)

Viewing the raw images

# some useful libraries
library(ggplot2)
library(dplyr)
library(gridExtra)
library(knitr)
library(RColorBrewer)

It is helpful to have an idea of what kind of images the subject was viewing. The raw images are contained in the Stimuli.mat file, and is separated into a set of 1,750 training images (used to train our models) and 120 validation images (used to evaluate model performance).

The code below loads in the data and extracts the training and validation images

# a library for loading matlab files
library(R.matlab)
# load in the stimuli images
stimuli <- readMat("raw_data/Stimuli.mat")
# extract training stimuli array
train.stimuli <- stimuli[[1]]
# extract validation stimuli array
val.stimuli <- stimuli[[2]]
# remove the original stimuli object
rm(stimuli)

We display four of the training images below using our superheat package.

# view some of the images
im1 <- superheat(train.stimuli[1, 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = F)
im2 <- superheat(train.stimuli[3, 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = F)
im3 <- superheat(train.stimuli[10, 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = F)
im4 <- superheat(train.stimuli[15, 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = F)
# place the images in a 2x2 grid
grid.arrange(im1$plot, im2$plot, im3$plot, im4$plot, ncol = 2)

Preparing the data for analysis

Despite best efforts, sadly not all of the data is publicly available. The Gabor wavelet filters, for example, are not available on the CRCNS website (see above), but the dedicated reader can try to generate their own Gabor wavelets using the raw images contained in Stimuli.mat (see above).

Gabor wavelet features

I, however, have access to a file, fMRIdata.RData which contains the Gabor wavelet features for the training and validation set images. These Gabor feature matrices are contained in the fit_feat and val_feat objects from the fMRIdata.Rdata file (again, this is unfortunately not publicly available).

# load in the gabor wavelet filters
load("processed_data/fMRIdata.RData") # not available in github
# fit_feat contains the gabor wavelet features for the 1750 training images
train.feat <- fit_feat
dim(train.feat)
## [1]  1750 10921
# val_feat contains the gabor wavelet features for the 120 validation images
val.feat <- val_feat
dim(val.feat)
## [1]   120 10921
# Remove the other objects that correspond to the responses of a subset of the voxels
rm(fit_feat)
rm(val_feat)
rm(resp_dat)
rm(loc_dat)

The voxel responses

The voxel responses to each image was collected for two subjects, “S1” and “S2”. We will restrict our analysis to predicting the voxel response in the V1 region for Subject 1 only (see the image below taken from Matthew Schmolesky’s Webvision).

Loading in the voxel responses (from the EstimatedResponses.mat file) involves first converting the .mat file to a format readable by R. Specifically I had to convert the data to version 6 MATLAB file in Octave:

# In Octave, run:
>> dat = load("EstimatedResponses.mat")
>> save("-V6", "EstimatedResponsesV6.mat", "dat")

Having converted the original MATLAB file to version 6 MATLAB file, we can load it into R using the R.matlab package.

# load in the Version 6 matlab file
voxel.response <- readMat("processed_data/EstimatedResponsesV6.mat")$dat
dimnames(voxel.response)[[1]]
## [1] "dataTrnS1" "dataTrnS2" "dataValS1" "dataValS2" "roiS1"     "roiS2"    
## [7] "voxIdxS1"  "voxIdxS2"

We can then filter through the objects contained in this file to extract only the responses for the 1331 V1 voxels to the training and validation images.

# extract useful objects
# the columns of train.resp are the 1,750 training images
# the rows are the 25,915 voxels
train.resp <- voxel.response[[1]]
# the columns of val.resp are the 120 training images
# the rows are the 25,915 voxels
val.resp <- voxel.response[[3]]
# extract the V1 voxels
V1.vox.index <- which(voxel.response[[5]][, 1] == 1)
# there are 1331 V1 voxels
length(V1.vox.index)
## [1] 1331
# filter train.resp and val.resp to the V1 voxels only
train.resp <- train.resp[V1.vox.index, ]
val.resp <- val.resp[V1.vox.index, ]
# remove the remaining data
rm(voxel.response)

Cleaning the data

The final data objects in our workspace are

ls()
##  [1] "im1"            "im2"            "im3"            "im4"           
##  [5] "train.feat"     "train.resp"     "train.stimuli"  "V1.vox.index"  
##  [9] "val.feat"       "val.resp"       "val.stimuli"    "voxel.response"

where

  • train.stimuli: a 1750×128×1281750×128×128 array corresponding to the 1,750 raw training images each of dimension 128×128128×128.
  • val.stimuli: a 120×128×128120×128×128 array corresponding to the 120 raw validation images each of dimension 128×128128×128.
  • train.feat: a 1750×109211750×10921 matrix corresponding to the 10,921 Gabor wavelet features for each of the 1,750 training images.
  • val.feat: a 120×10921120×10921 matrix corresponding to the 10,921 Gabor wavelet features for each of the 120 validation images.
  • train.resp: a 1331×17501331×1750 matrix containing the responses of the 1,331 voxels in the V1 region to each of the 1,750 training images.
  • val.resp: a 1331×1201331×120 matrix containing the responses of the 1,331 voxels in the V1 region to each of the 120 validation images.

Missing values

Note that of the 1,331 voxels in the V1 region, 37 of them have at least 40% missing responses. So we will remove these voxels.

# identify the proportion of missing values for each voxel
missing.variables <- apply(train.resp, 1, function(x) sum(is.na(x)) / length(x))
# print the proportion of missingness for the voxels with at 
# least some missingness.
length(missing.variables[missing.variables > 0])
## [1] 37
# remove these voxels from the training and validation sets
train.resp <- t(train.resp[which(missing.variables == 0), ])
val.resp <- t(val.resp[which(missing.variables == 0), ])
save(train.feat, train.resp, file = "processed_data/fmri_training_data.RData")
save(val.feat, val.resp, file = "processed_data/fmri_validation_data.RData")
# number of voxels remaining after removing missing values
ncol(train.resp)
## [1] 1294

We now thus have 1,294 voxels in the V1 region.

Modeling

For each of our models, our goal is to predict the response to the viewing of an image for each of the 1,294 voxels (cubic region in the brain).

That is, we have 1,294 separate models (one for each voxel), where the predictors/variables correspond to the 10,921 Gabor features from the training images.

Feature selection

To make the problem computationally feasible, we decided to filter the 10,921 Gabor features to the 500 that were most correlated with each voxel.

To identify the correlation of each Gabor feature with each voxel response, we ran the following code (also contained in the code/voxel_cor.R file that can be found here) on the statistics department computer cluster at UC Berkeley. You could probably run it on your laptop, but I would advise against it unless you are incredibly patient.

library(parallel)

nCores <- 24  # to set manually 
cl <- makeCluster(nCores) 

# export the necessary variables
clusterExport(cl, c("train.feat", "train.resp", "glmnet"), 
              envir=environment()) 
# calcualte the correlation of each variable with each voxel
cor.vox <- parLapply(cl, data.frame(train.resp), function(voxel) {
  apply(train.feat, 2, function(feature) cor(voxel, feature))
})
# save the results
save(cor.vox, file = "results/voxel_cor.RData")

Next, to identify the 500 Gabor features that are most correlated with each voxel, we can simply load the correlation data saved above and run the following code.

load("results/voxel_cor.RData")
# identify the 500 most correlated features for each variable
top.features <- lapply(cor.vox, function(cor) {
  order(cor, decreasing = TRUE)[1:500]
})

Lasso + OLS model

We now fit a lasso linear model to predict the voxel responses to the image stimuli.

The 1750×5001750×500 design matrix (feature matrix) can be represented as follows:

X=⎡⎣⎢⎢⎢⎢x1,1x2,1x1750,1x1,2x2,2x1750,2............x1,500x2,500x1750,500⎤⎦⎥⎥⎥⎥X=[x1,1×1,2…x1,500×2,1×2,2…x2,500⋮⋮…⋮x1750,1×1750,2…x1750,500]

where xi,jxi,j is the jjth Gabor wavelet feature value for the iith image.

We have 1,294 response vectors (as we fit 1,294 separate models) each of which contains the response for an individual voxel to the 17501750 training images. The response vector for voxel v{1,...,1294}v∈{1,…,1294} can be written as:

yv=⎡⎣⎢⎢⎢⎢⎢yv1yv2yv1750⎤⎦⎥⎥⎥⎥⎥yv=[y1vy2v⋮y1750v]

To fit these 1,294 models, the following code was run using the Statistics Department cluster at UC Berkeley. You will not be able to run this code on your laptop.

library(glmnet)
library(parallel)
# set up the cluster for parallelization
nCores <- 24  # to set manually 
cl <- makeCluster(nCores) 

# load the CV parameter selection function
source("code/select_lambda.R")

# fit the lasso model to each voxel using the 1000 most correlated features
# for the voxel.
clusterExport(cl, c("train.feat", "train.resp", "glmnet", "top.features"), 
              envir=environment()) 
lasso.list <- parLapply(cl, 1:ncol(train.resp), function(voxel) {
  glmnet(x = train.feat[, top.features[[voxel]]], 
         y = as.vector(train.resp[ , voxel])) 
})
# extract the lambda sequence from each voxel
# we want to avoid the ends of the sequence
# note that some sequences terminate early so providing a 
# universal endpoint for the vector sometimes produces an error
lambda.seq <- lapply(lasso.list, function(model) {
  model$lambda[5:length(model$lambda)]
  })

# use CV to select the best lambda for each voxel-model
# export the necessary variables
clusterExport(cl, c("train.feat", 
                    "train.resp", "top.features", 
                    "cv.glmnet", "lambda.seq"), 
              envir=environment()) 
# the ith entry of selected.lambda corresponds to the 
# best lambda value for voxel i
selected.lambda <- selectLambda(cl = cl, x.mat = train.feat, 
                       y.mat = train.resp, features = top.features, 
                       lambda.seq = lambda.seq)
# save the results
save(lasso.list, selected.lambda, top.features, 
     file = "results/lasso_results_top500.RData")
# stop the cluster
stopCluster(cl)

Having run the above code on the cluster and saved the results in the file lasso_results_top500.RData, we can load the results into the current R session and calculate the corresponding voxel response predictions.

First, we can extract the model coefficients for each model (since we saved the results for each possible lambda, we need to filter to the model with the “best” lambda).

load("results/lasso_results_top500.RData")
# extract the best model from lasso.list (above) corresponding to the best lambda
beta <- lapply(1:length(lasso.list), function(voxel) {
  model.index <- which(lasso.list[[voxel]]$lambda == selected.lambda[voxel])
  as.matrix(lasso.list[[voxel]]$beta)[, model.index]
})

We can also identify which feature was selected (had a non-zero lasso coefficient) by each voxel.

# identify the selected features for each voxel
selected.features.list <- lapply(1:length(top.features), function(voxel) {
  # idenitfy which features were selected for the voxel
  # start with the features used to train the lasso model
  voxel.features <- top.features[[voxel]]
  # obtain the coefficients for these feautures
  voxel.features.selected <- beta[[voxel]] != 0
  # filter the features index and beta vector to the non-zero coefficients
  voxel.features <- voxel.features[voxel.features.selected]
})

Next, for each voxel we re-fit a classical un-regularized OLS model for each voxel based on the selected (non-zero) coefficients from the corresponding lasso model.

# fit an OLS model for each voxel based on the selected features
ols.list <- lapply(1:ncol(train.resp), function(voxel) {
  voxel.features <- selected.features.list[[voxel]]
  # filter the feature matrix to these selected features
  voxel.train <- as.data.frame(train.feat[, voxel.features])
  colnames(voxel.train) <- paste0("X", 1:ncol(voxel.train))
  voxel.train$y <- train.resp[, voxel]
  # fit an OLS model
  lm(y ~ ., data = voxel.train)
})

Evaluating the model on the validation set

The code below extracts the predicted voxel response for the validation set.

# calculate the predicted voxel responses for each image in the TRAINING set
predictions <- lapply(1:ncol(val.resp), function(voxel) { 
  # extract the selected Gabor features for the current voxel
  voxel.val <- as.data.frame(val.feat[, selected.features.list[[voxel]]])
  colnames(voxel.val) <- paste0("X", 1:ncol(voxel.val))
  # predict the voxel response for the validation images 
  predict.lm(ols.list[[voxel]], voxel.val)
})

Below we calculate and plot the correlation of the true voxel responses with the predicted voxel responses for each of the 1,294 voxels.

# calculate the correlation between the predictions and the true responses for each voxel
prediction.cor <- sapply(1:ncol(val.resp), function(voxel) {
  cor(predictions[[voxel]], val.resp[ ,voxel])
})
# convert to data frame
prediction.cor <- data.frame(voxel = 1:length(prediction.cor), cor = prediction.cor)
# plot a histogram of the correlations between the true and predicted voxel responses
ggplot(prediction.cor) + 
  geom_histogram(aes(x = cor), col = "white", binwidth = 0.02) +
  scale_y_continuous(name = "Number of voxels") +
  scale_x_continuous(name = "Correlation of predicted and true voxel response") 

A histogram of the correlation of the true voxel response on the validation images and the predicted voxel response on the validation images for each of the 1294 voxels.

A histogram of the correlation of the true voxel response on the validation images and the predicted voxel response on the validation images for each of the 1294 voxels.

We see that there appear to be two groups of voxels: those whose whose true and predicted correlations are quite low (around 0.2), and another group whose correlations are around 0.6.

Visualizing voxel performance accross all voxels

We will now use superheat to simultaneously evaluate the performance of the lasso models for predicting the responses to the validation images of each of the 1294 voxels.

First we will cluster the images and voxels into two groups for visualization purposes.

set.seed(384653)
# calculate row clusters
image.clusters <- kmeans(val.resp, centers = 2)$cluster
# calcualte column clusters
voxel.clusters <- kmeans(t(val.resp), centers = 2)$cluster
save(voxel.clusters, file = "results/voxel_clusters.RData")

Below we plot some examples of images from each cluster clusters. The following presents four images from the first cluster

set.seed(23847)
four.images.cl1 <- sample(which(image.clusters == 1), 4)
cl1.im1 <- superheat(val.stimuli[four.images.cl1[1], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
cl1.im2 <- superheat(val.stimuli[four.images.cl1[2], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
cl1.im3 <- superheat(val.stimuli[four.images.cl1[3], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
cl1.im4 <- superheat(val.stimuli[four.images.cl1[4], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
grid.arrange(cl1.im1$plot, cl1.im2$plot, cl1.im3$plot, cl1.im4$plot, ncol = 2)

While the next four images are from the second cluster.

set.seed(23847)
four.images.cl2 <- sample(which(image.clusters == 2), 4)
cl2.im1 <- superheat(val.stimuli[four.images.cl2[1], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
cl2.im2 <- superheat(val.stimuli[four.images.cl2[2], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
cl2.im3 <- superheat(val.stimuli[four.images.cl2[3], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
cl2.im4 <- superheat(val.stimuli[four.images.cl2[4], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
grid.arrange(cl2.im1$plot, cl2.im2$plot, cl2.im3$plot, cl2.im4$plot, ncol = 2)

Next, we plot a heatmap of the validation response matrix, that is, the response of each voxel to each validation image. Above each column in the heatmap (each column corresponds to a voxel), we plot the correlation of the observed voxel response with the predicted voxel response.

superheat(val.resp, 
          yt = prediction.cor$cor,
          yt.axis.name = "Correlation between\npredicted and true\nvoxel responses",
          
          heat.pal = brewer.pal(5, "RdBu"),
          
          yt.obs.col = rep("slategray4", ncol(val.resp)),
          yt.point.alpha = 0.6,
          yt.axis.name.size = 12,
          yt.plot.size = 0.7,
          yt.point.size = 2,
          
          membership.rows = image.clusters,
          membership.cols = voxel.clusters,
          
          left.label = "none",
          bottom.label = "none",
          grid.hline.col = "white",
          grid.vline.col = "white",
          grid.hline.size = 2,
          grid.vline.size = 2,
          
          row.title = "Validation images (120)",
          column.title = "Voxels (1,294)",
          title = "(a)")

The heatmap above is very noisy as a lot of information is being crammed into a small number of pixels. It is thus often much easier to “smooth” the heatmap within its clusters to highlight the “big picture”.

superheat(val.resp, 
          
           X.text = matrix(c("lower\nvoxel\nresponse",
                            "higher\nvoxel\nresponse",
                            "neutral\nvoxel\nresponse", 
                            "neutral\nvoxel\nresponse"), ncol = 2),
          X.text.col = matrix(c("white", "white", "black", "black"), ncol = 2),
          
          smooth.heat = T,
          
          heat.pal = brewer.pal(5, "RdBu"),
          
          yt = prediction.cor$cor,
          yt.axis.name = "Correlation between\npredicted and true\nvoxel responses",
          yt.plot.type = "boxplot",
          yt.cluster.col = "slategray4",
          yt.axis.name.size = 12,
          yt.plot.size = 0.7,
          
          membership.rows = image.clusters,
          membership.cols = voxel.clusters,
          
          left.label = "none",
          bottom.label = "none",
          grid.hline.col = "white",
          grid.vline.col = "white",
          grid.hline.size = 2,
          grid.vline.size = 2,
          
          row.title = "Validation images (120)",
          column.title = "Voxels (1,294)",
          title = "(b)")
## [1] -0.30 -0.20 -0.04  0.09  0.20

What we can see is that the two clusters of voxels behave very differently:

  • the first cluster of voxels is much more active in response to viewings of images from the second cluster of images than to images from the first cluster.
  • the second cluster of voxels does not appear to exhibit a strong difference in response to the two clusters of images.

Furthermore, the predicted responses for the first cluster of voxels are much more correlated with the true responses for these voxels, with an average correlation of around 0.5. The predicted responses to the second cluster of voxels, on the other hand, have approximately zero correlation with the true responses exhibited by these voxels.

Exploring the voxel clusters

Notice that we found two clusters of voxels that respond to the visual stimuli differently: the first cluster of voxels is highly sensitive to visual stimuli, whereas the second cluster is not.

Our goal is to explore the physical locations of the voxels. We can load in the locations data that can be found on Yuval Benjamini’s website.

load("raw_data/v1_locations.RData")
# v1 locations
dim(v1_locations)
## [1] 1331    3
head(v1_locations)
##      [,1] [,2] [,3]
## [1,]   34   22    3
## [2,]   34   23    3
## [3,]   34   24    3
## [4,]   34   25    3
## [5,]   34   26    3
## [6,]   34   20    4

Note that v1_locations appears to hold the (x, y, z)-locations of each V1 voxel. However, recall that we had a number of voxels with mostly missing values. We remove these from our location data frame below.

v1.locations <- as.data.frame(v1_locations[which(missing.variables == 0), ])
colnames(v1.locations) <- c("x", "y", "z")
v1.locations$cluster <- factor(voxel.clusters)
rm(v1_locations)

Next, we can plot the voxels in space.

library(plotly)
voxel.clusters <- factor(paste("cluster", voxel.clusters))
plot_ly(v1.locations, x = ~x, y = ~y, z = ~z, 
        color = ~cluster, 
        type = "scatter3d",
        mode = "markers",
        colors = c('#e41a1c', '#377eb8'),
        marker = list(line = list(color = 'black', width = 1.5))) %>%
  layout(scene = list(xaxis = list(title = 'x'),
                   yaxis = list(title = 'y'),
                   zaxis = list(title = 'z')))

 

library(plotly)
voxel.clusters <- factor(paste("cluster", voxel.clusters))
plot_ly(v1.locations, x = ~x, y = ~y, z = ~z, 
        color = ~cluster, 
        type = "scatter3d",
        mode = "markers",
        colors = c('#e41a1c', '#377eb8'),
        marker = list(line = list(color = 'black', width = 1.5))) %>%
  layout(scene = list(xaxis = list(title = 'x'),
                   yaxis = list(title = 'y'),
                   zaxis = list(title = 'z')))

 

소스: Superheat

R을 활용한 퀀트트레이드

, ,

“R을 활용한 퀀트 트레이딩” 교육 자료입니다.

[출처] 17. R을 활용한 퀀트 트레이딩 교육자료|작성자 아마퀀트

  1. R 프로그래밍 기초
  • R 및 RStudio 설치 및 사용자 인터페이스
  • R 기본 명령어 구조
  • R 기본 자료형 (벡터, 행렬, 리스트, 데이터 프레임)
  • R 기본 문법 (조건문, 반복문)
  • R 확장 자료형 (시계열 지료형, xts)
  • quantmod 패키지 사용법
  • 주가의 기술적 분석 백 테스트 예시

[출처] R을 활용한 퀀트 트레이딩 교육자료|작성자 아마퀀트

빅데이터 인포그래픽스

, ,

빅데이터를 한눈에 파악해 볼 수 있는 코너인 “빅데이터 인포그래픽스” 입니다. 각종 데이터나 정보를 시각적으로 표현해서 빅데이터에 대한 이해를 좀더 쉽게 했습니다.