DataExplorer: 최소의 코드로 빠른 데이터 탐색 구현

Forbes 기사에 따르면, 데이터 정제 및 정리는 가장 시간이 많이 걸리고 재미없는 데이터 과학 작업이라고 합니다. DataExplorer 패키지를 이용하면 전체 리소스의 80% 까지 최소화할 수 있습니다. 이와 더불어, 사용자에게 매우 친숙한 디자인이과 대부분의 경우 하나의 함수 호출만으로 데이터 탐색을 즐길 수 있습니다!

 

데이터 조작은 data.table에 의해 제공되므로 대용량 데이터 세트를 포함하는 작업은 대개 몇 초 내에 완료됩니다. 또한 이 패키지는 입력 데이터 클래스로 융통성이 높아 어떤 data.frame과 같은 객체를 포함 할 수 있습니다. 그러나 특정 함수는 update-by-reference 특징으로 인해 data.table 클래스 객체를 입력으로 요구합니다.

이제 몇 가지 코드를 살펴 볼까요?


mlbench 라이브러리에서BostonHousing 데이터 세트를 가져옵니다.

library(mlbench)
data("BostonHousing", package = "mlbench")

초기 시각화

데이터에 대해 잘 알지 못해도 처음 세 가지 작업은 거의 항상 다음과 같습니다.

library(DataExplorer)
plot_missing(BostonHousing) ## 누락 된 값이 있습니까? 누락 된 데이터 프로필은 무엇입니까?
plot_bar(BostonHousing) ## 각 이산 변수의 범주별 빈도는 어떻게 생겼습니까?
plot_histogram(BostonHousing) ## 각 연속 변수의 분포는 어떻습니까?

흥미로운 통찰력은별로 없지만, plot_missing,plot_bar 및 plot_histogram 결과는 아래와 같습니다.

 

Histogram

면밀히 조사하자면, 변수 rad는 이산적인 것처럼 보입니다. crim, zn, indusbinto bin도 그룹화하려고 합니다.

## 인자로`rad`를 설정하십시오.
BostonHousing$rad <- as.factor(BostonHousing$rad)

## 새 이산 변수 만들기
for (col in c("crim", "zn", "indus", "b")) 
  BostonHousing[[paste0(col, "_d")]] <- as.factor(ggplot2::cut_interval(BostonHousing[[col]], 2))

## 모든 이산 변수에 대한 플롯 바 차트
plot_bar(BostonHousing)

Bar

이 시점에서 우리는 데이터 배포에 대해 훨씬 더 잘 이해하고 있습니다. 이제 medv (1000 달러짜리 주택 소유 주택의 중간값)에 관심이 있고 그것을 예측할 모델을 만들고 싶다고 가정 해보십시오. 다른 모든 변수와 비교해 봅시다.

plot_boxplot(BostonHousing, by = "medv")    

Boxplot

plot_scatterplot(
  subset(BostonHousing, select = -c(crim, zn, indus, b)), 
  by = "medv", size = 0.5)

Scatterplot_1
Scatterplot_2

plot_correlation(BostonHousing)

Correlation

이것은 데이터를 3 행의 코드만으로 상관 관계를 분석하는 방법입니다.

기능 공학

기능 공학은 더 나은 모델을 구축하는 데 중요한 단계입니다. DataExplorer는 프로세스를 쉽게 수행 할 수있는 두 가지 기능을 제공합니다. 모두 번개처럼 빠르기 때문에 입력 객체로data.table이 필요합니다. 그러나 data.table 구문으로 코딩하려는 느낌이 들지 않으면 다음 프로세스를 채택 할 수 있습니다.

## 먼저 데이터를`data.table`으로 설정하십시오.
your_data <- data.table(your_data)

## DataExplorer 기능 적용
group_category(your_data, ...)
drop_columns(your_data, ...)
set_missing(your_data, ...)

## 데이터를 다시 원래 객체로 설정
class(your_data) <- "original_object_name"

BostonHousing 데이터 세트로 돌아 갑시다. 이 섹션의 나머지 부분에서는 데이터가 이미data.table로 변환되었다고 가정합니다.

library(data.table)
BostonHousingDT <- data.table(BostonHousing)

그 변화된 연속 변수를 기억 하나요? 그 변수들을 제거해 보시죠.

drop_columns(BostonHousingDT, c("crim", "zn", "indus", "b"))

 

 

참고 :data.table은 참조로 업데이트되므로 반환 된 객체를 다시 할당 할 필요없이 원본 객체가 업데이트됩니다.

이산 변수 rad를 살펴 보겠습니다.

 

plot_bar(BostonHousingDT$rad)

Rad_bar

I think categories other than 4, 5 and 24 are too sparse, and might skew my model fit. How could I group all the sparse categories together?

4, 5, 24 이외의 범주가 너무 희박하고 이 모델에 적합하지 않을 수도 있다고 생각합니다. 카테고리를 어떻게 그룹화 할 수 있습니까?

group_category(BostonHousingDT, "rad", 0.25, update = FALSE)

#    rad cnt       pct   cum_pct
# 1:  24 132 0.2608696 0.2608696
# 2:   5 115 0.2272727 0.4881423
# 3:   4 110 0.2173913 0.7055336

rad의 하단 25 %로 그룹화 한 것처럼 보이는 것은 우리가 필요한 것을 줄 것입니다.

group_category(BostonHousingDT, "rad", 0.25, update = TRUE)
plot_bar(BostonHousingDT$rad)

Grouped_rad_bar

In addition to categorical frequency, you may also play with the measure argument to group by the sum of a different variable. See ?group_category for more example use cases.

범주 별 빈도 이외에 다른 변수의 합계로 그룹화하기 위해 measure 인수를 사용할 수도 있습니다. 더 많은 예제 사용 사례는?group_category를 참조하십시오.

데이터 보고서

데이터 보고서를 생성하려면 다음을 수행하십시오.

create_report(BostonHousing)

 

소스: DataExplorer: Fast Data Exploration With Minimum Code (Revolutions)

Coindeskr 팩키지와 Shiny를 활용한 비트코인 가격 추적기 구축

인정합시다. 비트코인으로 전 세계가 미쳐 버렸습니다. Satoshi Nakamoto가 소개 한 최초의 암호화(double-spend 문제를 해결하는 최초의 디지털 화폐) 인 비트코인(BTC)은 잘 설립 된 회사 (심지어 몇몇 국가)보다 커졌습니다. 따라서 많은 비트코인 매니아와 투자자는 시장을 더 잘 읽고 그에 따라 움직일 수 있도록 일일 가격을 추적하려고합니다.

 

이 자습서는 R 사용자가 Coindeskr, Shiny 및 Dygraphs의 세 가지 패키지를 사용하여 자신의 일일 비트코인 가격 추적을 만들도록 돕기위한 것입니다.

CoindeskrCoindesk에서 제공하는 역사적인 비트코인 가격을 포함하여 Bitcoin Price Index를 추출하기 위해 coindesk API에 액세스하는 데 도움이됩니다.

Shiny 구조 및 스크립트 명명

Every Shiny app contains two parts – the UI part and the Server part. This post follows single file layout of designing the Shiny app where the shiny app contains one single file app.R that contains two functions ui and serverin the same code.

Start a new R Script app.R in a new folder (of desired name) and proceed further with the following code.

모든 Shiny 앱은 UI 및 Server 라는 두 부분으로 구성됩니다. 이 게시물은 Shiny 애플리케이션을 설계하는 단일 파일 레이아웃을 다음과 같은 두 가지 기능인 ui와 server가 하나의 파일 app.R에 포함되어 있습니다.

새 R 스크립트 app.R을 새 폴더 (원하는 이름)에 시작하고 다음 코드를 계속 진행하십시오.

패키지 설치 및 불러오기

#install.packages('shiny')
#install.packages('coindeskr')
#install.packages('dygraphs')

library(shiny) #shiny App 패키지
library(coindeskr) #Coindesk API 패키지 
library(dygraphs) #대화식 시계열 그래프 패키지

먼저 위의 코드에서 처럼 Shiny, Coindeskr 및 dygraphs라는 필수 패키지를 불러드립니다. (위의 패키지가 없는 경우 먼저 설치하시기 바랍니다)

지난 31 일간 비트코인 가격 추출

coindeskr의 편리한 함수인 get_last31days_price()를 사용하여 비트코인의 USD 가격을 지난 31 일 동안 추출하고 데이터 프레임 last31에 저장합니다.

last31 <- get_last31days_price() 

앱에 표시 될 UI 요소

A Bitcoin Price Tracker should not only display the graph of the last one month price information but also should give us some highlights or summary, hence let us also display the minimum and maximum price of Bitcoin in the given time period and dates of when it was minimum and maximum.

비트코인 가격 추적기는 최근 1 개월 가격 정보의 그래프를 표시 할뿐만 아니라 우리에게 몇 가지 하이라이트 또는 요약을 제공해야 합니다. 따라서 주어진 기간 및 해당 날짜의 비트코인의 최소.최대 가격을 표시하십시오

UI는 다음을 포함해야합니다.

  • 앱 제목
  • 최소 비트코인 가격 및 이에 상응하는 날짜
  • 최대비트코인 가격 및 이에 상응하는 날짜
  • 지난 31 일 동안 비트코인 가격의 추세를 볼 수있는 대화형 시계열 그래프

첫 번째 세 요소는 기본 Shiny 기능만으로 코딩 할 수 있지만 마지막 대화형 시계열 그래프에서는 dygraphOutput() 함수를 사용하여 배경 객체를 표시 할 수 있습니다.

ui <- shinyUI(
  fluidPage(
  titlePanel('Bitcoin USD Price for Last 31 days'),
  mainPanel(
    h3('Minimum'),
    h3(htmlOutput('minprice')),
    h3('Maximum'),
    h3(htmlOutput('maxprice')),
    dygraphOutput("btcprice")
  )
))

Shiny App의 서버 코드

여기서 우리는 데이터 프레임 last31에서 정보를 추출해야하며 코드의 UI 부분에 정의된 해당 output ID – minprice, maxprice 및 btcprice에 피드를 제공해야합니다. 최소 비트코인 가격과 최대 Bitcoin 가격을 추출하기 위해 min() 및 max() 함수가 각각 사용되며 which.min() 및 which.max()가 사용되며 최소 및 최대 가격 값의 rownames를 반환합니다. 여기서 rownames 동등한 날짜를 포함한다. 앞에서 언급했듯이 dygraphs 패키지는 renderDygraph() 함수를 제공하여 dygraph() 함수를 사용하여 만든 대화 형 (html) 시계열 그래프를 표시합니다.

server <- function(input,output){
  
  output$minprice <- renderText(
    paste('Price : $', min(last31), '<br>Date :', rownames(last31)[which.min(last31$Price)] )
  )
  
  
  output$maxprice <- renderText(
    paste('Price : $', max(last31), '<br>Date :', rownames(last31)[which.max(last31$Price)] )
  )
  output$btcprice <- renderDygraph(
    dygraph(data = last31, main = "Bitcoin USD Price for Last 31 days") %>% 
      dyHighlight(highlightCircleSize = 5, 
                  highlightSeriesBackgroundAlpha = 0.2,
                  hideOnMouseOut = FALSE, highlightSeriesOpts = list(strokeWidth = 3)) %>%
      dyRangeSelector()
  )
}

비트코인 가격 추적 Shiny 앱 :

코드가 준비되면 RStudio의 오른쪽 위에있는 Run App 버튼을 사용하여 Shiny 앱을 (평소와 같이) 실행할 수 있습니다. 코드가 아직 준비되지 않았다면 다음 코드를 사용하여 Github에서 직접 shiny 앱을 실행하십시오. 필요한 모든 패키지 (shiny, coindeskr, dygraphs가 이미 설치되어있어야 합니다).

shiny::runGitHub('amrrs/Bitcoin_price_tracker_Daily')

비트코인 가격 추적 Shiny 앱 스크린 샷 :

이제 비트코인 일일 가격 추적기가 준비되었습니다! 위에서 사용 된 코드는 여기 github에서 사용할 수 있습니다. Shiny에 관심이 있다면 Datacamp에 개설된 Building Web Applications in R with Shiny Course 에서 더 많은 것을 배울 수 있습니다.

 

소스: Building a Daily Bitcoin Price Tracker with Coindeskr and Shiny in R | R-bloggers

The SeaClass R Package · R Views

The SeaClass R Package

The Operations Technology and Advanced Analytics Group (OTAAG) at Seagate Technology has decided to share an internal project that helps accelerate development of classification models. The interactive SeaClass tool is contained in an R-based package built using shiny and other CRAN packages commonly used for binary classification. The package is free to use and develop further, but any analysis mistakes are the sole responsibility of the user. Check out the demo video below:

Origin Story

Manufacturer data sets used for pass/fail analysis are typically highly imbalanced with many more passing cases than failing cases. In some situations, the failure examples may be scarce or nonexistent. In these extreme cases complex modeling techniques are likely inadvisable. Data scientists commonly consider data scarcity, class imbalance, and data dimensionality when discriminating between competing candidate approaches, such as anomaly detection, simple models, and complex models. Standard approaches are easily identified within each of these analysis categories, and can be exploited as reasonable initial classification analysis baselines.

The SeaClass application was created to generate template R code for the commonly encountered classification problems described above. The application allows data analysts to explore multiple models quickly with essentially no programming required. SeaClass provides an option to download the corresponding R code for further model training/testing. This workflow enables our data analysts to jump-start their modeling, saving time and initial hassles.

The Advanced Analytics group decided to open-source the package for several reasons. Firstly, we encourage other contributors to suggest improvements to the SeaClass application. Additionally, we are hopeful our application will inspire other code-generating projects within the R community. Lastly, our group benefits greatly from open-source tools, and it’s nice to give back every once in a while.

Package Overview

The SeaClass R package provides tools for analyzing classification problems. In particular, specialized tools are available for addressing the problem of imbalanced data sets. The SeaClass application provides an easy-to-use interface that requires only minimal R programming knowledge to get started, and can be launched using the RStudio Addins menu. The application allows the user to explore numerous methods by simply clicking on the available options and interacting with the generated results. The user can choose to download the codes for any procedures they wish to explore further. SeaClass was designed to jump-start the analysis process for both novice and advanced R users. See screenshots below for one demonstration.

Installation Instructions

The SeaClass application depends on multiple R packages. To install SeaClass and its dependencies, run:

install.packages('devtools')
devtools::install_github('ChrisDienes/SeaClass')

Usage Instructions

Step 1. Begin by loading and preparing your data in R. Some general advice:

  • Your data set must be saved as an R data frame object.
  • The data set must contain a binary response variable (0/1, PASS/FAIL, A/B, etc.)
  • All other variables must be predictor variables.
  • Predictor variables can be numeric, categorical, or factors.
  • Including too many predictors may slow down the application and weaken performance.
  • Categorical predictors are often ignored when the number of levels exceeds 10, since they tend to have improper influences.
  • Missing values are not allowed and will throw a flag. Please remove or impute NAs prior to starting the app.
  • Keep the number of observations (rows) to a medium or small size.
  • Data sets with many rows (>10,000) or many columns (>30) may slow down the app’s interactive responses.

Step 2. After data preparation, start the application by either loading SeaClass from the RStudio Addins drop-down menu, or by loading the SeaClass function from the command line. For example:

library(SeaClass)

### Make some fake data:
X <- matrix(rnorm(10000,0,1),ncol=10,nrow=1000)
X[1:100,1:2] <- X[1:100,1:2] + 3
Y <- c(rep(1,100), rep(0,900))
Fake_Data <- data.frame(Y = Y , X)

### Load the SeaClass rare failure data:
data("rareFailData")

### Start the interactive GUI:
SeaClass()

If the application fails to load, you may need to specify your favorite browser path first. For example:

options(browser = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")

Step 3. The user has various options for configuring their analysis within the GUI. Once the analysis runs, the user can view the results, interact with the results (module-dependent), save the underlying R script, or start over. Additional help is provided within the application. See above screenshots for one depiction of these steps.

Step 4. Besides the SeaClass function, several other functions are contained within the library. For example:

### List available functions:
ls("package:SeaClass")
### Note this is a sample data set:
# data(rareFailData)
### Note code_output is a support function for SeaClass, not for general use.

### View help:
?accuracy_threshold

### Run example from help file:
### General Use: ###
set.seed(123)
x <- c(rnorm(100,0,1),rnorm(100,2,1))
group <- c(rep(0,100),rep(2,100))
accuracy_threshold(x=x, group=group, pos_class=2)
accuracy_threshold(x=x, group=group, pos_class=0)
### Bagged Example ###
set.seed(123)
replicate_function = function(index){accuracy_threshold(x=x[index], group=group[index], pos_class=2)[[2]]}
sample_cuts <- replicate(100, {
  sample_index = sample.int(n=length(x),replace=TRUE)
  replicate_function(index=sample_index)
})
bagged_scores <- sapply(x, function(x) mean(x > sample_cuts))
unbagged_cut    <- accuracy_threshold(x=x, group=group, pos_class=2)[[2]]
unbagged_scores <- ifelse(x > unbagged_cut, 1, 0)
# Compare AUC:
PRROC::roc.curve(scores.class0 = bagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]]
PRROC::roc.curve(scores.class0 = unbagged_scores,weights.class0 = ifelse(group==2,1,0))[[2]]
bagged_prediction <- ifelse(bagged_scores > 0.50, 2, 0)
unbagged_prediction <- ifelse(x > unbagged_cut, 2, 0)
# Compare Confusion Matrix:
table(bagged_prediction, group)
table(unbagged_prediction, group)

소스: The SeaClass R Package · R Views

[패키지] forecast 8.0

,

In what is now a roughly annual event, the forecast package has been updated on CRAN with a new version, this time 8.0. A few of the more important new features are described below. Check residuals…

소스: forecast 8.0 | Hyndsight

 

In what is now a roughly annual event, the forecast package has been updated on CRAN with a new version, this time 8.0.

A few of the more important new features are described below.

Check residuals

A common task when building forecasting models is to check that the residuals satisfy some assumptions (that they are uncorrelated, normally distributed, etc.). The new function checkresiduals makes this very easy: it produces a time plot, an ACF, a histogram with super-imposed normal curve, and does a Ljung-Box test on the residuals with appropriate number of lags and degrees of freedom.

fit <- auto.arima(WWWusage)
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  residuals
## Q* = 7.8338, df = 8, p-value = 0.4499
## 
## Model df: 2.   Total lags used: 10

This should work for all the modelling functions in the package, as well as some of the time series modelling functions in the stats package.

Different types of residuals

Usually, residuals are computed as the difference between observations and the corresponding one-step forecasts. But for some models, residuals are computed differently; for example, a multiplicative ETS model or a model with a Box-Cox transformation. So the residuals() function now has an additional argument to deal with this situation.

“Innovation residuals”” correspond to the white noise process that drives the evolution of the time series model. “Response residuals” are the difference between the observations and the fitted values (as with GLMs). For homoscedastic models, the innovation residuals and the one-step response residuals are identical. “Regression residuals” are also available for regression models with ARIMA errors, and are equal to the original data minus the effect of the regression variables. If there are no regression variables, the errors will be identical to the original series (possibly adjusted to have zero mean).

library(ggplot2)
fit <- ets(woolyrnq)
res <- cbind(Residuals = residuals(fit), 
             Response.residuals = residuals(fit, type='response'))
autoplot(res, facets=TRUE)

Some new graphs

The geom_histogram() function in the ggplot2 package is nice, but it does not have a good default bandwidth. So I added the gghistogram function which provides a quick histogram with good defaults. You can also overlay a normal density curve or a kernel density estimate.

gghistogram(lynx)

The ggseasonplot function is useful for studying seasonal patterns and how they change over time. It now has a polar argument to create graphs like this.

ggseasonplot(USAccDeaths, polar=TRUE)

I often want to add a time series line to an existing plot. Base graphics has line() which works well when a time series is passed as an argument. So I added autolayer which is similar (but more general). It is an S3 method like autoplot, and adds a layer to an existing ggplot object. autolayer will eventually form part of the next release of ggplot2, but for now it is available in the forecast package. There are methods provided for ts and forecast objects:

WWWusage %>% ets %>% forecast(h=20) -&gt; fc
autoplot(WWWusage, series="Data") + 
  autolayer(fc, series="Forecast") + 
  autolayer(fitted(fc), series="Fitted")

Cross-validation

The tsCV and CVar functions have been added. These were discussed in a previous post.

Bagged ETS

The baggedETS function has been added, which implements the procedure discussed in Bergmeir et al (2016) for bagging ETS forecasts.

head and tail of time series

I’ve long found it annoying that head and tail do not work on multiple time series. So I added some functions to the package so they now work.

Imports and Dependencies

The pipe operator from the magrittr package is now imported. So you don’t need to load the magrittr package to use it.

There are now no packages that are loaded with forecast – everything required is imported. This makes the start up much cleaner (no more annoying messages from all those packages being loaded). Instead, some random tips are occasionally printed when you load the forecast package (much like ggplot2 does).

There is quite a bit more — see the Changelog for a list.

시각적 자극에 대한 뇌 반응 모델링 – superheat 패키지

, , ,

이 사례 연구는 UC Berkeley의 Gallant 신경 과학 연구소 (Gallant lab 관련 논문 참조)에서 한 개인에 대해 수행한 기능적 자기 공명 영상 (fMRI) 실험에서 수집 한 데이터를 검사합니다. edu / mlpapers / paper_files / NIPS2008_0963.pdf “> 참조1 참조2 ).

fMRI는 신경 활동의 간접적인 척도로 간주 될 수있는 뇌의 산소가 공급 된 혈액 흐름을 측정합니다 (두 과정은 상호 연관성이 높음). fMRI 실험에서 얻은 측정치는 뇌의 큐브와 같은 화소 내에서 수십만개 뉴런의 총 응답과 일치합니다. 두뇌를 3D 화소로 세분화하는 것은 이미지를 2D 픽셀로 세분화하는 것과 유사합니다.

The data

이 데이터에는 1,750 가지의 이미지 (예 : 아기, 집 또는 그림의 영상)에 대한 반응으로 한 개인의 시각 피질에있는 약 1,300개 화소에 대한 fMRI 측정 값 입니다.

진행 상황을 시각화하기 위해 아래 그림은 fMRI 시스템에서 개별 이미지를 보는 모습입니다.

An individual viewing an image while in an fMRI machine

fMRI 기기에서 이미지를 보는 개인

각각의 이미지는 길이 1282 = 16,3841282 = 16,384의 벡터로 표현 될 수 있지만 가보 (Gabor) 웨이블릿 변환을 통해 길이 10,92110,921로 감소 될 수있는 128 × 128128 × 128 픽셀 그레이 스케일 이미지이다.

원시 데이터는 Computational Neuroscience (CRCNS) 데이터 공유 저장소의 협업 연구에 저장되며 여기에서 찾을 수 있습니다. 데이터에 액세스하려면 데이터로 수행 할 작업을 설명하는 CRCNS 계정을 요청해야하지만 이는 매우 간단합니다.

Loading Superheat

Installing the superheat package from github is easy, assuming you have the devtools package installed in R. Simply type the following command:

# install devtools if you don't have it already
install.packages("devtools")
# install the development version of superheat
devtools::install_github("rlbarter/superheat")

Assuming that you didn’t run into any unfortunate errors when installing the package, you can load the package into R in the normal way.

library(superheat)

Viewing the raw images

# some useful libraries
library(ggplot2)
library(dplyr)
library(gridExtra)
library(knitr)
library(RColorBrewer)

It is helpful to have an idea of what kind of images the subject was viewing. The raw images are contained in the Stimuli.mat file, and is separated into a set of 1,750 training images (used to train our models) and 120 validation images (used to evaluate model performance).

The code below loads in the data and extracts the training and validation images

# a library for loading matlab files
library(R.matlab)
# load in the stimuli images
stimuli <- readMat("raw_data/Stimuli.mat")
# extract training stimuli array
train.stimuli <- stimuli[[1]]
# extract validation stimuli array
val.stimuli <- stimuli[[2]]
# remove the original stimuli object
rm(stimuli)

We display four of the training images below using our superheat package.

# view some of the images
im1 <- superheat(train.stimuli[1, 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = F)
im2 <- superheat(train.stimuli[3, 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = F)
im3 <- superheat(train.stimuli[10, 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = F)
im4 <- superheat(train.stimuli[15, 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = F)
# place the images in a 2x2 grid
grid.arrange(im1$plot, im2$plot, im3$plot, im4$plot, ncol = 2)

Preparing the data for analysis

Despite best efforts, sadly not all of the data is publicly available. The Gabor wavelet filters, for example, are not available on the CRCNS website (see above), but the dedicated reader can try to generate their own Gabor wavelets using the raw images contained in Stimuli.mat (see above).

Gabor wavelet features

I, however, have access to a file, fMRIdata.RData which contains the Gabor wavelet features for the training and validation set images. These Gabor feature matrices are contained in the fit_feat and val_feat objects from the fMRIdata.Rdata file (again, this is unfortunately not publicly available).

# load in the gabor wavelet filters
load("processed_data/fMRIdata.RData") # not available in github
# fit_feat contains the gabor wavelet features for the 1750 training images
train.feat <- fit_feat
dim(train.feat)
## [1]  1750 10921
# val_feat contains the gabor wavelet features for the 120 validation images
val.feat <- val_feat
dim(val.feat)
## [1]   120 10921
# Remove the other objects that correspond to the responses of a subset of the voxels
rm(fit_feat)
rm(val_feat)
rm(resp_dat)
rm(loc_dat)

The voxel responses

The voxel responses to each image was collected for two subjects, “S1” and “S2”. We will restrict our analysis to predicting the voxel response in the V1 region for Subject 1 only (see the image below taken from Matthew Schmolesky’s Webvision).

Loading in the voxel responses (from the EstimatedResponses.mat file) involves first converting the .mat file to a format readable by R. Specifically I had to convert the data to version 6 MATLAB file in Octave:

# In Octave, run:
>> dat = load("EstimatedResponses.mat")
>> save("-V6", "EstimatedResponsesV6.mat", "dat")

Having converted the original MATLAB file to version 6 MATLAB file, we can load it into R using the R.matlab package.

# load in the Version 6 matlab file
voxel.response <- readMat("processed_data/EstimatedResponsesV6.mat")$dat
dimnames(voxel.response)[[1]]
## [1] "dataTrnS1" "dataTrnS2" "dataValS1" "dataValS2" "roiS1"     "roiS2"    
## [7] "voxIdxS1"  "voxIdxS2"

We can then filter through the objects contained in this file to extract only the responses for the 1331 V1 voxels to the training and validation images.

# extract useful objects
# the columns of train.resp are the 1,750 training images
# the rows are the 25,915 voxels
train.resp <- voxel.response[[1]]
# the columns of val.resp are the 120 training images
# the rows are the 25,915 voxels
val.resp <- voxel.response[[3]]
# extract the V1 voxels
V1.vox.index <- which(voxel.response[[5]][, 1] == 1)
# there are 1331 V1 voxels
length(V1.vox.index)
## [1] 1331
# filter train.resp and val.resp to the V1 voxels only
train.resp <- train.resp[V1.vox.index, ]
val.resp <- val.resp[V1.vox.index, ]
# remove the remaining data
rm(voxel.response)

Cleaning the data

The final data objects in our workspace are

ls()
##  [1] "im1"            "im2"            "im3"            "im4"           
##  [5] "train.feat"     "train.resp"     "train.stimuli"  "V1.vox.index"  
##  [9] "val.feat"       "val.resp"       "val.stimuli"    "voxel.response"

where

  • train.stimuli: a 1750×128×1281750×128×128 array corresponding to the 1,750 raw training images each of dimension 128×128128×128.
  • val.stimuli: a 120×128×128120×128×128 array corresponding to the 120 raw validation images each of dimension 128×128128×128.
  • train.feat: a 1750×109211750×10921 matrix corresponding to the 10,921 Gabor wavelet features for each of the 1,750 training images.
  • val.feat: a 120×10921120×10921 matrix corresponding to the 10,921 Gabor wavelet features for each of the 120 validation images.
  • train.resp: a 1331×17501331×1750 matrix containing the responses of the 1,331 voxels in the V1 region to each of the 1,750 training images.
  • val.resp: a 1331×1201331×120 matrix containing the responses of the 1,331 voxels in the V1 region to each of the 120 validation images.

Missing values

Note that of the 1,331 voxels in the V1 region, 37 of them have at least 40% missing responses. So we will remove these voxels.

# identify the proportion of missing values for each voxel
missing.variables <- apply(train.resp, 1, function(x) sum(is.na(x)) / length(x))
# print the proportion of missingness for the voxels with at 
# least some missingness.
length(missing.variables[missing.variables > 0])
## [1] 37
# remove these voxels from the training and validation sets
train.resp <- t(train.resp[which(missing.variables == 0), ])
val.resp <- t(val.resp[which(missing.variables == 0), ])
save(train.feat, train.resp, file = "processed_data/fmri_training_data.RData")
save(val.feat, val.resp, file = "processed_data/fmri_validation_data.RData")
# number of voxels remaining after removing missing values
ncol(train.resp)
## [1] 1294

We now thus have 1,294 voxels in the V1 region.

Modeling

For each of our models, our goal is to predict the response to the viewing of an image for each of the 1,294 voxels (cubic region in the brain).

That is, we have 1,294 separate models (one for each voxel), where the predictors/variables correspond to the 10,921 Gabor features from the training images.

Feature selection

To make the problem computationally feasible, we decided to filter the 10,921 Gabor features to the 500 that were most correlated with each voxel.

To identify the correlation of each Gabor feature with each voxel response, we ran the following code (also contained in the code/voxel_cor.R file that can be found here) on the statistics department computer cluster at UC Berkeley. You could probably run it on your laptop, but I would advise against it unless you are incredibly patient.

library(parallel)

nCores <- 24  # to set manually 
cl <- makeCluster(nCores) 

# export the necessary variables
clusterExport(cl, c("train.feat", "train.resp", "glmnet"), 
              envir=environment()) 
# calcualte the correlation of each variable with each voxel
cor.vox <- parLapply(cl, data.frame(train.resp), function(voxel) {
  apply(train.feat, 2, function(feature) cor(voxel, feature))
})
# save the results
save(cor.vox, file = "results/voxel_cor.RData")

Next, to identify the 500 Gabor features that are most correlated with each voxel, we can simply load the correlation data saved above and run the following code.

load("results/voxel_cor.RData")
# identify the 500 most correlated features for each variable
top.features <- lapply(cor.vox, function(cor) {
  order(cor, decreasing = TRUE)[1:500]
})

Lasso + OLS model

We now fit a lasso linear model to predict the voxel responses to the image stimuli.

The 1750×5001750×500 design matrix (feature matrix) can be represented as follows:

X=⎡⎣⎢⎢⎢⎢x1,1x2,1x1750,1x1,2x2,2x1750,2............x1,500x2,500x1750,500⎤⎦⎥⎥⎥⎥X=[x1,1×1,2…x1,500×2,1×2,2…x2,500⋮⋮…⋮x1750,1×1750,2…x1750,500]

where xi,jxi,j is the jjth Gabor wavelet feature value for the iith image.

We have 1,294 response vectors (as we fit 1,294 separate models) each of which contains the response for an individual voxel to the 17501750 training images. The response vector for voxel v{1,...,1294}v∈{1,…,1294} can be written as:

yv=⎡⎣⎢⎢⎢⎢⎢yv1yv2yv1750⎤⎦⎥⎥⎥⎥⎥yv=[y1vy2v⋮y1750v]

To fit these 1,294 models, the following code was run using the Statistics Department cluster at UC Berkeley. You will not be able to run this code on your laptop.

library(glmnet)
library(parallel)
# set up the cluster for parallelization
nCores <- 24  # to set manually 
cl <- makeCluster(nCores) 

# load the CV parameter selection function
source("code/select_lambda.R")

# fit the lasso model to each voxel using the 1000 most correlated features
# for the voxel.
clusterExport(cl, c("train.feat", "train.resp", "glmnet", "top.features"), 
              envir=environment()) 
lasso.list <- parLapply(cl, 1:ncol(train.resp), function(voxel) {
  glmnet(x = train.feat[, top.features[[voxel]]], 
         y = as.vector(train.resp[ , voxel])) 
})
# extract the lambda sequence from each voxel
# we want to avoid the ends of the sequence
# note that some sequences terminate early so providing a 
# universal endpoint for the vector sometimes produces an error
lambda.seq <- lapply(lasso.list, function(model) {
  model$lambda[5:length(model$lambda)]
  })

# use CV to select the best lambda for each voxel-model
# export the necessary variables
clusterExport(cl, c("train.feat", 
                    "train.resp", "top.features", 
                    "cv.glmnet", "lambda.seq"), 
              envir=environment()) 
# the ith entry of selected.lambda corresponds to the 
# best lambda value for voxel i
selected.lambda <- selectLambda(cl = cl, x.mat = train.feat, 
                       y.mat = train.resp, features = top.features, 
                       lambda.seq = lambda.seq)
# save the results
save(lasso.list, selected.lambda, top.features, 
     file = "results/lasso_results_top500.RData")
# stop the cluster
stopCluster(cl)

Having run the above code on the cluster and saved the results in the file lasso_results_top500.RData, we can load the results into the current R session and calculate the corresponding voxel response predictions.

First, we can extract the model coefficients for each model (since we saved the results for each possible lambda, we need to filter to the model with the “best” lambda).

load("results/lasso_results_top500.RData")
# extract the best model from lasso.list (above) corresponding to the best lambda
beta <- lapply(1:length(lasso.list), function(voxel) {
  model.index <- which(lasso.list[[voxel]]$lambda == selected.lambda[voxel])
  as.matrix(lasso.list[[voxel]]$beta)[, model.index]
})

We can also identify which feature was selected (had a non-zero lasso coefficient) by each voxel.

# identify the selected features for each voxel
selected.features.list <- lapply(1:length(top.features), function(voxel) {
  # idenitfy which features were selected for the voxel
  # start with the features used to train the lasso model
  voxel.features <- top.features[[voxel]]
  # obtain the coefficients for these feautures
  voxel.features.selected <- beta[[voxel]] != 0
  # filter the features index and beta vector to the non-zero coefficients
  voxel.features <- voxel.features[voxel.features.selected]
})

Next, for each voxel we re-fit a classical un-regularized OLS model for each voxel based on the selected (non-zero) coefficients from the corresponding lasso model.

# fit an OLS model for each voxel based on the selected features
ols.list <- lapply(1:ncol(train.resp), function(voxel) {
  voxel.features <- selected.features.list[[voxel]]
  # filter the feature matrix to these selected features
  voxel.train <- as.data.frame(train.feat[, voxel.features])
  colnames(voxel.train) <- paste0("X", 1:ncol(voxel.train))
  voxel.train$y <- train.resp[, voxel]
  # fit an OLS model
  lm(y ~ ., data = voxel.train)
})

Evaluating the model on the validation set

The code below extracts the predicted voxel response for the validation set.

# calculate the predicted voxel responses for each image in the TRAINING set
predictions <- lapply(1:ncol(val.resp), function(voxel) { 
  # extract the selected Gabor features for the current voxel
  voxel.val <- as.data.frame(val.feat[, selected.features.list[[voxel]]])
  colnames(voxel.val) <- paste0("X", 1:ncol(voxel.val))
  # predict the voxel response for the validation images 
  predict.lm(ols.list[[voxel]], voxel.val)
})

Below we calculate and plot the correlation of the true voxel responses with the predicted voxel responses for each of the 1,294 voxels.

# calculate the correlation between the predictions and the true responses for each voxel
prediction.cor <- sapply(1:ncol(val.resp), function(voxel) {
  cor(predictions[[voxel]], val.resp[ ,voxel])
})
# convert to data frame
prediction.cor <- data.frame(voxel = 1:length(prediction.cor), cor = prediction.cor)
# plot a histogram of the correlations between the true and predicted voxel responses
ggplot(prediction.cor) + 
  geom_histogram(aes(x = cor), col = "white", binwidth = 0.02) +
  scale_y_continuous(name = "Number of voxels") +
  scale_x_continuous(name = "Correlation of predicted and true voxel response") 

A histogram of the correlation of the true voxel response on the validation images and the predicted voxel response on the validation images for each of the 1294 voxels.

A histogram of the correlation of the true voxel response on the validation images and the predicted voxel response on the validation images for each of the 1294 voxels.

We see that there appear to be two groups of voxels: those whose whose true and predicted correlations are quite low (around 0.2), and another group whose correlations are around 0.6.

Visualizing voxel performance accross all voxels

We will now use superheat to simultaneously evaluate the performance of the lasso models for predicting the responses to the validation images of each of the 1294 voxels.

First we will cluster the images and voxels into two groups for visualization purposes.

set.seed(384653)
# calculate row clusters
image.clusters <- kmeans(val.resp, centers = 2)$cluster
# calcualte column clusters
voxel.clusters <- kmeans(t(val.resp), centers = 2)$cluster
save(voxel.clusters, file = "results/voxel_clusters.RData")

Below we plot some examples of images from each cluster clusters. The following presents four images from the first cluster

set.seed(23847)
four.images.cl1 <- sample(which(image.clusters == 1), 4)
cl1.im1 <- superheat(val.stimuli[four.images.cl1[1], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
cl1.im2 <- superheat(val.stimuli[four.images.cl1[2], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
cl1.im3 <- superheat(val.stimuli[four.images.cl1[3], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
cl1.im4 <- superheat(val.stimuli[four.images.cl1[4], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
grid.arrange(cl1.im1$plot, cl1.im2$plot, cl1.im3$plot, cl1.im4$plot, ncol = 2)

While the next four images are from the second cluster.

set.seed(23847)
four.images.cl2 <- sample(which(image.clusters == 2), 4)
cl2.im1 <- superheat(val.stimuli[four.images.cl2[1], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
cl2.im2 <- superheat(val.stimuli[four.images.cl2[2], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
cl2.im3 <- superheat(val.stimuli[four.images.cl2[3], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
cl2.im4 <- superheat(val.stimuli[four.images.cl2[4], 128:1, ], 
          heat.pal = c("black", "white"),
          legend = FALSE,
          print.plot = FALSE)
grid.arrange(cl2.im1$plot, cl2.im2$plot, cl2.im3$plot, cl2.im4$plot, ncol = 2)

Next, we plot a heatmap of the validation response matrix, that is, the response of each voxel to each validation image. Above each column in the heatmap (each column corresponds to a voxel), we plot the correlation of the observed voxel response with the predicted voxel response.

superheat(val.resp, 
          yt = prediction.cor$cor,
          yt.axis.name = "Correlation between\npredicted and true\nvoxel responses",
          
          heat.pal = brewer.pal(5, "RdBu"),
          
          yt.obs.col = rep("slategray4", ncol(val.resp)),
          yt.point.alpha = 0.6,
          yt.axis.name.size = 12,
          yt.plot.size = 0.7,
          yt.point.size = 2,
          
          membership.rows = image.clusters,
          membership.cols = voxel.clusters,
          
          left.label = "none",
          bottom.label = "none",
          grid.hline.col = "white",
          grid.vline.col = "white",
          grid.hline.size = 2,
          grid.vline.size = 2,
          
          row.title = "Validation images (120)",
          column.title = "Voxels (1,294)",
          title = "(a)")

The heatmap above is very noisy as a lot of information is being crammed into a small number of pixels. It is thus often much easier to “smooth” the heatmap within its clusters to highlight the “big picture”.

superheat(val.resp, 
          
           X.text = matrix(c("lower\nvoxel\nresponse",
                            "higher\nvoxel\nresponse",
                            "neutral\nvoxel\nresponse", 
                            "neutral\nvoxel\nresponse"), ncol = 2),
          X.text.col = matrix(c("white", "white", "black", "black"), ncol = 2),
          
          smooth.heat = T,
          
          heat.pal = brewer.pal(5, "RdBu"),
          
          yt = prediction.cor$cor,
          yt.axis.name = "Correlation between\npredicted and true\nvoxel responses",
          yt.plot.type = "boxplot",
          yt.cluster.col = "slategray4",
          yt.axis.name.size = 12,
          yt.plot.size = 0.7,
          
          membership.rows = image.clusters,
          membership.cols = voxel.clusters,
          
          left.label = "none",
          bottom.label = "none",
          grid.hline.col = "white",
          grid.vline.col = "white",
          grid.hline.size = 2,
          grid.vline.size = 2,
          
          row.title = "Validation images (120)",
          column.title = "Voxels (1,294)",
          title = "(b)")
## [1] -0.30 -0.20 -0.04  0.09  0.20

What we can see is that the two clusters of voxels behave very differently:

  • the first cluster of voxels is much more active in response to viewings of images from the second cluster of images than to images from the first cluster.
  • the second cluster of voxels does not appear to exhibit a strong difference in response to the two clusters of images.

Furthermore, the predicted responses for the first cluster of voxels are much more correlated with the true responses for these voxels, with an average correlation of around 0.5. The predicted responses to the second cluster of voxels, on the other hand, have approximately zero correlation with the true responses exhibited by these voxels.

Exploring the voxel clusters

Notice that we found two clusters of voxels that respond to the visual stimuli differently: the first cluster of voxels is highly sensitive to visual stimuli, whereas the second cluster is not.

Our goal is to explore the physical locations of the voxels. We can load in the locations data that can be found on Yuval Benjamini’s website.

load("raw_data/v1_locations.RData")
# v1 locations
dim(v1_locations)
## [1] 1331    3
head(v1_locations)
##      [,1] [,2] [,3]
## [1,]   34   22    3
## [2,]   34   23    3
## [3,]   34   24    3
## [4,]   34   25    3
## [5,]   34   26    3
## [6,]   34   20    4

Note that v1_locations appears to hold the (x, y, z)-locations of each V1 voxel. However, recall that we had a number of voxels with mostly missing values. We remove these from our location data frame below.

v1.locations <- as.data.frame(v1_locations[which(missing.variables == 0), ])
colnames(v1.locations) <- c("x", "y", "z")
v1.locations$cluster <- factor(voxel.clusters)
rm(v1_locations)

Next, we can plot the voxels in space.

library(plotly)
voxel.clusters <- factor(paste("cluster", voxel.clusters))
plot_ly(v1.locations, x = ~x, y = ~y, z = ~z, 
        color = ~cluster, 
        type = "scatter3d",
        mode = "markers",
        colors = c('#e41a1c', '#377eb8'),
        marker = list(line = list(color = 'black', width = 1.5))) %>%
  layout(scene = list(xaxis = list(title = 'x'),
                   yaxis = list(title = 'y'),
                   zaxis = list(title = 'z')))

 

library(plotly)
voxel.clusters <- factor(paste("cluster", voxel.clusters))
plot_ly(v1.locations, x = ~x, y = ~y, z = ~z, 
        color = ~cluster, 
        type = "scatter3d",
        mode = "markers",
        colors = c('#e41a1c', '#377eb8'),
        marker = list(line = list(color = 'black', width = 1.5))) %>%
  layout(scene = list(xaxis = list(title = 'x'),
                   yaxis = list(title = 'y'),
                   zaxis = list(title = 'z')))

 

소스: Superheat

R을 활용한 퀀트트레이드

, ,

“R을 활용한 퀀트 트레이딩” 교육 자료입니다.

[출처] 17. R을 활용한 퀀트 트레이딩 교육자료|작성자 아마퀀트

  1. R 프로그래밍 기초
  • R 및 RStudio 설치 및 사용자 인터페이스
  • R 기본 명령어 구조
  • R 기본 자료형 (벡터, 행렬, 리스트, 데이터 프레임)
  • R 기본 문법 (조건문, 반복문)
  • R 확장 자료형 (시계열 지료형, xts)
  • quantmod 패키지 사용법
  • 주가의 기술적 분석 백 테스트 예시

[출처] R을 활용한 퀀트 트레이딩 교육자료|작성자 아마퀀트

빅데이터 인포그래픽스

, ,

빅데이터를 한눈에 파악해 볼 수 있는 코너인 “빅데이터 인포그래픽스” 입니다. 각종 데이터나 정보를 시각적으로 표현해서 빅데이터에 대한 이해를 좀더 쉽게 했습니다.