Load and transform dataset

(a). Install Bioconductor biocLite package in order to access the golubEsets library. golubEsets contains the raw data used by Todd Golub in the original paper.


In [1]:
## Most code is commented in this cell since it is unnecessary and time-consuming to run it everytime.
#options(repos='http://cran.rstudio.com/')
#source("http://bioconductor.org/biocLite.R")
#biocLite("golubEsets")
suppressMessages(library(golubEsets))

(b). Load the training, testing data from library golubEsets. Also transpose the data to make observations as rows.


In [2]:
#Training data predictor and response
data(Golub_Train)
golub_train_p = t(exprs(Golub_Train))
golub_train_r =pData(Golub_Train)[, "ALL.AML"]
#Testing data predictor
data(Golub_Test)
golub_test_p = t(exprs(Golub_Test))
golub_test_r = pData(Golub_Test)[, "ALL.AML"]
#Show summary
rbind(Train = dim(golub_train_p), Test = dim(golub_test_p))
cbind(Train = table(golub_train_r),Test = table(golub_test_r))


Train38 7129
Test34 7129
TrainTest
ALL2720
AML1114

(c). Perform data preprocessing: thresholding, filtering, logarithmic transformation and normalization as in the paper. The predictor is reduced to 3051 after preprocessing.

Most details of step 1(c) are not included in the original paper. We combine the information in paper 2, paper 9 and also a reproduce work done by Robert Gentleman, who confirmed in his work the procedure of thresholding and filtering is the same as in the original paper. One also need to notice that we should use the mean and standard deviation in the training data to normalize the testing data as mentioned in the Appendix A of the paper 2. At the end of this step, there are 3051 predictors left. The resulting dataset are same as the $72\times 3051$ Golub dataset available online.


In [3]:
# Thresholding
golub_train_pp = golub_train_p
golub_train_pp[golub_train_pp<100] = 100
golub_train_pp[golub_train_pp>16000] = 16000

# Filtering
golub_filter = function(x, r = 5, d=500){
    minval = min(x)
    maxval = max(x)
    (maxval/minval>r)&&(maxval-minval>d)
}
index = apply(golub_train_pp, 2, golub_filter)
golub_index = (1:7129)[index]
golub_train_pp = golub_train_pp[, golub_index]

golub_test_pp = golub_test_p
golub_test_pp[golub_test_pp<100] = 100
golub_test_pp[golub_test_pp>16000] = 16000
golub_test_pp = golub_test_pp[, golub_index]

# Log Transformation
golub_train_p_trans = log10(golub_train_pp)
golub_test_p_trans = log10(golub_test_pp)

# Normalization
train_m = colMeans(golub_train_p_trans)
train_sd = apply(golub_train_p_trans, 2, sd)
golub_train_p_trans = t((t(golub_train_p_trans)-train_m)/train_sd)
golub_test_p_trans  = t((t(golub_test_p_trans)-train_m)/train_sd)
golub_train_3051 = golub_train_p_trans
golub_train_response = golub_train_r
golub_test_3051 = golub_test_p_trans
golub_test_response = golub_test_r
save(golub_train_3051, golub_train_response, golub_test_3051, golub_test_response, file = "../transformed data/golub3051.rda")