This file contains the code used for implementing the classifer in paper Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring.

Data Set: Raw/original data used in the Golub etc. paper.

- Train Data: 38 bone marrow samples(27ALL, 11AML).
- Test Data: 34 samples(24 bone marrow, 10 peripheral blood samples, 20ALL, 14AML).
- Predictors: 7129 gene expression levels represent 6817 genes.

Main Purpose: Golub Classifier: class prediction, using the built classifiers to predict the class of a new tumor sample.

Reproduce using R

Most of the steps below strictly follow the procedures in the class prediction part in the Golub Paper and we get highly resemble result as in the paper. To reproduce the reproduced result, a seed is set in step 2 below.

Step 1 Load and transform dataset.

Step 1(a): Install Bioconductor biocLite package in order to access the golubEsets library. golubEsets contains the raw data used by Todd Golub in the original paper.


In [1]:
## Most code is commented in this cell since it is unnecessary and time-consuming to run it everytime.
# options(repos='http://cran.rstudio.com/') 
# source("http://bioconductor.org/biocLite.R")
# biocLite("golubEsets")
suppressMessages(library(golubEsets))

Step 1(b): Load the training, testing data from library golubEsets. Also transpose the data to make observations as rows.


In [2]:
#Training data predictor and response
data(Golub_Train)
golub_train_p = t(exprs(Golub_Train))
golub_train_r =pData(Golub_Train)[, "ALL.AML"]
#Testing data predictor
data(Golub_Test)
golub_test_p = t(exprs(Golub_Test))
golub_test_r = pData(Golub_Test)[, "ALL.AML"]
#Show summary
rbind(Train = dim(golub_train_p), Test = dim(golub_test_p))
cbind(Train = table(golub_train_r),Test = table(golub_test_r))


Train38 7129
Test34 7129
TrainTest
ALL2720
AML1114

Step 1(c): Perform data preprocessing: thresholding, filtering, logarithmic transformation and normalization as in the paper. The predictor is reduced to 3051 after preprocessing.

Most details of step 1(c) are not included in the original paper. We combine the information in paper 2, paper 9 and also a reproduce work done by Robert Gentleman, who confirmed in his work the procedure of thresholding and filtering is the same as in the original paper. One also need to notice that we should use the mean and standard deviation in the training data to normalize the testing data as mentioned in the Appendix A of the paper 2. At the end of this step, there are 3051 predictors left. The resulting dataset are same as the $72\times 3051$ Golub dataset available online.


In [3]:
# Thresholding
golub_train_pp = golub_train_p
golub_train_pp[golub_train_pp<100] = 100
golub_train_pp[golub_train_pp>16000] = 16000

# Filtering
golub_filter = function(x, r = 5, d=500){
    minval = min(x)
    maxval = max(x)
    (maxval/minval>r)&&(maxval-minval>d)
}
index = apply(golub_train_pp, 2, golub_filter)
golub_index = (1:7129)[index]
golub_train_pp = golub_train_pp[, golub_index]

golub_test_pp = golub_test_p
golub_test_pp[golub_test_pp<100] = 100
golub_test_pp[golub_test_pp>16000] = 16000
golub_test_pp = golub_test_pp[, golub_index]

# Log Transformation
golub_train_p_trans = log10(golub_train_pp)
golub_test_p_trans = log10(golub_test_pp)

# Normalization
train_m = colMeans(golub_train_p_trans)
train_sd = apply(golub_train_p_trans, 2, sd)
golub_train_p_trans = t((t(golub_train_p_trans)-train_m)/train_sd)
golub_test_p_trans  = t((t(golub_test_p_trans)-train_m)/train_sd)
golub_train_3051 = golub_train_p_trans
golub_train_response = golub_train_r
golub_test_3051 = golub_test_p_trans
golub_test_response = golub_test_r
save(golub_train_3051, golub_train_response, golub_test_3051, golub_test_response, file = "../transformed data/golub3051.rda")
set.seed(201703)

In [4]:
cbind(train = dim(golub_train_p_trans),test = dim(golub_test_p_trans))


traintest
38 34
30513051

Step2: Build classifier/predictor.

Step 2(a): Neighbourhood analysis, permutation test and "signal-to-noise" ratio calculation: $P(g, c) = (\mu_{AML}-\mu_{ALL})/(\sigma_{AML} + \sigma_{ALL})$ as in notes 16 and 17. As we can see from the table at the end, neighbourhood analysis further reduce the number of predictors to 740 using 0.01 significant level permutation test.


In [5]:
# Neighbourhood analysis
##signal-to-noise ratio/PS in the paper
get_p = function(train_d, train_r){
    tr_m_aml =  colMeans(train_d[train_r == "AML",])
    tr_sd_aml = apply(train_d[train_r == "AML",], 2, sd)
    tr_m_all = colMeans(train_d[train_r == "ALL",])
    tr_sd_all = apply(train_d[train_r == "ALL",], 2, sd)
    p = (tr_m_aml-tr_m_all)/(tr_sd_aml+tr_sd_all)
    return(p)
}

In [6]:
nna = matrix(0, 400, 3051)
set.seed(201702)
# Permutation test
for(i in 1:400){
    t_r = sample(golub_train_r)
    nna[i, ] = get_p(golub_train_p_trans, t_r)
}

In [7]:
# Predictor selection based on the result of Neighbourhood analysis
nna_q = apply(nna, 2, quantile, prob = c(0.005, 0.995))
p = get_p(golub_train_p_trans, golub_train_r)
# With 0.01 significant level
index_1 = (1:3051)[p>=nna_q[2,] | p<=nna_q[1,]]
golub_train_p_trans = golub_train_p_trans[, index_1]
train_m_aml = colMeans(golub_train_p_trans[golub_train_r == "AML",])
train_m_all = colMeans(golub_train_p_trans[golub_train_r =="ALL",])
golub_test_p_trans =golub_test_p_trans[, index_1]
p = p[index_1]

In [8]:
cbind(train = dim(golub_train_p_trans),test = dim(golub_test_p_trans))


traintest
38 34
740740

Step 2(b): Select the informative 50 genes as in the original paper. We follow the note 18 in selecting the genes.


In [9]:
cl_index = c(head(order(p), 25), head(order(p, decreasing = T), 25))
p_50 = p[cl_index]
b = (train_m_aml[cl_index]+train_m_all[cl_index])/2
train_cl = golub_train_p_trans[, cl_index]
test_cl = golub_test_p_trans[, cl_index]
golub_train_50 = train_cl
golub_test_50 = test_cl
save(golub_train_50, golub_train_response, golub_test_50, golub_test_response, file = "../transformed data/golub50gene.rda")

Step 3: Predict on training and testing data.

The prediction steps are the same as in note 18, 19 and 20.


In [10]:
#train
train_vote = t(p_50*t(sweep(train_cl, 2, b)))
train_V1 = apply(train_vote, 1, function(x) sum(x[x>0]))
train_V2 = abs(apply(train_vote, 1, function(x) sum(x[x<=0])))
train_pred = (train_V1-train_V2)/(train_V1+train_V2)
train_pred_r = ifelse(abs(train_pred)>0.3, ifelse(train_pred>0, "AML", "ALL"), "Uncertain")
train_table = table(Train_Predict = train_pred_r, Train_Actual = golub_train_r)
train_table
#test
test_vote = t(p_50*t(sweep(test_cl, 2, b)))
test_V1 = apply(test_vote, 1, function(x) sum(x[x>0]))
test_V2 = abs(apply(test_vote, 1, function(x) sum(x[x<=0])))
test_pred = (test_V1-test_V2)/(test_V1+test_V2)
test_pred_r = ifelse(abs(test_pred)>0.3, ifelse(test_pred>0, "AML", "ALL"), "Uncertain")
test_table = table(Test_Predict = test_pred_r, Test_Actual = golub_test_r)
test_table


             Train_Actual
Train_Predict ALL AML
    ALL        26   0
    AML         0  11
    Uncertain   1   0
            Test_Actual
Test_Predict ALL AML
   ALL        19   0
   AML         0  12
   Uncertain   1   2

Summary of Reproduce result

In step 1, data are loaded and preprocessed. However, most details of step 1(c) are not included in the original paper. We combine the information in paper 2, paper 9 and also a reproduce work done by Robert Gentleman, who confirmed in his work the procedure of thresholding and filtering. One also need to notice that we should use the mean and standard deviation in the training data to normalize the testing data as mentioned in the notes of the orignal paper. At the end of step 1, there are 3051 predictors left. The resulting dataset are same as the $72\times 3051$ Golub dataset available online.

In step 2, we first do the neighborhood analysis mentioned in the paper. However, detailed implementation is based on our understanding. (It is basically a permutation test and the seed is set to be 201702.) The method to get the critical values is simplified and hence the bound may be loose. The analysis further reduce the number of predictors, which you can check in the dataset dimensions. We then follow the procedure in the paper and select 50 most related genes as classifiers and subset our train and test data for classification sake.

In step 3, we perform the classification or prediction for both training data and testing data. As we can see in the two contigency table, the result is slightly better than in the original paper, since the bootstrap actually have some randomness.

Result Comparison: Train Prediction: In the paper, they correctly classify 36/38 with remaining two as uncertain. We correctly classify 37/38 with one as uncertain. Test Prediction: In the paper, they correctly classify 29/34 and we correctly classify 31/34.

However, in the original paper, they did leave one out cross validation to get the train accuracy and we didn't.

Compare classification with other feature selection method

  • Based on Paper 3 feature selection

In [11]:
# test_paper3 test_response train_paper3 train_response loaded
load("../transformed data/paper3.rda")
rbind( Train = dim(train_paper3), Test = dim(test_paper3))


train_paper3_aml = colMeans(train_paper3[train_response == "AML",])
train_paper3_all = colMeans(train_paper3[train_response =="ALL",])
b_p3 = (train_paper3_aml+train_paper3_all)/2
p_p3 = get_p(train_paper3, train_response)

#train
train_vote_p3 = t(p_p3*t(sweep(train_paper3, 2, b_p3)))
train_V1_p3 = apply(train_vote_p3, 1, function(x) sum(x[x>0]))
train_V2_p3 = abs(apply(train_vote_p3, 1, function(x) sum(x[x<=0])))
train_pred_p3 = (train_V1_p3-train_V2_p3)/(train_V1_p3+train_V2_p3)
train_pred_r_p3 = ifelse(abs(train_pred_p3)>0.3, ifelse(train_pred_p3>0, "AML", "ALL"), "Uncertain")
train_table_p3 = table(Train_Predict = train_pred_r_p3, Train_Actual = train_response)
train_table_p3
#test
test_vote_p3 = t(p_p3*t(sweep(test_paper3, 2, b_p3)))
test_V1_p3 = apply(test_vote_p3, 1, function(x) sum(x[x>0]))
test_V2_p3 = abs(apply(test_vote_p3, 1, function(x) sum(x[x<=0])))
test_pred_p3 = (test_V1_p3-test_V2_p3)/(test_V1_p3+test_V2_p3)
test_pred_r_p3 = ifelse(abs(test_pred_p3)>0.3, ifelse(test_pred_p3>0, "AML", "ALL"), "Uncertain")
test_table_p3 = table(Test_Predict = test_pred_r_p3, Test_Actual = test_response)
test_table_p3


Train3850
Test3450
             Train_Actual
Train_Predict ALL AML
    ALL        27   0
    AML         0   9
    Uncertain   0   2
            Test_Actual
Test_Predict ALL AML
   ALL        20   1
   AML         0  12
   Uncertain   0   1
  • Based on paper 6 feature reduction

In [12]:
# pca_train, pca_test, pls_train, pls_test loaded
load("../transformed data/paper6.rda")
rbind(Train_pca = dim(pca_train), Test_pca = dim(pca_test), Train_pls = dim(pls_train), Test_pls = dim(pls_test))


Train_pca384
Test_pca344
Train_pls384
Test_pls344

As we could see from the table above, paper6 performs dimension reduction and there is only three resulting features(components). However, we could still perform classification using those three features.


In [13]:
# pca
train_p6_pca_aml = colMeans(pca_train[pca_train$response== "AML",-1])
train_p6_pca_all = colMeans(pca_train[ pca_train$response=="ALL",-1])
b_p6_pca = (train_p6_pca_aml+train_p6_pca_all)/2
p_p6_pca = get_p(pca_train[, -1], pca_train$response)
#train
train_vote_p6_pca = t(p_p6_pca*t(sweep(pca_train[, -1], 2, b_p6_pca)))
train_V1_p6_pca = apply(train_vote_p6_pca, 1, function(x) sum(x[x>0]))
train_V2_p6_pca = abs(apply(train_vote_p6_pca, 1, function(x) sum(x[x<=0])))
train_pred_p6_pca = (train_V1_p6_pca-train_V2_p6_pca)/(train_V1_p6_pca+train_V2_p6_pca)
train_pred_r_p6_pca = ifelse(abs(train_pred_p6_pca)>0.3, ifelse(train_pred_p6_pca>0, "AML", "ALL"), "Uncertain")
train_table_p6_pca = table(Train_Predict = train_pred_r_p6_pca, Train_Actual = pca_train$response)
train_table_p6_pca
#test
test_vote_p6_pca = t(p_p6_pca*t(sweep(pca_test[, -1], 2, b_p6_pca)))
test_V1_p6_pca = apply(test_vote_p6_pca, 1, function(x) sum(x[x>0]))
test_V2_p6_pca = abs(apply(test_vote_p6_pca, 1, function(x) sum(x[x<=0])))
test_pred_p6_pca = (test_V1_p6_pca-test_V2_p6_pca)/(test_V1_p6_pca+test_V2_p6_pca)
test_pred_r_p6_pca = ifelse(abs(test_pred_p6_pca)>0.3, ifelse(test_pred_p6_pca>0, "AML", "ALL"), "Uncertain")
test_table_p6_pca = table(Test_Predict = test_pred_r_p6_pca, Test_Actual = pca_test$response)
test_table_p6_pca


             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_Predict ALL AML
         ALL  20   1
         AML   0  13

In [14]:
# pls
train_p6_pls_aml = colMeans(pls_train[pls_train$response== "AML",-1])
train_p6_pls_all = colMeans(pls_train[pls_train$response=="ALL",-1])
b_p6_pls = (train_p6_pls_aml+train_p6_pls_all)/2
p_p6_pls = get_p(pls_train[, -1], pls_train$response)
#train
train_vote_p6_pls = t(p_p6_pls*t(sweep(pls_train[, -1], 2, b_p6_pls)))
train_V1_p6_pls = apply(train_vote_p6_pls, 1, function(x) sum(x[x>0]))
train_V2_p6_pls = abs(apply(train_vote_p6_pls, 1, function(x) sum(x[x<=0])))
train_pred_p6_pls = (train_V1_p6_pls-train_V2_p6_pls)/(train_V1_p6_pls+train_V2_p6_pls)
train_pred_r_p6_pls = ifelse(abs(train_pred_p6_pls)>0.3, ifelse(train_pred_p6_pls>0, "AML", "ALL"), "Uncertain")
train_table_p6_pls = table(Train_Predict = train_pred_r_p6_pls, Train_Actual = pls_train$response)
train_table_p6_pls
#test
test_vote_p6_pls = t(p_p6_pls*t(sweep(pls_test[, -1], 2, b_p6_pls)))
test_V1_p6_pls = apply(test_vote_p6_pls, 1, function(x) sum(x[x>0]))
test_V2_p6_pls = abs(apply(test_vote_p6_pls, 1, function(x) sum(x[x<=0])))
test_pred_p6_pls = (test_V1_p6_pls - test_V2_p6_pls)/(test_V1_p6_pls + test_V2_p6_pls)
test_pred_r_p6_pls = ifelse(abs(test_pred_p6_pls)>0.3, ifelse(test_pred_p6_pls>0, "AML", "ALL"), "Uncertain")
test_table_p6_pls = table(Test_Predict = test_pred_r_p6_pls, Test_Actual = pls_test$response)
test_table_p6_pls


             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_Predict ALL AML
         ALL  20   1
         AML   0  13
  • Based on Paper 9 feature selection

In [15]:
# test_BW_predictor test_r train_BW_predictor train_r loaded
load("../transformed data/paper9.rda")
rbind(Train = dim(train_BW_predictor), Test = dim(test_BW_predictor))

train_p9_aml = colMeans(train_BW_predictor[train_r == "AML",])
train_p9_all = colMeans(train_BW_predictor[train_r =="ALL",])
b_p9 = (train_p9_aml+train_p9_all)/2
p_p9 = get_p(train_BW_predictor, train_r)

#train
train_vote_p9 = t(p_p9*t(sweep(train_BW_predictor, 2, b_p9)))
train_V1_p9 = apply(train_vote_p9, 1, function(x) sum(x[x>0]))
train_V2_p9 = abs(apply(train_vote_p9, 1, function(x) sum(x[x<=0])))
train_pred_p9 = (train_V1_p9-train_V2_p9)/(train_V1_p9+train_V2_p9)
train_pred_r_p9 = ifelse(abs(train_pred_p9)>0.3, ifelse(train_pred_p9>0, "AML", "ALL"), "Uncertain")
train_table_p9 = table(Train_Predict = train_pred_r_p9, Train_Actual = train_r)
train_table_p9
#test
test_vote_p9 = t(p_p9*t(sweep(test_BW_predictor, 2, b_p9)))
test_V1_p9 = apply(test_vote_p9, 1, function(x) sum(x[x>0]))
test_V2_p9 = abs(apply(test_vote_p9, 1, function(x) sum(x[x<=0])))
test_pred_p9 = (test_V1_p9-test_V2_p9)/(test_V1_p9+test_V2_p9)
test_pred_r_p9 = ifelse(abs(test_pred_p9)>0.3, ifelse(test_pred_p9>0, "AML", "ALL"), "Uncertain")
test_table_p9 = table(Test_Predict = test_pred_r_p9, Test_Actual = test_r)
test_table_p9


Train4850
Test2450
             Train_Actual
Train_Predict ALL AML
    ALL        28   0
    AML         0  17
    Uncertain   2   1
            Test_Actual
Test_Predict ALL AML
   ALL        16   0
   AML         0   7
   Uncertain   1   0
  • Based on paper29 feature selection

In [16]:
# test_kmeans train_kmeans golub_test_r golub_train_r loaded
load("../transformed data/paper29.rda")
rbind(Train = dim(train_kmeans), Test = dim(test_kmeans))

train_p29_aml = colMeans(train_kmeans[golub_train_r == "AML",])
train_p29_all = colMeans(train_kmeans[golub_train_r =="ALL",])
b_p29 = (train_p29_aml+train_p29_all)/2
p_p29 = get_p(train_kmeans, golub_train_r)

#train
train_vote_p29 = t(p_p29*t(sweep(train_kmeans, 2, b_p29)))
train_V1_p29 = apply(train_vote_p29, 1, function(x) sum(x[x>0]))
train_V2_p29 = abs(apply(train_vote_p29, 1, function(x) sum(x[x<=0])))
train_pred_p29 = (train_V1_p29-train_V2_p29)/(train_V1_p29+train_V2_p29)
train_pred_r_p29 = ifelse(abs(train_pred_p29)>0.3, ifelse(train_pred_p29>0, "AML", "ALL"), "Uncertain")
train_table_p29 = table(Train_Predict = train_pred_r_p29, Train_Actual = golub_train_r)
train_table_p29
#test
test_vote_p29 = t(p_p29*t(sweep(test_kmeans, 2, b_p29)))
test_V1_p29 = apply(test_vote_p29, 1, function(x) sum(x[x>0]))
test_V2_p29 = abs(apply(test_vote_p29, 1, function(x) sum(x[x<=0])))
test_pred_p29 = (test_V1_p29-test_V2_p29)/(test_V1_p29+test_V2_p29)
test_pred_r_p29 = ifelse(abs(test_pred_p29)>0.3, ifelse(test_pred_p29>0, "AML", "ALL"), "Uncertain")
test_table_p29 = table(Test_Predict = test_pred_r_p29, Test_Actual = golub_test_r)
test_table_p29


Train3850
Test3450
             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_Predict ALL AML
   ALL        18   1
   AML         0   6
   Uncertain   2   7

As we can see the result above, for paper29 selection, classification result is not as good as others and even a little worse than the bn result in paper29 notebook.