This file contains the code used for implementing the classifer in paper Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring.

Data Set: Raw/original data used in the Golub etc. paper.

- Train Data: 38 bone marrow samples(27ALL, 11AML).
- Test Data: 34 samples(24 bone marrow, 10 peripheral blood samples, 20ALL, 14AML).
- Predictors: 7129 gene expression levels represent 6817 genes.

Main Purpose: Golub Classifier: class prediction, using the built classifiers to predict the class of a new tumor sample.

Reproduce using R

Most of the steps below strictly follow the procedures in the class prediction part in the Golub Paper and we get highly resemble result as in the paper. To reproduce the reproduced result, a seed is set in step 2 below.

Step 1 Load and transform dataset.

Step 1(a): Install Bioconductor biocLite package in order to access the golubEsets library. golubEsets contains the raw data used by Todd Golub in the original paper.



In [1]:

    
## Most code is commented in this cell since it is unnecessary and time-consuming to run it everytime.
# options(repos='http://cran.rstudio.com/') 
# source("http://bioconductor.org/biocLite.R")
# biocLite("golubEsets")
suppressMessages(library(golubEsets))

Step 1(b): Load the training, testing data from library golubEsets. Also transpose the data to make observations as rows.



In [2]:

    
#Training data predictor and response
data(Golub_Train)
golub_train_p = t(exprs(Golub_Train))
golub_train_r =pData(Golub_Train)[, "ALL.AML"]
#Testing data predictor
data(Golub_Test)
golub_test_p = t(exprs(Golub_Test))
golub_test_r = pData(Golub_Test)[, "ALL.AML"]
#Show summary
rbind(Train = dim(golub_train_p), Test = dim(golub_test_p))
cbind(Train = table(golub_train_r),Test = table(golub_test_r))









    






	Train 38  7129
	Test 34  7129









    





Train Test

	ALL 27 20
	AML 11 14

Step 1(c): Perform data preprocessing: thresholding, filtering, logarithmic transformation and normalization as in the paper. The predictor is reduced to 3051 after preprocessing.

Most details of step 1(c) are not included in the original paper. We combine the information in paper 2, paper 9 and also a reproduce work done by Robert Gentleman, who confirmed in his work the procedure of thresholding and filtering is the same as in the original paper. One also need to notice that we should use the mean and standard deviation in the training data to normalize the testing data as mentioned in the Appendix A of the paper 2. At the end of this step, there are 3051 predictors left. The resulting dataset are same as the $72\times 3051$ Golub dataset available online.



In [3]:

    
# Thresholding
golub_train_pp = golub_train_p
golub_train_pp[golub_train_pp<100] = 100
golub_train_pp[golub_train_pp>16000] = 16000

# Filtering
golub_filter = function(x, r = 5, d=500){
    minval = min(x)
    maxval = max(x)
    (maxval/minval>r)&&(maxval-minval>d)
}
index = apply(golub_train_pp, 2, golub_filter)
golub_index = (1:7129)[index]
golub_train_pp = golub_train_pp[, golub_index]

golub_test_pp = golub_test_p
golub_test_pp[golub_test_pp<100] = 100
golub_test_pp[golub_test_pp>16000] = 16000
golub_test_pp = golub_test_pp[, golub_index]

# Log Transformation
golub_train_p_trans = log10(golub_train_pp)
golub_test_p_trans = log10(golub_test_pp)

# Normalization
train_m = colMeans(golub_train_p_trans)
train_sd = apply(golub_train_p_trans, 2, sd)
golub_train_p_trans = t((t(golub_train_p_trans)-train_m)/train_sd)
golub_test_p_trans  = t((t(golub_test_p_trans)-train_m)/train_sd)
golub_train_3051 = golub_train_p_trans
golub_train_response = golub_train_r
golub_test_3051 = golub_test_p_trans
golub_test_response = golub_test_r
save(golub_train_3051, golub_train_response, golub_test_3051, golub_test_response, file = "../transformed data/golub3051.rda")
set.seed(201703)



In [4]:

    
cbind(train = dim(golub_train_p_trans),test = dim(golub_test_p_trans))









    





train test

	  38   34
	3051 3051

Step2: Build classifier/predictor.

Step 2(a): Neighbourhood analysis, permutation test and "signal-to-noise" ratio calculation: $P(g, c) = (\mu_{AML}-\mu_{ALL})/(\sigma_{AML} + \sigma_{ALL})$ as in notes 16 and 17. As we can see from the table at the end, neighbourhood analysis further reduce the number of predictors to 740 using 0.01 significant level permutation test.



In [5]:

    
# Neighbourhood analysis
##signal-to-noise ratio/PS in the paper
get_p = function(train_d, train_r){
    tr_m_aml =  colMeans(train_d[train_r == "AML",])
    tr_sd_aml = apply(train_d[train_r == "AML",], 2, sd)
    tr_m_all = colMeans(train_d[train_r == "ALL",])
    tr_sd_all = apply(train_d[train_r == "ALL",], 2, sd)
    p = (tr_m_aml-tr_m_all)/(tr_sd_aml+tr_sd_all)
    return(p)
}



In [6]:

    
nna = matrix(0, 400, 3051)
set.seed(201702)
# Permutation test
for(i in 1:400){
    t_r = sample(golub_train_r)
    nna[i, ] = get_p(golub_train_p_trans, t_r)
}



In [7]:

    
# Predictor selection based on the result of Neighbourhood analysis
nna_q = apply(nna, 2, quantile, prob = c(0.005, 0.995))
p = get_p(golub_train_p_trans, golub_train_r)
# With 0.01 significant level
index_1 = (1:3051)[p>=nna_q[2,] | p<=nna_q[1,]]
golub_train_p_trans = golub_train_p_trans[, index_1]
train_m_aml = colMeans(golub_train_p_trans[golub_train_r == "AML",])
train_m_all = colMeans(golub_train_p_trans[golub_train_r =="ALL",])
golub_test_p_trans =golub_test_p_trans[, index_1]
p = p[index_1]



In [8]:

    
cbind(train = dim(golub_train_p_trans),test = dim(golub_test_p_trans))









    





train test

	 38  34
	740 740

Step 2(b): Select the informative 50 genes as in the original paper. We follow the note 18 in selecting the genes.



In [9]:

    
cl_index = c(head(order(p), 25), head(order(p, decreasing = T), 25))
p_50 = p[cl_index]
b = (train_m_aml[cl_index]+train_m_all[cl_index])/2
train_cl = golub_train_p_trans[, cl_index]
test_cl = golub_test_p_trans[, cl_index]
golub_train_50 = train_cl
golub_test_50 = test_cl
save(golub_train_50, golub_train_response, golub_test_50, golub_test_response, file = "../transformed data/golub50gene.rda")

Step 3: Predict on training and testing data.

The prediction steps are the same as in note 18, 19 and 20.



In [10]:

    
#train
train_vote = t(p_50*t(sweep(train_cl, 2, b)))
train_V1 = apply(train_vote, 1, function(x) sum(x[x>0]))
train_V2 = abs(apply(train_vote, 1, function(x) sum(x[x<=0])))
train_pred = (train_V1-train_V2)/(train_V1+train_V2)
train_pred_r = ifelse(abs(train_pred)>0.3, ifelse(train_pred>0, "AML", "ALL"), "Uncertain")
train_table = table(Train_Predict = train_pred_r, Train_Actual = golub_train_r)
train_table
#test
test_vote = t(p_50*t(sweep(test_cl, 2, b)))
test_V1 = apply(test_vote, 1, function(x) sum(x[x>0]))
test_V2 = abs(apply(test_vote, 1, function(x) sum(x[x<=0])))
test_pred = (test_V1-test_V2)/(test_V1+test_V2)
test_pred_r = ifelse(abs(test_pred)>0.3, ifelse(test_pred>0, "AML", "ALL"), "Uncertain")
test_table = table(Test_Predict = test_pred_r, Test_Actual = golub_test_r)
test_table









    





             Train_Actual
Train_Predict ALL AML
    ALL        26   0
    AML         0  11
    Uncertain   1   0






    





            Test_Actual
Test_Predict ALL AML
   ALL        19   0
   AML         0  12
   Uncertain   1   2

Summary of Reproduce result

In step 1, data are loaded and preprocessed. However, most details of step 1(c) are not included in the original paper. We combine the information in paper 2, paper 9 and also a reproduce work done by Robert Gentleman, who confirmed in his work the procedure of thresholding and filtering. One also need to notice that we should use the mean and standard deviation in the training data to normalize the testing data as mentioned in the notes of the orignal paper. At the end of step 1, there are 3051 predictors left. The resulting dataset are same as the $72\times 3051$ Golub dataset available online.

In step 2, we first do the neighborhood analysis mentioned in the paper. However, detailed implementation is based on our understanding. (It is basically a permutation test and the seed is set to be 201702.) The method to get the critical values is simplified and hence the bound may be loose. The analysis further reduce the number of predictors, which you can check in the dataset dimensions. We then follow the procedure in the paper and select 50 most related genes as classifiers and subset our train and test data for classification sake.

In step 3, we perform the classification or prediction for both training data and testing data. As we can see in the two contigency table, the result is slightly better than in the original paper, since the bootstrap actually have some randomness.

Result Comparison: Train Prediction: In the paper, they correctly classify 36/38 with remaining two as uncertain. We correctly classify 37/38 with one as uncertain. Test Prediction: In the paper, they correctly classify 29/34 and we correctly classify 31/34.

However, in the original paper, they did leave one out cross validation to get the train accuracy and we didn't.

Compare classification with other feature selection method

Based on Paper 3 feature selection



In [11]:

    
# test_paper3 test_response train_paper3 train_response loaded
load("../transformed data/paper3.rda")
rbind( Train = dim(train_paper3), Test = dim(test_paper3))


train_paper3_aml = colMeans(train_paper3[train_response == "AML",])
train_paper3_all = colMeans(train_paper3[train_response =="ALL",])
b_p3 = (train_paper3_aml+train_paper3_all)/2
p_p3 = get_p(train_paper3, train_response)

#train
train_vote_p3 = t(p_p3*t(sweep(train_paper3, 2, b_p3)))
train_V1_p3 = apply(train_vote_p3, 1, function(x) sum(x[x>0]))
train_V2_p3 = abs(apply(train_vote_p3, 1, function(x) sum(x[x<=0])))
train_pred_p3 = (train_V1_p3-train_V2_p3)/(train_V1_p3+train_V2_p3)
train_pred_r_p3 = ifelse(abs(train_pred_p3)>0.3, ifelse(train_pred_p3>0, "AML", "ALL"), "Uncertain")
train_table_p3 = table(Train_Predict = train_pred_r_p3, Train_Actual = train_response)
train_table_p3
#test
test_vote_p3 = t(p_p3*t(sweep(test_paper3, 2, b_p3)))
test_V1_p3 = apply(test_vote_p3, 1, function(x) sum(x[x>0]))
test_V2_p3 = abs(apply(test_vote_p3, 1, function(x) sum(x[x<=0])))
test_pred_p3 = (test_V1_p3-test_V2_p3)/(test_V1_p3+test_V2_p3)
test_pred_r_p3 = ifelse(abs(test_pred_p3)>0.3, ifelse(test_pred_p3>0, "AML", "ALL"), "Uncertain")
test_table_p3 = table(Test_Predict = test_pred_r_p3, Test_Actual = test_response)
test_table_p3









    






	Train 38 50
	Test 34 50









    





             Train_Actual
Train_Predict ALL AML
    ALL        27   0
    AML         0   9
    Uncertain   0   2






    





            Test_Actual
Test_Predict ALL AML
   ALL        20   1
   AML         0  12
   Uncertain   0   1

Based on paper 6 feature reduction



In [12]:

    
# pca_train, pca_test, pls_train, pls_test loaded
load("../transformed data/paper6.rda")
rbind(Train_pca = dim(pca_train), Test_pca = dim(pca_test), Train_pls = dim(pls_train), Test_pls = dim(pls_test))









    






	Train_pca 38 4 
	Test_pca 34 4 
	Train_pls 38 4 
	Test_pls 34 4

As we could see from the table above, paper6 performs dimension reduction and there is only three resulting features(components). However, we could still perform classification using those three features.



In [13]:

    
# pca
train_p6_pca_aml = colMeans(pca_train[pca_train$response== "AML",-1])
train_p6_pca_all = colMeans(pca_train[ pca_train$response=="ALL",-1])
b_p6_pca = (train_p6_pca_aml+train_p6_pca_all)/2
p_p6_pca = get_p(pca_train[, -1], pca_train$response)
#train
train_vote_p6_pca = t(p_p6_pca*t(sweep(pca_train[, -1], 2, b_p6_pca)))
train_V1_p6_pca = apply(train_vote_p6_pca, 1, function(x) sum(x[x>0]))
train_V2_p6_pca = abs(apply(train_vote_p6_pca, 1, function(x) sum(x[x<=0])))
train_pred_p6_pca = (train_V1_p6_pca-train_V2_p6_pca)/(train_V1_p6_pca+train_V2_p6_pca)
train_pred_r_p6_pca = ifelse(abs(train_pred_p6_pca)>0.3, ifelse(train_pred_p6_pca>0, "AML", "ALL"), "Uncertain")
train_table_p6_pca = table(Train_Predict = train_pred_r_p6_pca, Train_Actual = pca_train$response)
train_table_p6_pca
#test
test_vote_p6_pca = t(p_p6_pca*t(sweep(pca_test[, -1], 2, b_p6_pca)))
test_V1_p6_pca = apply(test_vote_p6_pca, 1, function(x) sum(x[x>0]))
test_V2_p6_pca = abs(apply(test_vote_p6_pca, 1, function(x) sum(x[x<=0])))
test_pred_p6_pca = (test_V1_p6_pca-test_V2_p6_pca)/(test_V1_p6_pca+test_V2_p6_pca)
test_pred_r_p6_pca = ifelse(abs(test_pred_p6_pca)>0.3, ifelse(test_pred_p6_pca>0, "AML", "ALL"), "Uncertain")
test_table_p6_pca = table(Test_Predict = test_pred_r_p6_pca, Test_Actual = pca_test$response)
test_table_p6_pca









    





             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11






    





            Test_Actual
Test_Predict ALL AML
         ALL  20   1
         AML   0  13



In [14]:

    
# pls
train_p6_pls_aml = colMeans(pls_train[pls_train$response== "AML",-1])
train_p6_pls_all = colMeans(pls_train[pls_train$response=="ALL",-1])
b_p6_pls = (train_p6_pls_aml+train_p6_pls_all)/2
p_p6_pls = get_p(pls_train[, -1], pls_train$response)
#train
train_vote_p6_pls = t(p_p6_pls*t(sweep(pls_train[, -1], 2, b_p6_pls)))
train_V1_p6_pls = apply(train_vote_p6_pls, 1, function(x) sum(x[x>0]))
train_V2_p6_pls = abs(apply(train_vote_p6_pls, 1, function(x) sum(x[x<=0])))
train_pred_p6_pls = (train_V1_p6_pls-train_V2_p6_pls)/(train_V1_p6_pls+train_V2_p6_pls)
train_pred_r_p6_pls = ifelse(abs(train_pred_p6_pls)>0.3, ifelse(train_pred_p6_pls>0, "AML", "ALL"), "Uncertain")
train_table_p6_pls = table(Train_Predict = train_pred_r_p6_pls, Train_Actual = pls_train$response)
train_table_p6_pls
#test
test_vote_p6_pls = t(p_p6_pls*t(sweep(pls_test[, -1], 2, b_p6_pls)))
test_V1_p6_pls = apply(test_vote_p6_pls, 1, function(x) sum(x[x>0]))
test_V2_p6_pls = abs(apply(test_vote_p6_pls, 1, function(x) sum(x[x<=0])))
test_pred_p6_pls = (test_V1_p6_pls - test_V2_p6_pls)/(test_V1_p6_pls + test_V2_p6_pls)
test_pred_r_p6_pls = ifelse(abs(test_pred_p6_pls)>0.3, ifelse(test_pred_p6_pls>0, "AML", "ALL"), "Uncertain")
test_table_p6_pls = table(Test_Predict = test_pred_r_p6_pls, Test_Actual = pls_test$response)
test_table_p6_pls









    





             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11






    





            Test_Actual
Test_Predict ALL AML
         ALL  20   1
         AML   0  13

Based on Paper 9 feature selection



In [15]:

    
# test_BW_predictor test_r train_BW_predictor train_r loaded
load("../transformed data/paper9.rda")
rbind(Train = dim(train_BW_predictor), Test = dim(test_BW_predictor))

train_p9_aml = colMeans(train_BW_predictor[train_r == "AML",])
train_p9_all = colMeans(train_BW_predictor[train_r =="ALL",])
b_p9 = (train_p9_aml+train_p9_all)/2
p_p9 = get_p(train_BW_predictor, train_r)

#train
train_vote_p9 = t(p_p9*t(sweep(train_BW_predictor, 2, b_p9)))
train_V1_p9 = apply(train_vote_p9, 1, function(x) sum(x[x>0]))
train_V2_p9 = abs(apply(train_vote_p9, 1, function(x) sum(x[x<=0])))
train_pred_p9 = (train_V1_p9-train_V2_p9)/(train_V1_p9+train_V2_p9)
train_pred_r_p9 = ifelse(abs(train_pred_p9)>0.3, ifelse(train_pred_p9>0, "AML", "ALL"), "Uncertain")
train_table_p9 = table(Train_Predict = train_pred_r_p9, Train_Actual = train_r)
train_table_p9
#test
test_vote_p9 = t(p_p9*t(sweep(test_BW_predictor, 2, b_p9)))
test_V1_p9 = apply(test_vote_p9, 1, function(x) sum(x[x>0]))
test_V2_p9 = abs(apply(test_vote_p9, 1, function(x) sum(x[x<=0])))
test_pred_p9 = (test_V1_p9-test_V2_p9)/(test_V1_p9+test_V2_p9)
test_pred_r_p9 = ifelse(abs(test_pred_p9)>0.3, ifelse(test_pred_p9>0, "AML", "ALL"), "Uncertain")
test_table_p9 = table(Test_Predict = test_pred_r_p9, Test_Actual = test_r)
test_table_p9









    






	Train 48 50
	Test 24 50









    





             Train_Actual
Train_Predict ALL AML
    ALL        28   0
    AML         0  17
    Uncertain   2   1






    





            Test_Actual
Test_Predict ALL AML
   ALL        16   0
   AML         0   7
   Uncertain   1   0

Based on paper29 feature selection



In [16]:

    
# test_kmeans train_kmeans golub_test_r golub_train_r loaded
load("../transformed data/paper29.rda")
rbind(Train = dim(train_kmeans), Test = dim(test_kmeans))

train_p29_aml = colMeans(train_kmeans[golub_train_r == "AML",])
train_p29_all = colMeans(train_kmeans[golub_train_r =="ALL",])
b_p29 = (train_p29_aml+train_p29_all)/2
p_p29 = get_p(train_kmeans, golub_train_r)

#train
train_vote_p29 = t(p_p29*t(sweep(train_kmeans, 2, b_p29)))
train_V1_p29 = apply(train_vote_p29, 1, function(x) sum(x[x>0]))
train_V2_p29 = abs(apply(train_vote_p29, 1, function(x) sum(x[x<=0])))
train_pred_p29 = (train_V1_p29-train_V2_p29)/(train_V1_p29+train_V2_p29)
train_pred_r_p29 = ifelse(abs(train_pred_p29)>0.3, ifelse(train_pred_p29>0, "AML", "ALL"), "Uncertain")
train_table_p29 = table(Train_Predict = train_pred_r_p29, Train_Actual = golub_train_r)
train_table_p29
#test
test_vote_p29 = t(p_p29*t(sweep(test_kmeans, 2, b_p29)))
test_V1_p29 = apply(test_vote_p29, 1, function(x) sum(x[x>0]))
test_V2_p29 = abs(apply(test_vote_p29, 1, function(x) sum(x[x<=0])))
test_pred_p29 = (test_V1_p29-test_V2_p29)/(test_V1_p29+test_V2_p29)
test_pred_r_p29 = ifelse(abs(test_pred_p29)>0.3, ifelse(test_pred_p29>0, "AML", "ALL"), "Uncertain")
test_table_p29 = table(Test_Predict = test_pred_r_p29, Test_Actual = golub_test_r)
test_table_p29









    






	Train 38 50
	Test 34 50









    





             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11






    





            Test_Actual
Test_Predict ALL AML
   ALL        18   1
   AML         0   6
   Uncertain   2   7

As we can see the result above, for paper29 selection, classification result is not as good as others and even a little worse than the bn result in paper29 notebook.