This file contains the code used for implementing the classifer in paper Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring.
Data Set: Raw/original data used in the Golub etc. paper.
- Train Data: 38 bone marrow samples(27ALL, 11AML).
- Test Data: 34 samples(24 bone marrow, 10 peripheral blood samples, 20ALL, 14AML).
- Predictors: 7129 gene expression levels represent 6817 genes.
Main Purpose: Golub Classifier: class prediction, using the built classifiers to predict the class of a new tumor sample.
Step 1(a): Install Bioconductor biocLite package in order to access the golubEsets library. golubEsets contains the raw data used by Todd Golub in the original paper.
In [1]:
## Most code is commented in this cell since it is unnecessary and time-consuming to run it everytime.
# options(repos='http://cran.rstudio.com/')
# source("http://bioconductor.org/biocLite.R")
# biocLite("golubEsets")
suppressMessages(library(golubEsets))
Step 1(b): Load the training, testing data from library golubEsets. Also transpose the data to make observations as rows.
In [2]:
#Training data predictor and response
data(Golub_Train)
golub_train_p = t(exprs(Golub_Train))
golub_train_r =pData(Golub_Train)[, "ALL.AML"]
#Testing data predictor
data(Golub_Test)
golub_test_p = t(exprs(Golub_Test))
golub_test_r = pData(Golub_Test)[, "ALL.AML"]
#Show summary
rbind(Train = dim(golub_train_p), Test = dim(golub_test_p))
cbind(Train = table(golub_train_r),Test = table(golub_test_r))
Step 1(c): Perform data preprocessing: thresholding, filtering, logarithmic transformation and normalization as in the paper. The predictor is reduced to 3051 after preprocessing.
Most details of step 1(c) are not included in the original paper. We combine the information in paper 2, paper 9 and also a reproduce work done by Robert Gentleman, who confirmed in his work the procedure of thresholding and filtering is the same as in the original paper. One also need to notice that we should use the mean and standard deviation in the training data to normalize the testing data as mentioned in the Appendix A of the paper 2. At the end of this step, there are 3051 predictors left. The resulting dataset are same as the $72\times 3051$ Golub dataset available online.
In [3]:
# Thresholding
golub_train_pp = golub_train_p
golub_train_pp[golub_train_pp<100] = 100
golub_train_pp[golub_train_pp>16000] = 16000
# Filtering
golub_filter = function(x, r = 5, d=500){
minval = min(x)
maxval = max(x)
(maxval/minval>r)&&(maxval-minval>d)
}
index = apply(golub_train_pp, 2, golub_filter)
golub_index = (1:7129)[index]
golub_train_pp = golub_train_pp[, golub_index]
golub_test_pp = golub_test_p
golub_test_pp[golub_test_pp<100] = 100
golub_test_pp[golub_test_pp>16000] = 16000
golub_test_pp = golub_test_pp[, golub_index]
# Log Transformation
golub_train_p_trans = log10(golub_train_pp)
golub_test_p_trans = log10(golub_test_pp)
# Normalization
train_m = colMeans(golub_train_p_trans)
train_sd = apply(golub_train_p_trans, 2, sd)
golub_train_p_trans = t((t(golub_train_p_trans)-train_m)/train_sd)
golub_test_p_trans = t((t(golub_test_p_trans)-train_m)/train_sd)
golub_train_3051 = golub_train_p_trans
golub_train_response = golub_train_r
golub_test_3051 = golub_test_p_trans
golub_test_response = golub_test_r
save(golub_train_3051, golub_train_response, golub_test_3051, golub_test_response, file = "../transformed data/golub3051.rda")
set.seed(201703)
In [4]:
cbind(train = dim(golub_train_p_trans),test = dim(golub_test_p_trans))
Step 2(a): Neighbourhood analysis, permutation test and "signal-to-noise" ratio calculation: $P(g, c) = (\mu_{AML}-\mu_{ALL})/(\sigma_{AML} + \sigma_{ALL})$ as in notes 16 and 17. As we can see from the table at the end, neighbourhood analysis further reduce the number of predictors to 740 using 0.01 significant level permutation test.
In [5]:
# Neighbourhood analysis
##signal-to-noise ratio/PS in the paper
get_p = function(train_d, train_r){
tr_m_aml = colMeans(train_d[train_r == "AML",])
tr_sd_aml = apply(train_d[train_r == "AML",], 2, sd)
tr_m_all = colMeans(train_d[train_r == "ALL",])
tr_sd_all = apply(train_d[train_r == "ALL",], 2, sd)
p = (tr_m_aml-tr_m_all)/(tr_sd_aml+tr_sd_all)
return(p)
}
In [6]:
nna = matrix(0, 400, 3051)
set.seed(201702)
# Permutation test
for(i in 1:400){
t_r = sample(golub_train_r)
nna[i, ] = get_p(golub_train_p_trans, t_r)
}
In [7]:
# Predictor selection based on the result of Neighbourhood analysis
nna_q = apply(nna, 2, quantile, prob = c(0.005, 0.995))
p = get_p(golub_train_p_trans, golub_train_r)
# With 0.01 significant level
index_1 = (1:3051)[p>=nna_q[2,] | p<=nna_q[1,]]
golub_train_p_trans = golub_train_p_trans[, index_1]
train_m_aml = colMeans(golub_train_p_trans[golub_train_r == "AML",])
train_m_all = colMeans(golub_train_p_trans[golub_train_r =="ALL",])
golub_test_p_trans =golub_test_p_trans[, index_1]
p = p[index_1]
In [8]:
cbind(train = dim(golub_train_p_trans),test = dim(golub_test_p_trans))
Step 2(b): Select the informative 50 genes as in the original paper. We follow the note 18 in selecting the genes.
In [9]:
cl_index = c(head(order(p), 25), head(order(p, decreasing = T), 25))
p_50 = p[cl_index]
b = (train_m_aml[cl_index]+train_m_all[cl_index])/2
train_cl = golub_train_p_trans[, cl_index]
test_cl = golub_test_p_trans[, cl_index]
golub_train_50 = train_cl
golub_test_50 = test_cl
save(golub_train_50, golub_train_response, golub_test_50, golub_test_response, file = "../transformed data/golub50gene.rda")
In [10]:
#train
train_vote = t(p_50*t(sweep(train_cl, 2, b)))
train_V1 = apply(train_vote, 1, function(x) sum(x[x>0]))
train_V2 = abs(apply(train_vote, 1, function(x) sum(x[x<=0])))
train_pred = (train_V1-train_V2)/(train_V1+train_V2)
train_pred_r = ifelse(abs(train_pred)>0.3, ifelse(train_pred>0, "AML", "ALL"), "Uncertain")
train_table = table(Train_Predict = train_pred_r, Train_Actual = golub_train_r)
train_table
#test
test_vote = t(p_50*t(sweep(test_cl, 2, b)))
test_V1 = apply(test_vote, 1, function(x) sum(x[x>0]))
test_V2 = abs(apply(test_vote, 1, function(x) sum(x[x<=0])))
test_pred = (test_V1-test_V2)/(test_V1+test_V2)
test_pred_r = ifelse(abs(test_pred)>0.3, ifelse(test_pred>0, "AML", "ALL"), "Uncertain")
test_table = table(Test_Predict = test_pred_r, Test_Actual = golub_test_r)
test_table
In step 1, data are loaded and preprocessed. However, most details of step 1(c) are not included in the original paper. We combine the information in paper 2, paper 9 and also a reproduce work done by Robert Gentleman, who confirmed in his work the procedure of thresholding and filtering. One also need to notice that we should use the mean and standard deviation in the training data to normalize the testing data as mentioned in the notes of the orignal paper. At the end of step 1, there are 3051 predictors left. The resulting dataset are same as the $72\times 3051$ Golub dataset available online.
In step 2, we first do the neighborhood analysis mentioned in the paper. However, detailed implementation is based on our understanding. (It is basically a permutation test and the seed is set to be 201702.) The method to get the critical values is simplified and hence the bound may be loose. The analysis further reduce the number of predictors, which you can check in the dataset dimensions. We then follow the procedure in the paper and select 50 most related genes as classifiers and subset our train and test data for classification sake.
In step 3, we perform the classification or prediction for both training data and testing data. As we can see in the two contigency table, the result is slightly better than in the original paper, since the bootstrap actually have some randomness.
Result Comparison: Train Prediction: In the paper, they correctly classify 36/38 with remaining two as uncertain. We correctly classify 37/38 with one as uncertain. Test Prediction: In the paper, they correctly classify 29/34 and we correctly classify 31/34.
However, in the original paper, they did leave one out cross validation to get the train accuracy and we didn't.
In [11]:
# test_paper3 test_response train_paper3 train_response loaded
load("../transformed data/paper3.rda")
rbind( Train = dim(train_paper3), Test = dim(test_paper3))
train_paper3_aml = colMeans(train_paper3[train_response == "AML",])
train_paper3_all = colMeans(train_paper3[train_response =="ALL",])
b_p3 = (train_paper3_aml+train_paper3_all)/2
p_p3 = get_p(train_paper3, train_response)
#train
train_vote_p3 = t(p_p3*t(sweep(train_paper3, 2, b_p3)))
train_V1_p3 = apply(train_vote_p3, 1, function(x) sum(x[x>0]))
train_V2_p3 = abs(apply(train_vote_p3, 1, function(x) sum(x[x<=0])))
train_pred_p3 = (train_V1_p3-train_V2_p3)/(train_V1_p3+train_V2_p3)
train_pred_r_p3 = ifelse(abs(train_pred_p3)>0.3, ifelse(train_pred_p3>0, "AML", "ALL"), "Uncertain")
train_table_p3 = table(Train_Predict = train_pred_r_p3, Train_Actual = train_response)
train_table_p3
#test
test_vote_p3 = t(p_p3*t(sweep(test_paper3, 2, b_p3)))
test_V1_p3 = apply(test_vote_p3, 1, function(x) sum(x[x>0]))
test_V2_p3 = abs(apply(test_vote_p3, 1, function(x) sum(x[x<=0])))
test_pred_p3 = (test_V1_p3-test_V2_p3)/(test_V1_p3+test_V2_p3)
test_pred_r_p3 = ifelse(abs(test_pred_p3)>0.3, ifelse(test_pred_p3>0, "AML", "ALL"), "Uncertain")
test_table_p3 = table(Test_Predict = test_pred_r_p3, Test_Actual = test_response)
test_table_p3
In [12]:
# pca_train, pca_test, pls_train, pls_test loaded
load("../transformed data/paper6.rda")
rbind(Train_pca = dim(pca_train), Test_pca = dim(pca_test), Train_pls = dim(pls_train), Test_pls = dim(pls_test))
As we could see from the table above, paper6 performs dimension reduction and there is only three resulting features(components). However, we could still perform classification using those three features.
In [13]:
# pca
train_p6_pca_aml = colMeans(pca_train[pca_train$response== "AML",-1])
train_p6_pca_all = colMeans(pca_train[ pca_train$response=="ALL",-1])
b_p6_pca = (train_p6_pca_aml+train_p6_pca_all)/2
p_p6_pca = get_p(pca_train[, -1], pca_train$response)
#train
train_vote_p6_pca = t(p_p6_pca*t(sweep(pca_train[, -1], 2, b_p6_pca)))
train_V1_p6_pca = apply(train_vote_p6_pca, 1, function(x) sum(x[x>0]))
train_V2_p6_pca = abs(apply(train_vote_p6_pca, 1, function(x) sum(x[x<=0])))
train_pred_p6_pca = (train_V1_p6_pca-train_V2_p6_pca)/(train_V1_p6_pca+train_V2_p6_pca)
train_pred_r_p6_pca = ifelse(abs(train_pred_p6_pca)>0.3, ifelse(train_pred_p6_pca>0, "AML", "ALL"), "Uncertain")
train_table_p6_pca = table(Train_Predict = train_pred_r_p6_pca, Train_Actual = pca_train$response)
train_table_p6_pca
#test
test_vote_p6_pca = t(p_p6_pca*t(sweep(pca_test[, -1], 2, b_p6_pca)))
test_V1_p6_pca = apply(test_vote_p6_pca, 1, function(x) sum(x[x>0]))
test_V2_p6_pca = abs(apply(test_vote_p6_pca, 1, function(x) sum(x[x<=0])))
test_pred_p6_pca = (test_V1_p6_pca-test_V2_p6_pca)/(test_V1_p6_pca+test_V2_p6_pca)
test_pred_r_p6_pca = ifelse(abs(test_pred_p6_pca)>0.3, ifelse(test_pred_p6_pca>0, "AML", "ALL"), "Uncertain")
test_table_p6_pca = table(Test_Predict = test_pred_r_p6_pca, Test_Actual = pca_test$response)
test_table_p6_pca
In [14]:
# pls
train_p6_pls_aml = colMeans(pls_train[pls_train$response== "AML",-1])
train_p6_pls_all = colMeans(pls_train[pls_train$response=="ALL",-1])
b_p6_pls = (train_p6_pls_aml+train_p6_pls_all)/2
p_p6_pls = get_p(pls_train[, -1], pls_train$response)
#train
train_vote_p6_pls = t(p_p6_pls*t(sweep(pls_train[, -1], 2, b_p6_pls)))
train_V1_p6_pls = apply(train_vote_p6_pls, 1, function(x) sum(x[x>0]))
train_V2_p6_pls = abs(apply(train_vote_p6_pls, 1, function(x) sum(x[x<=0])))
train_pred_p6_pls = (train_V1_p6_pls-train_V2_p6_pls)/(train_V1_p6_pls+train_V2_p6_pls)
train_pred_r_p6_pls = ifelse(abs(train_pred_p6_pls)>0.3, ifelse(train_pred_p6_pls>0, "AML", "ALL"), "Uncertain")
train_table_p6_pls = table(Train_Predict = train_pred_r_p6_pls, Train_Actual = pls_train$response)
train_table_p6_pls
#test
test_vote_p6_pls = t(p_p6_pls*t(sweep(pls_test[, -1], 2, b_p6_pls)))
test_V1_p6_pls = apply(test_vote_p6_pls, 1, function(x) sum(x[x>0]))
test_V2_p6_pls = abs(apply(test_vote_p6_pls, 1, function(x) sum(x[x<=0])))
test_pred_p6_pls = (test_V1_p6_pls - test_V2_p6_pls)/(test_V1_p6_pls + test_V2_p6_pls)
test_pred_r_p6_pls = ifelse(abs(test_pred_p6_pls)>0.3, ifelse(test_pred_p6_pls>0, "AML", "ALL"), "Uncertain")
test_table_p6_pls = table(Test_Predict = test_pred_r_p6_pls, Test_Actual = pls_test$response)
test_table_p6_pls
In [15]:
# test_BW_predictor test_r train_BW_predictor train_r loaded
load("../transformed data/paper9.rda")
rbind(Train = dim(train_BW_predictor), Test = dim(test_BW_predictor))
train_p9_aml = colMeans(train_BW_predictor[train_r == "AML",])
train_p9_all = colMeans(train_BW_predictor[train_r =="ALL",])
b_p9 = (train_p9_aml+train_p9_all)/2
p_p9 = get_p(train_BW_predictor, train_r)
#train
train_vote_p9 = t(p_p9*t(sweep(train_BW_predictor, 2, b_p9)))
train_V1_p9 = apply(train_vote_p9, 1, function(x) sum(x[x>0]))
train_V2_p9 = abs(apply(train_vote_p9, 1, function(x) sum(x[x<=0])))
train_pred_p9 = (train_V1_p9-train_V2_p9)/(train_V1_p9+train_V2_p9)
train_pred_r_p9 = ifelse(abs(train_pred_p9)>0.3, ifelse(train_pred_p9>0, "AML", "ALL"), "Uncertain")
train_table_p9 = table(Train_Predict = train_pred_r_p9, Train_Actual = train_r)
train_table_p9
#test
test_vote_p9 = t(p_p9*t(sweep(test_BW_predictor, 2, b_p9)))
test_V1_p9 = apply(test_vote_p9, 1, function(x) sum(x[x>0]))
test_V2_p9 = abs(apply(test_vote_p9, 1, function(x) sum(x[x<=0])))
test_pred_p9 = (test_V1_p9-test_V2_p9)/(test_V1_p9+test_V2_p9)
test_pred_r_p9 = ifelse(abs(test_pred_p9)>0.3, ifelse(test_pred_p9>0, "AML", "ALL"), "Uncertain")
test_table_p9 = table(Test_Predict = test_pred_r_p9, Test_Actual = test_r)
test_table_p9
In [16]:
# test_kmeans train_kmeans golub_test_r golub_train_r loaded
load("../transformed data/paper29.rda")
rbind(Train = dim(train_kmeans), Test = dim(test_kmeans))
train_p29_aml = colMeans(train_kmeans[golub_train_r == "AML",])
train_p29_all = colMeans(train_kmeans[golub_train_r =="ALL",])
b_p29 = (train_p29_aml+train_p29_all)/2
p_p29 = get_p(train_kmeans, golub_train_r)
#train
train_vote_p29 = t(p_p29*t(sweep(train_kmeans, 2, b_p29)))
train_V1_p29 = apply(train_vote_p29, 1, function(x) sum(x[x>0]))
train_V2_p29 = abs(apply(train_vote_p29, 1, function(x) sum(x[x<=0])))
train_pred_p29 = (train_V1_p29-train_V2_p29)/(train_V1_p29+train_V2_p29)
train_pred_r_p29 = ifelse(abs(train_pred_p29)>0.3, ifelse(train_pred_p29>0, "AML", "ALL"), "Uncertain")
train_table_p29 = table(Train_Predict = train_pred_r_p29, Train_Actual = golub_train_r)
train_table_p29
#test
test_vote_p29 = t(p_p29*t(sweep(test_kmeans, 2, b_p29)))
test_V1_p29 = apply(test_vote_p29, 1, function(x) sum(x[x>0]))
test_V2_p29 = abs(apply(test_vote_p29, 1, function(x) sum(x[x<=0])))
test_pred_p29 = (test_V1_p29-test_V2_p29)/(test_V1_p29+test_V2_p29)
test_pred_r_p29 = ifelse(abs(test_pred_p29)>0.3, ifelse(test_pred_p29>0, "AML", "ALL"), "Uncertain")
test_table_p29 = table(Test_Predict = test_pred_r_p29, Test_Actual = golub_test_r)
test_table_p29
As we can see the result above, for paper29 selection, classification result is not as good as others and even a little worse than the bn result in paper29 notebook.