This file contains the code used for implementing the classifer in paper Tissue Classification with Gene Expression Profiles.
Data Set: Raw/original data used in the Golub etc. paper.
- Train Data: 38 bone marrow samples(27ALL, 11AML).
- Test Data: 34 samples(24 bone marrow, 10 peripheral blood samples, 20ALL, 14AML).
- Predictors: 7129 gene expression levels represent 6817 genes.
Main Purpose: Multiple classifications: NN, SVM, AdaBoost.
Install Bioconductor biocLite package in order to access the golubEsets library. golubEsets contains the raw data used by Todd Golub in the original paper.
Load the training, testing data from library golubEsets. Also transpose the data to make observations as rows.
In [1]:
## The code below is commented out since it is unnecessary and time-consuming to run it everytime. Run it if needed.
# options(repos='http://cran.rstudio.com/')
# source("http://bioconductor.org/biocLite.R")
# biocLite("golubEsets")
suppressMessages(library(golubEsets))
#Training data
data(Golub_Train)
golub_train_p = t(exprs(Golub_Train))
golub_train_r =pData(Golub_Train)[, "ALL.AML"]
golub_train_l = ifelse(golub_train_r == "AML", 1, 0)
#Testing data
data(Golub_Test)
golub_test_p = t(exprs(Golub_Test))
golub_test_r = pData(Golub_Test)[, "ALL.AML"]
golub_test_l = ifelse(golub_test_r == "AML", 1, 0)
#Show summary
rbind(Train = dim(golub_train_p), Test = dim(golub_test_p))
cbind(Train = table(golub_train_r),Test = table(golub_test_r))
As a bootstrap is used we set a seed to ensure reproducibility of this reproduce work. We follow the steps in section 4 of the paper and it may take some time to run the bootstrap. We select genes with TNoM score less than 14 and also have bootstrap p-value less than 0.01.
In [2]:
set.seed(201703)
# TNoM score
r_score = function(slabel){
total = length(slabel)
n = sum(slabel == 0)
p = sum(slabel == 1)
temp = min(n, p)
d = ifelse(n < p, 1, -1)
for(i in 1:(total-1)){
count = sum(slabel[1:i] == 0)
t_t = min(n-2*count+i, p+2*count-i)
t_d = ifelse((n-2*count+i)<(p+2*count-i),1,-1)
if(t_t < temp){
temp = t_t
d = t_d
}
}
c(temp, t_d)
}
# Significance (using bootstrap with size 1000 to replace)
r_bootstrap = function(gene, label){
total = length(label)
index = order(gene)
s_l = label[index]
score= r_score(s_l)
dist_score = numeric(200)
for(i in 1:200){
temp = sample(1:total)
t = r_score(label[temp])
dist_score[i] = t[1]
}
prob = mean(dist_score<score[1])
c(score[1], score[2], prob)
}
In [3]:
# perform the caculation, this may take a while since the inevitable loops in above functions
a = apply(golub_train_p, 2, r_bootstrap, label = golub_train_l)
In [4]:
# select informative genes and subset the train and test data
index = (1:7129)[a[1,]<14 & a[3,]<0.01]
b = order(a[1,index])[1:50]
# data subsetting
train_cl = golub_train_p[, index]
test_cl = golub_test_p[, index]
train_paper3 = train_cl[, b]
train_response = golub_train_r
test_paper3 = test_cl[, b]
test_response = golub_test_r
save(train_paper3, train_response, test_paper3, test_response, train_cl, test_cl, file = "../transformed data/paper3.rda")
Step 3(a) Nearest Neighbor Classification
Pearson correlation is used as the measure of distance in the nearest neighbor classification. In the paper, they have 91.6% classification rate while we 33/34 = 97% classification rate on test data as shown below.
In [5]:
# Build the classifier
cl_nn = function(new_s, train, train_label){
# use Pearson correlation
corr = apply(train, 1, cor, new_s)
train_label[corr==max(corr)]
}
# prediction
nn_train_pr = apply(train_cl,1, cl_nn, train_cl, golub_train_r)
nn_test_pr = apply(test_cl,1, cl_nn, train_cl, golub_train_r)
# show result of prediction
table(Train_Predict = nn_train_pr, Train_Actual = golub_train_r)
table(Test_Predict = nn_test_pr, Test_Actual = golub_test_r)
Step 3(b) SVM
Use R package e1701, which has svm already implemented. As we use the function in e1701, we don't have unclassified observations as in the paper. They have accuracy rate of 93.3% for linear kernel and 94.4% for quadratic kernel. We have 31/34 = 91.2% correctly classified by linear kernel and 32/34 = 94.1% correctly classified by quadratic kernel.
In [6]:
## The next two lines are commented out. If you don't have the packages in your env, you need to run them first.
#options(repos='http://cran.rstudio.com/')
#install.packages("e1071")
suppressMessages(library(e1071))
# need to set.seed for reproducibility
set.seed(201702)
In [7]:
# build the data for R functions
r_train = data.frame(train_cl, Y = factor(golub_train_l))
r_test =data.frame( test_cl, Y = factor(golub_test_l))
# build svm with linear kernel
svm_linear = svm(Y~., data = r_train)
svm_l_trpr = predict(svm_linear, r_train)
svm_l_tepr = predict(svm_linear, newdata = r_test)
# Result summary
table(Train_Predicted = svm_l_trpr, Train_Actual = golub_train_l)
table(Test_Predicted = svm_l_tepr, Test_Actual = golub_test_l)
In [8]:
#build svm with quadratic kernel
svm_quad = svm(Y~., data = r_train, kernel = "polynomial", degree = 2, gamma =0.01, coef0 = 100)
svm_q_trpr = predict(svm_quad, r_train)
svm_q_tepr = predict(svm_quad, newdata = r_test)
# Result_summary
table(Train_Predicted = svm_q_trpr, Train_Actual = golub_train_l)
table(Test_Predicted = svm_q_tepr, Test_Actual = golub_test_l)
Step 3(c) AdaBoost
Use R package fastAdaboost, which use decision trees as weak classifiers as the paper use decision stumps as week learners. They have accuracy rate of 95.8% and we have accuracy rate of 31/34 = 91.2%.
In [9]:
## The next two lines are commented out. If you don't have the packages in your env, you need to run them first.
# options(repos='http://cran.rstudio.com/')
# install.packages("fastAdaboost")
suppressMessages(library(fastAdaboost))
In [10]:
# build the classifier iter 100
ada_cl = adaboost(Y~., data = r_train, 100)
# prediction and result
ada_train_pr = predict(ada_cl, r_train)
ada_test_pr = predict(ada_cl, newdata = r_test)
table(Train_Predict = ada_train_pr$class, Train_Actual = golub_train_l)
table(Test_Predict = ada_test_pr$class, Test_Actual = golub_test_l)
In this notebook, we reproduce paper 3 and have 3 algorithm reproduce, NN, SVM and Adaboost. One thing worth notice is that in the last one, we seem not reproduce the iteration 1000 and 10000 results. Actually, we do. But since they are the same as iteration 100's result, we didn't include them then.
Also, in the paper, they use LOOCV while we don't.
Prediction Comparison: Prediction result comparison is included in each algorithm in step 3. We have about the same accuracy rates for each method used in the paper. Except for random factors such as seeds, there are also two factors that we cannot completely reproduce their result. First, they use leave one out cross validation to calculte the accuracy rate while we don't. Second, we use functions from existing packages instead of reimplement each method strictly as in the paper.
In [11]:
# golub_test_50 golub_test_response golub_train_50 golub_train_response loaded
load("../transformed data/golub50gene.rda")
# prediction
nn_train_pr_p1 = apply(golub_train_50,1, cl_nn, golub_train_50, golub_train_response)
nn_test_pr_p1 = apply(golub_test_50,1, cl_nn, golub_train_50, golub_train_response)
# show result of prediction
table(Train_Predict = nn_train_pr_p1, Train_Actual = golub_train_response)
table(Test_Predict = nn_test_pr_p1, Test_Actual = golub_test_response)
In [12]:
# pca_train, pca_test, pls_train, pls_test loaded
load("../transformed data/paper6.rda")
# prediction
nn_train_pr_p6 = apply(pca_train[, -1],1, cl_nn, pca_train[,-1], pca_train$response)
nn_test_pr_p6 = apply(pca_test[, -1],1, cl_nn, pca_train[, -1], pca_train$response)
# show result of prediction
table(Train_Predict = nn_train_pr_p6, Train_Actual = pca_train$response)
table(Test_Predict = nn_test_pr_p6, Test_Actual = pca_test$response)
In [13]:
# test_BW_predictor test_r train_BW_predictor train_r loaded
load("../transformed data/paper9.rda")
# prediction
nn_train_pr_p9 = apply(train_BW_predictor,1, cl_nn, train_BW_predictor, train_r)
nn_test_pr_p9 = apply(test_BW_predictor,1, cl_nn, train_BW_predictor, train_r)
# show result of prediction
table(Train_Predict = nn_train_pr_p9, Train_Actual = train_r)
table(Test_Predict = nn_test_pr_p9, Test_Actual = test_r)
In [14]:
# test_kmeans train_kmeans golub_test_r golub_train_r loaded
load("../transformed data/paper29.rda")
# prediction
nn_train_pr_p29 = apply(train_kmeans,1, cl_nn, train_kmeans, golub_train_r)
nn_test_pr_p29 = apply(test_kmeans,1, cl_nn, train_kmeans, golub_train_r)
# show result of prediction
table(Train_Predict = nn_train_pr_p29, Train_Actual = golub_train_r)
table(Test_Predict = nn_test_pr_p29, Test_Actual = golub_test_r)
In [15]:
# build the data for R functions
r_train_p1 = data.frame(golub_train_50, Y = factor(golub_train_response))
r_test_p1 =data.frame( golub_test_50, Y = factor(golub_test_response))
# build svm with linear kernel
svm_linear_p1 = svm(Y~., data = r_train_p1)
svm_l_trpr_p1 = predict(svm_linear_p1, r_train_p1)
svm_l_tepr_p1 = predict(svm_linear_p1, newdata = r_test_p1)
# Result summary
table(Train_Predicted = svm_l_trpr_p1, Train_Actual = golub_train_response)
table(Test_Predicted = svm_l_tepr_p1, Test_Actual = golub_test_response)
In [16]:
#build svm with quadratic kernel
svm_quad_p1 = svm(Y~., data = r_train_p1, kernel = "polynomial", degree = 2, gamma =0.01, coef0 = 100)
svm_q_trpr_p1 = predict(svm_quad_p1, r_train_p1)
svm_q_tepr_p1 = predict(svm_quad_p1, newdata = r_test_p1)
# Result_summary
table(Train_Predicted = svm_q_trpr_p1, Train_Actual = golub_train_response)
table(Test_Predicted = svm_q_tepr_p1, Test_Actual = golub_test_response)
In [17]:
# build the data for R functions
r_train_p6 = data.frame(pca_train[, -1], Y = factor(pca_train$response))
r_test_p6 =data.frame( pca_test[, -1], Y = factor(pca_test$response))
# build svm with linear kernel
svm_linear_p6 = svm(Y~., data = r_train_p6)
svm_l_trpr_p6 = predict(svm_linear_p6, r_train_p6)
svm_l_tepr_p6 = predict(svm_linear_p6, newdata = r_test_p6)
# Result summary
table(Train_Predicted = svm_l_trpr_p6, Train_Actual = pca_train$response)
table(Test_Predicted = svm_l_tepr_p6, Test_Actual = pca_test$respons)
#build svm with quadratic kernel
svm_quad_p6 = svm(Y~., data = r_train_p6, kernel = "polynomial", degree = 2, gamma =0.01, coef0 = 100)
svm_q_trpr_p6 = predict(svm_quad_p6, r_train_p6)
svm_q_tepr_p6 = predict(svm_quad_p6, newdata = r_test_p6)
# Result_summary
table(Train_Predicted = svm_q_trpr_p6, Train_Actual = pca_train$response)
table(Test_Predicted = svm_q_tepr_p6, Test_Actual = pca_test$response)
In [18]:
# build the data for R functions
r_train_p6_pls = data.frame(pls_train[, -1], Y = factor(pls_train$response))
r_test_p6_pls =data.frame(pls_test[, -1], Y = factor(pls_test$response))
# build svm with linear kernel
svm_linear_p6_pls = svm(Y~., data = r_train_p6_pls)
svm_l_trpr_p6_pls = predict(svm_linear_p6_pls, r_train_p6_pls)
svm_l_tepr_p6_pls = predict(svm_linear_p6_pls, newdata = r_test_p6_pls)
# Result summary
table(Train_Predicted = svm_l_trpr_p6_pls, Train_Actual = pls_train$response)
table(Test_Predicted = svm_l_tepr_p6_pls, Test_Actual = pls_test$respons)
#build svm with quadratic kernel
svm_quad_p6_pls = svm(Y~., data = r_train_p6_pls, kernel = "polynomial", degree = 2, gamma =0.01, coef0 = 100)
svm_q_trpr_p6_pls = predict(svm_quad_p6_pls, r_train_p6_pls)
svm_q_tepr_p6_pls = predict(svm_quad_p6_pls, newdata = r_test_p6_pls)
# Result_summary
table(Train_Predicted = svm_q_trpr_p6_pls, Train_Actual = pls_train$response)
table(Test_Predicted = svm_q_tepr_p6_pls, Test_Actual = pls_test$response)
In [19]:
# build the data for R functions
r_train_p9 = data.frame(train_BW_predictor, Y = factor(train_r))
r_test_p9 = data.frame( test_BW_predictor, Y = factor(test_r))
# build svm with linear kernel
svm_linear_p9 = svm(Y~., data = r_train_p9)
svm_l_trpr_p9 = predict(svm_linear_p9, r_train_p9)
svm_l_tepr_p9 = predict(svm_linear_p9, newdata = r_test_p9)
# Result summary
table(Train_Predicted = svm_l_trpr_p9, Train_Actual = train_r)
table(Test_Predicted = svm_l_tepr_p9, Test_Actual = test_r)
#build svm with quadratic kernel
svm_quad_p9 = svm(Y~., data = r_train_p9, kernel = "polynomial", degree = 2, gamma =0.01, coef0 = 100)
svm_q_trpr_p9 = predict(svm_quad_p9, r_train_p9)
svm_q_tepr_p9 = predict(svm_quad_p9, newdata = r_test_p9)
# Result_summary
table(Train_Predicted = svm_q_trpr_p9, Train_Actual = train_r)
table(Test_Predicted = svm_q_tepr_p9, Test_Actual = test_r)
In [20]:
# build the data for R functions
r_train_p29 = data.frame(train_kmeans, Y = factor(golub_train_r))
r_test_p29 = data.frame( test_kmeans, Y = factor(golub_test_r))
# build svm with linear kernel
svm_linear_p29 = svm(Y~., data = r_train_p29)
svm_l_trpr_p29 = predict(svm_linear_p29, r_train_p29)
svm_l_tepr_p29 = predict(svm_linear_p29, newdata = r_test_p29)
# Result summary
table(Train_Predicted = svm_l_trpr_p29, Train_Actual = golub_train_r)
table(Test_Predicted = svm_l_tepr_p29, Test_Actual = golub_test_r)
#build svm with quadratic kernel
svm_quad_p29 = svm(Y~., data = r_train_p29, kernel = "polynomial", degree = 2, gamma =0.01, coef0 = 100)
svm_q_trpr_p29 = predict(svm_quad_p29, r_train_p29)
svm_q_tepr_p29 = predict(svm_quad_p29, newdata = r_test_p29)
# Result_summary
table(Train_Predicted = svm_q_trpr_p29, Train_Actual = golub_train_r)
table(Test_Predicted = svm_q_tepr_p29, Test_Actual = golub_test_r)
In [21]:
# build the classifier iter 100
ada_cl_p1 = adaboost(Y~., data = r_train_p1, 100)
# prediction and result
ada_train_pr_p1 = predict(ada_cl_p1, r_train_p1)
ada_test_pr_p1 = predict(ada_cl_p1, newdata = r_test_p1)
table(Train_Predict = ada_train_pr_p1$class, Train_Actual = golub_train_response)
table(Test_Predict = ada_test_pr_p1$class, Test_Actual = golub_test_response)
In [22]:
## pca selection
# build the classifier iter 100
ada_cl_p6 = adaboost(Y~., data = r_train_p6, 100)
# prediction and result
ada_train_pr_p6 = predict(ada_cl_p6, r_train_p6)
ada_test_pr_p6 = predict(ada_cl_p6, newdata = r_test_p6)
table(Train_Predict = ada_train_pr_p6$class, Train_Actual = pca_train$response)
table(Test_Predict = ada_test_pr_p6$class, Test_Actual = pca_test$response)
In [23]:
## pls selection
# build the classifier iter 100
ada_cl_p6_pls = adaboost(Y~., data = r_train_p6_pls, 100)
# prediction and result
ada_train_pr_p6_pls = predict(ada_cl_p6_pls, r_train_p6_pls)
ada_test_pr_p6_pls = predict(ada_cl_p6_pls, newdata = r_test_p6_pls)
table(Train_Predict = ada_train_pr_p6_pls$class, Train_Actual = pls_train$response)
table(Test_Predict = ada_test_pr_p6_pls$class, Test_Actual = pls_test$response)
In [24]:
# build the classifier iter 100
ada_cl_p9 = adaboost(Y~., data = r_train_p9, 100)
# prediction and result
ada_train_pr_p9 = predict(ada_cl_p9, r_train_p9)
ada_test_pr_p9 = predict(ada_cl_p9, newdata = r_test_p9)
table(Train_Predict = ada_train_pr_p9$class, Train_Actual = train_r)
table(Test_Predict = ada_test_pr_p9$class, Test_Actual = test_r)
In [25]:
# build the classifier iter 100
ada_cl_p29 = adaboost(Y~., data = r_train_p29, 100)
# prediction and result
ada_train_pr_p29 = predict(ada_cl_p29, r_train_p29)
ada_test_pr_p29 = predict(ada_cl_p29, newdata = r_test_p29)
table(Train_Predict = ada_train_pr_p29$class, Train_Actual = golub_train_r)
table(Test_Predict = ada_test_pr_p29$class, Test_Actual = golub_test_r)
In [ ]: