This file contains the code used for implementing the classifer in paper Tissue Classification with Gene Expression Profiles.

Data Set: Raw/original data used in the Golub etc. paper.

- Train Data: 38 bone marrow samples(27ALL, 11AML).
- Test Data: 34 samples(24 bone marrow, 10 peripheral blood samples, 20ALL, 14AML).
- Predictors: 7129 gene expression levels represent 6817 genes.

Main Purpose: Multiple classifications: NN, SVM, AdaBoost.

Reproduce using R

Step 1 Load and transform dataset.

Install Bioconductor biocLite package in order to access the golubEsets library. golubEsets contains the raw data used by Todd Golub in the original paper.

Load the training, testing data from library golubEsets. Also transpose the data to make observations as rows.


In [1]:
## The code below is commented out since it is unnecessary and time-consuming to run it everytime. Run it if needed.
# options(repos='http://cran.rstudio.com/') 
# source("http://bioconductor.org/biocLite.R")
# biocLite("golubEsets")
suppressMessages(library(golubEsets))
#Training data
data(Golub_Train)
golub_train_p = t(exprs(Golub_Train))
golub_train_r =pData(Golub_Train)[, "ALL.AML"]
golub_train_l = ifelse(golub_train_r == "AML", 1, 0)

#Testing data
data(Golub_Test)
golub_test_p = t(exprs(Golub_Test))
golub_test_r = pData(Golub_Test)[, "ALL.AML"]
golub_test_l = ifelse(golub_test_r == "AML", 1, 0)

#Show summary
rbind(Train = dim(golub_train_p), Test = dim(golub_test_p))
cbind(Train = table(golub_train_r),Test = table(golub_test_r))


Train38 7129
Test34 7129
TrainTest
ALL2720
AML1114

Step 2 Feature selection using TNoM score

As a bootstrap is used we set a seed to ensure reproducibility of this reproduce work. We follow the steps in section 4 of the paper and it may take some time to run the bootstrap. We select genes with TNoM score less than 14 and also have bootstrap p-value less than 0.01.


In [2]:
set.seed(201703)
# TNoM score
r_score = function(slabel){
    total = length(slabel)
    n = sum(slabel == 0) 
    p = sum(slabel == 1) 
    temp = min(n, p)
    d = ifelse(n < p, 1, -1)
    for(i in 1:(total-1)){
        count = sum(slabel[1:i] == 0)
        t_t = min(n-2*count+i, p+2*count-i)
        t_d = ifelse((n-2*count+i)<(p+2*count-i),1,-1)
        if(t_t < temp){
            temp = t_t
            d = t_d
        }
    }
    c(temp, t_d)
}

# Significance (using bootstrap with size 1000 to replace)
r_bootstrap = function(gene, label){
    total = length(label)
    index = order(gene)
    s_l = label[index]
    score= r_score(s_l)
    dist_score = numeric(200)
    for(i in 1:200){
        temp = sample(1:total)
        t = r_score(label[temp])
        dist_score[i] = t[1]
    }
    prob = mean(dist_score<score[1])
    c(score[1], score[2], prob)
}

In [3]:
# perform the caculation, this may take a while since the inevitable loops in above functions
a = apply(golub_train_p, 2, r_bootstrap, label = golub_train_l)

In [4]:
# select informative genes and subset the train and test data
index = (1:7129)[a[1,]<14 & a[3,]<0.01]
b = order(a[1,index])[1:50]

# data subsetting
train_cl = golub_train_p[, index]
test_cl = golub_test_p[, index]
train_paper3 = train_cl[, b]
train_response = golub_train_r
test_paper3 = test_cl[, b]
test_response = golub_test_r
save(train_paper3, train_response, test_paper3, test_response, train_cl, test_cl, file = "../transformed data/paper3.rda")

Step 3 Classification

Step 3(a) Nearest Neighbor Classification

Pearson correlation is used as the measure of distance in the nearest neighbor classification. In the paper, they have 91.6% classification rate while we 33/34 = 97% classification rate on test data as shown below.


In [5]:
# Build the classifier
cl_nn = function(new_s, train, train_label){
    # use Pearson correlation
    corr = apply(train, 1, cor, new_s)
    train_label[corr==max(corr)]
}

# prediction
nn_train_pr = apply(train_cl,1, cl_nn, train_cl, golub_train_r)
nn_test_pr = apply(test_cl,1, cl_nn, train_cl, golub_train_r)

# show result of prediction
table(Train_Predict = nn_train_pr, Train_Actual = golub_train_r)
table(Test_Predict = nn_test_pr, Test_Actual = golub_test_r)


             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_Predict ALL AML
         ALL  20   1
         AML   0  13

Step 3(b) SVM

Use R package e1701, which has svm already implemented. As we use the function in e1701, we don't have unclassified observations as in the paper. They have accuracy rate of 93.3% for linear kernel and 94.4% for quadratic kernel. We have 31/34 = 91.2% correctly classified by linear kernel and 32/34 = 94.1% correctly classified by quadratic kernel.


In [6]:
## The next two lines are commented out. If you don't have the packages in your env, you need to run them first.
#options(repos='http://cran.rstudio.com/') 
#install.packages("e1071")
suppressMessages(library(e1071))
# need to set.seed for reproducibility
set.seed(201702)

In [7]:
# build the data for R functions
r_train = data.frame(train_cl, Y = factor(golub_train_l))
r_test =data.frame( test_cl, Y = factor(golub_test_l))

# build svm with linear kernel
svm_linear = svm(Y~., data = r_train)
svm_l_trpr = predict(svm_linear, r_train)
svm_l_tepr = predict(svm_linear, newdata = r_test)

# Result summary
table(Train_Predicted  = svm_l_trpr, Train_Actual = golub_train_l)
table(Test_Predicted = svm_l_tepr, Test_Actual = golub_test_l)


               Train_Actual
Train_Predicted  0  1
              0 27  0
              1  0 11
              Test_Actual
Test_Predicted  0  1
             0 16  2
             1  4 12

In [8]:
#build svm with quadratic kernel
svm_quad = svm(Y~., data = r_train, kernel = "polynomial", degree = 2,  gamma =0.01, coef0 = 100)
svm_q_trpr = predict(svm_quad, r_train)
svm_q_tepr = predict(svm_quad, newdata = r_test)

# Result_summary
table(Train_Predicted  = svm_q_trpr, Train_Actual = golub_train_l)
table(Test_Predicted = svm_q_tepr, Test_Actual = golub_test_l)


               Train_Actual
Train_Predicted  0  1
              0 27  0
              1  0 11
              Test_Actual
Test_Predicted  0  1
             0 20  3
             1  0 11

Step 3(c) AdaBoost

Use R package fastAdaboost, which use decision trees as weak classifiers as the paper use decision stumps as week learners. They have accuracy rate of 95.8% and we have accuracy rate of 31/34 = 91.2%.


In [9]:
## The next two lines are commented out. If you don't have the packages in your env, you need to run them first.
# options(repos='http://cran.rstudio.com/') 
# install.packages("fastAdaboost")
suppressMessages(library(fastAdaboost))

In [10]:
# build the classifier iter 100
ada_cl = adaboost(Y~., data = r_train, 100)

# prediction and result
ada_train_pr = predict(ada_cl, r_train)
ada_test_pr = predict(ada_cl, newdata = r_test)
table(Train_Predict = ada_train_pr$class, Train_Actual = golub_train_l)
table(Test_Predict = ada_test_pr$class, Test_Actual = golub_test_l)


             Train_Actual
Train_Predict  0  1
            0 27  0
            1  0 11
            Test_Actual
Test_Predict  0  1
           0 18  1
           1  2 13

Summary of Reproduce result

In this notebook, we reproduce paper 3 and have 3 algorithm reproduce, NN, SVM and Adaboost. One thing worth notice is that in the last one, we seem not reproduce the iteration 1000 and 10000 results. Actually, we do. But since they are the same as iteration 100's result, we didn't include them then.

Also, in the paper, they use LOOCV while we don't.

Prediction Comparison: Prediction result comparison is included in each algorithm in step 3. We have about the same accuracy rates for each method used in the paper. Except for random factors such as seeds, there are also two factors that we cannot completely reproduce their result. First, they use leave one out cross validation to calculte the accuracy rate while we don't. Second, we use functions from existing packages instead of reimplement each method strictly as in the paper.

Compare classification with other feature selection method

NN

  • Golub selection

In [11]:
# golub_test_50 golub_test_response golub_train_50 golub_train_response loaded
load("../transformed data/golub50gene.rda")

# prediction
nn_train_pr_p1 = apply(golub_train_50,1, cl_nn, golub_train_50, golub_train_response)
nn_test_pr_p1 = apply(golub_test_50,1, cl_nn, golub_train_50, golub_train_response)

# show result of prediction
table(Train_Predict = nn_train_pr_p1, Train_Actual = golub_train_response)
table(Test_Predict = nn_test_pr_p1, Test_Actual = golub_test_response)


             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_Predict ALL AML
         ALL  19   0
         AML   1  14
  • paper6 selection

In [12]:
# pca_train, pca_test, pls_train, pls_test loaded
load("../transformed data/paper6.rda")

# prediction
nn_train_pr_p6 = apply(pca_train[, -1],1, cl_nn, pca_train[,-1], pca_train$response)
nn_test_pr_p6 = apply(pca_test[, -1],1, cl_nn, pca_train[, -1], pca_train$response)

# show result of prediction
table(Train_Predict = nn_train_pr_p6, Train_Actual = pca_train$response)
table(Test_Predict = nn_test_pr_p6, Test_Actual = pca_test$response)


             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_Predict ALL AML
         ALL  18   1
         AML   2  13
  • paper9 selection

In [13]:
# test_BW_predictor test_r train_BW_predictor train_r loaded
load("../transformed data/paper9.rda")

# prediction
nn_train_pr_p9 = apply(train_BW_predictor,1, cl_nn, train_BW_predictor, train_r)
nn_test_pr_p9 = apply(test_BW_predictor,1, cl_nn, train_BW_predictor, train_r)

# show result of prediction
table(Train_Predict = nn_train_pr_p9, Train_Actual = train_r)
table(Test_Predict = nn_test_pr_p9, Test_Actual = test_r)


             Train_Actual
Train_Predict ALL AML
          ALL  30   0
          AML   0  18
            Test_Actual
Test_Predict ALL AML
         ALL  17   0
         AML   0   7
  • paper29 selection

In [14]:
# test_kmeans train_kmeans golub_test_r golub_train_r loaded
load("../transformed data/paper29.rda")

# prediction
nn_train_pr_p29 = apply(train_kmeans,1, cl_nn, train_kmeans, golub_train_r)
nn_test_pr_p29 = apply(test_kmeans,1, cl_nn, train_kmeans, golub_train_r)

# show result of prediction
table(Train_Predict = nn_train_pr_p29, Train_Actual = golub_train_r)
table(Test_Predict = nn_test_pr_p29, Test_Actual = golub_test_r)


             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_Predict ALL AML
         ALL  17   0
         AML   3  14

SVM

  • Golub selection

In [15]:
# build the data for R functions
r_train_p1 = data.frame(golub_train_50, Y = factor(golub_train_response))
r_test_p1 =data.frame( golub_test_50, Y = factor(golub_test_response))

# build svm with linear kernel
svm_linear_p1 = svm(Y~., data = r_train_p1)
svm_l_trpr_p1 = predict(svm_linear_p1, r_train_p1)
svm_l_tepr_p1 = predict(svm_linear_p1, newdata = r_test_p1)

# Result summary
table(Train_Predicted  = svm_l_trpr_p1, Train_Actual = golub_train_response)
table(Test_Predicted = svm_l_tepr_p1, Test_Actual = golub_test_response)


               Train_Actual
Train_Predicted ALL AML
            ALL  27   0
            AML   0  11
              Test_Actual
Test_Predicted ALL AML
           ALL  20   1
           AML   0  13

In [16]:
#build svm with quadratic kernel
svm_quad_p1 = svm(Y~., data = r_train_p1, kernel = "polynomial", degree = 2,  gamma =0.01, coef0 = 100)
svm_q_trpr_p1 = predict(svm_quad_p1, r_train_p1)
svm_q_tepr_p1 = predict(svm_quad_p1, newdata = r_test_p1)

# Result_summary
table(Train_Predicted  = svm_q_trpr_p1, Train_Actual = golub_train_response)
table(Test_Predicted = svm_q_tepr_p1, Test_Actual = golub_test_response)


               Train_Actual
Train_Predicted ALL AML
            ALL  27   0
            AML   0  11
              Test_Actual
Test_Predicted ALL AML
           ALL  20   1
           AML   0  13
  • paper 6

In [17]:
# build the data for R functions
r_train_p6 = data.frame(pca_train[, -1], Y = factor(pca_train$response))
r_test_p6 =data.frame( pca_test[, -1], Y = factor(pca_test$response))

# build svm with linear kernel
svm_linear_p6 = svm(Y~., data = r_train_p6)
svm_l_trpr_p6 = predict(svm_linear_p6, r_train_p6)
svm_l_tepr_p6 = predict(svm_linear_p6, newdata = r_test_p6)

# Result summary
table(Train_Predicted  = svm_l_trpr_p6, Train_Actual = pca_train$response)
table(Test_Predicted = svm_l_tepr_p6, Test_Actual = pca_test$respons)

#build svm with quadratic kernel
svm_quad_p6 = svm(Y~., data = r_train_p6, kernel = "polynomial", degree = 2,  gamma =0.01, coef0 = 100)
svm_q_trpr_p6 = predict(svm_quad_p6, r_train_p6)
svm_q_tepr_p6 = predict(svm_quad_p6, newdata = r_test_p6)

# Result_summary
table(Train_Predicted  = svm_q_trpr_p6, Train_Actual = pca_train$response)
table(Test_Predicted = svm_q_tepr_p6, Test_Actual = pca_test$response)


               Train_Actual
Train_Predicted ALL AML
            ALL  27   0
            AML   0  11
              Test_Actual
Test_Predicted ALL AML
           ALL  19   1
           AML   1  13
               Train_Actual
Train_Predicted ALL AML
            ALL  27   0
            AML   0  11
              Test_Actual
Test_Predicted ALL AML
           ALL  20   1
           AML   0  13

In [18]:
# build the data for R functions
r_train_p6_pls = data.frame(pls_train[, -1], Y = factor(pls_train$response))
r_test_p6_pls =data.frame(pls_test[, -1], Y = factor(pls_test$response))

# build svm with linear kernel
svm_linear_p6_pls = svm(Y~., data = r_train_p6_pls)
svm_l_trpr_p6_pls = predict(svm_linear_p6_pls, r_train_p6_pls)
svm_l_tepr_p6_pls = predict(svm_linear_p6_pls, newdata = r_test_p6_pls)

# Result summary
table(Train_Predicted  = svm_l_trpr_p6_pls, Train_Actual = pls_train$response)
table(Test_Predicted = svm_l_tepr_p6_pls, Test_Actual = pls_test$respons)

#build svm with quadratic kernel
svm_quad_p6_pls = svm(Y~., data = r_train_p6_pls, kernel = "polynomial", degree = 2,  gamma =0.01, coef0 = 100)
svm_q_trpr_p6_pls = predict(svm_quad_p6_pls, r_train_p6_pls)
svm_q_tepr_p6_pls = predict(svm_quad_p6_pls, newdata = r_test_p6_pls)

# Result_summary
table(Train_Predicted  = svm_q_trpr_p6_pls, Train_Actual = pls_train$response)
table(Test_Predicted = svm_q_tepr_p6_pls, Test_Actual = pls_test$response)


               Train_Actual
Train_Predicted ALL AML
            ALL  27   0
            AML   0  11
              Test_Actual
Test_Predicted ALL AML
           ALL  20   1
           AML   0  13
               Train_Actual
Train_Predicted ALL AML
            ALL  27   0
            AML   0  11
              Test_Actual
Test_Predicted ALL AML
           ALL  20   1
           AML   0  13
  • paper9

In [19]:
# build the data for R functions
r_train_p9 = data.frame(train_BW_predictor, Y = factor(train_r))
r_test_p9 = data.frame( test_BW_predictor, Y = factor(test_r))

# build svm with linear kernel
svm_linear_p9 = svm(Y~., data = r_train_p9)
svm_l_trpr_p9 = predict(svm_linear_p9, r_train_p9)
svm_l_tepr_p9 = predict(svm_linear_p9, newdata = r_test_p9)

# Result summary
table(Train_Predicted  = svm_l_trpr_p9, Train_Actual = train_r)
table(Test_Predicted = svm_l_tepr_p9, Test_Actual = test_r)

#build svm with quadratic kernel
svm_quad_p9 = svm(Y~., data = r_train_p9, kernel = "polynomial", degree = 2,  gamma =0.01, coef0 = 100)
svm_q_trpr_p9 = predict(svm_quad_p9, r_train_p9)
svm_q_tepr_p9 = predict(svm_quad_p9, newdata = r_test_p9)

# Result_summary
table(Train_Predicted  = svm_q_trpr_p9, Train_Actual = train_r)
table(Test_Predicted = svm_q_tepr_p9, Test_Actual = test_r)


               Train_Actual
Train_Predicted ALL AML
            ALL  30   0
            AML   0  18
              Test_Actual
Test_Predicted ALL AML
           ALL  17   0
           AML   0   7
               Train_Actual
Train_Predicted ALL AML
            ALL  30   0
            AML   0  18
              Test_Actual
Test_Predicted ALL AML
           ALL  17   0
           AML   0   7
  • paper29

In [20]:
# build the data for R functions
r_train_p29 = data.frame(train_kmeans, Y = factor(golub_train_r))
r_test_p29 = data.frame( test_kmeans, Y = factor(golub_test_r))

# build svm with linear kernel
svm_linear_p29 = svm(Y~., data = r_train_p29)
svm_l_trpr_p29 = predict(svm_linear_p29, r_train_p29)
svm_l_tepr_p29 = predict(svm_linear_p29, newdata = r_test_p29)

# Result summary
table(Train_Predicted  = svm_l_trpr_p29, Train_Actual = golub_train_r)
table(Test_Predicted = svm_l_tepr_p29, Test_Actual = golub_test_r)

#build svm with quadratic kernel
svm_quad_p29 = svm(Y~., data = r_train_p29, kernel = "polynomial", degree = 2,  gamma =0.01, coef0 = 100)
svm_q_trpr_p29 = predict(svm_quad_p29, r_train_p29)
svm_q_tepr_p29 = predict(svm_quad_p29, newdata = r_test_p29)

# Result_summary
table(Train_Predicted  = svm_q_trpr_p29, Train_Actual = golub_train_r)
table(Test_Predicted = svm_q_tepr_p29, Test_Actual = golub_test_r)


               Train_Actual
Train_Predicted ALL AML
            ALL  27   0
            AML   0  11
              Test_Actual
Test_Predicted ALL AML
           ALL  20   8
           AML   0   6
               Train_Actual
Train_Predicted ALL AML
            ALL  27   0
            AML   0  11
              Test_Actual
Test_Predicted ALL AML
           ALL  20   3
           AML   0  11

Adaboost

  • paper1

In [21]:
# build the classifier iter 100
ada_cl_p1 = adaboost(Y~., data = r_train_p1, 100)

# prediction and result
ada_train_pr_p1 = predict(ada_cl_p1, r_train_p1)
ada_test_pr_p1 = predict(ada_cl_p1, newdata = r_test_p1)
table(Train_Predict = ada_train_pr_p1$class, Train_Actual = golub_train_response)
table(Test_Predict = ada_test_pr_p1$class, Test_Actual = golub_test_response)


             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_Predict ALL AML
         ALL  18   1
         AML   2  13
  • paper 6

In [22]:
## pca selection
# build the classifier iter 100
ada_cl_p6 = adaboost(Y~., data = r_train_p6, 100)

# prediction and result
ada_train_pr_p6 = predict(ada_cl_p6, r_train_p6)
ada_test_pr_p6 = predict(ada_cl_p6, newdata = r_test_p6)
table(Train_Predict = ada_train_pr_p6$class, Train_Actual = pca_train$response)
table(Test_Predict = ada_test_pr_p6$class, Test_Actual = pca_test$response)


             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_Predict ALL AML
         ALL  20   1
         AML   0  13

In [23]:
## pls selection
# build the classifier iter 100
ada_cl_p6_pls = adaboost(Y~., data = r_train_p6_pls, 100)

# prediction and result
ada_train_pr_p6_pls = predict(ada_cl_p6_pls, r_train_p6_pls)
ada_test_pr_p6_pls = predict(ada_cl_p6_pls, newdata = r_test_p6_pls)
table(Train_Predict = ada_train_pr_p6_pls$class, Train_Actual = pls_train$response)
table(Test_Predict = ada_test_pr_p6_pls$class, Test_Actual = pls_test$response)


             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_Predict ALL AML
         ALL  20   1
         AML   0  13
  • paper9

In [24]:
# build the classifier iter 100
ada_cl_p9 = adaboost(Y~., data = r_train_p9, 100)

# prediction and result
ada_train_pr_p9 = predict(ada_cl_p9, r_train_p9)
ada_test_pr_p9 = predict(ada_cl_p9, newdata = r_test_p9)
table(Train_Predict = ada_train_pr_p9$class, Train_Actual = train_r)
table(Test_Predict = ada_test_pr_p9$class, Test_Actual = test_r)


             Train_Actual
Train_Predict ALL AML
          ALL  30   0
          AML   0  18
            Test_Actual
Test_Predict ALL AML
         ALL  16   0
         AML   1   7
  • paper 29

In [25]:
# build the classifier iter 100
ada_cl_p29 = adaboost(Y~., data = r_train_p29, 100)

# prediction and result
ada_train_pr_p29 = predict(ada_cl_p29, r_train_p29)
ada_test_pr_p29 = predict(ada_cl_p29, newdata = r_test_p29)
table(Train_Predict = ada_train_pr_p29$class, Train_Actual = golub_train_r)
table(Test_Predict = ada_test_pr_p29$class, Test_Actual = golub_test_r)


             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_Predict ALL AML
         ALL  20   2
         AML   0  12

In [ ]: