This file contains the code used for implementing the classifer in paper Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression data. This notebook could be pretty slow as a lot of function implemented are not time efficient enough in R (will try to optimize it if possible in the future).

Data Set: Merge the train and test data in the original paper and split the merged data to form train and test during learning.

- Merge Data: 72 samples(47 ALL, 25 AML, 62 bone marrow, 10 peripheral blood samples)
- Predictors: 7129 gene expression levels represent 6817 genes.

Main Purpose: FLDA, DLDA, DQDA, Decision Tree, Adaboost, Bagging(CPD), NN.

Reproduce using R

Step 1 Load and preprocess data.


In [1]:
# The implementation below is very similar to the one in paper1 notehbook and details chould be checked there.

## The code below is commented out since it is unnecessary and time-consuming to run it everytime. Run it if needed.
#options(repos='http://cran.rstudio.com/') 
#source("http://bioconductor.org/biocLite.R")
#biocLite("golubEsets")
#install.packages("tree")
#install.packages("fastAdaboost")
#install.packages("sparsediscrim", dependencies = T)
suppressMessages(library(sparsediscrim))
suppressMessages(library(tree))
suppressMessages(library(golubEsets))
suppressMessages(library(fastAdaboost))

# load data from golubEsets
data(Golub_Merge)
golub_merge_p = t(exprs(Golub_Merge))
golub_merge_r =pData(Golub_Merge)[, "ALL.AML"]
golub_merge_l = ifelse(golub_merge_r == "AML", 1, 0)

#show summary
dim(golub_merge_p) 
table(golub_merge_r)

# Thresholding
golub_merge_pp = golub_merge_p
golub_merge_pp[golub_merge_pp<100] = 100
golub_merge_pp[golub_merge_pp>16000] = 16000

# Filtering
golub_filter = function(x, r = 5, d=500){
    minval = min(x)
    maxval = max(x)
    (maxval/minval>r)&&(maxval-minval>d)
}
merge_index = apply(golub_merge_pp, 2, golub_filter)
golub_merge_index = (1:7129)[merge_index]
golub_merge_pp = golub_merge_pp[, golub_merge_index]

# Base 10 logarithmic transformation
golub_merge_p_trans = log10(golub_merge_pp)

#show summary again
dim(golub_merge_p_trans)
table(golub_merge_r)

total3571_predictor = golub_merge_p_trans
total3571_response = golub_merge_r
save(total3571_predictor, total3571_response, file = "../transformed data/golub3571.rda")
# Further standardization to mean 0 variance 1.
scale_golub_merge = scale(golub_merge_p_trans)
set.seed(201703)


  1. 72
  2. 7129
golub_merge_r
ALL AML 
 47  25 
  1. 72
  2. 3571
golub_merge_r
ALL AML 
 47  25 

As we see in the result above, the dataset is indeed reduce to $72 \times 3571$ and this is also the origin of the $72 \times 3571$ dataset available online.

Step 2 Build Classification

Step 2(a): General settings as specified in the paper and helper functions for BW(Feature selection) and train-test split.


In [2]:
# Settings as specified in the paper
p = 40 # number of genes for FLDA
B = 50 # Aggregation predictors
N = 200 # repeat classification N times
d = c(0.05, 0.1,0.25, 0.5, 0.75, 1) # CPD parameter

In [3]:
# Split train test as specified in the paper
mysplit = function(n){
    sample(1:n, floor(n/3))
}
# implement function for calculating BW as stated in the paper(the ratio of between-group to within group sums of squares)
BW = function(predictor, response){
    overall = colMeans(predictor)
    ALL_mean = apply(predictor, 2, function(x) mean(x[response == "ALL"]))
    AML_mean = apply(predictor, 2, function(x) mean(x[response == "AML"]))
    numerator = sum(response == "ALL")*(ALL_mean-overall)^2+sum(response == "AML")*(AML_mean-overall)^2
    denumerator = colSums((t(t(predictor[response == "ALL", ])-ALL_mean))^2)+colSums((t(t(predictor[response == "AML", ])-AML_mean))^2)
    numerator/denumerator
}
                     
# randomly feature select once for comparison for furthur study
id = mysplit(nrow(scale_golub_merge))
train_p = scale_golub_merge[-id,]
train_r = golub_merge_r[-id]
test_p = scale_golub_merge[id,]
test_r = golub_merge_r[id]
temp_bw = order(BW(train_p, train_r), decreasing = T)[1:50]
train_BW_predictor = train_p[,temp_bw]
test_BW_predictor = test_p[,temp_bw]
save(train_BW_predictor, train_r, test_BW_predictor, test_r, file = "../transformed data/paper9.rda")
  • Nearest Neighbor

As the distance measure defined in the paper are not general distance measure used by R knn function. We reimplement KNN in R and this cell could be a little bit slow.


In [4]:
k = seq(1, 21, 2)
# Distance measure used in the paper
Distance = function(predictor, test){
    1- apply(predictor, 1, cor, test)
}
# NN classification process
nn = function(test, pk, learning, response){
     distance = Distance(learning, test)
     index = order(distance)[1:pk]
     cl = ifelse(sum(response[index] == "AML")>sum(response[index]=="ALL"), "AML", "ALL")
     cl
}
# leave-one-cross-validation to tune k
mycv= function(pk,learning,response){
    error = 0
    for(i in 1:nrow(learning)){
        cl = nn(learning[i,], pk, learning[-i, ], response[-i])
        error = error+(cl == response[i])
    }
    error
}
error_count = numeric(N)
for(i in 1:N){
    # split data
    nn_index = mysplit(nrow(scale_golub_merge))
    nn_train_p = scale_golub_merge[-nn_index,]
    nn_train_r = golub_merge_r[-nn_index]
    nn_test_p = scale_golub_merge[nn_index,]
    nn_test_r = golub_merge_r[nn_index]
    # gene selection/feature selection
    temp_bw = order(BW(nn_train_p, nn_train_r), decreasing = T)[1:p]
    nn_train_p = nn_train_p[,temp_bw]
    nn_test_p = nn_test_p[,temp_bw]
    # cross-validation to choose k
    choose_k = sapply(k,mycv, learning = nn_train_p, response = nn_train_r)
    # nn classification
    nn_r = apply(nn_test_p,1, nn, k[which.min(choose_k)], nn_train_p, nn_train_r)
    error_count[i] = sum(nn_r != nn_test_r)
}
resultNN = c(Median = median(error_count), Upper_quartile = quantile(error_count, 0.75))
  • Desicion Tree(CART): Single CART tree with pruning by 10-fold CV

This part is accomplished using tree package in R as used in the paper. Since not much is mentioned in the paper about the set up of the decision tree, the default depth in tree are used. Also 10 fold cross-validation and prune.tree are used as specified in the paper.


In [5]:
cbine_data = data.frame(response = factor(golub_merge_r), scale_golub_merge)
tree_mdl = tree(response~.,data = cbine_data)
cv_error = replicate(N, ceiling(cv.tree(tree_mdl, , prune.tree, K = 10, method = "misclass")$dev[1]/10))
resultDT = c(Median = median(cv_error), Upper_quartile = quantile(cv_error, 0.75))
  • Bagging

Implemented without using packages in R.


In [6]:
# initialize the result vector
bg_error = numeric(N)
# implement helper function for each bagging
my_baghelper = function(train, test){
    bg = sample(nrow(train), replace = T)
    temp_md = tree(response~., data = train[bg, ])
    predict(temp_md, test, type = "class")
}

# perform N time classification(bootstrap)
for(i in 1:N){
    bg_index = mysplit(nrow(cbine_data))
    bg_train = cbine_data[-bg_index,]
    bg_test = cbine_data[bg_index,]
    
    # gene selection
    temp_bw = order(BW(bg_train[, -1], bg_train$response), decreasing = T)[1:p]
    bg_train_p = data.frame(response = bg_train$response, bg_train[,temp_bw+1])
    bg_test_p= data.frame(response = bg_test$response, bg_test[,temp_bw+1])
    
    t1 = replicate(B, my_baghelper(bg_train_p, bg_test_p))
    pred = apply(t1, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
    bg_error[i] = sum(pred != bg_test_p$response)
}
resultBag = c(Median = median(bg_error), Upper_quartile = quantile(bg_error, 0.75))
  • Bagging with CPD

In [7]:
d = 0.75
# implement CPD
CPD = function(d, x1, x2){
    a = runif(nrow(x1), 0, d)
    a*x1+(1-a)*x2
}
# helper function for each bagging with CPD
my_cpdhelper = function(train, test){
    id1 = sample(nrow(train), replace = T)
    id2 = sample(nrow(train), replace = T)
    temp = CPD(d, train[id1, -1], train[id2,-1])
    temp_md = tree(response~., data = data.frame(temp, response = train$response[id1]))
    predict(temp_md, test, type = "class")
}
#initialize the error vector
cpd_error = numeric(N)
# repeat N times
for(i in 1:N){
    cpd_index = mysplit(nrow(cbine_data))
    cpd_train = cbine_data[-cpd_index,]
    cpd_test = cbine_data[cpd_index,]
    
    # gene selection
    temp_bw = order(BW(cpd_train[, -1], cpd_train$response), decreasing = T)[1:p]
    cpd_train_t = data.frame(response = cpd_train$response, cpd_train[,temp_bw+1])
    cpd_test_t= data.frame(response = cpd_test$response, cpd_test[,temp_bw+1])
   
    t1 = replicate(B, my_cpdhelper(cpd_train_t, cpd_test_t))
    pred = apply(t1, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
    cpd_error[i] = sum(pred != cpd_test_t$response)
}
resultCPD = c(Median = median(cpd_error), Upper_quartile = quantile(cpd_error, 0.75))
  • AdaBoost

Using adaboost function from fastAdaboost package in R.


In [8]:
ada_error = numeric(N)
for(i in 1:N){
    ad_index = mysplit(nrow(cbine_data))
    ad_train = cbine_data[-ad_index,]
    ad_test = cbine_data[ad_index,]
    
    # gene selection
    temp_bw = order(BW(ad_train[, -1], ad_train$response), decreasing = T)[1:p]
    ad_train_t = data.frame(response = ad_train$response, ad_train[,temp_bw+1])
    ad_test_t= data.frame(response = ad_test$response, ad_test[,temp_bw+1])
    
    ada_cl = adaboost(response~., data = ad_train_t, 50)
    ada_test_pr = predict(ada_cl, ad_test_t)$class
   ada_error[i] = sum(ada_test_pr != ad_test_t$response)
}
resultAda = c(Median = median(ada_error), Upper_quartile = quantile(ada_error, 0.75))
  • FLDA

As the current LDA is very similar to the FLDA, lda function in MASS package is used to accomplish this part.


In [9]:
flda_error = numeric(N)
for(i in 1:200){
    flda_index = mysplit(nrow(cbine_data))
    flda_train = cbine_data[-flda_index,]
    flda_test = cbine_data[flda_index,]
    
    # gene selection
    temp_bw = order(BW(flda_train[, -1], flda_train$response), decreasing = T)[1:p]
    flda_train_t = data.frame(response = flda_train$response, flda_train[,temp_bw+1])
    flda_test_t= data.frame(response = flda_test$response, flda_test[,temp_bw+1])
    
    flda_md = MASS::lda(response~., data = flda_train_t)
    flda_pred = predict(flda_md, flda_test_t)$class
    flda_error[i] = sum(flda_pred != flda_test_t$response)
}
resultFLDA = c(Median = median(flda_error), Upper_quartile = quantile(flda_error, 0.75))
  • DLDA

Accomplished using R dlda in sparsediscrim packages.


In [10]:
dlda_error = numeric(N)
for(i in 1:200){
    dlda_index = mysplit(nrow(cbine_data))
    dlda_train = cbine_data[-dlda_index,]
    dlda_test = cbine_data[dlda_index,]
    
    # gene selection
    temp_bw = order(BW(dlda_train[, -1], dlda_train$response), decreasing = T)[1:p]
    dlda_train_t = data.frame(response = dlda_train$response, dlda_train[,temp_bw+1])
    dlda_test_t= data.frame(response = dlda_test$response, dlda_test[,temp_bw+1])
    
    dlda_md = dlda(response~., data = dlda_train_t)
    dlda_pred = predict(dlda_md, dlda_test_t[, -1])$class
    dlda_error[i] = sum(dlda_pred != dlda_test_t$response)
}
resultDLDA = c(Median = median(dlda_error), Upper_quartile = quantile(dlda_error, 0.75))
  • DQDA

In [11]:
dqda_error = numeric(N)
for(i in 1:200){
    dqda_index = mysplit(nrow(cbine_data))
    dqda_train = cbine_data[-dqda_index,]
    dqda_test = cbine_data[dqda_index,]
    
    # gene selection
    temp_bw = order(BW(dqda_train[, -1], dqda_train$response), decreasing = T)[1:p]
    dqda_train_t = data.frame(response = dqda_train$response, dqda_train[,temp_bw+1])
    dqda_test_t= data.frame(response = dqda_test$response, dqda_test[,temp_bw+1])
    
    dqda_md = dlda(response~., data = dqda_train_t)
    dqda_pred = predict(dqda_md, dqda_test_t[,-1])$class
    dqda_error[i] = sum(dqda_pred != dqda_test_t$response)
}
resultDQDA = c(Median = median(dqda_error), Upper_quartile = quantile(dqda_error, 0.75))

Summary of Reproduce result

In this notebook, we reproduce paper 9 and have several algorithms reproduce, such as FLDA, DLDA, DQDA etc. We made some implicit assumptions where details are not specified in the paper and when using existing packages in R. However, most of our work follows procedures exactly as described in the paper.

Prediction Comparison: Summary of our reproduce result is shown below. We also include the result in the paper below our result.


In [12]:
rbind(FLDA = resultFLDA, DLDA = resultDLDA, DQDA = resultDQDA,  DT = resultDT, Bag = resultBag ,Boost = resultAda, CPD = resultCPD, NN = resultNN)


MedianUpper_quartile.75%
FLDA45
DLDA11
DQDA11
DT22
Bag23
Boost12
CPD23
NN12
Classifier Median quartile Upper quartile
FLDA 3 4
DLDA 0 1
DQDA 1 2
DT 3 4
Bag 2 2
Boost 1 2
CPD 1 2
NN 1 1

As we can see from the above two tables, we basically have the same result as presented in the paper. Although our result is a slightly worse.

Compare classification with other feature selection method

NN in paper9

  • paper1

In [5]:
# golub_test_50 golub_test_response golub_train_50 golub_train_response loaded
load("../transformed data/golub50gene.rda")
# cross-validation to choose k
k = seq(1, 21, 2)
choose_k_p1 = sapply(k,mycv, learning = golub_train_50, response= golub_train_response)
# nn classification
nn_train_p1 = apply(golub_train_50 ,1, nn, k[which.min(choose_k_p1)], golub_train_50,golub_train_response)
nn_test_p1 = apply(golub_test_50 ,1, nn, k[which.min(choose_k_p1)], golub_train_50,golub_train_response)
table(Train_Predict = nn_train_p1, Train_Actual = golub_train_response)
table(Test_Predict = nn_test_p1, Test_Actual = golub_test_response)


             Train_Actual
Train_Predict ALL AML
          ALL  26   0
          AML   1  11
            Test_Actual
Test_Predict ALL AML
         ALL  19   0
         AML   1  14
  • paper 3

In [6]:
# test_paper3 test_response train_paper3 train_response loaded
load("../transformed data/paper3.rda")
# cross-validation to choose k
choose_k_p3 = sapply(k,mycv, learning = train_paper3, response= train_response)
# nn classification
nn_train_p3 = apply(train_paper3 ,1, nn, k[which.min(choose_k_p3)], train_paper3, train_response)
nn_test_p3 = apply(test_paper3 ,1, nn, k[which.min(choose_k_p3)], train_paper3, train_response)
table(Train_Predict = nn_train_p3, Train_Actual = train_response)
table(Test_Predict = nn_test_p3, Test_Actual = test_response)


             Train_Actual
Train_Predict ALL AML
          ALL  27   2
          AML   0   9
            Test_Actual
Test_Predict ALL AML
         ALL  20   3
         AML   0  11
  • paper 6

In [37]:
# pca_train, pca_test, pls_train, pls_test loaded
load("../transformed data/paper6.rda")

## PCA
# cross-validation to choose k
pca_tr = as.matrix(pca_train[, -1])
pca_te = as.matrix(pca_test[, -1])
choose_k_p6_pca = sapply(k,mycv, learning = pca_tr, response= pca_train$response)
# nn classification
nn_train_p6 = apply(pca_tr ,1, nn, k[which.min(choose_k_p6_pca)], pca_tr, pca_train$response)
nn_test_p6 = apply(pca_te ,1, nn, k[which.min(choose_k_p6_pca)], pca_tr, pca_train$response)
table(Train_Predict = nn_train_p6, Train_Actual = pca_train$response)
table(Test_Predict = nn_test_p6, Test_Actual = pca_test$response)


             Train_Actual
Train_Predict ALL AML
          ALL  23   0
          AML   4  11
            Test_Actual
Test_Predict ALL AML
         ALL  15   0
         AML   5  14

In [39]:
## PLS
# cross-validation to choose k
pls_tr = as.matrix(pls_train[, -1])
pls_te = as.matrix(pls_test[, -1])
choose_k_p6_pls = sapply(k,mycv, learning = pls_tr, response= pls_train$response)
# nn classification
nn_train_p6_pls = apply(pls_tr ,1, nn, k[which.min(choose_k_p6_pls)], pls_tr, pls_train$response)
nn_test_p6_pls = apply(pls_te ,1, nn, k[which.min(choose_k_p6_pls)], pls_tr, pls_train$response)
table(Train_Predict = nn_train_p6_pls, Train_Actual = pls_train$response)
table(Test_Predict = nn_test_p6_pls, Test_Actual = pls_test$response)


             Train_Actual
Train_Predict ALL AML
          ALL  25   0
          AML   2  11
            Test_Actual
Test_Predict ALL AML
         ALL  19   0
         AML   1  14
  • paper29

In [40]:
# test_kmeans train_kmeans golub_test_r golub_train_r loaded
load("../transformed data/paper29.rda")
# cross-validation to choose k
choose_k_p29 = sapply(k,mycv, learning = train_kmeans, response= golub_train_r)
# nn classification
nn_train_p29 = apply(train_kmeans ,1, nn, k[which.min(choose_k_p29)], train_kmeans, golub_train_r)
nn_test_p29 = apply(test_kmeans ,1, nn, k[which.min(choose_k_p29)], train_kmeans, golub_train_r)
table(Train_Predict = nn_train_p29, Train_Actual = golub_train_r)
table(Test_Predict = nn_test_p29, Test_Actual = golub_test_r)


             Train_Actual
Train_Predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_Predict ALL AML
         ALL  20   1
         AML   0  13

Decision Tree

  • paper 1

In [53]:
# golub_test_50 golub_test_response golub_train_50 golub_train_response loaded
cbine_data_p1 = data.frame(response = factor(golub_train_response), golub_train_50)
tree_mdl_p1 = tree(response~.,data = cbine_data_p1)
tree_tr_p1 = predict(tree_mdl_p1, data.frame(golub_train_50), type = "class")
tree_te_p1 = predict(tree_mdl_p1, data.frame(golub_test_50), type = "class")
table(Train_predict = tree_tr_p1, Train_Actual = golub_train_response)
table(Test_predict = tree_te_p1, Test_Actual = golub_test_response)


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  18   1
         AML   2  13
  • paper 3

In [54]:
# test_paper3 test_response train_paper3 train_response loaded
cbine_data_p3 = data.frame(response = factor(train_response), train_paper3)
tree_mdl_p3 = tree(response~.,data = cbine_data_p3)
tree_tr_p3 = predict(tree_mdl_p3, data.frame(train_paper3), type = "class")
tree_te_p3 = predict(tree_mdl_p3, data.frame(test_paper3), type = "class")
table(Train_predict = tree_tr_p3, Train_Actual = train_response)
table(Test_predict = tree_te_p3, Test_Actual = test_response)


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  18   1
         AML   2  13
  • paper 6

In [56]:
# pca_train, pca_test
cbine_data_pca = data.frame(response = factor(pca_train$response), pca_train[,-1])
tree_mdl_pca = tree(response~.,data = cbine_data_pca)
tree_tr_pca = predict(tree_mdl_pca, pca_train[,-1], type = "class")
tree_te_pca = predict(tree_mdl_pca, pca_test[,-1], type = "class")
table(Train_predict = tree_tr_pca, Train_Actual = pca_train[, 1])
table(Test_predict = tree_te_pca, Test_Actual = pca_test[, 1])


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  20   1
         AML   0  13

In [59]:
# pls_train, pls_test loaded
cbine_data_pls = data.frame(response = factor(pls_train$response), pls_train[,-1])
tree_mdl_pls = tree(response~.,data = cbine_data_pls)
tree_tr_pls = predict(tree_mdl_pls, pls_train[,-1], type = "class")
tree_te_pls = predict(tree_mdl_pls, pls_test[,-1], type = "class")
table(Train_predict = tree_tr_pls, Train_Actual = pls_train[, 1])
table(Test_predict = tree_te_pls, Test_Actual = pls_test[, 1])


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  20   1
         AML   0  13
  • paper 29

In [61]:
# test_kmeans train_kmeans golub_test_r golub_train_r loaded
cbine_data_p29 = data.frame(response = factor(golub_train_r), train_kmeans)
tree_mdl_p29 = tree(response~.,data = cbine_data_p29)
tree_tr_p29 = predict(tree_mdl_p29, data.frame(train_kmeans), type = "class")
tree_te_p29 = predict(tree_mdl_p29, data.frame(test_kmeans), type = "class")
table(Train_predict = tree_tr_p29, Train_Actual = golub_train_r)
table(Test_predict = tree_te_p29, Test_Actual = golub_test_r)


             Train_Actual
Train_predict ALL AML
          ALL  26   0
          AML   1  11
            Test_Actual
Test_predict ALL AML
         ALL  19   8
         AML   1   6

Bagging

  • paper 1

In [68]:
t_tr_p1 = replicate(B, my_baghelper(cbine_data_p1, data.frame(golub_train_50)))
pred_tr_p1 = apply(t_tr_p1, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
t_te_p1 = replicate(B, my_baghelper(cbine_data_p1, data.frame(golub_test_50)))
pred_te_p1 = apply(t_te_p1, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
table(Train_predict = pred_tr_p1, Train_Actual = golub_train_response)
table(Test_predict = pred_te_p1, Test_Actual = golub_test_response)


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  18   1
         AML   2  13
  • paper 3

In [70]:
t_tr_p3 = replicate(B, my_baghelper(cbine_data_p3, data.frame(train_paper3)))
pred_tr_p3 = apply(t_tr_p3, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
t_te_p3 = replicate(B, my_baghelper(cbine_data_p3, data.frame(test_paper3)))
pred_te_p3 = apply(t_te_p3, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
table(Train_predict = pred_tr_p3, Train_Actual = train_response)
table(Test_predict = pred_te_p3, Test_Actual = test_response)


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  18   1
         AML   2  13
  • paper 6

In [75]:
# pca
t_tr_p6 = replicate(B, my_baghelper(pca_train, pca_train))
pred_tr_p6 = apply(t_tr_p6, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
t_te_p6 = replicate(B, my_baghelper(pca_train, pca_test))
pred_te_p6 = apply(t_te_p6, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
table(Train_predict = pred_tr_p6, Train_Actual =pca_train[,1])
table(Test_predict = pred_te_p6, Test_Actual = pca_test[, 1])


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  20   1
         AML   0  13

In [76]:
# pls
t_tr_p6_pls = replicate(B, my_baghelper(pls_train, pls_train))
pred_tr_p6_pls = apply(t_tr_p6_pls, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
t_te_p6_pls = replicate(B, my_baghelper(pls_train, pls_test))
pred_te_p6_pls = apply(t_te_p6_pls, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
table(Train_predict = pred_tr_p6_pls, Train_Actual =pls_train[,1])
table(Test_predict = pred_te_p6_pls, Test_Actual = pls_test[, 1])


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  20   1
         AML   0  13
  • paper 29

In [79]:
t_tr_p29 = replicate(B, my_baghelper(cbine_data_p29, data.frame(train_kmeans)))
pred_tr_p29 = apply(t_tr_p29, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
t_te_p29 = replicate(B, my_baghelper(cbine_data_p29,data.frame(test_kmeans)))
pred_te_p29 = apply(t_te_p29, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
table(Train_predict = pred_tr_p29, Train_Actual = golub_train_response)
table(Test_predict = pred_te_p29, Test_Actual = golub_test_response)


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  20   8
         AML   0   6

Bagging with CPD

  • paper 1

In [81]:
c_tr_p1 = replicate(B, my_cpdhelper(cbine_data_p1, data.frame(golub_train_50)))
c_pred_tr_p1 = apply(c_tr_p1, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
c_te_p1 = replicate(B, my_cpdhelper(cbine_data_p1, data.frame(golub_test_50)))
c_pred_te_p1 = apply(c_te_p1, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
table(Train_predict = c_pred_tr_p1, Train_Actual = golub_train_response)
table(Test_predict = c_pred_te_p1, Test_Actual = golub_test_response)


             Train_Actual
Train_predict ALL AML
          ALL  27   1
          AML   0  10
            Test_Actual
Test_predict ALL AML
         ALL  20   7
         AML   0   7
  • paper 3

In [83]:
c_tr_p3 = replicate(B, my_cpdhelper(cbine_data_p3, data.frame(train_paper3)))
c_pred_tr_p3 = apply(c_tr_p3, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
c_te_p3 = replicate(B, my_cpdhelper(cbine_data_p3, data.frame(test_paper3)))
c_pred_te_p3 = apply(c_te_p3, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
table(Train_predict = c_pred_tr_p3, Train_Actual = train_response)
table(Test_predict = c_pred_te_p3, Test_Actual = test_response)


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  20   5
         AML   0   9
  • paper 6

In [86]:
# pca
c_tr_p6 = replicate(B, my_cpdhelper(pca_train, pca_train))
c_pred_tr_p6 = apply(c_tr_p6, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
c_te_p6 = replicate(B, my_cpdhelper(pca_train, pca_test))
c_pred_te_p6 = apply(c_te_p6, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
table(Train_predict = c_pred_tr_p6, Train_Actual =pca_train[,1])
table(Test_predict = c_pred_te_p6, Test_Actual = pca_test[, 1])


             Train_Actual
Train_predict ALL AML
          ALL  27   9
          AML   0   2
            Test_Actual
Test_predict ALL AML
         ALL  20   6
         AML   0   8

In [87]:
# pls
c_tr_p6_pls = replicate(B, my_cpdhelper(pls_train, pls_train))
c_pred_tr_p6_pls = apply(c_tr_p6_pls, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
c_te_p6_pls = replicate(B, my_cpdhelper(pls_train, pls_test))
c_pred_te_p6_pls = apply(c_te_p6_pls, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
table(Train_predict = c_pred_tr_p6_pls, Train_Actual =pls_train[,1])
table(Test_predict = c_pred_te_p6_pls, Test_Actual = pls_test[, 1])


             Train_Actual
Train_predict ALL AML
          ALL  27   2
          AML   0   9
            Test_Actual
Test_predict ALL AML
         ALL  20   3
         AML   0  11
  • paper 29

In [89]:
c_tr_p29 = replicate(B, my_cpdhelper(cbine_data_p29, data.frame(train_kmeans)))
c_pred_tr_p29 = apply(c_tr_p29, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
c_te_p29 = replicate(B, my_cpdhelper(cbine_data_p29,data.frame(test_kmeans)))
c_pred_te_p29 = apply(c_te_p29, 1, function(x) ifelse(sum(x == "AML")>sum(x =="ALL"), "AML", "ALL"))
table(Train_predict = c_pred_tr_p29, Train_Actual = golub_train_response)
table(Test_predict = c_pred_te_p29, Test_Actual = golub_test_response)


             Train_Actual
Train_predict ALL AML
          ALL  27   3
          AML   0   8
            Test_Actual
Test_predict ALL AML
         ALL  20   5
         AML   0   9

Adaboost

  • paper1

In [91]:
ada_cl_p1 = adaboost(response~., data = cbine_data_p1, 50)
ada_train_pr_p1 = predict(ada_cl_p1, data.frame(golub_train_50))$class
ada_test_pr_p1 = predict(ada_cl_p1, data.frame(golub_test_50))$class
table(Train_predict = ada_train_pr_p1, Train_Actual = golub_train_response)
table(Test_predict = ada_test_pr_p1, Test_Actual = golub_test_response)


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  18   1
         AML   2  13
  • paper 3

In [93]:
ada_cl_p3 = adaboost(response~., data = cbine_data_p3, 50)
ada_train_pr_p3 = predict(ada_cl_p3, data.frame(train_paper3))$class
ada_test_pr_p3 = predict(ada_cl_p3, data.frame(test_paper3))$class
table(Train_predict =ada_train_pr_p3, Train_Actual = train_response)
table(Test_predict = ada_test_pr_p3, Test_Actual = test_response)


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  18   1
         AML   2  13
  • paper 6

In [94]:
# pca
ada_cl_p6 = adaboost(response~., data = cbine_data_pca, 50)
ada_train_pr_p6 = predict(ada_cl_p6, pca_train)$class
ada_test_pr_p6 = predict(ada_cl_p6, pca_test)$class
table(Train_predict = ada_train_pr_p6, Train_Actual =pca_train[,1])
table(Test_predict =ada_test_pr_p6, Test_Actual = pca_test[, 1])


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  20   1
         AML   0  13

In [95]:
# pls
ada_cl_p6_pls = adaboost(response~., data = cbine_data_pls, 50)
ada_train_pr_p6_pls = predict(ada_cl_p6_pls, pls_train)$class
ada_test_pr_p6_pls = predict(ada_cl_p6_pls, pls_test)$class
table(Train_predict = ada_train_pr_p6_pls, Train_Actual =pls_train[,1])
table(Test_predict =ada_test_pr_p6_pls, Test_Actual = pls_test[, 1])


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  20   1
         AML   0  13
  • paper 29

In [97]:
ada_cl_p29 = adaboost(response~., data = cbine_data_p29, 50)
ada_train_pr_p29 = predict(ada_cl_p29, data.frame(train_kmeans))$class
ada_test_pr_p29 = predict(ada_cl_p29, data.frame(test_kmeans))$class
table(Train_predict = c_pred_tr_p29, Train_Actual = golub_train_response)
table(Test_predict = c_pred_te_p29, Test_Actual = golub_test_response)


             Train_Actual
Train_predict ALL AML
          ALL  27   3
          AML   0   8
            Test_Actual
Test_predict ALL AML
         ALL  20   5
         AML   0   9

FLDA

  • paper 1

In [99]:
flda_md_p1 = MASS::lda(response~., data = cbine_data_p1)
flda_tr_p1 = predict(flda_md_p1, data.frame(golub_train_50))$class
flda_te_p1 = predict(flda_md_p1, data.frame(golub_test_50))$class
table(Train_predict = flda_tr_p1, Train_Actual = golub_train_response)
table(Test_predict = flda_te_p1, Test_Actual = golub_test_response)


Warning message in lda.default(x, grouping, ...):
“variables are collinear”
             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  19   3
         AML   1  11
  • paper 3

In [100]:
flda_md_p3 = MASS::lda(response~., data = cbine_data_p3)
flda_tr_p3 = predict(flda_md_p3, data.frame(train_paper3))$class
flda_te_p3 = predict(flda_md_p3, data.frame(test_paper3))$class
table(Train_predict = flda_tr_p3, Train_Actual = train_response)
table(Test_predict = flda_te_p3, Test_Actual = test_response)


Warning message in lda.default(x, grouping, ...):
“variables are collinear”
             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  18   2
         AML   2  12
  • paper 6

In [101]:
# pca
flda_md_p6 = MASS::lda(response~., data = cbine_data_pca)
flda_tr_p6 = predict(flda_md_p6, pca_train)$class
flda_te_p6 = predict(flda_md_p6, pca_test)$class
table(Train_predict = flda_tr_p6, Train_Actual =pca_train[,1])
table(Test_predict = flda_te_p6, Test_Actual = pca_test[, 1])


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  20   1
         AML   0  13

In [102]:
# pls
flda_md_p6_pls = MASS::lda(response~., data = cbine_data_pls)
flda_tr_p6_pls = predict(flda_md_p6, pls_train)$class
flda_te_p6_pls = predict(flda_md_p6, pls_test)$class
table(Train_predict = flda_tr_p6_pls, Train_Actual =pls_train[,1])
table(Test_predict = flda_te_p6_pls, Test_Actual = pls_test[, 1])


             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  20   1
         AML   0  13
  • paper 29

In [104]:
flda_md_p29 = MASS::lda(response~., data = cbine_data_p29)
flda_tr_p29 = predict(flda_md_p29, data.frame(train_kmeans))$class
flda_te_p29 = predict(flda_md_p29, data.frame(test_kmeans))$class
table(Train_predict = flda_tr_p29, Train_Actual = golub_train_response)
table(Test_predict = flda_te_p29, Test_Actual = golub_test_response)


Warning message in lda.default(x, grouping, ...):
“variables are collinear”
             Train_Actual
Train_predict ALL AML
          ALL  27   0
          AML   0  11
            Test_Actual
Test_predict ALL AML
         ALL  20   3
         AML   0  11