This file contains the code used for implementing the classifer in paper Acute Leukemia Classification using Bayesian Networks. There are several assumptions made during the reproducing implementation as the paper omits lots of realization details. The paper uses the same preprocessing method as in paper 1 and also use SNR as the feature selection criterion. In addition to SNR score, the paper also adopts kmeans during feature selection.

Data Set:

- Train Data: 38 bone marrow samples(27ALL, 11AML).
- Test Data: 34 samples(24 bone marrow, 10 peripheral blood samples, 20ALL, 14AML).

Main Purpose: SNR/KMeans,SNR + Bayesian Network

Reproduce using R

Step 1 Load data

The paper states nothing about data preprocessing. However, the scale of each predictor is important when calculating SNR. Hence, we just assume the preprocessing is the same as in the Golub. paper. Also, the SNR mentioned in the paper is essentially the same as the PS in the Golub paper.


In [1]:
## Most code is commented in this cell since it is unnecessary and time-consuming to run it everytime.
# options(repos='http://cran.rstudio.com/') 
# source("http://bioconductor.org/biocLite.R")
# biocLite("golubEsets")
suppressMessages(library(golubEsets))
suppressMessages(library(bnlearn))
#Training data predictor and response
data(Golub_Train)
golub_train_p = t(exprs(Golub_Train))
golub_train_r =pData(Golub_Train)[, "ALL.AML"]
#Testing data predictor
data(Golub_Test)
golub_test_p = t(exprs(Golub_Test))
golub_test_r = pData(Golub_Test)[, "ALL.AML"]
#Show summary
rbind(Train = dim(golub_train_p), Test = dim(golub_test_p))
cbind(Train = table(golub_train_r),Test = table(golub_test_r))
# Thresholding
golub_train_pp = golub_train_p
golub_train_pp[golub_train_pp<100] = 100
golub_train_pp[golub_train_pp>16000] = 16000

# Filtering
golub_filter = function(x, r = 5, d=500){
    minval = min(x)
    maxval = max(x)
    (maxval/minval>r)&&(maxval-minval>d)
}
index = apply(golub_train_pp, 2, golub_filter)
golub_index = (1:7129)[index]
golub_train_pp = golub_train_pp[, golub_index]

golub_test_pp = golub_test_p
golub_test_pp[golub_test_pp<100] = 100
golub_test_pp[golub_test_pp>16000] = 16000
golub_test_pp = golub_test_pp[, golub_index]

# Log Transformation
golub_train_p_trans = log10(golub_train_pp)
golub_test_p_trans = log10(golub_test_pp)

# Normalization
train_m = colMeans(golub_train_p_trans)
train_sd = apply(golub_train_p_trans, 2, sd)
golub_train_p_trans = t((t(golub_train_p_trans)-train_m)/train_sd)
golub_test_p_trans  = t((t(golub_test_p_trans)-train_m)/train_sd)
rbind(Train = dim(golub_train_p_trans), Test = dim(golub_test_p_trans))
cbind(Train = table(golub_train_r),Test = table(golub_test_r))
set.seed(201703)


Train38 7129
Test34 7129
TrainTest
ALL2720
AML1114
Train38 3051
Test34 3051
TrainTest
ALL2720
AML1114

Step 2 Feature selection

As the details about this is not found in the paper, we make an assumption and use the genes selected using train data also as predictors selected for the test data. Also, for kmeans clustering feature selection, we select the top1 genes from each cluster using absolute SNR. Below are the helper functions for feature selection.


In [2]:
#SNR:signal-to-noise ratio
get_SNR = function(train_d, train_r){
    tr_m_aml =  colMeans(train_d[train_r == "AML",])
    tr_sd_aml = apply(train_d[train_r == "AML",], 2, sd)
    tr_m_all = colMeans(train_d[train_r == "ALL",])
    tr_sd_all = apply(train_d[train_r == "ALL",], 2, sd)
    p = (tr_m_aml-tr_m_all)/(tr_sd_aml+tr_sd_all)
    return(p)
}

In [3]:
# Kmeans clustering and then SNR ranking selection
get_kmeans = function(k, train_d, train_r){
    cl = kmeans(t(train_d), k, iter.max=50)$cluster
    result = numeric(k)
    for(i in 1:k){
        id = (cl == i)
        oid = (1:ncol(train_d))[id]
        iSNR = get_SNR(t(t(train_d)[id,]),train_r)
        temp = which.max(abs(iSNR))
        result[i] = oid[temp]
    }
    return(result)
}

In [4]:
# direct SNR ranking selection
get_direct = function(k, train_d, train_r){
    SNR = get_SNR(train_d, train_r)
    result = c(head(order(SNR, decreasing = F), k-k%/%2), head(order(SNR, decreasing = T), k%/%2))
    return(result)
}

Step 3 Bayesion Network Classfier

For each feature selection methods, we check both 50 and 90 genes selection below.

50 genes classification


In [5]:
#get the data after gene selection for each method
k = c(5,10,20,30,50,70,90)
kmeans_id_50 = get_kmeans(k[5], golub_train_p_trans, golub_train_r)
train_kmeans_50 = data.frame(golub_train_p_trans[,kmeans_id_50], class = golub_train_r)
test_kmeans_50 = data.frame(golub_test_p_trans[,kmeans_id_50])
direct_id_50 = get_direct(k[5], golub_train_p_trans, golub_train_r)
train_direct_50 = data.frame(golub_train_p_trans[, direct_id_50], class = golub_train_r)
test_direct_50 = data.frame(golub_test_p_trans[, direct_id_50])
train_kmeans = golub_train_p_trans[,kmeans_id_50]
test_kmeans = golub_test_p_trans[,kmeans_id_50]
save(train_kmeans,golub_train_r, test_kmeans,golub_test_r, file = "../transformed data/paper29.rda")

In [6]:
# kmeans method
km_50 = empty.graph(c("class", colnames(data.frame(golub_train_p_trans[,kmeans_id_50]))))
#km_set = cbind( from = colnames(data.frame(golub_train_p_trans[,kmeans_id])), to = rep("class", k[5]))
arcs(km_50) = matrix(c(rep("class", k[5]), 
                    colnames(data.frame(golub_train_p_trans[,kmeans_id_50]))), 
                  ncol = 2, byrow = F, dimnames = list(c(), c("from", "to")))
#plot(km)
km_fitted_50 = bn.fit(km_50, train_kmeans_50)
km_predict_train_50 = predict(km_fitted_50, node = "class", method="bayes-lw", train_kmeans_50)
km_predict_test_50 = predict(km_fitted_50, node = "class", method="bayes-lw", test_kmeans_50)
table(Predict_train = km_predict_train_50, True = train_kmeans_50$class)
table(Predict_test = km_predict_test_50, True = golub_test_r)


             True
Predict_train ALL AML
          ALL  27   0
          AML   0  11
            True
Predict_test ALL AML
         ALL  19  11
         AML   1   3

In [7]:
# direct method
dr_50 = empty.graph(c("class", colnames(data.frame(golub_train_p_trans[, direct_id_50]))))
arcs(dr_50) = matrix(c(rep("class", k[5]), colnames(data.frame(golub_train_p_trans[,direct_id_50]))), 
                  ncol = 2, byrow = F, dimnames = list(c(), c("from", "to")))
dr_fitted_50 = bn.fit(dr_50, train_direct_50)
dr_predict_train_50 = predict(dr_fitted_50, "class", method = "bayes-lw", train_direct_50)
dr_predict_test_50 = predict(dr_fitted_50, "class", method = "bayes-lw", test_direct_50)
table(Predict_train = dr_predict_train_50, True = train_direct_50$class)
table(Predict_test = dr_predict_test_50, True = golub_test_r)


             True
Predict_train ALL AML
          ALL  27   0
          AML   0  11
            True
Predict_test ALL AML
         ALL  19   8
         AML   1   6

90 genes classification


In [8]:
kmeans_id_90 = get_kmeans(k[7], golub_train_p_trans, golub_train_r)
train_kmeans_90 = data.frame(golub_train_p_trans[,kmeans_id_90], class = golub_train_r)
test_kmeans_90 = data.frame(golub_test_p_trans[,kmeans_id_90])
direct_id_90 = get_direct(k[7], golub_train_p_trans, golub_train_r)
train_direct_90 = data.frame(golub_train_p_trans[, direct_id_90], class = golub_train_r)
test_direct_90 = data.frame(golub_test_p_trans[, direct_id_90])

In [9]:
# kmeans method
km_90 = empty.graph(c("class", colnames(data.frame(golub_train_p_trans[,kmeans_id_90]))))
#km_set = cbind( from = colnames(data.frame(golub_train_p_trans[,kmeans_id])), to = rep("class", k[5]))
arcs(km_90) = matrix(c(rep("class", k[7]), 
                    colnames(data.frame(golub_train_p_trans[,kmeans_id_90]))), 
                  ncol = 2, byrow = F, dimnames = list(c(), c("from", "to")))
#plot(km)
km_fitted_90 = bn.fit(km_90, train_kmeans_90)
km_predict_train_90 = predict(km_fitted_90, node = "class", method="bayes-lw", train_kmeans_90)
km_predict_test_90 = predict(km_fitted_90, node = "class", method="bayes-lw", test_kmeans_90)
table(Predict_train = km_predict_train_90, True = train_kmeans_90$class)
table(Predict_test = km_predict_test_90, True = golub_test_r)

# direct method
dr_90 = empty.graph(c("class", colnames(data.frame(golub_train_p_trans[, direct_id_90]))))
arcs(dr_90) = matrix(c(rep("class", k[7]), colnames(data.frame(golub_train_p_trans[,direct_id_90]))), 
                  ncol = 2, byrow = F, dimnames = list(c(), c("from", "to")))
dr_fitted_90 = bn.fit(dr_90, train_direct_90)
dr_predict_train_90 = predict(dr_fitted_90, "class", method = "bayes-lw", train_direct_90)
dr_predict_test_90 = predict(dr_fitted_90, "class", method = "bayes-lw", test_direct_90)
table(Predict_train = dr_predict_train_90, True = train_direct_90$class)
table(Predict_test = dr_predict_test_90, True = golub_test_r)


             True
Predict_train ALL AML
          ALL  27   0
          AML   0  11
            True
Predict_test ALL AML
         ALL  19   5
         AML   1   9
             True
Predict_train ALL AML
          ALL  27   0
          AML   0  11
            True
Predict_test ALL AML
         ALL  19   8
         AML   1   6

Summary of Reproduce result

The 50-gene and 90-gene classification results in the paper is show below. However, we didn't get as high accuracy as in the paper.

  • Our results(Test results)
# of features With Kmeans Withouth Kmeans
50 22/34 25/34
90 28/34 25/34
  • Paper Results(Test results)
# of features With Kmeans Without Kmeans
50 31/34 29/34
90 32/34 32/34

As we can see above, our results are quite different from the paper's results. One of the reasons is that the paper omits too many details about their implementation or settings. Although it includes general introduction to kmeans and bayesian network, it still cannot enable us to reproduce their results. Things it should state includes but not limitted to preprocessing steps, number of genes selected from each cluster, bayesian network settings etc.