This file contains the code used for implementing the classifer in paper Acute Leukemia Classification using Bayesian Networks. There are several assumptions made during the reproducing implementation as the paper omits lots of realization details. The paper uses the same preprocessing method as in paper 1 and also use SNR as the feature selection criterion. In addition to SNR score, the paper also adopts kmeans during feature selection.
Data Set:
- Train Data: 38 bone marrow samples(27ALL, 11AML).
- Test Data: 34 samples(24 bone marrow, 10 peripheral blood samples, 20ALL, 14AML).
Main Purpose: SNR/KMeans,SNR + Bayesian Network
The paper states nothing about data preprocessing. However, the scale of each predictor is important when calculating SNR. Hence, we just assume the preprocessing is the same as in the Golub. paper. Also, the SNR mentioned in the paper is essentially the same as the PS in the Golub paper.
In [1]:
## Most code is commented in this cell since it is unnecessary and time-consuming to run it everytime.
# options(repos='http://cran.rstudio.com/')
# source("http://bioconductor.org/biocLite.R")
# biocLite("golubEsets")
suppressMessages(library(golubEsets))
suppressMessages(library(bnlearn))
#Training data predictor and response
data(Golub_Train)
golub_train_p = t(exprs(Golub_Train))
golub_train_r =pData(Golub_Train)[, "ALL.AML"]
#Testing data predictor
data(Golub_Test)
golub_test_p = t(exprs(Golub_Test))
golub_test_r = pData(Golub_Test)[, "ALL.AML"]
#Show summary
rbind(Train = dim(golub_train_p), Test = dim(golub_test_p))
cbind(Train = table(golub_train_r),Test = table(golub_test_r))
# Thresholding
golub_train_pp = golub_train_p
golub_train_pp[golub_train_pp<100] = 100
golub_train_pp[golub_train_pp>16000] = 16000
# Filtering
golub_filter = function(x, r = 5, d=500){
minval = min(x)
maxval = max(x)
(maxval/minval>r)&&(maxval-minval>d)
}
index = apply(golub_train_pp, 2, golub_filter)
golub_index = (1:7129)[index]
golub_train_pp = golub_train_pp[, golub_index]
golub_test_pp = golub_test_p
golub_test_pp[golub_test_pp<100] = 100
golub_test_pp[golub_test_pp>16000] = 16000
golub_test_pp = golub_test_pp[, golub_index]
# Log Transformation
golub_train_p_trans = log10(golub_train_pp)
golub_test_p_trans = log10(golub_test_pp)
# Normalization
train_m = colMeans(golub_train_p_trans)
train_sd = apply(golub_train_p_trans, 2, sd)
golub_train_p_trans = t((t(golub_train_p_trans)-train_m)/train_sd)
golub_test_p_trans = t((t(golub_test_p_trans)-train_m)/train_sd)
rbind(Train = dim(golub_train_p_trans), Test = dim(golub_test_p_trans))
cbind(Train = table(golub_train_r),Test = table(golub_test_r))
set.seed(201703)
As the details about this is not found in the paper, we make an assumption and use the genes selected using train data also as predictors selected for the test data. Also, for kmeans clustering feature selection, we select the top1 genes from each cluster using absolute SNR. Below are the helper functions for feature selection.
In [2]:
#SNR:signal-to-noise ratio
get_SNR = function(train_d, train_r){
tr_m_aml = colMeans(train_d[train_r == "AML",])
tr_sd_aml = apply(train_d[train_r == "AML",], 2, sd)
tr_m_all = colMeans(train_d[train_r == "ALL",])
tr_sd_all = apply(train_d[train_r == "ALL",], 2, sd)
p = (tr_m_aml-tr_m_all)/(tr_sd_aml+tr_sd_all)
return(p)
}
In [3]:
# Kmeans clustering and then SNR ranking selection
get_kmeans = function(k, train_d, train_r){
cl = kmeans(t(train_d), k, iter.max=50)$cluster
result = numeric(k)
for(i in 1:k){
id = (cl == i)
oid = (1:ncol(train_d))[id]
iSNR = get_SNR(t(t(train_d)[id,]),train_r)
temp = which.max(abs(iSNR))
result[i] = oid[temp]
}
return(result)
}
In [4]:
# direct SNR ranking selection
get_direct = function(k, train_d, train_r){
SNR = get_SNR(train_d, train_r)
result = c(head(order(SNR, decreasing = F), k-k%/%2), head(order(SNR, decreasing = T), k%/%2))
return(result)
}
In [5]:
#get the data after gene selection for each method
k = c(5,10,20,30,50,70,90)
kmeans_id_50 = get_kmeans(k[5], golub_train_p_trans, golub_train_r)
train_kmeans_50 = data.frame(golub_train_p_trans[,kmeans_id_50], class = golub_train_r)
test_kmeans_50 = data.frame(golub_test_p_trans[,kmeans_id_50])
direct_id_50 = get_direct(k[5], golub_train_p_trans, golub_train_r)
train_direct_50 = data.frame(golub_train_p_trans[, direct_id_50], class = golub_train_r)
test_direct_50 = data.frame(golub_test_p_trans[, direct_id_50])
train_kmeans = golub_train_p_trans[,kmeans_id_50]
test_kmeans = golub_test_p_trans[,kmeans_id_50]
save(train_kmeans,golub_train_r, test_kmeans,golub_test_r, file = "../transformed data/paper29.rda")
In [6]:
# kmeans method
km_50 = empty.graph(c("class", colnames(data.frame(golub_train_p_trans[,kmeans_id_50]))))
#km_set = cbind( from = colnames(data.frame(golub_train_p_trans[,kmeans_id])), to = rep("class", k[5]))
arcs(km_50) = matrix(c(rep("class", k[5]),
colnames(data.frame(golub_train_p_trans[,kmeans_id_50]))),
ncol = 2, byrow = F, dimnames = list(c(), c("from", "to")))
#plot(km)
km_fitted_50 = bn.fit(km_50, train_kmeans_50)
km_predict_train_50 = predict(km_fitted_50, node = "class", method="bayes-lw", train_kmeans_50)
km_predict_test_50 = predict(km_fitted_50, node = "class", method="bayes-lw", test_kmeans_50)
table(Predict_train = km_predict_train_50, True = train_kmeans_50$class)
table(Predict_test = km_predict_test_50, True = golub_test_r)
In [7]:
# direct method
dr_50 = empty.graph(c("class", colnames(data.frame(golub_train_p_trans[, direct_id_50]))))
arcs(dr_50) = matrix(c(rep("class", k[5]), colnames(data.frame(golub_train_p_trans[,direct_id_50]))),
ncol = 2, byrow = F, dimnames = list(c(), c("from", "to")))
dr_fitted_50 = bn.fit(dr_50, train_direct_50)
dr_predict_train_50 = predict(dr_fitted_50, "class", method = "bayes-lw", train_direct_50)
dr_predict_test_50 = predict(dr_fitted_50, "class", method = "bayes-lw", test_direct_50)
table(Predict_train = dr_predict_train_50, True = train_direct_50$class)
table(Predict_test = dr_predict_test_50, True = golub_test_r)
In [8]:
kmeans_id_90 = get_kmeans(k[7], golub_train_p_trans, golub_train_r)
train_kmeans_90 = data.frame(golub_train_p_trans[,kmeans_id_90], class = golub_train_r)
test_kmeans_90 = data.frame(golub_test_p_trans[,kmeans_id_90])
direct_id_90 = get_direct(k[7], golub_train_p_trans, golub_train_r)
train_direct_90 = data.frame(golub_train_p_trans[, direct_id_90], class = golub_train_r)
test_direct_90 = data.frame(golub_test_p_trans[, direct_id_90])
In [9]:
# kmeans method
km_90 = empty.graph(c("class", colnames(data.frame(golub_train_p_trans[,kmeans_id_90]))))
#km_set = cbind( from = colnames(data.frame(golub_train_p_trans[,kmeans_id])), to = rep("class", k[5]))
arcs(km_90) = matrix(c(rep("class", k[7]),
colnames(data.frame(golub_train_p_trans[,kmeans_id_90]))),
ncol = 2, byrow = F, dimnames = list(c(), c("from", "to")))
#plot(km)
km_fitted_90 = bn.fit(km_90, train_kmeans_90)
km_predict_train_90 = predict(km_fitted_90, node = "class", method="bayes-lw", train_kmeans_90)
km_predict_test_90 = predict(km_fitted_90, node = "class", method="bayes-lw", test_kmeans_90)
table(Predict_train = km_predict_train_90, True = train_kmeans_90$class)
table(Predict_test = km_predict_test_90, True = golub_test_r)
# direct method
dr_90 = empty.graph(c("class", colnames(data.frame(golub_train_p_trans[, direct_id_90]))))
arcs(dr_90) = matrix(c(rep("class", k[7]), colnames(data.frame(golub_train_p_trans[,direct_id_90]))),
ncol = 2, byrow = F, dimnames = list(c(), c("from", "to")))
dr_fitted_90 = bn.fit(dr_90, train_direct_90)
dr_predict_train_90 = predict(dr_fitted_90, "class", method = "bayes-lw", train_direct_90)
dr_predict_test_90 = predict(dr_fitted_90, "class", method = "bayes-lw", test_direct_90)
table(Predict_train = dr_predict_train_90, True = train_direct_90$class)
table(Predict_test = dr_predict_test_90, True = golub_test_r)
The 50-gene and 90-gene classification results in the paper is show below. However, we didn't get as high accuracy as in the paper.
# of features | With Kmeans | Withouth Kmeans |
---|---|---|
50 | 22/34 | 25/34 |
90 | 28/34 | 25/34 |
# of features | With Kmeans | Without Kmeans |
---|---|---|
50 | 31/34 | 29/34 |
90 | 32/34 | 32/34 |
As we can see above, our results are quite different from the paper's results. One of the reasons is that the paper omits too many details about their implementation or settings. Although it includes general introduction to kmeans and bayesian network, it still cannot enable us to reproduce their results. Things it should state includes but not limitted to preprocessing steps, number of genes selected from each cluster, bayesian network settings etc.