Feature selection

As the details about this is not found in the paper, we make an assumption and use the genes selected using train data also as predictors selected for the test data. Also, for kmeans clustering feature selection, we select the top1 genes from each cluster using absolute SNR. Below are the helper functions for feature selection.


In [1]:
load("DP.rda")

In [2]:
set.seed(201703)
#SNR:signal-to-noise ratio
get_SNR = function(train_d, train_r){
    tr_m_aml =  colMeans(train_d[train_r == "AML",])
    tr_sd_aml = apply(train_d[train_r == "AML",], 2, sd)
    tr_m_all = colMeans(train_d[train_r == "ALL",])
    tr_sd_all = apply(train_d[train_r == "ALL",], 2, sd)
    p = (tr_m_aml-tr_m_all)/(tr_sd_aml+tr_sd_all)
    return(p)
}
# Kmeans clustering and then SNR ranking selection
get_kmeans = function(k, train_d, train_r){
    cl = kmeans(t(train_d), k, iter.max=50)$cluster
    result = numeric(k)
    for(i in 1:k){
        id = (cl == i)
        oid = (1:ncol(train_d))[id]
        iSNR = get_SNR(t(t(train_d)[id,]),train_r)
        temp = which.max(abs(iSNR))
        result[i] = oid[temp]
    }
    return(result)
}

#get the data after gene selection for each method
k = c(5,10,20,30,50,70,90)
kmeans_id_50 = get_kmeans(k[5], golub_train_p_trans, golub_train_r)
train_kmeans_50 = data.frame(golub_train_p_trans[,kmeans_id_50], class = golub_train_r)
test_kmeans_50 = data.frame(golub_test_p_trans[,kmeans_id_50])
train_kmeans = golub_train_p_trans[,kmeans_id_50]
test_kmeans = golub_test_p_trans[,kmeans_id_50]
save(train_kmeans,golub_train_r, test_kmeans,golub_test_r,train_kmeans_50, test_kmeans_50, file = "../transformed data/paper29.rda")

In [ ]: