Feature Selection Result Comparison

The dataset generated by preprocessing and feature selection in each paper implemented is saved in an .rda file. Since paper6 is about PCA/PLS dimension reduction, we cannot compare the output dataset with others. In this notebook, we analyse the genes the other three paper choose and see how similar/different they are. In each paper, we use the seed 201703 before feature selection.

Load Data

We also have two other dataset, which are output after data preprocessing using method applied by Golub. paper. The dataset with predictor number 3051 is preprocessing using train data and select the same gene for test data as in train data. The other dataset has 3571 predictors, since the preprocessing is done for the merged data.



In [2]:

    
load("golub50gene.rda") #Paper 1
load("paper3.rda")
load("paper6.rda")
load("paper9.rda")
load("paper29.rda")

Check the data loaded



In [3]:

    
ls()









    





	'golub_test_50'
	'golub_test_r'
	'golub_test_response'
	'golub_train_50'
	'golub_train_r'
	'golub_train_response'
	'pca_test'
	'pca_train'
	'pls_test'
	'pls_train'
	'test_BW_predictor'
	'test_kmeans'
	'test_paper3'
	'test_r'
	'test_response'
	'train_BW_predictor'
	'train_kmeans'
	'train_paper3'
	'train_r'
	'train_response'

Check the dimensions of train data from different paper



In [4]:

    
rbind(paper1 = dim(golub_train_50), 
paper3 = dim(train_paper3), 
paper9 = dim(train_BW_predictor), paper29 = dim(train_kmeans))









    






	paper1 38 50
	paper3 38 50
	paper9 48 50
	paper29 38 50



In [5]:

    
golub_50_col = colnames(golub_train_50)
paper3_col = colnames(train_paper3)
paper9_col = colnames(train_BW_predictor)
paper29_col = colnames(train_kmeans)

Check the similarity between different selection

Paper 1 and Paper 3 have 26 in 50 same genes selected. Paper 1 use the prediction strength(later defined as signal noise ratio) and paper 3 use TNoM score(TNoM stands for threshold number of misclassification).



In [6]:

    
length(intersect(golub_50_col, paper3_col))

Paper 1 and Paper 9 have 25 same genes. Paper9 use the similar preprocessing and feature selection method as paper 1. However, paper9 perform the preprocessing on the entire merge data while paper1 perform preprocessing on the train data. Paper 9 then randomly split the dataset into train and test and use train to select feature genes.



In [7]:

    
length(intersect(golub_50_col, paper9_col))

Paper 3 and paper 9 have 19 genes in common.



In [8]:

    
length(intersect(paper3_col, paper9_col))

Paper 29 interaction with others



In [9]:

    
length(intersect(golub_50_col, paper29_col))
length(intersect(paper3_col, paper29_col))
length(intersect(paper9_col, paper29_col))

Which gene all the methods selected?



In [13]:

    
all_contain = intersect(intersect(golub_50_col, paper29_col), intersect( paper3_col, paper9_col))
length(all_contain)
all_contain









    




3






    





	'U22376_cds2_s_at'
	'M31523_at'
	'M27891_at'



In [15]:

    
first_contain = intersect(golub_50_col, intersect( paper3_col, paper9_col))
length(first_contain)
first_contain









    




16






    





	'U22376_cds2_s_at'
	'M31523_at'
	'L47738_at'
	'J05243_at'
	'M92287_at'
	'M11722_at'
	'M27891_at'
	'X95735_at'
	'M23197_at'
	'M84526_at'
	'D88422_at'
	'M16038_at'
	'U46499_at'
	'L09209_s_at'
	'X62654_rna1_at'
	'M96326_rna1_at'

Summary of the comparison

	Paper1	Paper3	Paper9	Paper 29
Paper1	50
Paper3	26	50
Paper9	25	19	50
Paper29	9	6	3	50



In [ ]: