Feature Selection Result Comparison

The dataset generated by preprocessing and feature selection in each paper implemented is saved in an .rda file. Since paper6 is about PCA/PLS dimension reduction, we cannot compare the output dataset with others. In this notebook, we analyse the genes the other three paper choose and see how similar/different they are. In each paper, we use the seed 201703 before feature selection.

Load Data

We also have two other dataset, which are output after data preprocessing using method applied by Golub. paper. The dataset with predictor number 3051 is preprocessing using train data and select the same gene for test data as in train data. The other dataset has 3571 predictors, since the preprocessing is done for the merged data.


In [2]:
load("golub50gene.rda") #Paper 1
load("paper3.rda")
load("paper6.rda")
load("paper9.rda")
load("paper29.rda")

Check the data loaded


In [3]:
ls()


  1. 'golub_test_50'
  2. 'golub_test_r'
  3. 'golub_test_response'
  4. 'golub_train_50'
  5. 'golub_train_r'
  6. 'golub_train_response'
  7. 'pca_test'
  8. 'pca_train'
  9. 'pls_test'
  10. 'pls_train'
  11. 'test_BW_predictor'
  12. 'test_kmeans'
  13. 'test_paper3'
  14. 'test_r'
  15. 'test_response'
  16. 'train_BW_predictor'
  17. 'train_kmeans'
  18. 'train_paper3'
  19. 'train_r'
  20. 'train_response'

Check the dimensions of train data from different paper


In [4]:
rbind(paper1 = dim(golub_train_50), 
paper3 = dim(train_paper3), 
paper9 = dim(train_BW_predictor), paper29 = dim(train_kmeans))


paper13850
paper33850
paper94850
paper293850

In [5]:
golub_50_col = colnames(golub_train_50)
paper3_col = colnames(train_paper3)
paper9_col = colnames(train_BW_predictor)
paper29_col = colnames(train_kmeans)

Check the similarity between different selection

  • Paper 1 and Paper 3 have 26 in 50 same genes selected. Paper 1 use the prediction strength(later defined as signal noise ratio) and paper 3 use TNoM score(TNoM stands for threshold number of misclassification).

In [6]:
length(intersect(golub_50_col, paper3_col))


26
  • Paper 1 and Paper 9 have 25 same genes. Paper9 use the similar preprocessing and feature selection method as paper 1. However, paper9 perform the preprocessing on the entire merge data while paper1 perform preprocessing on the train data. Paper 9 then randomly split the dataset into train and test and use train to select feature genes.

In [7]:
length(intersect(golub_50_col, paper9_col))


25
  • Paper 3 and paper 9 have 19 genes in common.

In [8]:
length(intersect(paper3_col, paper9_col))


19
  • Paper 29 interaction with others

In [9]:
length(intersect(golub_50_col, paper29_col))
length(intersect(paper3_col, paper29_col))
length(intersect(paper9_col, paper29_col))


9
6
3
  • Which gene all the methods selected?

In [13]:
all_contain = intersect(intersect(golub_50_col, paper29_col), intersect( paper3_col, paper9_col))
length(all_contain)
all_contain


3
  1. 'U22376_cds2_s_at'
  2. 'M31523_at'
  3. 'M27891_at'

In [15]:
first_contain = intersect(golub_50_col, intersect( paper3_col, paper9_col))
length(first_contain)
first_contain


16
  1. 'U22376_cds2_s_at'
  2. 'M31523_at'
  3. 'L47738_at'
  4. 'J05243_at'
  5. 'M92287_at'
  6. 'M11722_at'
  7. 'M27891_at'
  8. 'X95735_at'
  9. 'M23197_at'
  10. 'M84526_at'
  11. 'D88422_at'
  12. 'M16038_at'
  13. 'U46499_at'
  14. 'L09209_s_at'
  15. 'X62654_rna1_at'
  16. 'M96326_rna1_at'

Summary of the comparison

Paper1 Paper3 Paper9 Paper 29
Paper1 50
Paper3 26 50
Paper9 25 19 50
Paper29 9 6 3 50

In [ ]: