The dataset generated by preprocessing and feature selection in each paper implemented is saved in an .rda file. Since paper6 is about PCA/PLS dimension reduction, we cannot compare the output dataset with others. In this notebook, we analyse the genes the other three paper choose and see how similar/different they are. In each paper, we use the seed 201703 before feature selection.
Load Data
We also have two other dataset, which are output after data preprocessing using method applied by Golub. paper. The dataset with predictor number 3051 is preprocessing using train data and select the same gene for test data as in train data. The other dataset has 3571 predictors, since the preprocessing is done for the merged data.
In [2]:
load("golub50gene.rda") #Paper 1
load("paper3.rda")
load("paper6.rda")
load("paper9.rda")
load("paper29.rda")
Check the data loaded
In [3]:
ls()
Check the dimensions of train data from different paper
In [4]:
rbind(paper1 = dim(golub_train_50),
paper3 = dim(train_paper3),
paper9 = dim(train_BW_predictor), paper29 = dim(train_kmeans))
In [5]:
golub_50_col = colnames(golub_train_50)
paper3_col = colnames(train_paper3)
paper9_col = colnames(train_BW_predictor)
paper29_col = colnames(train_kmeans)
Check the similarity between different selection
In [6]:
length(intersect(golub_50_col, paper3_col))
In [7]:
length(intersect(golub_50_col, paper9_col))
In [8]:
length(intersect(paper3_col, paper9_col))
In [9]:
length(intersect(golub_50_col, paper29_col))
length(intersect(paper3_col, paper29_col))
length(intersect(paper9_col, paper29_col))
In [13]:
all_contain = intersect(intersect(golub_50_col, paper29_col), intersect( paper3_col, paper9_col))
length(all_contain)
all_contain
In [15]:
first_contain = intersect(golub_50_col, intersect( paper3_col, paper9_col))
length(first_contain)
first_contain
Summary of the comparison
Paper1 | Paper3 | Paper9 | Paper 29 | |
---|---|---|---|---|
Paper1 | 50 | |||
Paper3 | 26 | 50 | ||
Paper9 | 25 | 19 | 50 | |
Paper29 | 9 | 6 | 3 | 50 |
In [ ]: