Summary of papers checked

Paper ID Dataset Described Feature Selection Classifier Validation/Test
1 $72(Train\ 38, Test\ 34)\times 6817$ $(47 ALL, 25 AML)$ Informative genes Golub classifier: Weighted vote + Prediction Strength(difference between votes) 1-fold cross-validation on initial data and on independent test data
2 $72(Train\ 38, Test\ 34)\times 6817$ $(47 ALL,25 AML)$ Relative class separation(same as Informative genes) Golub classifier: Weighted vote + Prediction Strength(difference between votes) 1-fold cross-validation on initial data and on independent test data
3 $72\times 7129$ $(47 ALL,25 AML)$ the Threshold Number of Misclassification(TNoM) Nearest Neighbor
SVM(linear kernel, quadratic kernel)
Boosting (100, 1000, 10000 iteration)
LOOCV on whole set
4 $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL,25 AML)$ Same as Golub. Paper linear kernel SVM(all features, top 25, 250, 500, 1000 features), Perceptron(lack of details) Tune by CV on Train set, Test on set, also have reorganize data set to explore more possibilities
5 $72 \times 7070$ $(47 ALL,25 AML)$ MVR(median vote relevance),NBGR(naive bayes global relevance), MAR(Golub paper relevance)(ranked use 72 obs) SVM(linear kernel, radial kernel) LOOCV, Tr/Ts:38/34, Tr/Ts: 34/38
6 $72(Train\ 38, Test\ 34)\times 6817$ $(47 ALL,25 AML)$ Dimension Reduction: PCA, PLS(Partial Least Square)(from p = 50) logistic and quadratic discrimination analysis LOOCV on Train, Test on test set; rerandomization with equal split 36/36
7 $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL,25 AML)$ Recursive Feature Elimination SVM(linear kernel) LOOCV/Test: success rate (at zero rejection),acceptance rate (at zero error) with various number of features
8 $72(Train\ 38, Test\ 34)\times 6817$ $(47 ALL,25 AML)$ Almost same as 6 Almost same as 6 Almost same as 6
9 $72\times 6817$ $(47 ALL, 25 AML)$->$72\times 3571$ BW: ratio of their between-group to withingroup sums of squares Linear and quadratic discriminant analysis(FLDA, DLDA, DQDA),Golub classification, Classification trees(CV, Bag, Boosted, Boosted with CPD), Nearest neighbors 2:1 Train-test random split
10 $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ (Train\ 38, Test\ 34) single C4.5(decision tree), bagged(C4.5), AdaBoost(C4.5)(encapsuled in WEKA) Test Accuracy
11 $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ adaptive effective dimension reduction approach (MAVE) of Xia et al. (2002) MAVE-LD, DLDA, DQDA, MAVE-NPLD LOOCV/Test accuracy with 50, 100, 200 genes
12 $72\times 7129$ $(47 ALL, 25 AML)$ classifier feedback approach+Disjoint PCA Soft Independent Modeling of Class Analogy (SIMCA) classification Test Accuracy
13 $72\times 7129$ $(47 ALL, 25 AML)$ This is a clustering paper.
14 $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$->$72\times 3571$ nonparametric scoring method LogitBoost, AdaBoost, Nearest Neighbor, Classification Tree LOOCV Tune on Train and test on Test
15 $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ ratio of between classes sum of squares to within class sum of squares for each gene MSVM(linear and gaussian kernel) Misclassifcation on Test
16 $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ ideal feature construction MAximal MArgin Linear Programming(MAMA) LOOCV on Train/Test: Misclassifcation
17 $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ univariate ranking (UR),recursive feature elimination (RFE) Penalized Logistic Regression, SVM 10-CV on Train/Test: error
18 $72\times 7129$ $(47 ALL, 25 AML)$ information gain, twoing rule, sum minority, max minority, Gini index, sum of variances, one-dimensional SVM and t -statistics. SVM, KNN, Naive Bayes, J4.8 DT Test Accuracy
19 $72\times 5327$ $(47 ALL, 25 AML)$ (1) ratio of genes between-categories to within-category sums of squares(BW) (Dudoit et al., 2002); (2–3) signal-to-noise (S2N) scores (Golub et al.,1999) applied in a OVR (S2N-OVR) and in OVO (S2N-OVO) fashion; and (4) Kruskal–Wallis non-parametric one-way ANOVA (KW) (Jones, 1997)(5)no feature selection MC-SVM, Neural Network, KNN, PNN 10 folds-CV accuracy
20 $72(Train\ 38, Test\ 34)\times 3571$ $(47 ALL, 25 AML)$ BSS=WSS criterion (Dudoit et al., 2002), Wilcoxon rank-based statistics and soft-thresholding method (Tibshirani et al., 2002) Fisher’s linear discriminant analysis (FLDA),Diagonal linear and quadratic discriminant analysis (DLDA, DQDA),Logistic regression (LOGISTIC),Generalized partial least squares (GPLS),k nearest neighbor (kNN),CART and aggregating classi%ers (BAG, BOOST, LogitBOOST,RandomForest),Single & multi layer neural network (NN-1, NN-3),Support vector machine (SVM-linear, radial),Flexible discriminant analysis (FDA-POL, FDA-MARS),Penalized discriminant analysis (PDA),Mixture discriminant analysis (MDA-Linear, MDA-MARS),Shrunken centroids method (or Predictive Analysis of Microarrays (PAM)) Mean testset error
21 $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ No explicit feature selection TSP(Top scoring pairs),C4.5 decision trees (DT), Naïve Bayes (NB), k-nearest neighbor (k-NN), Support Vector Machines (SVM) and prediction analysis of microarrays (PAM) LOOCV accuracy on Train, test accuracy reported
22 $38\times 3051$ $(27 ALL, 11 AML)$ CV, F-ratio without variable selection: random forest, Diagonal Linear Discriminant Analysis (DLDA), K nearest neighbor (KNN), Support Vector Machines (SVM) with linear kernel; with variable selection: Shrunken centroids(SC),SC.l and SC.s,Nearest neighbor bootstrap leave out error
23 $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ RMIMR(Rough Maximum Interaction-Maximum Relevance) SVM and Naive-Bayes LOOCV error rate
24 $72\times 7129$ $(47 ALL, 25 AML)$ Independent component analysis(ICA) SVM, PCA+FDA, P-RR,P-PCR,P-ICR, PAM Train/Test accuracy
25 $38\times 3051$ $(27 ALL, 11 AML)$ Family-Wise-Error rate, BBF (Based Bayes error Filter) KNN, SVM LOOCV
26 $72(Train\ 38, Test\ 34)\times 3051$ $(47 ALL, 25 AML)$ the q most significant genes with gene expression difference between arrays with y=1 and arrays with y=0 for each of the p genes parametric bootstrap model boostrap mean prediction error
27 $62\times 7129$ Stepwise regression-based feature selection,ICA-based feature transformation Naive Bayes Hold out accuracy
28 $72\times 7129$ $(47 ALL, 25 AML)$ Information gain attribute evaluator, relief attribute evaluator and correlation-based feature selection (CFS) methods mixed integer linear programming based hyper-box enclosure (HBE) approach Test/LOOCV/10-CV Accuracy rate reported
29 $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ Signal-to-Noise Ratio, Kmeans clustering Bayes Network Classification accuracy reported
30 $34\times 7129$ $(20 ALL, 14 AML)$ This is to select genes Kmeans Clustering Accuracy, Specificity, Sensitivity

Reasons of exclusion

1. repeated algorithm (paper2, paper4, paper5, paper7, paper8, paper17, paper20)
2. encapsulated in software (paper4, paper5, paper10, paper11, paper18, paper19, paper21, paper26)
3. unfamiliar methods(paper11, paper12, paper14, paper16, paper22, paper23, paper26, paper28)
4. clustering(paper13)
5. use golub data for multiclass problem(paper15, paper18)
6. not using the whole golub dataset or not clear about the dataset(paper22, paper25, paper27,  paper30)
7. Parameters or settings not clearly specified in the paper(paper28, paper29, paper30)

Paper 1:Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

  • Data:
    • Train: 38*7129 (27ALL/11AML)
    • Test: 34*7129 (20ALL/ 14 AML)
  • Preprocessing:
    • Thresholding
    • Filtering
    • Transformation
    • Normalization
  • Feature Selection:
    • Neighborhood analysis
    • Informative genes selection
  • Steps:
    • Preprocessing Train/Test data using Train data statistics
    • Feature selection using Train data
    • Perform LOOCV on Train data for each LOOCV build various model and get LOOCV accuracy
    • Predict on Test data using gene selected and model built on Train data and get Test accuracy

Paper 3:Tissue Classification with Gene Expression Profiles

  • Data:
    • Merge: 72*7129 (27ALL/11AML)
  • Preprocessing:
    • Unknown preprocessing and we use Golub preprocessing instead.
  • Feature Selection:
    • TNoM score
    • Significance: Monte Carlo permutation test
  • Steps:
    • Perform LOOCV from the start using merged Golub. data
    • Preprocessing and feature selection on the hold 71 samples
    • Train on 71 hold samples and predict on leave out sample and get LOOCV accuracy

Paper 6:Classification of Acute Leukemia Based on DNA Microarray Gene Expressions Using Partial Least Squares

  • Data:
    • Train: 38*7129 (27ALL/11AML)
    • Test: 34*7129 (20ALL/ 14 AML)
  • Preprocessing:
    • Thresholding Algorithm for rescaling, however, we use Golub. one.
  • Feature Selection:
    • Same 50 genes as in Golub. Paper
  • Dimension Reduction:
    • PCA
    • PLS
  • Steps
    • Select the same gene as in Golub. Paper
    • Perform dimension reduction on the selected gene
    • LOOCV on Train data: build LD and QDA model and get the LOOCV accuracy
    • Dimension reduction of Test data using loadings of Train, predict by model built on Train data and get the Test Accuracy

Paper 9: Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data

  • Data:
    • Merge: 72*7129 (42ALL/25AML)
  • Preprocessing:
    • Thresholding
    • Filtering
    • Transformation
    • Normalization
  • Feature Selection:
    • BW
  • Steps:
    • Preprocess on the whole merged data
    • Repeat below steps 200 times:
      • Split the merged data into Train/Test with ratio 2:1
      • Feature selection on and train model on Train data
      • Test on Test data using features selected and model built in the previous step
      • Get Ave Test accuracy

Paper 29:Acute Leukemia Classification using Bayesian Networks

  • Data:
    • Train: 38*7129 (27ALL/11AML)
    • Test: 34*7129 (20ALL/ 14 AML)
  • Preprocessing:
    • Thresholding
    • Filtering
    • Transformation
    • Normalization
  • Feature Selection:
    • SNR
    • Kmeans Clustering
  • Steps:
    • Same preprocessing and feature selection on Train data as Golub. Paper
    • Build Bayesian Networks on Train data
    • Predict using Networks built on Test data
    • Get test accuracy

In [ ]: