Paper ID | Dataset Description | Feature Selection | Classifier | Validation/Test |
---|---|---|---|---|
1 | $72(Train\ 38, Test\ 34)\times 6817$ $(47 ALL, 25 AML)$ | Informative genes | Golub classifier: Weighted vote + Prediction Strength(difference between votes) | 1-fold cross-validation on initial data and on independent test data |
2 | $72(Train\ 38, Test\ 34)\times 6817$ $(47 ALL,25 AML)$ | Relative class separation(same as Informative genes) | Golub classifier: Weighted vote + Prediction Strength(difference between votes) | 1-fold cross-validation on initial data and on independent test data |
3 | $72\times 7129$ $(47 ALL,25 AML)$ | the Threshold Number of Misclassification(TNoM) | Nearest Neighbor SVM(linear kernel, quadratic kernel) Boosting (100, 1000, 10000 iteration) |
LOOCV on whole set |
4 | $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL,25 AML)$ | Same as Golub. Paper | linear kernel SVM(all features, top 25, 250, 500, 1000 features), Perceptron(lack of details) | Tune by CV on Train set, Test on set, also have reorganize data set to explore more possibilities |
5 | $72 \times 7070$ $(47 ALL,25 AML)$ | MVR(median vote relevance),NBGR(naive bayes global relevance), MAR(Golub paper relevance)(ranked use 72 obs) | SVM(linear kernel, radial kernel) | LOOCV, Tr/Ts:38/34, Tr/Ts: 34/38 |
6 | $72(Train\ 38, Test\ 34)\times 6817$ $(47 ALL,25 AML)$ | Dimension Reduction: PCA, PLS(Partial Least Square)(from p = 50) | logistic and quadratic discrimination analysis | LOOCV on Train, Test on test set; rerandomization with equal split 36/36 |
7 | $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL,25 AML)$ | Recursive Feature Elimination | SVM(linear kernel) | LOOCV/Test: success rate (at zero rejection),acceptance rate (at zero error) with various number of features |
8 | $72(Train\ 38, Test\ 34)\times 6817$ $(47 ALL,25 AML)$ | Almost same as 6 | Almost same as 6 | Almost same as 6 |
9 | $72\times 6817$ $(47 ALL, 25 AML)$->$72\times 3571$ | BW: ratio of their between-group to withingroup sums of squares | Linear and quadratic discriminant analysis(FLDA, DLDA, DQDA),Golub classification, Classification trees(CV, Bag, Boosted, Boosted with CPD), Nearest neighbors | 2:1 Train-test random split |
10 | $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ | (Train\ 38, Test\ 34) | single C4.5(decision tree), bagged(C4.5), AdaBoost(C4.5)(encapsuled in WEKA) | Test Accuracy |
11 | $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ | adaptive effective dimension reduction approach (MAVE) of Xia et al. (2002) | MAVE-LD, DLDA, DQDA, MAVE-NPLD | LOOCV/Test accuracy with 50, 100, 200 genes |
12 | $72\times 7129$ $(47 ALL, 25 AML)$ | classifier feedback approach+Disjoint PCA | Soft Independent Modeling of Class Analogy (SIMCA) classification | Test Accuracy |
13 | $72\times 7129$ $(47 ALL, 25 AML)$ | This is a clustering paper. | ||
14 | $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$->$72\times 3571$ | nonparametric scoring method | LogitBoost, AdaBoost, Nearest Neighbor, Classification Tree | LOOCV Tune on Train and test on Test |
15 | $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ | ratio of between classes sum of squares to within class sum of squares for each gene | MSVM(linear and gaussian kernel) | Misclassifcation on Test |
16 | $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ | ideal feature construction | MAximal MArgin Linear Programming(MAMA) | LOOCV on Train/Test: Misclassifcation |
17 | $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ | univariate ranking (UR),recursive feature elimination (RFE) | Penalized Logistic Regression, SVM | 10-CV on Train/Test: error |
18 | $72\times 7129$ $(47 ALL, 25 AML)$ | information gain, twoing rule, sum minority, max minority, Gini index, sum of variances, one-dimensional SVM and t -statistics. | SVM, KNN, Naive Bayes, J4.8 DT | Test Accuracy |
19 | $72\times 5327$ $(47 ALL, 25 AML)$ | (1) ratio of genes between-categories to within-category sums of squares(BW) (Dudoit et al., 2002); (2–3) signal-to-noise (S2N) scores (Golub et al.,1999) applied in a OVR (S2N-OVR) and in OVO (S2N-OVO) fashion; and (4) Kruskal–Wallis non-parametric one-way ANOVA (KW) (Jones, 1997)(5)no feature selection | MC-SVM, Neural Network, KNN, PNN | 10 folds-CV accuracy |
20 | $72(Train\ 38, Test\ 34)\times 3571$ $(47 ALL, 25 AML)$ | BSS=WSS criterion (Dudoit et al., 2002), Wilcoxon rank-based statistics and soft-thresholding method (Tibshirani et al., 2002) | Fisher’s linear discriminant analysis (FLDA),Diagonal linear and quadratic discriminant analysis (DLDA, DQDA),Logistic regression (LOGISTIC),Generalized partial least squares (GPLS),k nearest neighbor (kNN),CART and aggregating classi%ers (BAG, BOOST, LogitBOOST,RandomForest),Single & multi layer neural network (NN-1, NN-3),Support vector machine (SVM-linear, radial),Flexible discriminant analysis (FDA-POL, FDA-MARS),Penalized discriminant analysis (PDA),Mixture discriminant analysis (MDA-Linear, MDA-MARS),Shrunken centroids method (or Predictive Analysis of Microarrays (PAM)) | Mean testset error |
21 | $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ | No explicit feature selection | TSP(Top scoring pairs),C4.5 decision trees (DT), Naïve Bayes (NB), k-nearest neighbor (k-NN), Support Vector Machines (SVM) and prediction analysis of microarrays (PAM) | LOOCV accuracy on Train, test accuracy reported |
22 | $38\times 3051$ $(27 ALL, 11 AML)$ | CV, F-ratio | without variable selection: random forest, Diagonal Linear Discriminant Analysis (DLDA), K nearest neighbor (KNN), Support Vector Machines (SVM) with linear kernel; with variable selection: Shrunken centroids(SC),SC.l and SC.s,Nearest neighbor | bootstrap leave out error |
23 | $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ | RMIMR(Rough Maximum Interaction-Maximum Relevance) | SVM and Naive-Bayes | LOOCV error rate |
24 | $72\times 7129$ $(47 ALL, 25 AML)$ | Independent component analysis(ICA) | SVM, PCA+FDA, P-RR,P-PCR,P-ICR, PAM | Train/Test accuracy |
25 | $38\times 3051$ $(27 ALL, 11 AML)$ | Family-Wise-Error rate, BBF (Based Bayes error Filter) | KNN, SVM | LOOCV |
26 | $72(Train\ 38, Test\ 34)\times 3051$ $(47 ALL, 25 AML)$ | the q most significant genes with gene expression difference between arrays with y=1 and arrays with y=0 for each of the p genes | parametric bootstrap model | boostrap mean prediction error |
27 | $62\times 7129$ | Stepwise regression-based feature selection,ICA-based feature transformation | Naive Bayes | Hold out accuracy |
28 | $72\times 7129$ $(47 ALL, 25 AML)$ | Information gain attribute evaluator, relief attribute evaluator and correlation-based feature selection (CFS) methods | mixed integer linear programming based hyper-box enclosure (HBE) approach | Test/LOOCV/10-CV Accuracy rate reported |
29 | $72(Train\ 38, Test\ 34)\times 7129$ $(47 ALL, 25 AML)$ | Signal-to-Noise Ratio, Kmeans clustering | Bayes Network | Classification accuracy reported |
30 | $34\times 7129$ $(20 ALL, 14 AML)$ | This is to select genes | Kmeans Clustering | Accuracy, Specificity, Sensitivity |
1. Algorithm repeated in a different article (papers 2, 4, 5, 7, 8, 17, 20)
2. Encapsulated in software (papers 4, 5, 10, 11, 18, 19, 21, 26)
3. Unfamiliar or impenetrable methods (papers 11, 12, 14, 16, 22, 23, 26, 28)
4. Clustering was carried out rather than classification (paper 13)
5. The golub data was used for a multiclass rather than a binary problem (papers 15, 18)
6. Started from a processed version of the Golub data (without documentation) e.g. not using the entire dataset or not clearly identifying the input dataset (papers 22, 25, 27, 30)
7. Parameters or settings not clearly specified in the paper, rending replication difficult (papers 28 and 30)
In [ ]: