Paper ID | Title | Published Year |
---|---|---|
1 | Molecular classification of cancer: class discovery and class prediction by gene expression monitoring | 1999/10/15 |
2 | Class Prediction and Discovery Using Gene Expression Data | 2000 |
3 | Tissue Classification with Gene Expression Profiles | 2000/08/01 |
4 | Support vector machine classification and validation of cancer tissue samples using microarray expression data | 2000/10/01 |
5 | Identifying marker genes in transcription profiling data using a mixture of feature relevance experts | 2001/03/08 |
6 | Classification of Acute Leukemia Based on DNA Microarray Gene Expressions Using Partial Least Squares | 2002 |
7 | Gene Selection for Cancer Classification using Support Vector Machines | 2002/01/01 |
8 | Tumor classification by partial least squares using microarray gene expression data | 2002/01/01 |
9 | Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data | 2002/03/01 |
10 | Ensemble machine learning on gene expression data for cancer classification | 2003 |
11 | Effective dimension reduction methods for tumor classification using gene expression data | 2003/03/22 |
12 | PCA disjoint models for multiclass cancer analysis using gene expression data | 2003/03/22 |
13 | Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions | 2003/04/01 |
14 | Boosting for tumor classification with gene expression data | 2003/06/12 |
15 | Classification of multiple cancer types by multicategory support vector machines using gene expression data | 2003/06/12 |
16 | Optimization models for cancer classification: extracting gene interaction information from microarray expression data | 2004/03/22 |
17 | Classification of gene microarrays by penalized logistic regression | 2004/07 |
18 | A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression | 2004/10/12 |
19 | A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis | 2005/03/01 |
20 | An extensive comparison of recent classification tools applied to microarray data | 2005/04/01 |
21 | Simple decision rules for classifying human cancers from gene expression profiles | 2005/10/15 |
22 | Gene selection and classification of microarray data using random forest | 2006 |
23 | Gene Selection Using Rough Set Theory | 2006/07/24 |
24 | Independent component analysis-based penalized discriminant method for tumor classification using gene expression data | 2006/08/01 |
25 | Gene selection for classification of microarray data based on the Bayes error | 2007 |
26 | Logistic regression for disease classification using microarray data: model selection in a large p and small n case | 2007/08/01 |
27 | A sequential feature extraction approach for naïve bayes classification of microarray data | 2009/08 |
28 | Optimization Based Tumor Classification from Microarray Gene Expression Data | 2011/02/04 |
29 | Acute Leukemia Classification using Bayesian Networks | 2012/10 |
30 | A novel approach to select significant genes of leukemia cancer data using K-Means clustering | 2013/02 |
Paper ID | Dataset Described | Classifier | Results | Note |
---|---|---|---|---|
1 | $72\times 6817$ $(47 ALL, 25 AML)$ 1 | Golub Classifier: informative genes + weighted vote | (50 genes) Train: 36 correct ,2 uncertain Test: 29 correct, 5 uncertain |
|
2 | $72\times 6817$ $(47 ALL,25 AML)$ | Golub Classifier: informative genes + weighted vote | (50 genes) Train: 36 correct ,2 uncertain Test: 29 correct, 5 uncertain |
Detailed explanation of 1 |
3 | $72\times 7129$ $(47 ALL,25 AML)$2 | Nearest Neighbor SVM(linear kernel, quadratic kernel) Boosting (100, 1000, 10000 iteration) |
Accuracy: $>= 90% $, ROC curves, Prediction Error | |
4 | $72\times 7129$ $(47 ALL,25 AML)$ | SVM(top 25, 250, 500, 1000 features) | # of correct classification is reported(too long to list here) | |
5 | $72 \times 7070$ $(47 ALL,25 AML)$ | MVR(median vote relevance),NBGR(naive bayes global relevance), MAR(Golub paper relevance)+ SVM | # of correct classification is reported | Mainly focus on the criterion of feature selection |
6 | $72\times 6817$ $(47 ALL,25 AML)$ | Dimension Reduction: PCA, PLS(Partial Least Square) Classification: logistic and quadratic discrimination |
Average accuracy rate reported | (50 same genes as in Golub Paper, however, re-randomization to the train and test samples introduced) |
7 | $72\times 7129$ $(47 ALL,25 AML)$ | SVM | multiple genes are selected, error rate/success rate, rejection rate/acceptance rate, externel margin, median margin reported | |
8 | $72\times 6817$ $(47 ALL,25 AML)$ | Almost same as 6 | ||
9 | $72\times 6817$ $(47 ALL, 25 AML)$->$72\times 3571$ | Linear and quadratic discriminant analysis(4), Classification trees(4), Nearest neighbors | number of misclassified tumor samples quartile for each classifiers reported | 40 genes used, test set sze 24 |
10 | $72\times 7129$ $(47 ALL, 25 AML)$ | single C4.5(decision tree), bagged(C4.5), AdaBoost C4.5 | Accuracy, Precision(Positive Predictive Accuracy), Sensitivity, Specificity reported/plotted | |
11 | $72\times 7129$ $(47 ALL, 25 AML)$ | MAVE-LD, DLDA, DQDA, MAVE-NPLD | # of correct classification and error rate reported | |
12 | $72\times 7129$ $(47 ALL, 25 AML)$ | Disjoint PCA, SIMCA classification, classifier feedback feature selection | correct classify and misclassified reported | |
13 | $72\times 7129$ $(47 ALL, 25 AML)$ | Spectral biclustering methods | correctly partitions the patient, with only 1 ambiguous case | |
14 | $72\times 7129$ $(47 ALL, 25 AML)$->$72\times 3571$ | LogitBoost, AdaBoost, Nearest Neighbor, Classification Tree | Error rate reported | |
15 | $72\times 7129$ $(47 ALL, 25 AML)$ | 2 types of preprocessing+2 kernel function+ 2 tuning methods | Test errors reported(#) | |
16 | $72\times 7129$ $(47 ALL, 25 AML)$ | MAMA | # of misclassifications and prediction rate reported | |
17 | $72\times 7129$ $(47 ALL, 25 AML)$ | Feature selection: UR, REF Classifier: Penalized Logistic Regression |
Error rate reported, also estimation of the prob. dist. | |
18 | $72\times 7129$ $(47 ALL, 25 AML)$ | SVM, KNN, Naive Bayes, J4.8 DT | Classification accuracy plotted | In this paper, they do a 3 class and a 4 class classification |
19 | $72\times 5327$ $(47 ALL, 25 AML)$ | MC-SVM, Neural Network, KNN | Accuracy, relative classifier information reported | Also compared the result w/o gene selection |
20 | $72\times 3571$ $(47 ALL, 25 AML)$ | Gene selection: BSS/WSS, Soft-thresholding, Rank-based Classifier: FLDA, DLDA, DQDA, KNN, logistic, GPLS..etc. |
Mean error rate reported | This one compared tons of classifers. |
21 | $72\times 7129$ $(47 ALL, 25 AML)$ | TSP(Top scoring pairs), KNN, PAM, DT, NB, SVM | LOOCV accuracy, test accuracy reported | |
22 | $38\times 3051$ $(27 ALL, 11 AML)$ | SVM, KNN, DLDA, SC, NN, RF | Error rate | Also discussed gene selection for RF |
23 | $72\times 7129$ $(47 ALL, 25 AML)$ | SVM, NB | Accuracy plotted | |
24 | $72\times 7129$ $(47 ALL, 25 AML)$ | SVM, PCA+FDA, P-RR,P-PCR,P-ICR, PAM | Accuracy reported | |
25 | $38\times 3051$ $(27 ALL, 11 AML)$ | KNN, SVM | error rate reported | Mainly about BBF gene selection instead of classification |
26 | $72\times 3051$ $(47 ALL, 25 AML)$3 | penalized logistic regression | prediction error reported | mainly discussed parametric bootstrap model to get a more accurate prediction error |
27 | $72\times 7129$ $(47 ALL, 25 AML)$ | NB, FS+NB, FS+ICA+NB, FS+CCICA+NB | Boxplot of Accuracy rate reported | stepwise feature selection |
28 | $72\times 7129$ $(47 ALL, 25 AML)$ | HBE, BayesNet, LibSVM, SMO, Logistic, RBF network, IBk, J48, Random Forest | Accuracy rate reported | |
29 | $72\times 7129$ $(47 ALL, 25 AML)$ | Bayes Network | Classification rate reported | |
30 | $34\times 7129$ $(20 ALL, 14 AML)$ | Kmeans Clustering | Accuracy, Specificity, Sensitivity reported | Although not explicitly say the data is from Golub, but the dimension indicate that |
Footnotes
Acute Lymphocytic leukemia (ALL), also called Acute Lymphoblastic Leukemia, is a cancer that starts from the early version of white blood cells called lymphocytes in the bone marrow. The term "acute" means that the leukemia can progress quickly, and if not treated in time, would probably be fatal within a few months. Lymphocytic means it develops from early(immature) forms of lymphocytes, a type of white blood cell. It is different from acute myeloid leukemia (AML), which develops in other blood cell types found in the bone marrow. Using machine learning method, we could classify the two types of leukemia quickly with high accuracy and a lot of work have been done around this topic.
A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias by Golub et al.(1999)[1]. They proposed a class discovery procedure automatically distinguish between AML and ALL without previous knowledge of these classes. That paper is also the origin of the famous Golub Gene expression dataset. After that, tons of work have used this dataset to verify their feature selections procedures, classifiers etc., which are summarized in the tables above.
There are two datasets in the paper, training data and test data. The Golub Gene expression dataset contains both of them and also one merged dataset of those two. The traning data consisted of 38 bone marrow samples (27 ALL, 11 AML) obtained from acute leukemia patients at the time of diagnosis. There are 7129 probes in the experiment for 6817 genes, i.e. there are 7129 gene expressions for 6817 genes in the dataset. The test data is an independent collection of 34 leukemia samples with 24 bone marrow and 10 peripheral blood samples. 20 of them are ALL samples and the rest are AML samples. More details about the dataset could be found in the paper 1 or in this linked discription golubEsets.
Since the range of the gene expression in the dataset is large and there are lots of negative gene expression values, usually several transformation would be done before building the classifier. In paper 2, they manually restricted the value to above some positive threshold and did a log transformation after that. Paper 9 proposed a transformation procedure, which is widely used by researcher afterwards. They did three preprocessing steps: thresholding, filtering and base 10 logarithmic transformation and then reduced the whole training and test dataset to have only 3571 predictors.(dataset) However, after preprocessing use the procedure, we will left with 3051 predictors and that resulting dataset is available at library/package.
Since the dataset has more predictors than observations, the focus of research on the dataset is not restrict to finding an effective classifiers but also the feature selection criterions. In the origin paper, they use a 50-gene classifiers selected by correlation. Lots of other criterions and classifiers are studied by other researchers in the later papers and we will try to reproduce them in our study.