List of articles classifying the ALL/AML dataset

The ID assigned to the article here persists throughout the study. The 5 articles we analyzed are italicized.

ID	Title	Year
1	Molecular classification of cancer: class discovery and class prediction by gene expression monitoring	1999/10/15
2	Class Prediction and Discovery Using Gene Expression Data	2000
3	Tissue Classification with Gene Expression Profiles	2000/08/01
4	Support vector machine classification and validation of cancer tissue samples using microarray expression data	2000/10/01
5	Identifying marker genes in transcription profiling data using a mixture of feature relevance experts	2001/03/08
6	Classification of Acute Leukemia Based on DNA Microarray Gene Expressions Using Partial Least Squares	2002
7	Gene Selection for Cancer Classification using Support Vector Machines	2002/01/01
8	Tumor classification by partial least squares using microarray gene expression data	2002/01/01
9	Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data	2002/03/01
10	Ensemble machine learning on gene expression data for cancer classification	2003
11	Effective dimension reduction methods for tumor classification using gene expression data	2003/03/22
12	PCA disjoint models for multiclass cancer analysis using gene expression data	2003/03/22
13	Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions	2003/04/01
14	Boosting for tumor classification with gene expression data	2003/06/12
15	Classification of multiple cancer types by multicategory support vector machines using gene expression data	2003/06/12
16	Optimization models for cancer classification: extracting gene interaction information from microarray expression data	2004/03/22
17	Classification of gene microarrays by penalized logistic regression	2004/07
18	A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression	2004/10/12
19	A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis	2005/03/01
20	An extensive comparison of recent classification tools applied to microarray data	2005/04/01
21	Simple decision rules for classifying human cancers from gene expression profiles	2005/10/15
22	Gene selection and classification of microarray data using random forest	2006
23	Gene Selection Using Rough Set Theory	2006/07/24
24	Independent component analysis-based penalized discriminant method for tumor classification using gene expression data	2006/08/01
25	Gene selection for classification of microarray data based on the Bayes error	2007
26	Logistic regression for disease classification using microarray data: model selection in a large p and small n case	2007/08/01
27	A sequential feature extraction approach for naïve bayes classification of microarray data	2009/08
28	Optimization Based Tumor Classification from Microarray Gene Expression Data_	2011/02/04
29	Acute Leukemia Classification using Bayesian Networks	2012/10
30	A novel approach to select significant genes of leukemia cancer data using K-Means clustering	2013/02

Description of the ALL/AML Dataset

There are two datasets presented in the paper 1 (Golub et al), training data and test data. The training data consists of 38 bone marrow samples (27 ALL, 11 AML) obtained from acute leukemia patients at the time of diagnosis. There are 7129 probes in the experiment for 6817 genes, i.e. there are 7129 gene expressions for 6817 genes in the dataset. The test data is an independent collection of 34 leukemia samples with 24 bone marrow and 10 peripheral blood samples. 20 of them are ALL samples and the rest are AML samples. More details about the dataset could be found in the paper 1 or in this linked discription golubEsets.
Since the range of the gene expression in the dataset is large and there are lots of negative gene expression values, usually several transformation would be done before building the classifier. In paper 2, they manually restricted the value to above some positive threshold and did a log transformation after that. Paper 9 proposed a transformation procedure (ceiling and floor thresholding and a log transformation), which is widely used by researchers afterwards. They did three preprocessing steps: thresholding, filtering and base 10 logarithmic transformation and then reduced the whole training and test dataset to have only 3571 predictors.(dataset) However, after preprocessing use the procedure, we will left with 3051 predictors and that resulting dataset is available at library/package.
Since the dataset has more predictors than observations, the focus of research on the dataset is not just finding an effective classifier but also the feature selection criterion. In the original paper 1, they use correlation to select 50 genes for the classifier training step.