List of articles classifying the ALL/AML dataset

The ID assigned to the article here persists throughout the study. The 5 articles we analyzed are italicized.

ID Title Year
1 Molecular classification of cancer: class discovery and class prediction by gene expression monitoring 1999/10/15
2 Class Prediction and Discovery Using Gene Expression Data 2000
3 Tissue Classification with Gene Expression Profiles 2000/08/01
4 Support vector machine classification and validation of cancer tissue samples using microarray expression data 2000/10/01
5 Identifying marker genes in transcription profiling data using a mixture of feature relevance experts 2001/03/08
6 Classification of Acute Leukemia Based on DNA Microarray Gene Expressions Using Partial Least Squares 2002
7 Gene Selection for Cancer Classification using Support Vector Machines 2002/01/01
8 Tumor classification by partial least squares using microarray gene expression data 2002/01/01
9 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data 2002/03/01
10 Ensemble machine learning on gene expression data for cancer classification 2003
11 Effective dimension reduction methods for tumor classification using gene expression data 2003/03/22
12 PCA disjoint models for multiclass cancer analysis using gene expression data 2003/03/22
13 Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions 2003/04/01
14 Boosting for tumor classification with gene expression data 2003/06/12
15 Classification of multiple cancer types by multicategory support vector machines using gene expression data 2003/06/12
16 Optimization models for cancer classification: extracting gene interaction information from microarray expression data 2004/03/22
17 Classification of gene microarrays by penalized logistic regression 2004/07
18 A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression 2004/10/12
19 A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis 2005/03/01
20 An extensive comparison of recent classification tools applied to microarray data 2005/04/01
21 Simple decision rules for classifying human cancers from gene expression profiles 2005/10/15
22 Gene selection and classification of microarray data using random forest 2006
23 Gene Selection Using Rough Set Theory 2006/07/24
24 Independent component analysis-based penalized discriminant method for tumor classification using gene expression data 2006/08/01
25 Gene selection for classification of microarray data based on the Bayes error 2007
26 Logistic regression for disease classification using microarray data: model selection in a large p and small n case 2007/08/01
27 A sequential feature extraction approach for naïve bayes classification of microarray data 2009/08
28 Optimization Based Tumor Classification from Microarray Gene Expression Data_ 2011/02/04
29 Acute Leukemia Classification using Bayesian Networks 2012/10
30 A novel approach to select significant genes of leukemia cancer data using K-Means clustering 2013/02

Description of the ALL/AML Dataset

  • There are two datasets presented in the paper 1 (Golub et al), training data and test data. The training data consists of 38 bone marrow samples (27 ALL, 11 AML) obtained from acute leukemia patients at the time of diagnosis. There are 7129 probes in the experiment for 6817 genes, i.e. there are 7129 gene expressions for 6817 genes in the dataset. The test data is an independent collection of 34 leukemia samples with 24 bone marrow and 10 peripheral blood samples. 20 of them are ALL samples and the rest are AML samples. More details about the dataset could be found in the paper 1 or in this linked discription golubEsets.

  • Since the range of the gene expression in the dataset is large and there are lots of negative gene expression values, usually several transformation would be done before building the classifier. In paper 2, they manually restricted the value to above some positive threshold and did a log transformation after that. Paper 9 proposed a transformation procedure (ceiling and floor thresholding and a log transformation), which is widely used by researchers afterwards. They did three preprocessing steps: thresholding, filtering and base 10 logarithmic transformation and then reduced the whole training and test dataset to have only 3571 predictors.(dataset) However, after preprocessing use the procedure, we will left with 3051 predictors and that resulting dataset is available at library/package.

  • Since the dataset has more predictors than observations, the focus of research on the dataset is not just finding an effective classifier but also the feature selection criterion. In the original paper 1, they use correlation to select 50 genes for the classifier training step.