ALL/AML Classifier

Research Description

Summary of the relative research

Index of the papers discussed

Paper ID	Title	Published Year
1	Molecular classification of cancer: class discovery and class prediction by gene expression monitoring	1999/10/15
2	Class Prediction and Discovery Using Gene Expression Data	2000
3	Tissue Classification with Gene Expression Profiles	2000/08/01
4	Support vector machine classification and validation of cancer tissue samples using microarray expression data	2000/10/01
5	Identifying marker genes in transcription profiling data using a mixture of feature relevance experts	2001/03/08
6	Classification of Acute Leukemia Based on DNA Microarray Gene Expressions Using Partial Least Squares	2002
7	Gene Selection for Cancer Classification using Support Vector Machines	2002/01/01
8	Tumor classification by partial least squares using microarray gene expression data	2002/01/01
9	Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data	2002/03/01
10	Ensemble machine learning on gene expression data for cancer classification	2003
11	Effective dimension reduction methods for tumor classification using gene expression data	2003/03/22
12	PCA disjoint models for multiclass cancer analysis using gene expression data	2003/03/22
13	Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions	2003/04/01
14	Boosting for tumor classification with gene expression data	2003/06/12
15	Classification of multiple cancer types by multicategory support vector machines using gene expression data	2003/06/12
16	Optimization models for cancer classification: extracting gene interaction information from microarray expression data	2004/03/22
17	Classification of gene microarrays by penalized logistic regression	2004/07
18	A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression	2004/10/12
19	A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis	2005/03/01
20	An extensive comparison of recent classification tools applied to microarray data	2005/04/01
21	Simple decision rules for classifying human cancers from gene expression profiles	2005/10/15
22	Gene selection and classification of microarray data using random forest	2006
23	Gene Selection Using Rough Set Theory	2006/07/24
24	Independent component analysis-based penalized discriminant method for tumor classification using gene expression data	2006/08/01
25	Gene selection for classification of microarray data based on the Bayes error	2007
26	Logistic regression for disease classification using microarray data: model selection in a large p and small n case	2007/08/01
27	A sequential feature extraction approach for naïve bayes classification of microarray data	2009/08
28	Optimization Based Tumor Classification from Microarray Gene Expression Data	2011/02/04
29	Acute Leukemia Classification using Bayesian Networks	2012/10
30	A novel approach to select significant genes of leukemia cancer data using K-Means clustering	2013/02

Detailed List

Paper ID	Dataset Described	Classifier	Results	Note
1	$72\times 6817$ $(47 ALL, 25 AML)$ ¹	Golub Classifier: informative genes + weighted vote	(50 genes) Train: 36 correct ,2 uncertain Test: 29 correct, 5 uncertain
2	$72\times 6817$ $(47 ALL,25 AML)$	Golub Classifier: informative genes + weighted vote	(50 genes) Train: 36 correct ,2 uncertain Test: 29 correct, 5 uncertain	Detailed explanation of 1
3	$72\times 7129$ $(47 ALL,25 AML)$²	Nearest Neighbor SVM(linear kernel, quadratic kernel) Boosting (100, 1000, 10000 iteration)	Accuracy: $>= 90% $, ROC curves, Prediction Error
4	$72\times 7129$ $(47 ALL,25 AML)$	SVM(top 25, 250, 500, 1000 features)	# of correct classification is reported(too long to list here)
5	$72 \times 7070$ $(47 ALL,25 AML)$	MVR(median vote relevance),NBGR(naive bayes global relevance), MAR(Golub paper relevance)+ SVM	# of correct classification is reported	Mainly focus on the criterion of feature selection
6	$72\times 6817$ $(47 ALL,25 AML)$	Dimension Reduction: PCA, PLS(Partial Least Square) Classification: logistic and quadratic discrimination	Average accuracy rate reported	(50 same genes as in Golub Paper, however, re-randomization to the train and test samples introduced)
7	$72\times 7129$ $(47 ALL,25 AML)$	SVM	multiple genes are selected, error rate/success rate, rejection rate/acceptance rate, externel margin, median margin reported
8	$72\times 6817$ $(47 ALL,25 AML)$			Almost same as 6
9	$72\times 6817$ $(47 ALL, 25 AML)$->$72\times 3571$	Linear and quadratic discriminant analysis(4), Classification trees(4), Nearest neighbors	number of misclassified tumor samples quartile for each classifiers reported	40 genes used, test set sze 24
10	$72\times 7129$ $(47 ALL, 25 AML)$	single C4.5(decision tree), bagged(C4.5), AdaBoost C4.5	Accuracy, Precision(Positive Predictive Accuracy), Sensitivity, Specificity reported/plotted
11	$72\times 7129$ $(47 ALL, 25 AML)$	MAVE-LD, DLDA, DQDA, MAVE-NPLD	# of correct classification and error rate reported
12	$72\times 7129$ $(47 ALL, 25 AML)$	Disjoint PCA, SIMCA classification, classifier feedback feature selection	correct classify and misclassified reported
13	$72\times 7129$ $(47 ALL, 25 AML)$	Spectral biclustering methods	correctly partitions the patient, with only 1 ambiguous case
14	$72\times 7129$ $(47 ALL, 25 AML)$->$72\times 3571$	LogitBoost, AdaBoost, Nearest Neighbor, Classification Tree	Error rate reported
15	$72\times 7129$ $(47 ALL, 25 AML)$	2 types of preprocessing+2 kernel function+ 2 tuning methods	Test errors reported(#)
16	$72\times 7129$ $(47 ALL, 25 AML)$	MAMA	# of misclassifications and prediction rate reported
17	$72\times 7129$ $(47 ALL, 25 AML)$	Feature selection: UR, REF Classifier: Penalized Logistic Regression	Error rate reported, also estimation of the prob. dist.
18	$72\times 7129$ $(47 ALL, 25 AML)$	SVM, KNN, Naive Bayes, J4.8 DT	Classification accuracy plotted	In this paper, they do a 3 class and a 4 class classification
19	$72\times 5327$ $(47 ALL, 25 AML)$	MC-SVM, Neural Network, KNN	Accuracy, relative classifier information reported	Also compared the result w/o gene selection
20	$72\times 3571$ $(47 ALL, 25 AML)$	Gene selection: BSS/WSS, Soft-thresholding, Rank-based Classifier: FLDA, DLDA, DQDA, KNN, logistic, GPLS..etc.	Mean error rate reported	This one compared tons of classifers.
21	$72\times 7129$ $(47 ALL, 25 AML)$	TSP(Top scoring pairs), KNN, PAM, DT, NB, SVM	LOOCV accuracy, test accuracy reported
22	$38\times 3051$ $(27 ALL, 11 AML)$	SVM, KNN, DLDA, SC, NN, RF	Error rate	Also discussed gene selection for RF
23	$72\times 7129$ $(47 ALL, 25 AML)$	SVM, NB	Accuracy plotted
24	$72\times 7129$ $(47 ALL, 25 AML)$	SVM, PCA+FDA, P-RR,P-PCR,P-ICR, PAM	Accuracy reported
25	$38\times 3051$ $(27 ALL, 11 AML)$	KNN, SVM	error rate reported	Mainly about BBF gene selection instead of classification
26	$72\times 3051$ $(47 ALL, 25 AML)$³	penalized logistic regression	prediction error reported	mainly discussed parametric bootstrap model to get a more accurate prediction error
27	$72\times 7129$ $(47 ALL, 25 AML)$	NB, FS+NB, FS+ICA+NB, FS+CCICA+NB	Boxplot of Accuracy rate reported	stepwise feature selection
28	$72\times 7129$ $(47 ALL, 25 AML)$	HBE, BayesNet, LibSVM, SMO, Logistic, RBF network, IBk, J48, Random Forest	Accuracy rate reported
29	$72\times 7129$ $(47 ALL, 25 AML)$	Bayes Network	Classification rate reported
30	$34\times 7129$ $(20 ALL, 14 AML)$	Kmeans Clustering	Accuracy, Specificity, Sensitivity reported	Although not explicitly say the data is from Golub, but the dimension indicate that

Footnotes

$72\times 6817 (47 ALL, 25 AML)$: Train:$38\times 6817(27 ALL, 11 AML)$ Test:$34\times6817(20 ALL, 14 AML)$
$72\times 7129 (47 ALL, 25 AML)$: Train:$38\times 7129(27 ALL, 11 AML)$ Test:$34\times 7129(20 ALL, 14 AML)$
$72\times 3051 (47 ALL, 25 AML)$: Train: $38\times 3051(27 ALL, 11 AML, used availible \ GeneLogit \ Library)$ Test: $34\times 3051$

Summary of the Leukemia Dataset

Acute Lymphocytic leukemia (ALL), also called Acute Lymphoblastic Leukemia, is a cancer that starts from the early version of white blood cells called lymphocytes in the bone marrow. The term "acute" means that the leukemia can progress quickly, and if not treated in time, would probably be fatal within a few months. Lymphocytic means it develops from early(immature) forms of lymphocytes, a type of white blood cell. It is different from acute myeloid leukemia (AML), which develops in other blood cell types found in the bone marrow. Using machine learning method, we could classify the two types of leukemia quickly with high accuracy and a lot of work have been done around this topic.
A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias by Golub et al.(1999)[1]. They proposed a class discovery procedure automatically distinguish between AML and ALL without previous knowledge of these classes. That paper is also the origin of the famous Golub Gene expression dataset. After that, tons of work have used this dataset to verify their feature selections procedures, classifiers etc., which are summarized in the tables above.
There are two datasets in the paper, training data and test data. The Golub Gene expression dataset contains both of them and also one merged dataset of those two. The traning data consisted of 38 bone marrow samples (27 ALL, 11 AML) obtained from acute leukemia patients at the time of diagnosis. There are 7129 probes in the experiment for 6817 genes, i.e. there are 7129 gene expressions for 6817 genes in the dataset. The test data is an independent collection of 34 leukemia samples with 24 bone marrow and 10 peripheral blood samples. 20 of them are ALL samples and the rest are AML samples. More details about the dataset could be found in the paper 1 or in this linked discription golubEsets.
Since the range of the gene expression in the dataset is large and there are lots of negative gene expression values, usually several transformation would be done before building the classifier. In paper 2, they manually restricted the value to above some positive threshold and did a log transformation after that. Paper 9 proposed a transformation procedure, which is widely used by researcher afterwards. They did three preprocessing steps: thresholding, filtering and base 10 logarithmic transformation and then reduced the whole training and test dataset to have only 3571 predictors.(dataset) However, after preprocessing use the procedure, we will left with 3051 predictors and that resulting dataset is available at library/package.
Since the dataset has more predictors than observations, the focus of research on the dataset is not restrict to finding an effective classifiers but also the feature selection criterions. In the origin paper, they use a 50-gene classifiers selected by correlation. Lots of other criterions and classifiers are studied by other researchers in the later papers and we will try to reproduce them in our study.