ALL/AML Classifier

Research Description

Summary of the relative research

  • Index of the papers discussed
Paper ID Title Published Year
1 Molecular classification of cancer: class discovery and class prediction by gene expression monitoring 1999/10/15
2 Class Prediction and Discovery Using Gene Expression Data 2000
3 Tissue Classification with Gene Expression Profiles 2000/08/01
4 Support vector machine classification and validation of cancer tissue samples using microarray expression data 2000/10/01
5 Identifying marker genes in transcription profiling data using a mixture of feature relevance experts 2001/03/08
6 Classification of Acute Leukemia Based on DNA Microarray Gene Expressions Using Partial Least Squares 2002
7 Gene Selection for Cancer Classification using Support Vector Machines 2002/01/01
8 Tumor classification by partial least squares using microarray gene expression data 2002/01/01
9 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data 2002/03/01
10 Ensemble machine learning on gene expression data for cancer classification 2003
11 Effective dimension reduction methods for tumor classification using gene expression data 2003/03/22
12 PCA disjoint models for multiclass cancer analysis using gene expression data 2003/03/22
13 Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions 2003/04/01
14 Boosting for tumor classification with gene expression data 2003/06/12
15 Classification of multiple cancer types by multicategory support vector machines using gene expression data 2003/06/12
16 Optimization models for cancer classification: extracting gene interaction information from microarray expression data 2004/03/22
17 Classification of gene microarrays by penalized logistic regression 2004/07
18 A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression 2004/10/12
19 A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis 2005/03/01
20 An extensive comparison of recent classification tools applied to microarray data 2005/04/01
21 Simple decision rules for classifying human cancers from gene expression profiles 2005/10/15
22 Gene selection and classification of microarray data using random forest 2006
23 Gene Selection Using Rough Set Theory 2006/07/24
24 Independent component analysis-based penalized discriminant method for tumor classification using gene expression data 2006/08/01
25 Gene selection for classification of microarray data based on the Bayes error 2007
26 Logistic regression for disease classification using microarray data: model selection in a large p and small n case 2007/08/01
27 A sequential feature extraction approach for naïve bayes classification of microarray data 2009/08
28 Optimization Based Tumor Classification from Microarray Gene Expression Data 2011/02/04
29 Acute Leukemia Classification using Bayesian Networks 2012/10
30 A novel approach to select significant genes of leukemia cancer data using K-Means clustering 2013/02
  • Detailed List
Paper ID Dataset Described Classifier Results Note
1 $72\times 6817$ $(47 ALL, 25 AML)$ 1 Golub Classifier: informative genes + weighted vote (50 genes)
Train: 36 correct ,2 uncertain
Test: 29 correct, 5 uncertain
2 $72\times 6817$ $(47 ALL,25 AML)$ Golub Classifier: informative genes + weighted vote (50 genes)
Train: 36 correct ,2 uncertain
Test: 29 correct, 5 uncertain
Detailed explanation of 1
3 $72\times 7129$ $(47 ALL,25 AML)$2 Nearest Neighbor
SVM(linear kernel, quadratic kernel)
Boosting (100, 1000, 10000 iteration)
Accuracy: $>= 90% $, ROC curves, Prediction Error
4 $72\times 7129$ $(47 ALL,25 AML)$ SVM(top 25, 250, 500, 1000 features) # of correct classification is reported(too long to list here)
5 $72 \times 7070$ $(47 ALL,25 AML)$ MVR(median vote relevance),NBGR(naive bayes global relevance), MAR(Golub paper relevance)+ SVM # of correct classification is reported Mainly focus on the criterion of feature selection
6 $72\times 6817$ $(47 ALL,25 AML)$ Dimension Reduction: PCA, PLS(Partial Least Square)
Classification: logistic and quadratic discrimination
Average accuracy rate reported (50 same genes as in Golub Paper, however, re-randomization to the train and test samples introduced)
7 $72\times 7129$ $(47 ALL,25 AML)$ SVM multiple genes are selected, error rate/success rate, rejection rate/acceptance rate, externel margin, median margin reported
8 $72\times 6817$ $(47 ALL,25 AML)$ Almost same as 6
9 $72\times 6817$ $(47 ALL, 25 AML)$->$72\times 3571$ Linear and quadratic discriminant analysis(4), Classification trees(4), Nearest neighbors number of misclassified tumor samples quartile for each classifiers reported 40 genes used, test set sze 24
10 $72\times 7129$ $(47 ALL, 25 AML)$ single C4.5(decision tree), bagged(C4.5), AdaBoost C4.5 Accuracy, Precision(Positive Predictive Accuracy), Sensitivity, Specificity reported/plotted
11 $72\times 7129$ $(47 ALL, 25 AML)$ MAVE-LD, DLDA, DQDA, MAVE-NPLD # of correct classification and error rate reported
12 $72\times 7129$ $(47 ALL, 25 AML)$ Disjoint PCA, SIMCA classification, classifier feedback feature selection correct classify and misclassified reported
13 $72\times 7129$ $(47 ALL, 25 AML)$ Spectral biclustering methods correctly partitions the patient, with only 1 ambiguous case
14 $72\times 7129$ $(47 ALL, 25 AML)$->$72\times 3571$ LogitBoost, AdaBoost, Nearest Neighbor, Classification Tree Error rate reported
15 $72\times 7129$ $(47 ALL, 25 AML)$ 2 types of preprocessing+2 kernel function+ 2 tuning methods Test errors reported(#)
16 $72\times 7129$ $(47 ALL, 25 AML)$ MAMA # of misclassifications and prediction rate reported
17 $72\times 7129$ $(47 ALL, 25 AML)$ Feature selection: UR, REF
Classifier: Penalized Logistic Regression
Error rate reported, also estimation of the prob. dist.
18 $72\times 7129$ $(47 ALL, 25 AML)$ SVM, KNN, Naive Bayes, J4.8 DT Classification accuracy plotted In this paper, they do a 3 class and a 4 class classification
19 $72\times 5327$ $(47 ALL, 25 AML)$ MC-SVM, Neural Network, KNN Accuracy, relative classifier information reported Also compared the result w/o gene selection
20 $72\times 3571$ $(47 ALL, 25 AML)$ Gene selection: BSS/WSS, Soft-thresholding, Rank-based
Classifier: FLDA, DLDA, DQDA, KNN, logistic, GPLS..etc.
Mean error rate reported This one compared tons of classifers.
21 $72\times 7129$ $(47 ALL, 25 AML)$ TSP(Top scoring pairs), KNN, PAM, DT, NB, SVM LOOCV accuracy, test accuracy reported
22 $38\times 3051$ $(27 ALL, 11 AML)$ SVM, KNN, DLDA, SC, NN, RF Error rate Also discussed gene selection for RF
23 $72\times 7129$ $(47 ALL, 25 AML)$ SVM, NB Accuracy plotted
24 $72\times 7129$ $(47 ALL, 25 AML)$ SVM, PCA+FDA, P-RR,P-PCR,P-ICR, PAM Accuracy reported
25 $38\times 3051$ $(27 ALL, 11 AML)$ KNN, SVM error rate reported Mainly about BBF gene selection instead of classification
26 $72\times 3051$ $(47 ALL, 25 AML)$3 penalized logistic regression prediction error reported mainly discussed parametric bootstrap model to get a more accurate prediction error
27 $72\times 7129$ $(47 ALL, 25 AML)$ NB, FS+NB, FS+ICA+NB, FS+CCICA+NB Boxplot of Accuracy rate reported stepwise feature selection
28 $72\times 7129$ $(47 ALL, 25 AML)$ HBE, BayesNet, LibSVM, SMO, Logistic, RBF network, IBk, J48, Random Forest Accuracy rate reported
29 $72\times 7129$ $(47 ALL, 25 AML)$ Bayes Network Classification rate reported
30 $34\times 7129$ $(20 ALL, 14 AML)$ Kmeans Clustering Accuracy, Specificity, Sensitivity reported Although not explicitly say the data is from Golub, but the dimension indicate that

Footnotes

  1. $72\times 6817 (47 ALL, 25 AML)$: Train:$38\times 6817(27 ALL, 11 AML)$ Test:$34\times6817(20 ALL, 14 AML)$
  2. $72\times 7129 (47 ALL, 25 AML)$: Train:$38\times 7129(27 ALL, 11 AML)$ Test:$34\times 7129(20 ALL, 14 AML)$
  3. $72\times 3051 (47 ALL, 25 AML)$: Train: $38\times 3051(27 ALL, 11 AML, used availible \ GeneLogit \ Library)$ Test: $34\times 3051$

Summary of the Leukemia Dataset

  • Acute Lymphocytic leukemia (ALL), also called Acute Lymphoblastic Leukemia, is a cancer that starts from the early version of white blood cells called lymphocytes in the bone marrow. The term "acute" means that the leukemia can progress quickly, and if not treated in time, would probably be fatal within a few months. Lymphocytic means it develops from early(immature) forms of lymphocytes, a type of white blood cell. It is different from acute myeloid leukemia (AML), which develops in other blood cell types found in the bone marrow. Using machine learning method, we could classify the two types of leukemia quickly with high accuracy and a lot of work have been done around this topic.

  • A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias by Golub et al.(1999)[1]. They proposed a class discovery procedure automatically distinguish between AML and ALL without previous knowledge of these classes. That paper is also the origin of the famous Golub Gene expression dataset. After that, tons of work have used this dataset to verify their feature selections procedures, classifiers etc., which are summarized in the tables above.

  • There are two datasets in the paper, training data and test data. The Golub Gene expression dataset contains both of them and also one merged dataset of those two. The traning data consisted of 38 bone marrow samples (27 ALL, 11 AML) obtained from acute leukemia patients at the time of diagnosis. There are 7129 probes in the experiment for 6817 genes, i.e. there are 7129 gene expressions for 6817 genes in the dataset. The test data is an independent collection of 34 leukemia samples with 24 bone marrow and 10 peripheral blood samples. 20 of them are ALL samples and the rest are AML samples. More details about the dataset could be found in the paper 1 or in this linked discription golubEsets.

  • Since the range of the gene expression in the dataset is large and there are lots of negative gene expression values, usually several transformation would be done before building the classifier. In paper 2, they manually restricted the value to above some positive threshold and did a log transformation after that. Paper 9 proposed a transformation procedure, which is widely used by researcher afterwards. They did three preprocessing steps: thresholding, filtering and base 10 logarithmic transformation and then reduced the whole training and test dataset to have only 3571 predictors.(dataset) However, after preprocessing use the procedure, we will left with 3051 predictors and that resulting dataset is available at library/package.

  • Since the dataset has more predictors than observations, the focus of research on the dataset is not restrict to finding an effective classifiers but also the feature selection criterions. In the origin paper, they use a 50-gene classifiers selected by correlation. Lots of other criterions and classifiers are studied by other researchers in the later papers and we will try to reproduce them in our study.