Predict disease classes using genetic microarray data

João Oda

Intro

A gene contains the information necessary to synthetize a protein. When the correspondent protein is being produced with in the cell, we say that gene is active. The activation level of the gene, i.e. the amount of correspondent protein that is being synthetize is also called the amount of expression of that gene.

The gene expression can be measure using techniques like microarray and RNA-Seq, these informations leads to a profile that can be used to indentifies cell types and states.

In this project we are going to use data mining techiques to look at microarray data, where we have expresion information of 7070 genes and classify cell types/states among 5 classes, labelled EPD, JPA, MED, MGL, RHB.

This project, originally from kdnuggets, was presented to me in a data mining course at the university. This is a version which implementation is performed completely in Python, using the pandas and scikit-learn library.

Dependencies



In [0]:

    
import pandas as pd
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
import scipy.stats as st

from sklearn import tree
from sklearn import naive_bayes
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from tempfile import mkdtemp

Data

Reading and formating the training data.



In [0]:

    
df = pd.read_csv('https://raw.githubusercontent.com/ojon/MD_Proj/master/pp5i_train.gr.csv')
df.set_index('SNO', inplace=True)
df = df.transpose()
df.reset_index(drop=True, inplace=True)



In [0]:

    
y = pd.read_csv('https://raw.githubusercontent.com/ojon/MD_Proj/master/pp5i_train_class.txt')
df = pd.concat([df, y], axis=1)



In [0]:

    
myRandomSeed = 72
df = df.sample(frac=1,random_state=myRandomSeed).reset_index(drop=True)



In [5]:

    
df.shape









    Out[5]:





(69, 7071)



In [6]:

    
df.head()









    Out[6]:







  
    
      
      A28102_at
      AB000114_at
      AB000115_at
      AB000220_at
      AB000409_at
      AB000449_at
      AB000450_at
      AB000460_at
      AB000462_at
      AB000464_at
      ...
      U58516_at
      U73738_at
      X06956_at
      X16699_at
      X83863_at
      Z17240_at
      L49218_f_at
      M71243_f_at
      Z78285_f_at
      Class
    
  
  
    
      0
      26
      26
      14
      85
      161
      34
      -9
      206
      34
      140
      ...
      138
      1
      29
      1
      153
      42
      7
      26
      -2
      MED
    
    
      1
      23
      24
      9
      174
      197
      3
      -11
      194
      94
      197
      ...
      180
      -1
      9
      -2
      91
      23
      -4
      35
      -4
      MED
    
    
      2
      11
      54
      5
      22
      180
      26
      -46
      448
      109
      331
      ...
      51
      6
      -25
      3
      747
      41
      -8
      28
      15
      MED
    
    
      3
      35
      27
      19
      33
      179
      57
      10
      235
      52
      141
      ...
      112
      -14
      8
      -11
      90
      42
      -7
      36
      -20
      RHB
    
    
      4
      68
      21
      35
      30
      131
      8
      22
      140
      37
      81
      ...
      364
      -1
      34
      0
      297
      39
      1
      27
      -1
      JPA
    
  

5 rows × 7071 columns



In [0]:

    
X = df.drop('Class',axis=1)



In [0]:

    
y = df['Class']

Data Cleaning

Threshold both train and test data to a minimum value of 20, maximum of 16,000.



In [0]:

    
X.clip(upper=16000, lower=20, inplace=True)

Remove from train data genes with fold differences across samples less or equal than 2.



In [10]:

    
X.shape









    Out[10]:





(69, 7070)



In [0]:

    
X = X.loc[:,X.max() - X.min() > 2]



In [12]:

    
X.shape









    Out[12]:





(69, 6746)

Feature Selection

For feature seletion we perform a t-test for each class among the samples of that class and the samples in other remaining classes. For each class the genes with highest absolte t-values are selected. We calculate the t-values as:

$$t = \frac{\bar{X_1} + \bar{X_2}}{\sqrt{\frac{s_1^2}{n_1} +\frac{s_2^2}{n_2}}}$$

The so called Welch's t-test. Used when the two population variances are not assumed to be equal.

T-test transformer

A transformer implemented in a way to be compatible with sklearn.pipelines. It does the selection of features using tscores, selecting the top w features with highest t-scores for each class. The implementation of the t-test from scypy library is used.



In [0]:

    
class TtestScoreSelection(BaseEstimator, TransformerMixin):
    def __init__(self, w=3):
        self.w = w

    def fit(self, X, y=None):
        X = check_array(X)
        self.input_shape_ = X.shape

        # Check that X and y have correct shape
        X, y = check_X_y(X, y)

        # Store the classes seen during fit
        self.labels_ = unique_labels(y)
        self.tValuesDF = pd.DataFrame(columns=self.labels_)
        self.sortedIndexes = pd.DataFrame(columns=self.labels_)

        for label in self.labels_:
            sample1 = X[y == label]
            sample2 = X[y != label]
            # In canse of both samples have variance 0 I put some arbitrary values on sample 1
            zeroVarBothColsIdx = (np.var(sample1,axis=0) + np.var(sample2,axis=0)) == 0
            sample1[:, zeroVarBothColsIdx] = 10e6 
            sample1[0, zeroVarBothColsIdx] = 1
            #perform the t test
            t = st.ttest_ind(sample1, sample2, equal_var=False)
            # I set t-values to 0 for the columns where both samples had 0 variance
            t[0][zeroVarBothColsIdx] = 0
            self.tValuesDF[label] = np.abs(t[0])
            self.sortedIndexes[label] = self.tValuesDF.sort_values(by=label,
                                                                   ascending=False).index

        # Return the transformer
        return self

    def transform(self, X):
        # Check is fit had been called
        check_is_fitted(self, ['input_shape_'])

        # Input validation
        X = check_array(X)

        # union of indexes from the top w columns for each label
        self.selCols = np.unique(self.sortedIndexes[:][0:self.w].values.flatten())

        return X[:, self.selCols]

Model Selection

We evaluate a set of classifiers and hyperparameters:



In [0]:

    
cachedir = mkdtemp()



In [0]:

    
pipe = Pipeline([('featureSelection', TtestScoreSelection(w=10)),
                 ('classify', KNeighborsClassifier(n_neighbors=1))],
                memory=cachedir)

N_GENES = [20,4,6,8,10,12,15,20,25,30]
N_LAYERS = [(32,),(64,),(128,)]

tuned_parameters = [
                    {'featureSelection__w': N_GENES,
                     'classify': [KNeighborsClassifier()],
                     'classify__n_neighbors': [1,3,5,7] 
                    },                    
                    {'featureSelection__w': N_GENES,
                     'classify': [tree.DecisionTreeClassifier()],
                     'classify__criterion':['gini','entropy'],
                     'classify__min_samples_leaf': [1,3,5],
                     'classify__max_depth': [3,6,9],
                     'classify__presort': [True]
                    },
                    {'featureSelection__w': N_GENES,
                     'classify': [MLPClassifier()],
                     'classify__hidden_layer_sizes': N_LAYERS,
                     'classify__activation': ['logistic'],
                     'classify__alpha':[0.05, 0.01, 0.005, 0.001],                      
                     'classify__max_iter':[1000],
                     'classify__solver': ['lbfgs'],
                     'classify__verbose': [True]                                    
                    },
                    {'featureSelection__w': N_GENES,
                    'classify': [naive_bayes.GaussianNB()]
                    }
                   ]
kfolds = KFold(n_splits=5, shuffle=True, random_state=myRandomSeed)
model = GridSearchCV(pipe, tuned_parameters, cv=kfolds, return_train_score=True)

The process of feature selection and the model training are done both using cross-validation, in order to avoid data leakage.



In [0]:

    
model.fit(X,y)
results = pd.DataFrame(model.cv_results_)



In [17]:

    
results.sort_values(by='mean_test_score', ascending=False).head()









    Out[17]:







  
    
      
      mean_fit_time
      mean_score_time
      mean_test_score
      mean_train_score
      param_classify
      param_classify__activation
      param_classify__alpha
      param_classify__criterion
      param_classify__hidden_layer_sizes
      param_classify__max_depth
      ...
      split2_test_score
      split2_train_score
      split3_test_score
      split3_train_score
      split4_test_score
      split4_train_score
      std_fit_time
      std_score_time
      std_test_score
      std_train_score
    
  
  
    
      246
      1.839278
      0.001893
      0.942029
      1.000000
      MLPClassifier(activation='logistic', alpha=0.0...
      logistic
      0.05
      NaN
      (128,)
      NaN
      ...
      0.928571
      1.000000
      0.928571
      1.000000
      1.0
      1.000000
      0.040034
      0.000047
      0.05339
      0.000000
    
    
      12
      0.071554
      0.001438
      0.927536
      0.978312
      KNeighborsClassifier(algorithm='auto', leaf_si...
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      0.928571
      0.981818
      0.928571
      0.981818
      1.0
      0.964286
      0.000992
      0.000045
      0.04467
      0.007013
    
    
      26
      0.070669
      0.001475
      0.927536
      0.978247
      KNeighborsClassifier(algorithm='auto', leaf_si...
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      0.928571
      0.981818
      0.928571
      0.981818
      1.0
      0.982143
      0.000956
      0.000058
      0.04467
      0.007306
    
    
      25
      0.070634
      0.001356
      0.927536
      0.981883
      KNeighborsClassifier(algorithm='auto', leaf_si...
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      0.928571
      0.981818
      0.928571
      1.000000
      1.0
      0.982143
      0.000760
      0.000020
      0.04467
      0.011500
    
    
      261
      0.638752
      0.001593
      0.927536
      1.000000
      MLPClassifier(activation='logistic', alpha=0.0...
      logistic
      0.01
      NaN
      (64,)
      NaN
      ...
      0.928571
      1.000000
      0.928571
      1.000000
      1.0
      1.000000
      0.012281
      0.000229
      0.04467
      0.000000
    
  

5 rows × 33 columns



In [18]:

    
model.best_estimator_









    Out[18]:





Pipeline(memory='/tmp/tmphx_aoffq',
     steps=[('featureSelection', TtestScoreSelection(w=15)), ('classify', MLPClassifier(activation='logistic', alpha=0.05, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(128,), learning_rate='constant',
       learning_rate_init=0.001, ...=True, solver='lbfgs', tol=0.0001, validation_fraction=0.1,
       verbose=True, warm_start=False))])

Considering the overall accuracy(not always the metric to consider), the best model we have is a multilayer perceptron with one hidden layer of 128 neurons

Test

We now make the predictions for the test set, using the best model fitted with all training data.



In [0]:

    
testSet = pd.read_csv('https://raw.githubusercontent.com/ojon/MD_Proj/master/pp5i_test.gr.csv')



In [0]:

    
testSet.set_index('SNO', inplace=True)
testX = testSet.transpose()
testX.reset_index(drop=True, inplace=True)



In [0]:

    
finalResult = pd.DataFrame()
finalResult['predicted'] = model.predict(testX)



In [22]:

    
finalResult

	A28102_at	AB000114_at	AB000115_at	AB000220_at	AB000409_at	AB000449_at	AB000450_at	AB000460_at	AB000462_at	AB000464_at	...	U58516_at	U73738_at	X06956_at	X16699_at	X83863_at	Z17240_at	L49218_f_at	M71243_f_at	Z78285_f_at	Class
0	26	26	14	85	161	34	-9	206	34	140	...	138	1	29	1	153	42	7	26	-2	MED
1	23	24	9	174	197	3	-11	194	94	197	...	180	-1	9	-2	91	23	-4	35	-4	MED
2	11	54	5	22	180	26	-46	448	109	331	...	51	6	-25	3	747	41	-8	28	15	MED
3	35	27	19	33	179	57	10	235	52	141	...	112	-14	8	-11	90	42	-7	36	-20	RHB
4	68	21	35	30	131	8	22	140	37	81	...	364	-1	34	0	297	39	1	27	-1	JPA

	mean_fit_time	mean_score_time	mean_test_score	mean_train_score	param_classify	param_classify__activation	param_classify__alpha	param_classify__criterion	param_classify__hidden_layer_sizes	param_classify__max_depth	...	split2_test_score	split2_train_score	split3_test_score	split3_train_score	split4_test_score	split4_train_score	std_fit_time	std_score_time	std_test_score	std_train_score
246	1.839278	0.001893	0.942029	1.000000	MLPClassifier(activation='logistic', alpha=0.0...	logistic	0.05	NaN	(128,)	NaN	...	0.928571	1.000000	0.928571	1.000000	1.0	1.000000	0.040034	0.000047	0.05339	0.000000
12	0.071554	0.001438	0.927536	0.978312	KNeighborsClassifier(algorithm='auto', leaf_si...	NaN	NaN	NaN	NaN	NaN	...	0.928571	0.981818	0.928571	0.981818	1.0	0.964286	0.000992	0.000045	0.04467	0.007013
26	0.070669	0.001475	0.927536	0.978247	KNeighborsClassifier(algorithm='auto', leaf_si...	NaN	NaN	NaN	NaN	NaN	...	0.928571	0.981818	0.928571	0.981818	1.0	0.982143	0.000956	0.000058	0.04467	0.007306
25	0.070634	0.001356	0.927536	0.981883	KNeighborsClassifier(algorithm='auto', leaf_si...	NaN	NaN	NaN	NaN	NaN	...	0.928571	0.981818	0.928571	1.000000	1.0	0.982143	0.000760	0.000020	0.04467	0.011500
261	0.638752	0.001593	0.927536	1.000000	MLPClassifier(activation='logistic', alpha=0.0...	logistic	0.01	NaN	(64,)	NaN	...	0.928571	1.000000	0.928571	1.000000	1.0	1.000000	0.012281	0.000229	0.04467	0.000000

	predicted
0	MGL
1	MED
2	RHB
3	MED
4	EPD
5	RHB
6	MED
7	MED
8	MED
9	MED
10	MED
11	EPD
12	MED
13	MED
14	RHB
15	MED
16	MED
17	MED
18	MED
19	MED
20	MED
21	RHB
22	MED

	A28102_at	AB000114_at	AB000115_at	AB000220_at	AB000409_at	AB000449_at	AB000450_at	AB000460_at	AB000462_at	AB000464_at	...	U58516_at	U73738_at	X06956_at	X16699_at	X83863_at	Z17240_at	L49218_f_at	M71243_f_at	Z78285_f_at	Class
0	26	26	14	85	161	34	-9	206	34	140	...	138	1	29	1	153	42	7	26	-2	MED
1	23	24	9	174	197	3	-11	194	94	197	...	180	-1	9	-2	91	23	-4	35	-4	MED
2	11	54	5	22	180	26	-46	448	109	331	...	51	6	-25	3	747	41	-8	28	15	MED
3	35	27	19	33	179	57	10	235	52	141	...	112	-14	8	-11	90	42	-7	36	-20	RHB
4	68	21	35	30	131	8	22	140	37	81	...	364	-1	34	0	297	39	1	27	-1	JPA

	predicted
0	MGL
1	MED
2	RHB
3	MED
4	EPD
5	RHB
6	MED
7	MED
8	MED
9	MED
10	MED
11	EPD
12	MED
13	MED
14	RHB
15	MED
16	MED
17	MED
18	MED
19	MED
20	MED
21	RHB
22	MED

	A28102_at	AB000114_at	AB000115_at	AB000220_at	AB000409_at	AB000449_at	AB000450_at	AB000460_at	AB000462_at	AB000464_at	...	U58516_at	U73738_at	X06956_at	X16699_at	X83863_at	Z17240_at	L49218_f_at	M71243_f_at	Z78285_f_at	Class
0	26	26	14	85	161	34	-9	206	34	140	...	138	1	29	1	153	42	7	26	-2	MED
1	23	24	9	174	197	3	-11	194	94	197	...	180	-1	9	-2	91	23	-4	35	-4	MED
2	11	54	5	22	180	26	-46	448	109	331	...	51	6	-25	3	747	41	-8	28	15	MED
3	35	27	19	33	179	57	10	235	52	141	...	112	-14	8	-11	90	42	-7	36	-20	RHB
4	68	21	35	30	131	8	22	140	37	81	...	364	-1	34	0	297	39	1	27	-1	JPA

	predicted
0	MGL
1	MED
2	RHB
3	MED
4	EPD
5	RHB
6	MED
7	MED
8	MED
9	MED
10	MED
11	EPD
12	MED
13	MED
14	RHB
15	MED
16	MED
17	MED
18	MED
19	MED
20	MED
21	RHB
22	MED