Predict disease classes using genetic microarray data

João Oda

Intro

A gene contains the information necessary to synthetize a protein. When the correspondent protein is being produced with in the cell, we say that gene is active. The activation level of the gene, i.e. the amount of correspondent protein that is being synthetize is also called the amount of expression of that gene.

The gene expression can be measure using techniques like microarray and RNA-Seq, these informations leads to a profile that can be used to indentifies cell types and states.

In this project we are going to use data mining techiques to look at microarray data, where we have expresion information of 7070 genes and classify cell types/states among 5 classes, labelled EPD, JPA, MED, MGL, RHB.

This project, originally from kdnuggets, was presented to me in a data mining course at the university. This is a version which implementation is performed completely in Python, using the pandas and scikit-learn library.

Dependencies


In [0]:
import pandas as pd
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
import scipy.stats as st

from sklearn import tree
from sklearn import naive_bayes
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from tempfile import mkdtemp

Data

Reading and formating the training data.


In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/ojon/MD_Proj/master/pp5i_train.gr.csv')
df.set_index('SNO', inplace=True)
df = df.transpose()
df.reset_index(drop=True, inplace=True)

In [0]:
y = pd.read_csv('https://raw.githubusercontent.com/ojon/MD_Proj/master/pp5i_train_class.txt')
df = pd.concat([df, y], axis=1)

In [0]:
myRandomSeed = 72
df = df.sample(frac=1,random_state=myRandomSeed).reset_index(drop=True)

In [5]:
df.shape


Out[5]:
(69, 7071)

In [6]:
df.head()


Out[6]:
A28102_at AB000114_at AB000115_at AB000220_at AB000409_at AB000449_at AB000450_at AB000460_at AB000462_at AB000464_at ... U58516_at U73738_at X06956_at X16699_at X83863_at Z17240_at L49218_f_at M71243_f_at Z78285_f_at Class
0 26 26 14 85 161 34 -9 206 34 140 ... 138 1 29 1 153 42 7 26 -2 MED
1 23 24 9 174 197 3 -11 194 94 197 ... 180 -1 9 -2 91 23 -4 35 -4 MED
2 11 54 5 22 180 26 -46 448 109 331 ... 51 6 -25 3 747 41 -8 28 15 MED
3 35 27 19 33 179 57 10 235 52 141 ... 112 -14 8 -11 90 42 -7 36 -20 RHB
4 68 21 35 30 131 8 22 140 37 81 ... 364 -1 34 0 297 39 1 27 -1 JPA

5 rows × 7071 columns


In [0]:
X = df.drop('Class',axis=1)

In [0]:
y = df['Class']

Data Cleaning

Threshold both train and test data to a minimum value of 20, maximum of 16,000.


In [0]:
X.clip(upper=16000, lower=20, inplace=True)

Remove from train data genes with fold differences across samples less or equal than 2.


In [10]:
X.shape


Out[10]:
(69, 7070)

In [0]:
X = X.loc[:,X.max() - X.min() > 2]

In [12]:
X.shape


Out[12]:
(69, 6746)

Feature Selection

For feature seletion we perform a t-test for each class among the samples of that class and the samples in other remaining classes. For each class the genes with highest absolte t-values are selected. We calculate the t-values as:

$$t = \frac{\bar{X_1} + \bar{X_2}}{\sqrt{\frac{s_1^2}{n_1} +\frac{s_2^2}{n_2}}}$$

The so called Welch's t-test. Used when the two population variances are not assumed to be equal.

T-test transformer

A transformer implemented in a way to be compatible with sklearn.pipelines. It does the selection of features using tscores, selecting the top w features with highest t-scores for each class. The implementation of the t-test from scypy library is used.


In [0]:
class TtestScoreSelection(BaseEstimator, TransformerMixin):
    def __init__(self, w=3):
        self.w = w

    def fit(self, X, y=None):
        X = check_array(X)
        self.input_shape_ = X.shape

        # Check that X and y have correct shape
        X, y = check_X_y(X, y)

        # Store the classes seen during fit
        self.labels_ = unique_labels(y)
        self.tValuesDF = pd.DataFrame(columns=self.labels_)
        self.sortedIndexes = pd.DataFrame(columns=self.labels_)

        for label in self.labels_:
            sample1 = X[y == label]
            sample2 = X[y != label]
            # In canse of both samples have variance 0 I put some arbitrary values on sample 1
            zeroVarBothColsIdx = (np.var(sample1,axis=0) + np.var(sample2,axis=0)) == 0
            sample1[:, zeroVarBothColsIdx] = 10e6 
            sample1[0, zeroVarBothColsIdx] = 1
            #perform the t test
            t = st.ttest_ind(sample1, sample2, equal_var=False)
            # I set t-values to 0 for the columns where both samples had 0 variance
            t[0][zeroVarBothColsIdx] = 0
            self.tValuesDF[label] = np.abs(t[0])
            self.sortedIndexes[label] = self.tValuesDF.sort_values(by=label,
                                                                   ascending=False).index

        # Return the transformer
        return self

    def transform(self, X):
        # Check is fit had been called
        check_is_fitted(self, ['input_shape_'])

        # Input validation
        X = check_array(X)

        # union of indexes from the top w columns for each label
        self.selCols = np.unique(self.sortedIndexes[:][0:self.w].values.flatten())

        return X[:, self.selCols]

Model Selection

We evaluate a set of classifiers and hyperparameters:


In [0]:
cachedir = mkdtemp()

In [0]:
pipe = Pipeline([('featureSelection', TtestScoreSelection(w=10)),
                 ('classify', KNeighborsClassifier(n_neighbors=1))],
                memory=cachedir)

N_GENES = [20,4,6,8,10,12,15,20,25,30]
N_LAYERS = [(32,),(64,),(128,)]

tuned_parameters = [
                    {'featureSelection__w': N_GENES,
                     'classify': [KNeighborsClassifier()],
                     'classify__n_neighbors': [1,3,5,7] 
                    },                    
                    {'featureSelection__w': N_GENES,
                     'classify': [tree.DecisionTreeClassifier()],
                     'classify__criterion':['gini','entropy'],
                     'classify__min_samples_leaf': [1,3,5],
                     'classify__max_depth': [3,6,9],
                     'classify__presort': [True]
                    },
                    {'featureSelection__w': N_GENES,
                     'classify': [MLPClassifier()],
                     'classify__hidden_layer_sizes': N_LAYERS,
                     'classify__activation': ['logistic'],
                     'classify__alpha':[0.05, 0.01, 0.005, 0.001],                      
                     'classify__max_iter':[1000],
                     'classify__solver': ['lbfgs'],
                     'classify__verbose': [True]                                    
                    },
                    {'featureSelection__w': N_GENES,
                    'classify': [naive_bayes.GaussianNB()]
                    }
                   ]
kfolds = KFold(n_splits=5, shuffle=True, random_state=myRandomSeed)
model = GridSearchCV(pipe, tuned_parameters, cv=kfolds, return_train_score=True)

The process of feature selection and the model training are done both using cross-validation, in order to avoid data leakage.


In [0]:
model.fit(X,y)
results = pd.DataFrame(model.cv_results_)

In [17]:
results.sort_values(by='mean_test_score', ascending=False).head()


Out[17]:
mean_fit_time mean_score_time mean_test_score mean_train_score param_classify param_classify__activation param_classify__alpha param_classify__criterion param_classify__hidden_layer_sizes param_classify__max_depth ... split2_test_score split2_train_score split3_test_score split3_train_score split4_test_score split4_train_score std_fit_time std_score_time std_test_score std_train_score
246 1.839278 0.001893 0.942029 1.000000 MLPClassifier(activation='logistic', alpha=0.0... logistic 0.05 NaN (128,) NaN ... 0.928571 1.000000 0.928571 1.000000 1.0 1.000000 0.040034 0.000047 0.05339 0.000000
12 0.071554 0.001438 0.927536 0.978312 KNeighborsClassifier(algorithm='auto', leaf_si... NaN NaN NaN NaN NaN ... 0.928571 0.981818 0.928571 0.981818 1.0 0.964286 0.000992 0.000045 0.04467 0.007013
26 0.070669 0.001475 0.927536 0.978247 KNeighborsClassifier(algorithm='auto', leaf_si... NaN NaN NaN NaN NaN ... 0.928571 0.981818 0.928571 0.981818 1.0 0.982143 0.000956 0.000058 0.04467 0.007306
25 0.070634 0.001356 0.927536 0.981883 KNeighborsClassifier(algorithm='auto', leaf_si... NaN NaN NaN NaN NaN ... 0.928571 0.981818 0.928571 1.000000 1.0 0.982143 0.000760 0.000020 0.04467 0.011500
261 0.638752 0.001593 0.927536 1.000000 MLPClassifier(activation='logistic', alpha=0.0... logistic 0.01 NaN (64,) NaN ... 0.928571 1.000000 0.928571 1.000000 1.0 1.000000 0.012281 0.000229 0.04467 0.000000

5 rows × 33 columns


In [18]:
model.best_estimator_


Out[18]:
Pipeline(memory='/tmp/tmphx_aoffq',
     steps=[('featureSelection', TtestScoreSelection(w=15)), ('classify', MLPClassifier(activation='logistic', alpha=0.05, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(128,), learning_rate='constant',
       learning_rate_init=0.001, ...=True, solver='lbfgs', tol=0.0001, validation_fraction=0.1,
       verbose=True, warm_start=False))])

Considering the overall accuracy(not always the metric to consider), the best model we have is a multilayer perceptron with one hidden layer of 128 neurons

Test

We now make the predictions for the test set, using the best model fitted with all training data.


In [0]:
testSet = pd.read_csv('https://raw.githubusercontent.com/ojon/MD_Proj/master/pp5i_test.gr.csv')

In [0]:
testSet.set_index('SNO', inplace=True)
testX = testSet.transpose()
testX.reset_index(drop=True, inplace=True)

In [0]:
finalResult = pd.DataFrame()
finalResult['predicted'] = model.predict(testX)

In [22]:
finalResult


Out[22]:
predicted
0 MGL
1 MED
2 RHB
3 MED
4 EPD
5 RHB
6 MED
7 MED
8 MED
9 MED
10 MED
11 EPD
12 MED
13 MED
14 RHB
15 MED
16 MED
17 MED
18 MED
19 MED
20 MED
21 RHB
22 MED