A gene contains the information necessary to synthetize a protein. When the correspondent protein is being produced with in the cell, we say that gene is active. The activation level of the gene, i.e. the amount of correspondent protein that is being synthetize is also called the amount of expression of that gene.
The gene expression can be measure using techniques like microarray and RNA-Seq, these informations leads to a profile that can be used to indentifies cell types and states.
In this project we are going to use data mining techiques to look at microarray data, where we have expresion information of 7070 genes and classify cell types/states among 5 classes, labelled EPD, JPA, MED, MGL, RHB.
This project, originally from kdnuggets, was presented to me in a data mining course at the university. This is a version which implementation is performed completely in Python, using the pandas and scikit-learn library.
In [0]:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
import scipy.stats as st
from sklearn import tree
from sklearn import naive_bayes
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from tempfile import mkdtemp
Reading and formating the training data.
In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/ojon/MD_Proj/master/pp5i_train.gr.csv')
df.set_index('SNO', inplace=True)
df = df.transpose()
df.reset_index(drop=True, inplace=True)
In [0]:
y = pd.read_csv('https://raw.githubusercontent.com/ojon/MD_Proj/master/pp5i_train_class.txt')
df = pd.concat([df, y], axis=1)
In [0]:
myRandomSeed = 72
df = df.sample(frac=1,random_state=myRandomSeed).reset_index(drop=True)
In [5]:
df.shape
Out[5]:
In [6]:
df.head()
Out[6]:
In [0]:
X = df.drop('Class',axis=1)
In [0]:
y = df['Class']
Threshold both train and test data to a minimum value of 20, maximum of 16,000.
In [0]:
X.clip(upper=16000, lower=20, inplace=True)
Remove from train data genes with fold differences across samples less or equal than 2.
In [10]:
X.shape
Out[10]:
In [0]:
X = X.loc[:,X.max() - X.min() > 2]
In [12]:
X.shape
Out[12]:
For feature seletion we perform a t-test for each class among the samples of that class and the samples in other remaining classes. For each class the genes with highest absolte t-values are selected. We calculate the t-values as:
$$t = \frac{\bar{X_1} + \bar{X_2}}{\sqrt{\frac{s_1^2}{n_1} +\frac{s_2^2}{n_2}}}$$The so called Welch's t-test. Used when the two population variances are not assumed to be equal.
A transformer implemented in a way to be compatible with sklearn.pipelines. It does the selection of features using tscores, selecting the top w features with highest t-scores for each class. The implementation of the t-test from scypy library is used.
In [0]:
class TtestScoreSelection(BaseEstimator, TransformerMixin):
def __init__(self, w=3):
self.w = w
def fit(self, X, y=None):
X = check_array(X)
self.input_shape_ = X.shape
# Check that X and y have correct shape
X, y = check_X_y(X, y)
# Store the classes seen during fit
self.labels_ = unique_labels(y)
self.tValuesDF = pd.DataFrame(columns=self.labels_)
self.sortedIndexes = pd.DataFrame(columns=self.labels_)
for label in self.labels_:
sample1 = X[y == label]
sample2 = X[y != label]
# In canse of both samples have variance 0 I put some arbitrary values on sample 1
zeroVarBothColsIdx = (np.var(sample1,axis=0) + np.var(sample2,axis=0)) == 0
sample1[:, zeroVarBothColsIdx] = 10e6
sample1[0, zeroVarBothColsIdx] = 1
#perform the t test
t = st.ttest_ind(sample1, sample2, equal_var=False)
# I set t-values to 0 for the columns where both samples had 0 variance
t[0][zeroVarBothColsIdx] = 0
self.tValuesDF[label] = np.abs(t[0])
self.sortedIndexes[label] = self.tValuesDF.sort_values(by=label,
ascending=False).index
# Return the transformer
return self
def transform(self, X):
# Check is fit had been called
check_is_fitted(self, ['input_shape_'])
# Input validation
X = check_array(X)
# union of indexes from the top w columns for each label
self.selCols = np.unique(self.sortedIndexes[:][0:self.w].values.flatten())
return X[:, self.selCols]
In [0]:
cachedir = mkdtemp()
In [0]:
pipe = Pipeline([('featureSelection', TtestScoreSelection(w=10)),
('classify', KNeighborsClassifier(n_neighbors=1))],
memory=cachedir)
N_GENES = [20,4,6,8,10,12,15,20,25,30]
N_LAYERS = [(32,),(64,),(128,)]
tuned_parameters = [
{'featureSelection__w': N_GENES,
'classify': [KNeighborsClassifier()],
'classify__n_neighbors': [1,3,5,7]
},
{'featureSelection__w': N_GENES,
'classify': [tree.DecisionTreeClassifier()],
'classify__criterion':['gini','entropy'],
'classify__min_samples_leaf': [1,3,5],
'classify__max_depth': [3,6,9],
'classify__presort': [True]
},
{'featureSelection__w': N_GENES,
'classify': [MLPClassifier()],
'classify__hidden_layer_sizes': N_LAYERS,
'classify__activation': ['logistic'],
'classify__alpha':[0.05, 0.01, 0.005, 0.001],
'classify__max_iter':[1000],
'classify__solver': ['lbfgs'],
'classify__verbose': [True]
},
{'featureSelection__w': N_GENES,
'classify': [naive_bayes.GaussianNB()]
}
]
kfolds = KFold(n_splits=5, shuffle=True, random_state=myRandomSeed)
model = GridSearchCV(pipe, tuned_parameters, cv=kfolds, return_train_score=True)
The process of feature selection and the model training are done both using cross-validation, in order to avoid data leakage.
In [0]:
model.fit(X,y)
results = pd.DataFrame(model.cv_results_)
In [17]:
results.sort_values(by='mean_test_score', ascending=False).head()
Out[17]:
In [18]:
model.best_estimator_
Out[18]:
In [0]:
testSet = pd.read_csv('https://raw.githubusercontent.com/ojon/MD_Proj/master/pp5i_test.gr.csv')
In [0]:
testSet.set_index('SNO', inplace=True)
testX = testSet.transpose()
testX.reset_index(drop=True, inplace=True)
In [0]:
finalResult = pd.DataFrame()
finalResult['predicted'] = model.predict(testX)
In [22]:
finalResult
Out[22]: