In this notebook we perform various machine learning methods and compare various aspects of machine learning paradigms:
We also reviewed several machine learning algorithms such as Support Vector Machine including its variants (c-SVN, regressive SVN etc), tree based methods (decision tree, random forest, extremely random forest etc) and other ensembel methods (AdaBoost)
In [1]:
## matrix and vector tools
import pandas as pd
from pandas import DataFrame as df
from pandas import Series
import numpy as np
In [181]:
## sklearn
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.feature_selection import VarianceThreshold
In [3]:
# matplotlib et al.
from matplotlib import pyplot as plt
%matplotlib inline
In [96]:
dna = df.from_csv('../../data/training_data_binding_site_prediction/dna_big.csv')
In [97]:
## embed class
dna = dna.reset_index(drop=False)
dna['class_bool'] = dna['class'] == '+'
dna['class_num'] = dna.class_bool.apply(lambda x: 1 if x else 0)
In [98]:
## added protein ID and corresponding position
dna['ID'] = dna.ID_pos.apply(lambda x: ''.join(x.split('_')[:-1]))
dna['pos'] = dna.ID_pos.apply(lambda x: x.split('_')[-1])
In [99]:
## data columns
dna.columns
Out[99]:
In [8]:
## print available features
for feature in dna.columns[:-6]:
print feature
In [152]:
dna
Out[152]:
In [95]:
## create column-wise normalized data-set
dna_norm = dna.copy()
for col in dna_norm[dna_norm.columns[1:][:-6]].columns:
dna_norm[col] = (dna_norm[col] - dna_norm[col].mean()) / (dna_norm[col].std() + .00001)
In [153]:
dna_norm
Out[153]:
We want to see whether certain evolution patterns have any influence on DNA binding mechanism. We apply Recursive Feature Elimination (RFE) to rank our all PSSM-based features according to their predictive power using linear SVM (that is, SVM with linear kernel function).
In [10]:
# extract dataset and prediction
X = dna[dna.columns[1:][:-6]]
X = X[[x for x in X.columns.tolist() if 'pssm' in x]]
X = X.iloc[range(1000)]
y = dna['class_bool']
y = y[range(1000)]
# apply RFE on linear c-SVM
estimator = SVC(kernel="linear")
selector = RFE(estimator, 5, step=1)
selector = selector.fit(X, y)
print selector.ranking_
In [11]:
# redid previous routine on the whole data
pssm_rank = pd.DataFrame()
cat = dna['class']
for i in range(dna.index.size / 1000):
this_cat = cat[range(i * 1000, (i + 1) * 1000)]
if this_cat.unique().size > 1:
X = dna[dna.columns[1:][:-6]]
X = X[[c for c in X.columns.tolist() if 'pssm' in c]]
X = X.iloc[range(i * 1000, (i + 1) * 1000)]
y = dna['class_bool']
y = y[range(i * 1000, (i + 1) * 1000)]
estimator = SVC(kernel="linear")
selector = RFE(estimator, 5, step=1)
selector = selector.fit(X, y)
print selector.ranking_
pssm_rank[str(i)] = selector.ranking_
pssm_rank.index = [c for c in X.columns.tolist() if 'pssm' in c]
In [12]:
##sort PSSM features by its predictive power
rank_av = [np.mean(pssm_rank.ix[i]) for i in pssm_rank.index]
arg_rank_av = np.argsort(rank_av)
pssm_rank_sorted = pssm_rank.ix[pssm_rank.index[arg_rank_av]]
pssm_rank_sorted['RANK_AV'] = np.sort(rank_av)
In [13]:
pssm_rank_sorted
Out[13]:
In [14]:
# plot average rank of all HSSP values
plt.hist([np.mean(pssm_rank.ix[i]) for i in pssm_rank.index], bins=60, alpha=.5)
plt.title("Histogram of Average HSSP Features Rank (RFE on linear SVM)")
fig = plt.gcf()
fig.set_size_inches(10, 6)
Some PSSM shows deviation from expected normal distribution -- which should be the case in neutral information setting due to Central Limit Theorem (CLT).
We use c-SVM on two models :
In [102]:
X = dna[dna.columns[1:][:-6]]
y = dna.class_num
In [103]:
## train c-SVM
clf_svm1 = SVC(kernel='rbf', C=0.7)
clf_svm1.fit(X[dna.fold == 0], y[dna.fold == 0])
Out[103]:
In [104]:
## predict class
pred = clf_svm1.predict(dna[dna.fold == 1][dna.columns[1:][:-6]])
In [105]:
truth = dna[dna.fold == 1]['class_num']
tp = pred[(np.array(pred) == 1) & (np.array(truth) == 1)].size
tn = pred[(np.array(pred) == 0) & (np.array(truth) == 0)].size
fp = pred[(np.array(pred) == 1) & (np.array(truth) == 0)].size
fn = pred[(np.array(pred) == 0) & (np.array(truth) == 1)].size
cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
"\t(+)-truth\t%d\t\t%d\n" +\
"\t(-)-truth\t%d\t\t%d"
print cm % (tp, fn, fp, tn)
In [165]:
print "Size of (-)- and (+)-sets:\n\t(+)\t %d\n\t(-)\t%d" % (truth[truth == 1].index.size, truth[truth == 0].index.size)
In [80]:
X_norm = dna_norm[dna_norm.columns[1:][:-6]]
y = dna_norm.class_num
In [81]:
## train c-SVM
clf_svm2 = SVC(kernel='rbf', C=0.7)
clf_svm2.fit(X_norm[dna_norm.fold == 0], y[dna_norm.fold == 0])
Out[81]:
In [108]:
## predict class
pred2 = clf_svm2.predict(dna_norm[dna_norm.fold == 1][dna_norm.columns[1:][:-6]])
In [109]:
truth = dna_norm[dna_norm.fold == 1]['class_num']
tp = pred2[(np.array(pred2) == 1) & (np.array(truth) == 1)].size
tn = pred2[(np.array(pred2) == 0) & (np.array(truth) == 0)].size
fp = pred2[(np.array(pred2) == 1) & (np.array(truth) == 0)].size
fn = pred2[(np.array(pred2) == 0) & (np.array(truth) == 1)].size
cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
"\t(+)-truth\t%d\t\t%d\n" +\
"\t(-)-truth\t%d\t\t%d"
print cm % (tp, fn, fp, tn)
Normalization over each feature space reduces the complexity of the problem which in turn improves the result.
In [88]:
## hand-pick features
features = [x for x in dna.columns[1:][:-6] if 'pssm' in x] +\
[x for x in dna.columns[1:][:-6] if 'glbl_aa_comp' in x] +\
[x for x in dna.columns[1:][:-6] if 'glbl_sec' in x] +\
[x for x in dna.columns[1:][:-6] if 'glbl_acc' in x] +\
[x for x in dna.columns[1:][:-6] if 'chemprop_mass' in x] +\
[x for x in dna.columns[1:][:-6] if 'chemprop_hyd' in x] +\
[x for x in dna.columns[1:][:-6] if 'chemprop_cbeta' in x] +\
[x for x in dna.columns[1:][:-6] if 'chemprop_charge' in x] +\
[x for x in dna.columns[1:][:-6] if 'inf_PP' in x] +\
[x for x in dna.columns[1:][:-6] if 'isis_bin' in x] +\
[x for x in dna.columns[1:][:-6] if 'isis_raw' in x] +\
[x for x in dna.columns[1:][:-6] if 'profbval_raw' in x] +\
[x for x in dna.columns[1:][:-6] if 'profphd_sec_raw' in x] +\
[x for x in dna.columns[1:][:-6] if 'profphd_sec_bin' in x] +\
[x for x in dna.columns[1:][:-6] if 'profphd_acc_bin' in x] +\
[x for x in dna.columns[1:][:-6] if 'profphd_normalize' in x] +\
[x for x in dna.columns[1:][:-6] if 'pfam_within_domain' in x] +\
[x for x in dna.columns[1:][:-6] if 'pfam_dom_cons' in x]
In [89]:
X_norm = dna_norm[features]
y = dna_norm.class_num
In [90]:
## train c-SVM
clf_svm3 = SVC(kernel='rbf', C=0.7)
clf_svm3.fit(X_norm[dna_norm.fold == 0], y[dna_norm.fold == 0])
Out[90]:
In [106]:
## predict class
pred3 = clf_svm3.predict(X_norm[dna_norm.fold == 1])
In [107]:
truth = dna_norm[dna_norm.fold == 1]['class_num']
tp = pred3[(np.array(pred3) == 1) & (np.array(truth) == 1)].size
tn = pred3[(np.array(pred3) == 0) & (np.array(truth) == 0)].size
fp = pred3[(np.array(pred3) == 1) & (np.array(truth) == 0)].size
fn = pred3[(np.array(pred3) == 0) & (np.array(truth) == 1)].size
cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
"\t(+)-truth\t%d\t\t%d\n" +\
"\t(-)-truth\t%d\t\t%d"
print cm % (tp, fn, fp, tn)
In [111]:
X = dna[dna.columns[1:][:-6]]
y = dna.class_num
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.
In [151]:
# compute cross validated accuracy of the model
clf_t1 = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
random_state=0)
scores = cross_val_score(clf_t1, X, y, cv=5)
print scores
print scores.mean()
In [116]:
clf_t1.fit(X[dna.fold == 0], y[dna.fold == 0])
Out[116]:
In [119]:
pred_t1 = clf_t1.predict(X[dna.fold == 1])
In [126]:
truth = dna[dna.fold == 1]['class_num']
tp = pred_t1[(np.array(pred_t1) == 1) & (np.array(truth) == 1)].size
tn = pred_t1[(np.array(pred_t1) == 0) & (np.array(truth) == 0)].size
fp = pred_t1[(np.array(pred_t1) == 1) & (np.array(truth) == 0)].size
fn = pred_t1[(np.array(pred_t1) == 0) & (np.array(truth) == 1)].size
cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
"\t(+)-truth\t%d\t\t%d\n" +\
"\t(-)-truth\t%d\t\t%d"
print cm % (tp, fn, fp, tn)
In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.
[SKL]
In [149]:
# compute cross validated accuracy of the model
clf_t2 = RandomForestClassifier(n_estimators=10, max_depth=None,
min_samples_split=2, random_state=0)
scores = cross_val_score(clf_t2, X, y, cv=5)
print scores
print scores.mean()
In [128]:
clf_t2.fit(X[dna.fold == 0], y[dna.fold == 0])
Out[128]:
In [130]:
pred_t2 = clf_t2.predict(X[dna.fold == 1])
In [131]:
truth = dna[dna.fold == 1]['class_num']
tp = pred_t2[(np.array(pred_t2) == 1) & (np.array(truth) == 1)].size
tn = pred_t2[(np.array(pred_t2) == 0) & (np.array(truth) == 0)].size
fp = pred_t2[(np.array(pred_t2) == 1) & (np.array(truth) == 0)].size
fn = pred_t2[(np.array(pred_t2) == 0) & (np.array(truth) == 1)].size
cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
"\t(+)-truth\t%d\t\t%d\n" +\
"\t(-)-truth\t%d\t\t%d"
print cm % (tp, fn, fp, tn)
In extremely randomized trees (see ExtraTreesClassifier and ExtraTreesRegressor classes), randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias:
[SKL]
In [150]:
# compute cross validated accuracy of the model
clf_t3 = ExtraTreesClassifier(n_estimators=10, max_depth=None,
min_samples_split=2, random_state=0)
scores = cross_val_score(clf_t3, X, y, cv=5)
print scores
print scores.mean()
In [134]:
clf_t3.fit(X[dna.fold == 0], y[dna.fold == 0])
Out[134]:
In [135]:
pred_t3 = clf_t3.predict(X[dna.fold == 1])
In [136]:
truth = dna[dna.fold == 1]['class_num']
tp = pred_t3[(np.array(pred_t3) == 1) & (np.array(truth) == 1)].size
tn = pred_t3[(np.array(pred_t3) == 0) & (np.array(truth) == 0)].size
fp = pred_t3[(np.array(pred_t3) == 1) & (np.array(truth) == 0)].size
fn = pred_t3[(np.array(pred_t3) == 0) & (np.array(truth) == 1)].size
cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
"\t(+)-truth\t%d\t\t%d\n" +\
"\t(-)-truth\t%d\t\t%d"
print cm % (tp, fn, fp, tn)
In [172]:
features = [x for x in dna.columns[1:][:-6] if 'pssm' in x] +\
[x for x in dna.columns[1:][:-6] if 'glbl_aa_comp' in x] +\
[x for x in dna.columns[1:][:-6] if 'glbl_sec' in x] +\
[x for x in dna.columns[1:][:-6] if 'glbl_acc' in x] +\
[x for x in dna.columns[1:][:-6] if 'chemprop_mass' in x] +\
[x for x in dna.columns[1:][:-6] if 'chemprop_hyd' in x] +\
[x for x in dna.columns[1:][:-6] if 'chemprop_cbeta' in x] +\
[x for x in dna.columns[1:][:-6] if 'chemprop_charge' in x] +\
[x for x in dna.columns[1:][:-6] if 'inf_PP' in x] +\
[x for x in dna.columns[1:][:-6] if 'isis_bin' in x] +\
[x for x in dna.columns[1:][:-6] if 'isis_raw' in x] +\
[x for x in dna.columns[1:][:-6] if 'profbval_raw' in x] +\
[x for x in dna.columns[1:][:-6] if 'profphd_sec_raw' in x] +\
[x for x in dna.columns[1:][:-6] if 'profphd_sec_bin' in x] +\
[x for x in dna.columns[1:][:-6] if 'profphd_acc_bin' in x] +\
[x for x in dna.columns[1:][:-6] if 'profphd_normalize' in x] +\
[x for x in dna.columns[1:][:-6] if 'pfam_within_domain' in x] +\
[x for x in dna.columns[1:][:-6] if 'pfam_dom_cons' in x]
In [173]:
X = dna[features]
y = dna.class_num
In [174]:
# compute cross validated accuracy of the model
clf_t4 = RandomForestClassifier(n_estimators=10, max_depth=None,
min_samples_split=2, random_state=0)
scores = cross_val_score(clf_t4, X, y, cv=5)
print scores
print scores.mean()
In [175]:
clf_t4.fit(X[dna.fold == 0], y[dna.fold == 0])
Out[175]:
In [176]:
pred_t4 = clf_t4.predict(X[dna.fold == 1])
In [177]:
truth = dna[dna.fold == 1]['class_num']
tp = pred_t4[(np.array(pred_t4) == 1) & (np.array(truth) == 1)].size
tn = pred_t4[(np.array(pred_t4) == 0) & (np.array(truth) == 0)].size
fp = pred_t4[(np.array(pred_t4) == 1) & (np.array(truth) == 0)].size
fn = pred_t4[(np.array(pred_t4) == 0) & (np.array(truth) == 1)].size
cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
"\t(+)-truth\t%d\t\t%d\n" +\
"\t(-)-truth\t%d\t\t%d"
print cm % (tp, fn, fp, tn)
While there is a significant accuracy improvement going from Decision Tree to Random Forest, the resulting prediction from Extremely Random Forest only improves the accuracy by the margin. Likewise, manually handpicking the features does not seem to improve the performance of the accuracy.
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction.
[SKL]
In [179]:
X = dna[dna.columns[1:][:-6]]
y = dna.class_num
In [184]:
# compute cross validated accuracy of the model
ada = AdaBoostClassifier(n_estimators=100)
scores = cross_val_score(ada, X, y, cv=5)
print scores
print scores.mean()
In [185]:
ada.fit(X[dna.fold == 0], y[dna.fold == 0])
Out[185]:
In [187]:
pred_ada = ada.predict(X[dna.fold == 1])
In [188]:
truth = dna[dna.fold == 1]['class_num']
tp = pred_ada[(np.array(pred_ada) == 1) & (np.array(truth) == 1)].size
tn = pred_ada[(np.array(pred_ada) == 0) & (np.array(truth) == 0)].size
fp = pred_ada[(np.array(pred_ada) == 1) & (np.array(truth) == 0)].size
fn = pred_ada[(np.array(pred_ada) == 0) & (np.array(truth) == 1)].size
cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
"\t(+)-truth\t%d\t\t%d\n" +\
"\t(-)-truth\t%d\t\t%d"
print cm % (tp, fn, fp, tn)
Some solutions that might work:
I. Quantitative features selection using RFE/RFA over complete feature spaces
Problem: Feature spaces might be too large for conventional canned algorithms.
Possible Hacks:
-- Bagging of features (55+ feature groups vs. 500+ features)
-- Removing similar features before RFE (elimination via cosine similarity et al.?)
-- Dimensionality reductions (t-SNE, PCA et al.?)
II. Regularization: Might work considering the system is not entirely overdetermined and many features are not actually informative + the tendency of the problem to overcomplicate.
Generally combination of I and II would make some sense.
SVM, Random Forest and AdaBoost regressor.