Binding Site Prediction

In this notebook we perform various machine learning methods and compare various aspects of machine learning paradigms:

  • Zero-knowledge vs. domain-knowledge based prediction
  • Single algorithms vs. ensemble methods
  • Prediction over normalized vs. non-normalized space

We also reviewed several machine learning algorithms such as Support Vector Machine including its variants (c-SVN, regressive SVN etc), tree based methods (decision tree, random forest, extremely random forest etc) and other ensembel methods (AdaBoost)

## matrix and vector tools

import pandas as pd
from pandas import DataFrame as df
from pandas import Series
import numpy as np

## sklearn

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
from sklearn.svm import SVC

from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

from sklearn.feature_selection import VarianceThreshold

# matplotlib et al.

from matplotlib import pyplot as plt

%matplotlib inline

Data Import and pre-processing

dna = df.from_csv('../../data/training_data_binding_site_prediction/dna_big.csv')

## embed class

dna = dna.reset_index(drop=False)
dna['class_bool'] = dna['class'] == '+'
dna['class_num'] = dna.class_bool.apply(lambda x: 1 if x else 0)

## added protein ID and corresponding position

dna['ID'] = dna.ID_pos.apply(lambda x: ''.join(x.split('_')[:-1]))
dna['pos'] = dna.ID_pos.apply(lambda x: x.split('_')[-1])

## data columns


Index([u'ID_pos', u'-2_A_pssm', u'-2_R_pssm', u'-2_N_pssm', u'-2_D_pssm',
       u'-2_C_pssm', u'-2_Q_pssm', u'-2_E_pssm', u'-2_G_pssm', u'-2_H_pssm',
       u'intermediate_composition3', u'buried_composition1',
       u'buried_composition2', u'buried_composition3', u'fold', u'class',
       u'class_bool', u'class_num', u'ID', u'pos'],
      dtype='object', length=524)

## print available features
for feature in dna.columns[:-6]:
    print feature


ID_pos -2_A_pssm -2_R_pssm -2_N_pssm -2_D_pssm -2_C_pssm -2_Q_pssm -2_E_pssm -2_G_pssm -2_H_pssm ... intermediate_composition3 buried_composition1 buried_composition2 buried_composition3 fold class class_bool class_num ID pos
0 Q6NS38_1 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 1
1 Q6NS38_2 0.047426 0.017986 0.006693 0.002473 0.017986 0.047426 0.006693 0.006693 0.017986 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 2
2 Q6NS38_3 0.119203 0.017986 0.119203 0.999665 0.002473 0.047426 0.731059 0.017986 0.017986 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 3
3 Q6NS38_4 0.731059 0.997527 0.119203 0.268941 0.006693 0.731059 0.119203 0.119203 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 4
4 Q6NS38_5 0.731059 0.268941 0.119203 0.006693 0.017986 0.268941 0.268941 0.047426 0.047426 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 5
5 Q6NS38_6 0.119203 0.880797 0.017986 0.017986 0.047426 0.047426 0.268941 0.119203 0.017986 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 6
6 Q6NS38_7 0.731059 0.119203 0.119203 0.119203 0.047426 0.119203 0.017986 0.017986 0.017986 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 7
7 Q6NS38_8 0.119203 0.880797 0.119203 0.119203 0.268941 0.982014 0.268941 0.119203 0.268941 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 8
8 Q6NS38_9 0.500000 0.982014 0.119203 0.119203 0.006693 0.268941 0.268941 0.952574 0.017986 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 9
9 Q6NS38_10 0.982014 0.268941 0.017986 0.500000 0.017986 0.047426 0.268941 0.047426 0.047426 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 10
10 Q6NS38_11 0.268941 0.880797 0.500000 0.268941 0.017986 0.731059 0.731059 0.119203 0.119203 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 11
11 Q6NS38_12 0.880797 0.268941 0.268941 0.119203 0.119203 0.268941 0.268941 0.731059 0.880797 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 12
12 Q6NS38_13 0.500000 0.268941 0.268941 0.952574 0.500000 0.119203 0.268941 0.952574 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 13
13 Q6NS38_14 0.731059 0.731059 0.268941 0.500000 0.047426 0.731059 0.731059 0.119203 0.880797 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 14
14 Q6NS38_15 0.500000 0.500000 0.268941 0.268941 0.047426 0.731059 0.268941 0.119203 0.119203 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 15
15 Q6NS38_16 0.880797 0.952574 0.119203 0.268941 0.268941 0.268941 0.500000 0.880797 0.268941 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 16
16 Q6NS38_17 0.500000 0.880797 0.268941 0.119203 0.119203 0.731059 0.500000 0.119203 0.119203 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 17
17 Q6NS38_18 0.500000 0.880797 0.500000 0.500000 0.006693 0.880797 0.500000 0.268941 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 18
18 Q6NS38_19 0.268941 0.731059 0.500000 0.880797 0.119203 0.880797 0.982014 0.119203 0.268941 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 19
19 Q6NS38_20 0.731059 0.500000 0.731059 0.880797 0.017986 0.268941 0.880797 0.268941 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 20
20 Q6NS38_21 0.268941 0.880797 0.119203 0.268941 0.006693 0.982014 0.880797 0.047426 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 21
21 Q6NS38_22 0.500000 0.880797 0.119203 0.268941 0.006693 0.500000 0.952574 0.500000 0.268941 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 22
22 Q6NS38_23 0.268941 0.500000 0.119203 0.119203 0.002473 0.880797 0.731059 0.047426 0.731059 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 23
23 Q6NS38_24 0.500000 0.731059 0.268941 0.268941 0.017986 0.500000 0.268941 0.268941 0.119203 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 24
24 Q6NS38_25 0.731059 0.731059 0.500000 0.731059 0.017986 0.268941 0.500000 0.731059 0.119203 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 25
25 Q6NS38_26 0.500000 0.500000 0.500000 0.880797 0.006693 0.880797 0.880797 0.268941 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 26
26 Q6NS38_27 0.500000 0.731059 0.500000 0.500000 0.017986 0.731059 0.880797 0.731059 0.268941 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 27
27 Q6NS38_28 0.500000 0.268941 0.119203 0.500000 0.006693 0.500000 0.500000 0.119203 0.731059 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 28
28 Q6NS38_29 0.880797 0.500000 0.268941 0.268941 0.731059 0.731059 0.500000 0.268941 0.119203 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 29
29 Q6NS38_30 0.731059 0.500000 0.268941 0.268941 0.268941 0.268941 0.268941 0.500000 0.268941 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 30
85634 P11746_257 0.119203 0.500000 0.731059 0.119203 0.017986 0.997527 0.268941 0.268941 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 257
85635 P11746_258 0.500000 0.119203 0.880797 0.500000 0.006693 0.880797 0.119203 0.880797 0.993307 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 258
85636 P11746_259 0.268941 0.047426 0.047426 0.119203 0.006693 0.731059 0.017986 0.119203 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 259
85637 P11746_260 0.731059 0.119203 0.500000 0.119203 0.002473 0.500000 0.119203 0.119203 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 260
85638 P11746_261 0.500000 0.268941 0.952574 0.119203 0.000911 0.500000 0.119203 0.268941 0.952574 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 261
85639 P11746_262 0.731059 0.047426 0.119203 0.119203 0.006693 0.731059 0.119203 0.880797 0.731059 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 262
85640 P11746_263 0.880797 0.017986 0.731059 0.268941 0.047426 0.500000 0.006693 0.731059 0.500000 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 263
85641 P11746_264 0.731059 0.017986 0.731059 0.047426 0.047426 0.731059 0.119203 0.500000 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 264
85642 P11746_265 0.268941 0.047426 0.017986 0.047426 0.000911 0.500000 0.119203 0.047426 0.268941 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 265
85643 P11746_266 0.268941 0.119203 0.731059 0.119203 0.047426 0.731059 0.119203 0.047426 0.993307 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 266
85644 P11746_267 0.119203 0.047426 0.119203 0.006693 0.017986 0.500000 0.047426 0.119203 0.500000 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 267
85645 P11746_268 0.268941 0.500000 0.993307 0.500000 0.268941 0.952574 0.268941 0.268941 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 268
85646 P11746_269 0.880797 0.017986 0.268941 0.119203 0.047426 0.731059 0.119203 0.268941 0.993307 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 269
85647 P11746_270 0.731059 0.500000 0.500000 0.500000 0.119203 0.880797 0.880797 0.047426 0.731059 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 270
85648 P11746_271 0.119203 0.731059 0.731059 0.268941 0.119203 0.993307 0.119203 0.880797 0.952574 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 271
85649 P11746_272 0.268941 0.119203 0.993307 0.268941 0.047426 0.880797 0.268941 0.268941 0.952574 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 272
85650 P11746_273 0.880797 0.268941 0.268941 0.119203 0.002473 0.880797 0.268941 0.047426 0.119203 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 273
85651 P11746_274 0.880797 0.017986 0.119203 0.268941 0.017986 0.731059 0.119203 0.268941 0.268941 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 274
85652 P11746_275 0.268941 0.500000 0.119203 0.119203 0.006693 0.731059 0.017986 0.119203 0.982014 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 275
85653 P11746_276 0.268941 0.119203 0.500000 0.268941 0.002473 0.997527 0.500000 0.268941 0.731059 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 276
85654 P11746_277 0.119203 0.500000 0.731059 0.500000 0.002473 0.997527 0.500000 0.119203 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 277
85655 P11746_278 0.268941 0.500000 0.500000 0.268941 0.006693 0.731059 0.119203 0.731059 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 278
85656 P11746_279 0.119203 0.268941 0.268941 0.047426 0.017986 0.500000 0.017986 0.119203 0.500000 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 279
85657 P11746_280 0.119203 0.268941 0.731059 0.268941 0.006693 0.993307 0.268941 0.268941 0.952574 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 280
85658 P11746_281 0.268941 0.047426 0.500000 0.500000 0.006693 0.952574 0.952574 0.119203 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 281
85659 P11746_282 0.047426 0.006693 0.268941 0.500000 0.002473 0.982014 0.500000 0.006693 0.500000 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 282
85660 P11746_283 0.119203 0.500000 0.731059 0.047426 0.000911 0.997527 0.119203 0.047426 0.731059 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 283
85661 P11746_284 0.119203 0.880797 0.119203 0.017986 0.006693 0.997527 0.500000 0.047426 0.731059 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 284
85662 P11746_285 0.119203 0.268941 0.982014 0.268941 0.006693 0.731059 0.119203 0.993307 0.500000 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 285
85663 P11746_286 0.119203 0.047426 0.500000 0.017986 0.006693 0.999089 0.268941 0.119203 0.982014 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 286

Data Pre-processing: normalization

We apply Student's t normalization on each column:

$$X' = \frac{X - \hat{X}}{s}$$

This set would be used in parallel to normal dataset for comparison

In [95]:
## create column-wise normalized data-set

dna_norm = dna.copy()

for col in dna_norm[dna_norm.columns[1:][:-6]].columns:
    dna_norm[col] = (dna_norm[col] - dna_norm[col].mean()) / (dna_norm[col].std() + .00001)

ID_pos -2_A_pssm -2_R_pssm -2_N_pssm -2_D_pssm -2_C_pssm -2_Q_pssm -2_E_pssm -2_G_pssm -2_H_pssm ... intermediate_composition3 buried_composition1 buried_composition2 buried_composition3 fold class class_bool class_num ID pos
0 Q6NS38_1 0.471432 0.333399 0.356656 0.424321 0.536658 0.167189 0.331158 0.821829 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 1
1 Q6NS38_2 -1.263070 -1.249199 -1.354682 -1.189555 -0.911272 -1.383960 -1.273175 -0.970122 -1.344461 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 2
2 Q6NS38_3 -0.987983 -1.249199 -0.964372 2.045131 -0.957874 -1.383960 1.082607 -0.929099 -1.344461 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 3
3 Q6NS38_4 1.356970 1.966934 -0.964372 -0.325185 -0.945197 0.959118 -0.907270 -0.561426 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 4
4 Q6NS38_5 1.356970 -0.425236 -0.964372 -1.175865 -0.911272 -0.624740 -0.420290 -0.822158 -1.255325 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 5
5 Q6NS38_6 -0.987983 1.583672 -1.315504 -1.139232 -0.822838 -1.383960 -0.420290 -0.561426 -1.344461 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 6
6 Q6NS38_7 1.356970 -0.916874 -0.964372 -0.810906 -0.822838 -1.137952 -1.236446 -0.929099 -1.344461 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 7
7 Q6NS38_8 -0.987983 1.583672 -0.964372 -0.810906 -0.157423 1.819240 -0.420290 -0.561426 -0.584634 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 8
8 Q6NS38_9 0.471432 1.915998 -0.964372 -0.810906 -0.945197 -0.624740 -0.420290 2.465816 -1.344461 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 9
9 Q6NS38_10 2.318762 -0.425236 -1.315504 0.424321 -0.911272 -1.383960 -0.420290 -0.822158 -1.255325 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 10
10 Q6NS38_11 -0.414106 1.583672 0.356656 -0.325185 -0.911272 0.959118 1.082607 -0.561426 -1.038003 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 11
11 Q6NS38_12 1.930846 -0.425236 -0.444912 -0.810906 -0.607225 -0.624740 -0.420290 1.661155 1.267906 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 12
12 Q6NS38_13 0.471432 -0.425236 -0.444912 1.892379 0.536658 -1.137952 -0.420290 2.465816 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 13
13 Q6NS38_14 1.356970 1.092035 -0.444912 0.424321 -0.822838 0.959118 1.082607 -0.561426 1.267906 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 14
14 Q6NS38_15 0.471432 0.333399 -0.444912 -0.325185 -0.822838 0.959118 -0.420290 -0.561426 -1.038003 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 15
15 Q6NS38_16 1.930846 1.819338 -0.964372 -0.325185 -0.157423 -0.624740 0.331158 2.205084 -0.584634 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 16
16 Q6NS38_17 0.471432 1.583672 -0.444912 -0.810906 -0.607225 0.959118 0.331158 -0.561426 -1.038003 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 17
17 Q6NS38_18 0.471432 1.583672 0.356656 0.424321 -0.945197 1.472330 0.331158 -0.017497 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 18
18 Q6NS38_19 -0.414106 1.092035 0.356656 1.659549 -0.607225 1.472330 1.898763 -0.561426 -0.584634 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 19
19 Q6NS38_20 1.356970 0.333399 1.158225 1.659549 -0.911272 -0.624740 1.569586 -0.017497 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 20
20 Q6NS38_21 -0.414106 1.583672 -0.964372 -0.325185 -0.945197 1.819240 1.569586 -0.822158 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 21
21 Q6NS38_22 0.471432 1.583672 -0.964372 -0.325185 -0.945197 0.167189 1.803020 0.821829 -0.584634 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 22
22 Q6NS38_23 -0.414106 0.333399 -0.964372 -0.810906 -0.957874 1.472330 1.082607 -0.822158 0.814537 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 23
23 Q6NS38_24 0.471432 1.092035 -0.444912 -0.325185 -0.911272 0.167189 -0.420290 -0.017497 -1.038003 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 24
24 Q6NS38_25 1.356970 1.092035 0.356656 1.173828 -0.911272 -0.624740 0.331158 1.661155 -1.038003 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 25
25 Q6NS38_26 0.471432 0.333399 0.356656 1.659549 -0.945197 1.472330 1.569586 -0.017497 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 26
26 Q6NS38_27 0.471432 1.092035 0.356656 0.424321 -0.911272 0.959118 1.569586 1.661155 -0.584634 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 27
27 Q6NS38_28 0.471432 -0.425236 -0.964372 0.424321 -0.945197 0.167189 0.331158 -0.561426 0.814537 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 28
28 Q6NS38_29 1.930846 0.333399 -0.444912 -0.325185 1.230739 0.959118 0.331158 -0.017497 -1.038003 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 29
29 Q6NS38_30 1.356970 0.333399 -0.444912 -0.325185 -0.157423 -0.624740 -0.420290 0.821829 -0.584634 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 30
85634 P11746_257 -0.987983 0.333399 1.158225 -0.810906 -0.911272 1.872411 -0.420290 -0.017497 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 257
85635 P11746_258 0.471432 -0.916874 1.677685 0.424321 -0.945197 1.472330 -0.907270 2.205084 1.608557 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 258
85636 P11746_259 -0.414106 -1.152540 -1.213375 -0.810906 -0.945197 0.959118 -1.236446 -0.561426 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 259
85637 P11746_260 1.356970 -0.916874 0.356656 -0.810906 -0.957874 0.167189 -0.907270 -0.561426 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 260
85638 P11746_261 0.471432 -0.425236 1.926688 -0.810906 -0.962565 0.167189 -0.907270 -0.017497 1.485228 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 261
85639 P11746_262 1.356970 -1.152540 -0.964372 -0.810906 -0.945197 0.959118 -0.907270 2.205084 0.814537 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 262
85640 P11746_263 1.930846 -1.249199 1.158225 -0.325185 -0.822838 0.167189 -1.273175 1.661155 0.114952 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 263
85641 P11746_264 1.356970 -1.249199 1.158225 -1.043736 -0.822838 0.959118 -0.907270 0.821829 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 264
85642 P11746_265 -0.414106 -1.152540 -1.315504 -1.043736 -0.962565 0.167189 -0.907270 -0.822158 -0.584634 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 265
85643 P11746_266 -0.414106 -0.916874 1.158225 -0.810906 -0.822838 0.959118 -0.907270 -0.822158 1.608557 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 266
85644 P11746_267 -0.987983 -1.152540 -0.964372 -1.175865 -0.911272 0.167189 -1.140703 -0.561426 0.114952 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 267
85645 P11746_268 -0.414106 0.333399 2.067995 0.424321 -0.157423 1.718338 -0.420290 -0.017497 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 268
85646 P11746_269 1.930846 -1.249199 -0.444912 -0.810906 -0.822838 0.959118 -0.907270 -0.017497 1.608557 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 269
85647 P11746_270 1.356970 0.333399 0.356656 0.424321 -0.607225 1.472330 1.569586 -0.822158 0.814537 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 270
85648 P11746_271 -0.987983 1.092035 1.158225 -0.325185 -0.607225 1.857947 -0.907270 2.205084 1.485228 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 271
85649 P11746_272 -0.414106 -0.916874 2.067995 -0.325185 -0.822838 1.472330 -0.420290 -0.017497 1.485228 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 272
85650 P11746_273 1.930846 -0.425236 -0.444912 -0.810906 -0.957874 1.472330 -0.420290 -0.822158 -1.038003 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 273
85651 P11746_274 1.930846 -1.249199 -0.964372 -0.325185 -0.911272 0.959118 -0.907270 -0.017497 -0.584634 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 274
85652 P11746_275 -0.414106 0.333399 -0.964372 -0.810906 -0.945197 0.959118 -1.236446 -0.561426 1.574364 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 275
85653 P11746_276 -0.414106 -0.916874 0.356656 -0.325185 -0.957874 1.872411 0.331158 -0.017497 0.814537 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 276
85654 P11746_277 -0.987983 0.333399 1.158225 0.424321 -0.957874 1.872411 0.331158 -0.561426 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 277
85655 P11746_278 -0.414106 0.333399 0.356656 -0.325185 -0.945197 0.959118 -0.907270 1.661155 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 278
85656 P11746_279 -0.987983 -0.425236 -0.444912 -1.043736 -0.911272 0.167189 -1.236446 -0.561426 0.114952 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 279
85657 P11746_280 -0.987983 -0.425236 1.158225 -0.325185 -0.945197 1.857947 -0.420290 -0.017497 1.485228 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 280
85658 P11746_281 -0.414106 -1.152540 0.356656 0.424321 -0.945197 1.718338 1.803020 -0.561426 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 281
85659 P11746_282 -1.263070 -1.286279 -0.444912 0.424321 -0.957874 1.819240 0.331158 -0.970122 0.114952 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 282
85660 P11746_283 -0.987983 0.333399 1.158225 -1.043736 -0.962565 1.872411 -0.907270 -0.822158 0.814537 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 283
85661 P11746_284 -0.987983 1.583672 -0.964372 -1.139232 -0.945197 1.872411 0.331158 -0.822158 0.814537 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 284
85662 P11746_285 -0.987983 -0.425236 2.028817 -0.325185 -0.945197 0.959118 -0.907270 2.613780 0.114952 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 285
85663 P11746_286 -0.987983 -1.152540 0.356656 -1.139232 -0.945197 1.877763 -0.420290 -0.561426 1.574364 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 286

Analysis I: Measuring the variation of significance among PSSM features

We want to see whether certain evolution patterns have any influence on DNA binding mechanism. We apply Recursive Feature Elimination (RFE) to rank our all PSSM-based features according to their predictive power using linear SVM (that is, SVM with linear kernel function).


  • Guyon, I., Weston, J., Barnhill, S., & Vapnik, V., “Gene selection for cancer classification using support vector machines”, Mach. Learn., 46(1-3), 389–422, 2002.

# extract dataset and prediction
X = dna[dna.columns[1:][:-6]]
X = X[[x for x in X.columns.tolist() if 'pssm' in x]]
X = X.iloc[range(1000)]
y = dna['class_bool']
y = y[range(1000)]

# apply RFE on linear c-SVM
estimator = SVC(kernel="linear")
selector = RFE(estimator, 5, step=1)
selector =, y)

print selector.ranking_

[ 5 36 72 79 57 41 49 61 65 64 76 40 44 33 95 78 69 51 68 48 26 82 86 38 35
  8 55 52 43  1  1  2  3 39 28 96 37 93 77  1 15 89 46 10 12 19 25 75 24 14
 67 30 23 32 74 94 17 21 66 34  9 22 85 84 80 20 45 31 47 13 88 73 11 81 54
 58 27 90 62  6 91 59 63 56 50  7  1 83 92  4 29 42 16 87 18 70 71 53 60  1]

# redid previous routine on the whole data

pssm_rank = pd.DataFrame()
cat = dna['class']

for i in range(dna.index.size / 1000):
    this_cat = cat[range(i * 1000, (i + 1) * 1000)]
    if this_cat.unique().size > 1:
        X = dna[dna.columns[1:][:-6]]
        X = X[[c for c in X.columns.tolist() if 'pssm' in c]]
        X = X.iloc[range(i * 1000, (i + 1) * 1000)]
        y = dna['class_bool']
        y = y[range(i * 1000, (i + 1) * 1000)]

        estimator = SVC(kernel="linear")
        selector = RFE(estimator, 5, step=1)
        selector =, y)
        print selector.ranking_
        pssm_rank[str(i)] = selector.ranking_
pssm_rank.index = [c for c in X.columns.tolist() if 'pssm' in c]

[ 5 36 72 79 57 41 49 61 65 64 76 40 44 33 95 78 69 51 68 48 26 82 86 38 35
  8 55 52 43  1  1  2  3 39 28 96 37 93 77  1 15 89 46 10 12 19 25 75 24 14
 67 30 23 32 74 94 17 21 66 34  9 22 85 84 80 20 45 31 47 13 88 73 11 81 54
 58 27 90 62  6 91 59 63 56 50  7  1 83 92  4 29 42 16 87 18 70 71 53 60  1]
[59  9 48  7 65 33  4 15 76 49 14 13 64 51 19 28 31 45 95 11 55 82  6  2 66
 17 32 84 61 56 10 86 91 68 29 57 36 26 78 60 22 93 85 38 92 43 39 21 46 35
  1 37  5 70 52  3  8 83 88 41 23 54 12 20 90 94 77 47 30  1 18 62 75 79 71
 25 72 69 81  1 53 34 24 67 50 96 58 80 27  1 40 89 44 63 42 16 74 73 87  1]
[26 67  2  1 60 56  8 72 22 25  7 23 81 34 70 24 71 38 93 88 69 42 28 78 61
 85 27 39 76 57 18 91 36 84 16 94 83 35 58 54 66 30 95 86 65 48  4  1 10 49
 59 45 79 41  1 75 33  1  9 29 17 12 37 63 89 47  3 14 53 74 43 13 50 46 52
 19 96 68 77 11  5 80 55 21 82 90 20  1 51 73 62 15 64 40 44  6 87 31 92 32]
[53 50 38 24 73 58  1 52 49 36 23 92  8 69 74 87 65 18 14 35 10 17 42 86 33
 71 68 41  1 61  4 16 55 40 72  6 89 76 47 37 12 43 46 96 44 26  2 93 66 32
 62 85 31 15 91  1 51 45 67 94 21  7 77 64 30 84 34 88 56 29 83 82 60 57 11
 20 22 79  9 13 90 28 78 54  1 63 39 19  3 95 75 27 80 25  1 70 48 59 81  5]
[17 64 53 33 93 84 62 31 85 73 87 90  5 39 10  1 44 40 91 74 20 76 19 14 29
 89 82 96 55 81 11 16 67 69  9 18 58  8 70 30 79 80 65 72 15  6 35 83 21 51
 54 61  1  2  7  1 71  1 24 23 27 45 92 52  3 78 77 86 43 94 34 59  1 28 25
 26 66 38 63 22 75 13 49 37 32 42 60 88 41 36 95 46 57 12 68 47 48  4 50 56]
[16 29 28 20 76  6  1 13 87 38 89 54 37 56 55 35 64 53 39 90  5 25 23 73 96
 42  4  8 59 34 93  9 17 12 68 66 86 10 85 27 47 63 57 74 78 80 11 70 51 60
 52  3 46 92 30  1 19 21 33 48 88 91 67  7 75 49 61 65 44  1 58 50 26  1 95
 32 18 41  2 81 40 14 77 72 82 15 45 36 24  1 83 84 31 71 79 22 62 43 69 94]
[ 4 22 38 25 18 19 10 30 85 95 27  7 67 68 92  1 54 63 61 84 20 90 73  6 59
 32 28 26 82  1  1  5 51 49 11 37  2 66 79  1 52 13 21 17 78 91 39 36 23 93
 50 15 60 65 69 24 83 89 87 42  3 48 12 86 94 75 33 47 64 35 31 46 96 71 58
  1 72 40 41 80 43  8 74 34 53 76 70 16 88 56 44  9 55 81 45 14 29 62 77 57]
[53 71 93 92  6 15  1 70 51 66 80 16 75 38 61  5 84 17 58 72 52 78 83 59  4
  1 12 87 55  1 96  9 46  1 77 82  8 89  3 34 63 37 76 57 25 32 29 86 81 79
 60 64 11 14 62 31 36 67 56 35 10 74 47 95 54  2  1 65 26 33 43 48 91 44  7
 73 49 94 90 50 13 24 28 41 39 19 21 88 42 69 30 23 18 27 40 45 20 85 22 68]
[93 69 47 16 64 37 38 18 60 14 61 94 42 23 58 92 81 33 59 95 62 24 84 57 79
  2 56 86 49 54 32  1 40 83 89 77 67 12 87 51  9  8 76  1 75 52 71 39  1 31
 78 10 63 91 66 19 21 26 27 30 85  1 70  4 11 73 96  3 17 34 45  1 44 90 53
 35 36 20 72 74 46 29 65  6 80 88 25 55 41 50 48 28 82  7 43  5 13 15 22 68]
[55  6 83 79 34 15  5 76  4 26  3 17 19 14 56 35 74 46 29 30 42 77 24 23 45
 47  2 94 25 60  1  1 80 87 71  1 52 16 41 31 20 43 78 67 75 68 10 21 65 69
 39 11 88 38 92 48 91 81 32 93 62 22 44 89 73 85 63 53 70 96 12 82 13 95 33
 86 64 90 18 51 54 84  8 40 27  1 72  9  7 61 50  1 49 57 66 36 37 28 58 59]
[79 25 95 10 54 12 22 87 81 86 78 61 40 57 83  7  1 49 47  2  4 92 55  8 33
 28  9 19 29 30 82 24 20 73 53 85 60 88 64 68  1 84 18 94 45 39  6 34 80 38
 46 31 11 62 59 51  1 50 65 89  5 70 15 43 23 44 17 36 41 56 58 48 32 26 27
  1 35 72 37 13 67  3 91 66 69 21 77 90 76 16 63  1 52 96 42 93 71 75 14 74]
[68 23 41 50 44 90 14 29 30  1  1 93 11 20  8 92 35 91 59 38 26  2 76 89 65
  9 55 53 61 32 10 12  1 13 74 31 73 94 51 84 63 66 87 81 33 27 49 28 40 21
 17 43 22 86 70 62 57 67 34 36 83  1 24 60 39  1 77 80 25 58 52  4  6  5 96
 19 45 88 75 95 56  3 48 47 16 82 46 85 64 71 79 42 15 78 37 69 18 72  7 54]
[48 88  5 33 11  6 41 76 43  3 77 72 65  9  7  1 91 78 75 79 45  1 30 12 17
 52 24 60 93 21 34 25 13 58  1  1 56 19 89 27 44 28 16 51 90 86  4 40 70 92
 54 31 26 10 47 50 87  8 71 42 29 14 95 57 55 32 85 53 82 96  2 66 74 81  1
 15 37 39 46 59 80 23 68 18 83 67 22 69 84 61 38 94 36 64 20 62 63 49 73 35]
[76 77 16 15 88 82 54 40 20 29 14 56 93 24 66 13 12 58 61 50 22 33  1  1 45
 79  4  3 81 39  1 96 23 78 47 18 19 60 65 21 11 34 91 48 68  1 67 49 72 43
 36  2 95 32 53 90 52 31 25 44 74 41 83 27 85  8 26 59 46 57 37 73 89 71 30
 94 55  7 42  5 17 92 75  6 84 38  1 35 80 69 10 51 87 86  9 70 63 28 64 62]
[96 47 27 26 78  5  8 39 95 53  3  9 65 84 43 64 46 70 60 29 17 51 24 81 22
 14  4 19 77  1 58 15 87 11 62 42 20 79 12 86 68 40 83 69 74 35  1 94 49 89
 21  1 23 56 91 41  6 55 75 44 92 28 54 16 36 71  1 57 30 76 32  1 31 34  7
 50 59 52 33 25 90 93 38 88 48 85 10 80 82 73  2 18 13 37 72 45 67 61 63 66]
[38 85 87 58 21 23 18 80 42 40 37 14 16 66 89 39 28 62 60 13  2  6 45 36 26
 90 15 82 63 22  1 69 43 27 53  1 20 86  9  1 17 19 68 32 59 10  8 96 49  4
 25 91 47 75 65 95 31 92 52  1  3 24 64 48 29 44  7 94 93 56 88 77 81 35 72
  5 51 57 55 12 79 74 11 61 33 73 50 84 34 71 83 67 78 41 30  1 46 76 54 70]
[69 92 48 61 55 46 18 14 51 75 43 45 47  3 12 54 90 32  9 84 70  1 89  1 10
  1  1 86 52 85 16 19 20 22 29 35 39 34 15 57 66 49 87 64 28  6  8 63 80 71
 53 50 44 27  2  4  5  7 59 65 95 37 67 26 73 58 13 91 81 41 76 17 72 79 38
 40 94  1 24 42 74 23 62 82 60 56 21 96 31 78 11 36 30 83 68 33 88 25 93 77]
[29 16 13 88 61 46 49 11  6 31 30 87 73 86 78 71 18 39 58 32 44  1 96 23 10
  1 83 14 53 21 20 70  2 79 72  7 51 64 43 48 42  5 91 69 22 80  3 68 50 34
 36 57  1 55 89  4 67 93 56 26 12 84 65 54  1 77 19 25 82 35 74  9 24 17 59
 75 92  8 52 33 37 38 47  1 60 40 81 90 27 41 62 15 76 94 28 45 95 63 66 85]
[67  8 16 71 25 38 17 15  7 58 79 68 74 39 90  1 45 49 80 46 35  1 11 48 53
 28 12 52  4 18  1 20 23 37  1  9 51 94 43 34 55 84 13 10 40 31 59 27 82 91
 26 47 22 21 88 62 19 33 32 78 72 63 14 89 92 64  1 56 75 69 85 95 57  2 73
  5 81 24 41 70 30 93 54 66 77 86 44 87 60  3  6 42 96 36 65 29 76 50 61 83]
[ 3 47 17 71 11 19 37 80 14 15 48 96 88 53 20 66 34 77 52 24 13 32  2 29 25
 30  1 87 81  1  1 69 36 60 64 79 68 91 62 12 57 90 26 40  1 61  6 73 54  5
 95 42  7 44 86 27 46 50 65  1 41 28  4 94 18 33 21 49 93 22 72 75 74 45 58
 39 84  8 43 89 70 55  9 56 51 83 23 31 35 38 59 16 85 76 82 10 63 78 92 67]
[63 74 16 73 58 57  1 25 56  9  1 86 67 95 49 26 84 27 68  7  3 66 76  2 46
 20 13 17 62 29  1 19 59 18  8 12 54 85 83 91 34 10 41 24 87 88 39 11 77 22
  1 23  5 15 75 38 44 92 14 21 35 89 96 28 78 90 50 55 82 53  1 93 52 36 51
 33 32 81 47 60 30 80 65 64 61 70 79 45 69 71 48 37  6 40  4 94 31 43 42 72]
[70 22 78 23 95 83  1 51 15 47 21 37 77 18 71 81 27 65 96 38  1 30 93 29  1
 10  2 87  7 20 13  1 19 41 82  1 88 84 55 89 42 74 85  4 92 94 48 11  9 54
  6 59 25 57 66 16 28 90 58 40 60 62 44  8 80 31 32 67 56 52  3 72 50 46  5
 33 34 12 64 53 91 17 86 61 68 63 73 24 49 76 79 26 69 14 39 36 35 45 43 75]
[ 2 84 63 25 58 43  4 62 52  1 75 79 74 87 35 27 13 20 53 26 22 78 50 39 48
 83 10 28 65 77 46 32 15  9 76 70 21 16 17 66 47 12 85 23  1 68 24 19 40 38
 14 81 37 60 11 45 93 59 54 94 91 33 18 73 90 31 41  8 96 61 42 34 69 29  3
 55  6 44 72 95 89 49  5 64 67 80 88 56 57  1  1 71 36 92  1 51 86 82 30  7]
[12 43 44 39 52 13 30 15 81  7 83  3 59 68 84 47 25 35 82  9  1 86 42  1 26
  5 10 40 88 85  1 61 76 63 14 16  1 54 21 75 94 67 33 58 62 19 60 29 87 74
 77 32 78 24 79 55 28 18 20  8 90 64 72 57 66 22 93 23 71 53 31 95  4 73 17
 27 34 41  6 92 91 70 37 96 48 11 69 89 45 65 80 49 36  2 38 46 51 56  1 50]
[73 49 50 52 30 19 18 29 85  1  2 54 91 70 34 94 36 40 77 57  1 79 23 67 43
 41  1 46 24 66  1  4 75 25  3 72  1 64 86 65 44 82 74 33 13 78 12 88 11 69
 63 89 80 27 47 62  8 26 48 76 28 71 90 96 38 51 16 45 95 14  9 59 81 20 10
 58 22 39 17 15 21 35 42  7 56 55  5 84 93 68  6 61 92 32 87 83 53 60 31 37]
[72 12 39 29 20 51 16 36 35 84 71 48 69 87 96 89 22 32 41 33 42 91 52  4  1
 13 73 40 11 94 83 67 25 37 95 93  1  8  6 34 60 77 53 14 38 74 55  2 68  1
  3 45 43 31 63 65 23 46 57  1 66 44 78  1 76 62 18 85 79 50 49 19 64  7 26
 56 59 92 10 90 15 28 24 17 47 70 75 54 27 88 21 81  9 82 80 58 61 30  5 86]
[ 8  2 59 86 92 11 85 15 53 33 39  1 18 84 77 88 87 70 51 58  1 68  3  1 45
 83 57 52 73 38  4 65 61 55 64 40 78 89 43 44 49 66 46 62 25 93 34 22 82 28
 37 60 30 91 50 41 95 81 17 23 94 67 29 21  7 69 26  1 14 90  1 36  6 80 24
 31 63 79  9 74 56 32 10 19 54 75 20 16 35 76 42 12 71 13  5 47 48 72 27 96]
[15 37  3 25 45 86 30 84 90 73  2 63  5 46 52 70 35 79 75 16 64  9 48 58 11
 18  1 56 68 39  1 20 44 21 62  1  1 34 22 38 81 40 27 66 95 36 23 43 76 78
 14 85 96 54 51 80 41 10 53 49 61 67 26 31 93 32 33 88  8 87 28  7 82 50 42
 57 19  1 89 83 47 91 77 94 74 60 24 59  4 92 72 17 71 12 29 13 55 69 65  6]
[80 22 56 60 41 10 76 73 25 95 24 23 27 15 66 20 94 14 59  8 67  1 33 34  3
 90  1 37 31 64 52 69 57 17  4 53 72 21 70  2 50 40 84 87 58 96 88 49 38 85
 46 89  1 92 35 77 11  6 12  1 18 30 13 29 16 65 83 91 54 61 45 75 47 86 79
 51 81 32 28  1 39  7 42 93 26 55 48 36 78 74 63  9 62 71 82 43 44 19 68  5]
[87 72  3  1 75 91  1 88 57 10  1 55  9 22 35 64 27 17 23 85 30 93 45  4 95
 46 14 48 89 71 70  1  5 54 80 47 92 81 74 12 59 49 66 31 21 78 32 76 41 50
 40 63 96 39 33 42 37 16 73 28  2 61 29 20 62 56 19 13 79 67 34 60 15 24 18
  6 52 82 43 11 36 94 38  8 53 58 90 26 77  7 69 86 25 83  1 68 44 65 84 51]
[82 24 13 41 94  4 56 64 16 78 95  5 93 27 54 31 30 29 52 47 14  7  3  1  9
  8  1 68 75 11  1 43 90  2  1 37 10 61 83 77 59 67 96 84 18 85 15 23 17 81
 19 65 57 46 79 58 76 62 49 48 88 26 51 86 44 28 72  6 42 21 34 20 87 74 32
 66 35 70 36 92 80 91 45 40 55 63 12  1 53 38 69 73 71 60 50 22 25 89 33 39]
[77 70 64 83 59 50 16 61 13 41 24 15 88 39 72 92 43 28 12 44  6  2  8 27 19
  9  1 26  5 62  1  7 96 36 91 82 49 20 18 21 22 53 56 94 14 47 45 46 30 68
 29 54 60 73 78  4  3 11 55 75 69 79 84 23 76 40 63 93 52 31 25 81 37 51 85
 48 35 38 66 32 65 87 95  1 86 90 42 10 58 33  1 17 57 71  1 74 80 67 89 34]
[15 36 53 57 42 58 20 85 32 51 79 40 82 18 47 64 14 45 68 87 10  1 96 19  1
 94 90 55 12 91  5 17 88 52 72 93  1 81 43 56  8  6 89 33 25 41 78 71 34 83
 31  4 59 73 69  7 21 63 76 39 95  1 60  9 80 61 86 66 54 48 49  2 26 27  1
 13 30 35 28 67 24 92 50 38 16 37 65 75 84 46 77 11 22 62 74 23 29  3 70 44]
[11 17 43 93 78 91 33 34 90 50 73 18 87 79 23  1 32 31 29 72  1 63 46 47 74
 84 92 28 62 64 20 96 55  3 41  1 12 44  4 54 30 60 76 15 42 52 51 82 59 89
 69 25 57  6 35  1 13  9 58 10  2 86  8 24 16 61 21 40 88 19 77  1 71 27 75
  5  7 39 80 22 48 65 68 45 36 14 94 56 53 95 83 66 38 49 85 37 67 70 81 26]
[ 5 52 82 53 65 83 34 69 26 57 29 36 81 89 17 45  1 39 64 63 10 43 62 95 49
 84 14 68 50 79  6 15 32 60  3 77 16 47 33 74 31 24 70 78 76 73 22 19 72  4
 38 23 92 48 35 94 13 75 40  7 80 46 91 86 61 18 11 42 85  1  1 12 59  9 54
 56 90 96  8  2 88 25 66 87 58 44 21 30 37  1 71 41 55 28 27 67 93 51 20  1]
[ 6 68 48 50 64 77 30 74 15 76 17 59 57 78 28 36 49 18 53 40  2  1 38 43 95
  1 22 69 58 60  1 61 75 45 89 46 65 16 47  7 33 12 62 88 79 34 42 93 44 35
 54 39  3 37 85 32 31 92 90 96 19 84 55 67 80  1 56 83 41 26 11  1 70 27 82
 24 86 91 14 13 94 23  8 20 71 21  4 81 29 10  5 87 25 63 52 73  9 66 72 51]
[10  4 37 15  3  2 96 35 18 60 77 42 95 92 87 16 33 53 75 58 13 72 31 47  6
 21  1 93 82 20  1  1 85 63 90 79  9 64 26 88 86 23 62 73 28 38  8 94 80 22
 12 40 45 61 19 46 49 27 17 54 81 51 41 89 24 57 32 39 84 56 29 59 25 74 55
 48 52 67 66 78 50  7 76 34 11 44  5  1 65 36  1 14 43 69 91 30 83 71 70 68]
[77 57 52 91 82 21 14 72  6 39  1 62 35 59 47 34 84 45 29 78 15  3 66 33  5
 43  2 25 56 74  1 68 71 67  8 37 12 41 69 88 61 53 19 51 23 55 16 76 90  1
 54 85 96 32 81 63 11 44 24  1 70 46 42 40 20 65 83 26 10 13  1 36 27  7 89
 75 22 31 48 28 87 18 50 38 73 64  9 94 17 79 60 93 80 92 95 58  4 86 30 49]
[52 76 67 96 93 38 63 82 44 95 14 51 70 11 71 20 87 69 43 21  3 19 79 34 16
 18  1 31 54 91  1  1  9 77  7  2 62 72 83 65 81 88 61 60 29  6 92 45 48 57
 55 59 28  1 35 47 46 32  1 15 75 86 58  5 84 33 64 68 50 80 36 90 37 78 12
 24 25 73 74 30  4 49 26  8 53 22 39 85 89 42 56 23 41 13 27 10 40 17 66 94]
[56 55 70 46 27 71  2 91 18 66 45  3  1  1 34 94 48 32 72 41 29 44 15 53  7
  4  1 77  9 61  8 90 38 21 16 68 78 13 25 60  5 73 22 33 11 47  1 93 64 50
 49 39 31  1 69 75 74  6 37 51 52 65 79 92 88 83 59 62 30 10 57 86 80 19 67
 54 76 20 82 12 28 17 95 23 63 87 24 26 14 58 36 96 40 43 81 85 35 84 42 89]
[88 54 89 87 47 78 13 20 46 96 50 85 76 23 29 11 25 65  1 32 74 45 68 10 66
 39 49  6 69 27 26  9 41 33 93 81 92  2 35 95 24 55 38 42 19 18 61 16 28 58
  4 56  1 44 84 62 15 71 82 34 14 63 79 30 91 53 94 37 67 12 22 60 21 73 80
 43 59 77 40 83  1  1  7 70 86  8 75 90 31  1 51  3 72 52 36 48 57 64 17  5]
[ 9 76 57  3 46 62 14 29 23 16  7 96 92 31 20 52 13 84 79 17 56  4 74  5 61
 91 66 10  8 45  1 38 39 22 41 68 63 35 30 44 70 69 36 15 53 34 25 11 12 51
 19 88 60  1 81 47 95 64  1 54 27 55 50 86 83 94 73  1 78 49  2 82 48 21 42
 72 71 80 32 37 24 75 59 65 89 77 67  1 58  6 28 85 40 18 87 33 90 93 43 26]
[89 29 46 72 93 48  4 31 96 34 23 24 83  9 54 45 50 78 76 84 19  1 42 15  8
 58  1 88 55 51  1 56 52 20  2 70  7 47 69 14 25 62 32 49 77 28  1 30 71 68
  1 90 10 92 39 13 79 73 85 86 75 91 80 95 41 82  6 21 67 57 74 38 22 16 18
 87  3 33 81  5 65 63 35 64 44 27 26 17 66 12 36 40 37 59 61 60 53 94 43 11]
[18 40 61 27 94 36  1 70 72 48 17  3 96 43 39 60 71 35 91 88 59 82 42 32 87
  8  5 64 19 52 30  4  1 13 12 22 54 84  9 51 81 38 45 44 83 10 90 29 89  6
 79 37 86 34 74 53 46 31 11 55  7 62 95 85 73  2 28 25 77 16 76 63 80 65 58
 57 47 69 23 75 49 93 78 21 68 67  1 92 15 50 26  1  1 20 66 56 33 41 14 24]
[93 25  5 75 50 21 23 61 67 37 32  1 81 24 83 58 48 44 35 40  1 57  4 56 27
 17  3 46 41 74  1  7 62 14 33 20 10 47 15 73 66 22  8 68 18 64 49 55 19 86
  2 43 13 11 60 36 92 16 12  6  1 88  9 80 72 95 31 52 76 30 28 65 82 96 34
 42 87 51 71 63 39 79 85 69 91 38 26 53 78 77 89 45 90 94 84  1 29 70 54 59]
[72 32 81 57 33 91  1 37 29 90 61 83  1 95 36 58 89 55 54 56 52  9 43 62 63
 92  8 48 13  1 41 84 12 74 79  6 60 78 75  1 47  1 65 35 80  5 19 93  4 14
 15 76 39 87 53 24 77 46 86  2 51 17 50 49 85 68 16 94 25 26 66 40 34 88 28
 67 23 22 73 27 42 10 82 20 38 59 18 69  7  3 45 11 71 64 96 44 21 31 30 70]
[73 62 45 90 88 82  5 11 16  6 24  8 77 31 69 95  1 63 28  4 46 84 89 32 13
  2  7 94  1  9 61 85 79 56 52 15 86 70 92 91 39 75 83 60 25 93 71 55 50 26
 43 65 36 67 51 10 47 42 34 87 66 29 35 49 54 37 22 78 74 12  3 27  1 33 64
 14 76 96 17 72 44 68 19 30 23 38 80 40 57  1 58 20 48 41 53 18 21 81 59  1]
[94 20 68 61 96  3 35 78 49  1  1 72 69 43 95 42 12 51 65 54 28  1 71 81 84
  6 21 26 62 18 48 66 50 70 46 16  2 90 17 15 67  7 83  4 19 11 77 88 82 33
 34 41 85  9 89  1 86 56 73 93 58 22 87 52 59 38  8 45 40 75 37 32 64 63 80
  1 74 55 39 91 44  5 36 92 30 57 14 60 25 13 47 10 29 23 76 31 53 79 24 27]
[55 38 53 95  1 70 11 69 65 76 49 14 79 59 30 86 83 18 75 87 35 62 68 40  1
 71  1 94 80 93 23  1 81 41 51 88 67 58 73  7 32 21 60  3 12 43  5 31  2 96
 63  4 91 57 34 19 61 20 56 10 84 39  8  9 74 48 24 77 90 29 54 28 27 42 78
 17 72 33 36 13 64 85 47 37 92 22  1 46 50 25 16  6 82 52 66 44 45 15 89 26]
[17 84 45 42 44 35  2 34 70 56 21  3 79 52 23 73 74 47 89 92 69 38 66 65 41
 83 15 19 90 32  4 16 64 95 93 96 80 59 94 39  1 20 43 58 40 87 10 26 54 57
 78 22 91 82 85 88 27 77  6  9 61 31 24 71 68 11 53 30 25 33  1 55  1 51 63
  7 46 37 14  1 62 67 81 72 29 50 49 36  1 76 28 60 86 18  8 48  5 12 13 75]
[94  9 73 14 51 40 15 86 13  3  1 93 47 56 96 49 50 37  8 67 69 16 10 45  1
 78  2 81 91 22  1 76 79 12 31  1 30 43 11 23 24 64 55 44 29 66 70 54 39 57
  5 19 82 58 90 74 71 18 27 85 21 28 95 41 36 92 46 84 68 83  4 77 38 88 25
 80 32 17 35 53 52 60 42 72 87 33 75 34 26 48 62 59 63 61  6 65 89  1 20  7]
[17 71 48 45 34 81  9 20 86 77 76 64 52 47 63 37 49 26 22 23 88 54  2 25  3
  1 24 21 69 62 41  8 33 56  7 78 12 36  4 91 95 18  1 79 29 75  1 57 32 16
 92 74 59 28 60 80 27 89 72 30 53 90 31 50 70 68  5  1 44 40 38 43 65 46 11
 35 10 73 96 84 87 61 13 58 83 19 14 82 93  6 66 15 67 42 51 39 55 85  1 94]
[19 25 81 72 69 44 55 18 60  1 34 79 15 73 26 71 66 68 61 40 10 33 84 94 90
  1  1 62 27 38 36  3 14 85  6 46  7 35 64 20 49 52 80 63 29 82 41 78 70 28
 21 30 77 42 67 13 16 96 37 56 53 59 31 17 91 95  4 88 93  8 45  2 48 12 54
 32 51 58 75  9 65 89 23 50 57 47 74 24 11  1  1  5 43 92 86 76 87 39 83 22]
[91 18 43 15 85 63 58 40 20 56 75 34 12 13 55 21 22 66  1 33 93 49 96 57  1
  9 53 92  2 83  3 10 52 46 36 38 69 79 47 78 80 71 59 31 25 88 82 42  5 90
 48 14 26 62 89 84 24 51 68 23 81 37 72 54  1 41 60 77 86 32  6 19 35 74 11
 67 27 95 17 64 50 73 16 76 70 39  1  4  8 65 29  1 61 94 28  7 44 87 45 30]
[34 17 84 81  6 38 35 90 20 80 41  4 82 93 76 69 71 56 78 43  1 30 32  9 75
 70 39 36 83 42  3 79 48 53 16  5 45 72 95 91 89 37 52 27 61 77 18 94 51 68
  8 31 26 54 46 19 11 67 65 88 73 15 57 28 62 40 23  1 12  1 29 87  1 86 55
 14 74 49 22  2 66 58 64 63  7 96 21 25 33 59 24 50 13 44 47 92  1 85 10 60]
[85 62  6  1 83 30 31 67 64 32  7 88 63 43 37 96 12 74 70 15 16 13 80  5 36
 76 10 20  4 33  1 40 14 79 91 41 39 27 51 77 82 75 90 48 55 86  1 72 54 78
  1  3 81 34 92 89 18 35 53 93 95 47 45 17 22 68 11 50 42 73 65 49 24 58  8
 19 66 46 57 38 87 94 61  1 21 29 71 25 23 52  2  9 28 56 84 60 26 44 59 69]
[14  1 61 29 91 67 90  9  7 23 20 43 55 37 45 86  4 87  1 56  6  1  8 76 95
 58  1 17 11 22 82 88 75  3 94 53 33 71  1 69 66 31 60 74 83  5 18 68 59 52
 21 32 34 41 28 80 47 10 12 48 85 92 78 16 89 72 79 84 39 65 13 30 54 27 44
 26 25 42 24 57 63 40 64 96 51 35 62 93 36 73 81 49 46 70 77 19 50 15 38  2]
[48  1 79 38 45 44 49  6 11  5 41 61 57 68 12 27  1 30 92 66 88  7 75  1 83
  2 43 22 77  4 72 89  1 78 93 71 16 36 94  3 69 26 13 14 52 73 81 24 32 25
 87 29 70 46 90 76  1  9 60 28 23 62 91 85 74 80 67  8 17 54 55 10 33 65 35
 82 21 47 95 96 15 37 19 20 58 86 59 42 53 31 64 34 40 50 84 39 51 63 18 56]
[84  5 79 42 14  6 87 96 64 61 90 16 15 92 85 17 20 82 80 71  8 38 59 57 23
  4 39 28 95 34 76  7 93 73 77 29 83 19 18  1 33 24 65 72 48  1 81 12 75 44
 43 86 74 60 46 22 25 10 11 67 45 56 70 63 78 58  3 30 53 27  1 37 66 91 13
  1 62 21 47 26 40 32 51 50 49  2 69  9 35 55 31 94 68 88  1 41 89 36 52 54]
[37  8 57 61 49 48 22 17 19  4 88 10 41  1  1 39 13 56  5 78 64  6  1  1 70
 14 60 46 59 40 80 31 30 76 69 95 58 82 26 42 51 81 28 36 87 86 27  3 85 89
 63 62 77 90 73 65 84 53 96 91 68  7 33 12 15 20  1 71 44 54 52 32 67 75 25
 35 94 43 16 55 29 50 66 38 11 24 18 34 21 74  9 72 79 45 23  2 47 93 83 92]
[ 3 85 26 50 70 45 80 73 40 42  1 13 21 66 39 81 16 84 12  2 57  1 48 63 91
 24  4 38 46 89  1 76 60  8 10  1 20 28 27 33 23 95 29 90 32 83 72 86 14 11
 88 44  6 18 15 71 43 79 93 67 49  5  7 19 41 74 94 55 65 37 82 64 35 36 78
 62 54 56 92 59 52 31 25 17 69 30 75 53 77 96 87 68 61  1 22 34 58 47  9 51]
[ 3 40 49 14 48 64 46 11 15 22 28 62 21 55 37  5 93 71 47 86  1  4 25  7 19
 52 51 72 30 63 60 78 18 84 23  1 58 44 68 82  6 89 24  9 26 66 29 70 92 61
 73 77 90 59 74  2 33 50 20 80 79 83 35  1 42 91 81 53 36 32 76 12 31 75 41
  1 94 10 17 34 85 96 65 13 95 69 54  1 87 57 45 38 43 27 67 16 39 88 56  8]
[ 6 62 81  7 86 63 10 27 16 94  1 44  1 37 46 52 39  8 28 76 21 78 51  1 55
 68 15 77 96 95  1  9 24 73 26 20 19 90 71 18 40 30  4  1 70 67 83 22 13 33
 35 32 91 92  5 41 61 14 34 74 88 93  2  3 38 31 59 47 84 79 50 65 29 25 85
 49 48 36 43 54 45 66 12 11 17 57 56 69 72 89 60 23 87 80 58 53 42 75 64 82]
[17 66 34 67 75 19 76 20 36 96 79 15 91 44 25 35 12 74 53 71  3 29  1 83 60
  2 10  1  1 18 61 21 31 27 80 70  5  4  8 62 64  7 30 90 57 28 58 24 86 93
 22  6 47 11 65  9 95 42 52 49 54 84 81 26 87 45 94 92 63 50 33 59 37 82 69
 73 23 48 89 85 40 72 14 55 32 13  1 39  1 41 51 38 43 78 68 16 77 46 88 56]
[ 7 80 52 38 23 81 62 53 91 17 22 31 29  6 66 87  1 30  9 13  4 10 25  8 48
 95 61 32 75 50  1 82 57 20 93 24 56 43 76 65 33 60  1 59 28 73 16 64 96 72
 27 36 63 88 71  1 85 19 77 41 47 14 83 34 18 84 35 89 42 94 26 12 92 79 68
 39  5 51 40 69 70  3 78 55 86 11 90 58 45 46 67  1  2 54 49 15 74 21 37 44]
[58 18 30 73 71 11 10 86 77  2 48 20 81 66 42 46 79 26 23 55 67 33 89 65 32
  9  1 17 41 34  1 85 74 70 27 60 78 69 75 68 25 15 21  6 29  3  4 53 31 82
 24 39 94 83 95 47 43 93 91 22 45 80 44 13 19  8  5 87 92 64 14 16 57 96 84
 40 12 52 28 63 62 76 56 37 54 88 36 61 59 51  1 72 50  1 35  7 49 38  1 90]
[44 28 45 23 37 65  3 42 69  1 41  2 75 26 61  1 89 12 27  1 87 51  9  7 17
  1 29 83  8  4 96  1 39 72 59 10 13 31 16 64 11 84 18 67 94 19 78 22  6 80
 86 52 20 53 47 36 73 76 71 38 85 79 50 48 43 34 91 55 63 15 58  5 90 54 24
 25 92 95 82 88 35 60 14 77 68 32 46 62 57 66 74 33 21 70 81 40 93 49 30 56]
[36  7 13  1 24 18 85  4  1 35 22 93 77 88 39 63 69 57 43 92 26 17  3 12 11
 15 73 31 10 49 65 14 16 60 61 62 29  5 82 51 64 55 71 83  9 68 38 42 96  6
 25 21  2  8 76 45 19 44 67 87 52 78 70 20 32 47 84 56 89  1  1 50 86 90 79
 66 48 95 23  1 53 75 74 72 40 80 46 41 54 37 34 30 58 33 28 91 81 59 27 94]
[20  1 24 62 57  1 36 78 79 59 94 35 47 52 49 86 56 71 80 67  3  1 64 32 40
  1  7 83 60 30  1 29 21 43 75  2 39 76 73 13 42 48 87 88  9 15  4 95 28 45
 85 14  5 33 72 96 70 58 63 90 54 31 81 89 92 12 66 74 46 69 23  6 41 65  8
 53 10 34 68 26 50 11 55 16 38 93 91 84 77 25 27 22 61 51 37 17 18 44 82 19]
[43 38 93 94 29  1 52 36 27  9 74  1 56 39 35 79 88 89 64 69 13  2 70 37 40
  5  3 32 78 65 49 19  4 60 66 44 15 80 23 41 51 33 90 57 24  1 58 31 83 12
 20  1 28 46 73 63 45 76 11 72 50  7 87 85  6 30 81  8 59 47 48 61 92 95 84
 75 71 82 91 54 21 10 42 62 68 67 22 14  1 55 53 96 77 34 16 17 25 18 86 26]
[50 37 68 63 22 35  1 27 79 64 16  1 89 85  1 34 70 32 71 17 29 15 20 19 52
  7  2 10 66 77  4  1 69 13 90 31 30 14 65 88 56 41 12 72 43 92  1 74 42 23
  9 11 61 54 36 55 39 45 78 62 38 75 24 83 80  8 86 91 93 53 76 60 25 47  5
 67 57 81 21 94 44 40 73 58 82 96 18 95 33 46 87 26 28 49  3 59 48  6 51 84]
[89 80 79 86 52 78 11  7 53 39 15 29 51 19 38 69 42 56 20 61 66  1  3 65 33
  1 13 37 55  5 45 49  6  1 90 83 12 30 81 41 93 87 44 96 62 76 25 75 77 10
 94 27 17 68 26 72 35 23 40  9 58 82 22 64 74 59 54 50 88 28 46 34 31 60 16
  4  8 91 32 21 43 67 48 85  1 71 57 95 84  2 70 47 73 63  1 36 24 18 92 14]
[40 78 67 51 75 79 71 52 61 57 32 27 31 38 47  1 29  2 39 80  1 30 34 96 42
 53  9 59 21 22  8 56 41 23 54 82 33 49 64  7 17 50 87 13 55 84 14 70 10 62
 24 81 20 18 37 76  4  3 83 72 91 68 35 36 44 95  1 88 16 48 94  6 46  1 12
 63  5  1 74 92 85 90 58 66 86 45 65 26 15 69 43 89 77 28 73 25 19 11 60 93]
[58 36 53  4 91 50 45 56 63 67 13 72  5 70 80 30 83 12 10  7 40 20 39 23 76
 28  6 21 64 66 16  2 31 90  8 41 14 55 59  9 73 29 61 93 54 86 57 52 22  3
  1 25 81 92 65 34 71 75 32 51 42 79 37 18 35 95 62  1 24 60 82 89 19 74 33
 44 46 69 84  1 27 49 11 85 68 48 47 96 15  1 94 38 78 88 87 26 43 77 17  1]
[32  5 24 95 31  6 62 92 84 28  7 66 45 37 33  1 87 47 78 81 16  1 17 26 52
 48  1 57 11 65  9 25 68 75 51  1 40 61 74 60 39 86 54 53  3 59 20 77 58 18
 85 73 70 15 79 80 91 10 13 83  4 72 43 42  2 22 63 30 23 41 82 89  1 35 44
 76 14 94 34 71 64 56 38 96 46 50 67 12 90 93 69 88 21 19  8 36 29 49 55 27]
[90 57 74 16 55 78  8 63 32 48  7 88 70 40 81 28 79 53 39 47 56 85 77 37  1
 10  1 38 94 50  2  3 67 66 52  1 58 60 84 17 93 76 26 31 36 19  1 29 18 73
  1 75 44 45 23 21 35 43 86 87 11 15 46 42 27 62 41 49 51 13 68 14 89 69 61
 24 25 95 82 12 72 59 54 80  6 83 30 20 65 34  5 96  4 64 92 22 71  9 91 33]
[52 72 41 31 56 48 95 57 26  1  1 47 65 37 38 60 25 92 76 55  1  4 61 21 45
  1 29 68 19 28  9  1 80 15 84 53 74 50 89 27 77 35 54  5 81 93 78 42 36  2
  3 11 12 13 86 39 24 75 46 63 69 59 71 10 73 64 66 87 51 18  8 20 94 16 40
 88 91 70 17 34 58 22 83  7 85 82 67 23 14 33 44  6 49 32 90 79 30 96 62 43]
[73 54 37 64 18 75  5 91 94 17 14 46 13  4 81 35 21 68 95 74 11 50 84 15 87
  6  1 78 67 62  1  3 86 47 32  9 42 88 80 34 53 96 22 63 30 71 12 82 79 66
 19 51 70 45 36 52 92 90 24 31 83 93 60 77 29 69  1 16 48 65  2  1 58 28 89
 25 39 76 44 59 57 10 55 61 20 33  7 72 41 26 43  8  1 38 49 56 23 85 40 27]
[75 62  3 34 84 67  1  6 13  1 58 61 36 81 21  7  5 89  2  1 42 41 20 46 83
 80 10 96 91 14 54 71 47 76 17 45 37 25 85 40 70 35 33 90 26 38  9 93 78 23
 24 69 15 19 16  8 63 49  1 52 18 56 88 32 59 92 31 82 73 66 60 53 87 79 48
  4 68 64  1 65 28 72 51 44 43 22 50 11 39 27 74 95 55 12 29 30 94 57 86 77]
[41 10 75 69 15 61 30 43  9 40 70 55 60 23 32 66 65 58 29 72 35 21 59 14 91
 82  1  6 11 57 31 48  3  1  1  1 45 52 36 22 28 25 33 78 95 86 12 24 76 44
 37 38 92  2 64 46 85 13 19 83 47 51 34 63 94 53 20 27 90 96 74 50 84  1 79
  8 73 77 42 87 26 16  4 80 93 89 88 54 17 71 18 39 68 67 62  5 56 49 81  7]
[56 65 66 32 12 63 16 68 95 62 26 46 71 70 30 82 85 41  9 51 25  5 19 20 24
 77  3 47 18 17  4 94 13  1 90 50 58 27 28 91 15 39 10 33 64  1  8 96 36 59
 87  6 34  1 40 88 44 73 75 67 21 31 42 89 69 35  1 84 79 60 74 53 57  1  7
 83 48 72 52 93 37 38 49 76 80 61 14 86 29 23  2 43 45 92 81 54 55 78 11 22]
[ 6 93 60 59 95 89 17 51 91 92 13 37 38 72 41 70 94 30 73 87  3  1 40 56 77
 66  1 53 86 85  2 64 90 28  1 31  4 52 29 67 26 49 96 44 45 78 12  9 65 81
 10 42 55 82  5 47 20 46 54 36  1 16 32 69 39 57 88 48 83 25 24 58 23 50 21
  1  7 76 43 22  8 74 79 62 14 84 35 68 63 19 75 33 80 71 15 11 34 61 27 18]
[66 53 38 26 91 22 82 57 78 32  3 83 11 34 45 70 84 42 27 67 13 90 12 23 28
 31  1 46 60 93 49 51 64  1 58 29 30 72 18  1  8 81 16 54 71 50 36  1 73 52
 59 80 41  9 61 95 43 65 55 68  5 92 37 79 20 21 96 74 40  1  2 77 56 19 63
 94 14 47 87 69 88 76 35 44 89 25 75  4 10  7 85 24 33 17 62 48 39 15 86  6]
[54 21 16 53 93 61 15 25 45 44 14 58 86 10 42 70 43  9 46 74 20 88  1  1  4
 81 34 83 95 91 17 72 89 76 39 26 19  1  6 82 52 64 18 94 75 65 12 78 51 23
 69 41 22 37 63 77  5 57 68  8 80 40 11 36 35  7 62 92 84 96 50 87 66  2 79
 67  3  1 55  1 48 13 90 30 31 60 28 71 47 59 85 32 33 29 27 49 24 38 73 56]

##sort PSSM features by its predictive power

rank_av = [np.mean(pssm_rank.ix[i]) for i in pssm_rank.index]
arg_rank_av = np.argsort(rank_av)
pssm_rank_sorted = pssm_rank.ix[pssm_rank.index[arg_rank_av]]
pssm_rank_sorted['RANK_AV'] = np.sort(rank_av)

0 1 2 3 4 5 6 7 8 9 ... 76 77 78 79 80 81 82 83 84 RANK_AV
-1_E_pssm 55 32 27 68 82 4 28 12 56 2 ... 1 29 1 10 1 3 1 1 34 21.369048
-1_L_pssm 1 10 18 4 11 93 1 96 32 1 ... 2 9 1 54 31 4 2 49 17 24.119048
-1_A_pssm 26 55 69 10 20 5 20 52 62 42 ... 56 1 11 42 35 25 3 13 20 29.714286
-2_E_pssm 49 4 8 1 62 1 10 1 38 5 ... 8 95 5 1 30 16 17 82 15 30.500000
0_E_pssm 25 39 4 2 35 11 39 29 71 10 ... 1 78 12 9 12 8 12 36 12 33.642857
-1_D_pssm 38 2 78 86 14 73 6 59 57 23 ... 37 21 15 46 14 20 56 23 1 35.083333
-1_R_pssm 82 82 42 17 76 25 90 78 24 77 ... 85 4 50 41 21 5 1 90 88 35.440476
-2_L_pssm 76 14 7 23 87 89 27 80 61 3 ... 7 1 14 58 70 26 13 3 14 36.988095
-1_Q_pssm 8 17 85 71 89 42 32 1 2 47 ... 10 1 6 80 82 77 66 31 81 37.190476
-1_K_pssm 2 86 91 16 16 9 5 9 1 1 ... 3 1 3 71 48 94 64 51 72 37.428571
0_L_pssm 67 1 59 62 54 52 50 60 78 39 ... 1 3 19 24 37 87 10 59 69 40.190476
1_L_pssm 88 18 43 83 34 58 31 43 45 12 ... 68 8 2 60 74 74 24 2 50 40.690476
-1_T_pssm 37 36 83 89 58 86 2 8 67 52 ... 58 74 42 37 45 58 4 30 19 40.940476
-1_C_pssm 35 66 61 33 29 96 59 4 79 45 ... 1 45 87 83 91 24 77 28 4 41.059524
1_S_pssm 58 25 19 20 26 32 1 73 35 86 ... 24 88 25 4 8 83 1 94 67 41.273810
2_K_pssm 42 89 15 27 46 84 9 23 28 1 ... 96 6 8 95 39 43 33 24 32 41.309524
2_S_pssm 70 16 6 70 47 22 14 45 5 36 ... 22 79 56 30 5 54 11 48 49 41.321429
-1_S_pssm 96 57 94 6 18 66 37 82 77 1 ... 1 53 9 45 1 50 31 29 26 41.488095
-2_I_pssm 64 49 25 36 73 38 95 66 14 26 ... 48 1 17 1 40 62 92 32 44 43.107143
-2_R_pssm 36 9 67 50 64 29 22 71 69 6 ... 57 72 54 62 10 65 93 53 21 43.238095
2_E_pssm 1 58 20 39 60 45 70 21 25 72 ... 30 67 7 50 88 14 35 75 28 43.547619
0_K_pssm 30 37 45 85 61 3 15 64 10 11 ... 75 11 51 69 38 6 42 80 41 43.678571
-1_N_pssm 86 6 28 42 19 23 73 83 84 24 ... 77 61 84 20 59 19 40 12 1 43.940476
0_T_pssm 17 8 33 51 71 19 83 36 21 91 ... 35 24 92 63 85 44 20 43 5 44.202381
1_E_pssm 45 77 3 34 77 61 33 1 96 63 ... 41 66 1 31 20 1 88 96 62 44.416667
-2_K_pssm 40 13 23 92 90 54 7 16 94 17 ... 88 47 46 61 55 46 37 83 58 44.523810
-1_F_pssm 39 68 84 40 69 12 49 1 83 87 ... 66 15 47 76 1 1 28 1 76 45.059524
0_F_pssm 32 70 41 15 2 92 65 14 91 38 ... 45 13 45 19 2 1 82 9 37 45.071429
1_P_pssm 54 71 52 11 25 95 58 7 53 33 ... 61 40 89 48 79 7 21 63 79 45.119048
1_I_pssm 13 1 74 29 94 1 35 33 34 96 ... 13 18 65 66 96 60 25 1 96 45.273810
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2_T_pssm 71 74 87 48 48 62 29 20 13 37 ... 71 30 23 94 56 55 34 39 24 50.035714
-1_Y_pssm 77 78 58 47 70 85 79 3 87 41 ... 84 89 80 85 36 28 29 18 6 50.273810
2_M_pssm 16 44 64 80 57 31 55 18 82 49 ... 4 49 1 55 68 45 80 33 33 50.297619
1_A_pssm 9 23 17 21 27 88 3 10 85 62 ... 11 69 83 18 47 21 1 5 80 50.464286
-1_H_pssm 43 61 76 1 55 59 82 55 49 25 ... 94 19 67 91 11 18 86 60 95 50.476190
0_H_pssm 24 46 10 66 21 51 23 81 1 65 ... 18 36 79 78 76 36 65 73 51 50.714286
2_G_pssm 83 80 1 19 88 36 16 88 55 9 ... 20 23 72 11 54 86 68 4 71 50.726190
-2_V_pssm 48 11 88 35 74 90 84 72 95 30 ... 47 55 74 1 72 51 87 67 74 50.761905
2_Y_pssm 60 87 92 81 50 69 77 22 22 58 ... 91 62 40 86 81 11 27 86 73 50.964286
1_R_pssm 22 54 12 7 45 91 48 74 1 22 ... 15 59 93 56 51 31 16 92 40 51.154762
-2_Y_pssm 68 95 93 14 91 39 61 58 59 29 ... 39 76 95 2 29 9 73 27 46 51.202381
-1_W_pssm 93 26 35 76 8 10 66 89 12 16 ... 60 50 88 25 52 27 52 72 1 51.321429
0_D_pssm 10 38 86 96 72 74 17 57 1 67 ... 31 5 63 90 78 33 44 54 94 51.452381
2_F_pssm 87 63 40 25 12 71 81 27 7 57 ... 64 32 38 12 67 92 71 17 29 51.583333
1_N_pssm 85 12 37 77 92 67 12 47 70 44 ... 46 71 60 88 34 42 32 37 11 51.595238
0_G_pssm 75 21 1 93 83 70 36 86 39 21 ... 29 42 82 93 24 96 9 1 78 51.642857
1_Q_pssm 20 94 47 84 78 49 75 2 73 85 ... 62 64 69 92 53 35 57 21 7 51.809524
0_N_pssm 46 85 95 46 65 57 21 76 76 78 ... 26 54 22 33 33 10 96 16 18 52.011905
2_A_pssm 91 53 5 90 75 40 43 13 46 54 ... 72 58 57 28 26 37 8 88 48 52.130952
0_Q_pssm 19 43 48 26 6 80 91 32 52 68 ... 19 93 71 38 86 1 78 50 65 52.333333
2_W_pssm 53 73 31 59 4 43 62 85 15 28 ... 9 96 85 57 49 78 61 15 38 52.821429
2_Q_pssm 7 96 90 63 42 15 76 19 88 1 ... 83 82 33 22 89 61 84 25 60 53.011905
1_G_pssm 31 47 14 88 86 65 47 65 3 53 ... 49 87 16 82 27 84 48 74 92 53.440476
-2_M_pssm 44 64 81 8 5 37 67 75 42 19 ... 70 65 13 36 60 71 38 11 86 54.059524
2_C_pssm 50 50 82 1 32 82 53 39 80 27 ... 6 85 20 43 93 80 14 89 31 54.142857
1_C_pssm 80 90 89 30 3 75 94 54 11 73 ... 27 73 29 59 94 69 39 20 35 54.345238
1_W_pssm 90 69 68 79 38 41 40 94 20 90 ... 95 70 76 64 77 72 76 47 1 54.345238
-2_C_pssm 57 65 60 73 93 76 18 6 64 34 ... 55 56 18 84 15 12 95 91 93 54.714286
0_P_pssm 74 52 1 91 7 30 69 62 66 92 ... 23 86 36 16 64 40 5 61 63 57.738095
1_H_pssm 47 30 53 56 43 44 64 26 17 70 ... 51 51 48 73 90 79 83 40 84 57.940476

100 rows × 85 columns

# plot average rank of all HSSP values

plt.hist([np.mean(pssm_rank.ix[i]) for i in pssm_rank.index], bins=60, alpha=.5)
plt.title("Histogram of Average HSSP Features Rank (RFE on linear SVM)")
fig = plt.gcf()
fig.set_size_inches(10, 6)

Some PSSM shows deviation from expected normal distribution -- which should be the case in neutral information setting due to Central Limit Theorem (CLT).

Analysis II: Support Vector Machine

We use c-SVM on two models :

  • all features model
  • handpicked features model


  • “Support-vector networks”, C. Cortes, V. Vapnik - Machine Learning, 20, 273-297 (1995).
  • “Automatic Capacity Tuning of Very Large VC-dimension Classifiers”, I. Guyon, B. Boser, V. Vapnik - Advances in neural information processing 1993.

SVM on all features

Without Normalization

X = dna[dna.columns[1:][:-6]]
y = dna.class_num

## train c-SVM

clf_svm1 = SVC(kernel='rbf', C=0.7)[dna.fold == 0], y[dna.fold == 0])

SVC(C=0.7, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

## predict class

pred = clf_svm1.predict(dna[dna.fold == 1][dna.columns[1:][:-6]])

In [105]:
truth = dna[dna.fold == 1]['class_num']

tp = pred[(np.array(pred) == 1) & (np.array(truth) == 1)].size
tn = pred[(np.array(pred) == 0) & (np.array(truth) == 0)].size
fp = pred[(np.array(pred) == 1) & (np.array(truth) == 0)].size
fn = pred[(np.array(pred) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
print cm % (tp, fn, fp, tn)

Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	0		2561
	(-)-truth	0		26832

print "Size of (-)- and (+)-sets:\n\t(+)\t %d\n\t(-)\t%d" % (truth[truth == 1].index.size, truth[truth == 0].index.size)

Size of (-)- and (+)-sets:
	(+)	 2561
	(-)	26832

With normalization

X_norm = dna_norm[dna_norm.columns[1:][:-6]]
y = dna_norm.class_num

## train c-SVM

clf_svm2 = SVC(kernel='rbf', C=0.7)[dna_norm.fold == 0], y[dna_norm.fold == 0])

SVC(C=0.7, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

## predict class

pred2 = clf_svm2.predict(dna_norm[dna_norm.fold == 1][dna_norm.columns[1:][:-6]])

In [109]:
truth = dna_norm[dna_norm.fold == 1]['class_num']

tp = pred2[(np.array(pred2) == 1) & (np.array(truth) == 1)].size
tn = pred2[(np.array(pred2) == 0) & (np.array(truth) == 0)].size
fp = pred2[(np.array(pred2) == 1) & (np.array(truth) == 0)].size
fn = pred2[(np.array(pred2) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
print cm % (tp, fn, fp, tn)

Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	323		2238
	(-)-truth	156		26676

Normalization over each feature space reduces the complexity of the problem which in turn improves the result.

SVM on PSSM + other putative useful features

We now test the performance on SVM using non-zero knowledge approach. We scourged through the complete list of features acquired by PredictProtein and include features that might have certain influence on DNA/RNA binding

In [88]:
## hand-pick features

features = [x for x in dna.columns[1:][:-6] if 'pssm' in x] +\
            [x for x in dna.columns[1:][:-6] if 'glbl_aa_comp' in x] +\
            [x for x in dna.columns[1:][:-6] if 'glbl_sec' in x] +\
            [x for x in dna.columns[1:][:-6] if 'glbl_acc' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_mass' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_hyd' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_cbeta' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_charge' in x] +\
            [x for x in dna.columns[1:][:-6] if 'inf_PP' in x] +\
            [x for x in dna.columns[1:][:-6] if 'isis_bin' in x] +\
            [x for x in dna.columns[1:][:-6] if 'isis_raw' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profbval_raw' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_sec_raw' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_sec_bin' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_acc_bin' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_normalize' in x] +\
            [x for x in dna.columns[1:][:-6] if 'pfam_within_domain' in x] +\
            [x for x in dna.columns[1:][:-6] if 'pfam_dom_cons' in x]

In [89]:
X_norm = dna_norm[features]
y = dna_norm.class_num

In [90]:
## train c-SVM

clf_svm3 = SVC(kernel='rbf', C=0.7)[dna_norm.fold == 0], y[dna_norm.fold == 0])

SVC(C=0.7, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

## predict class

pred3 = clf_svm3.predict(X_norm[dna_norm.fold == 1])

In [107]:
truth = dna_norm[dna_norm.fold == 1]['class_num']

tp = pred3[(np.array(pred3) == 1) & (np.array(truth) == 1)].size
tn = pred3[(np.array(pred3) == 0) & (np.array(truth) == 0)].size
fp = pred3[(np.array(pred3) == 1) & (np.array(truth) == 0)].size
fn = pred3[(np.array(pred3) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
print cm % (tp, fn, fp, tn)

Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	166		2395
	(-)-truth	99		26733

Analysis III: Tree-based Classificators

We picked three tree-based algorithms: Decision Tree (DT), Random Forest (RT) and Extremely Random Forest (ERT). From left to right, the algorithm allows more complexity into the models by introducing more randomness and biased than the previous algorithm.

In [111]:
y = dna.class_num

Decision Tree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.


L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.

# compute cross validated accuracy of the model

clf_t1 = DecisionTreeClassifier(max_depth=None, min_samples_split=2,

scores = cross_val_score(clf_t1, X, y, cv=5)

print scores
print scores.mean()

[ 0.84656239  0.84001634  0.85005545  0.86370535  0.85284847]

In [116]:[dna.fold == 0], y[dna.fold == 0])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=0, splitter='best')

pred_t1 = clf_t1.predict(X[dna.fold == 1])

In [126]:
truth = dna[dna.fold == 1]['class_num']

tp = pred_t1[(np.array(pred_t1) == 1) & (np.array(truth) == 1)].size
tn = pred_t1[(np.array(pred_t1) == 0) & (np.array(truth) == 0)].size
fp = pred_t1[(np.array(pred_t1) == 1) & (np.array(truth) == 0)].size
fn = pred_t1[(np.array(pred_t1) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
print cm % (tp, fn, fp, tn)

Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	760		1801
	(-)-truth	3237		23595

Random Forest

In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.



  • Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.
  • Breiman, “Arcing Classifiers”, Annals of Statistics 1998.

# compute cross validated accuracy of the model

clf_t2 = RandomForestClassifier(n_estimators=10, max_depth=None,
                                 min_samples_split=2, random_state=0)
scores = cross_val_score(clf_t2, X, y, cv=5)
print scores
print scores.mean()

[ 0.90253298  0.9058542   0.90632113  0.90917581  0.908417  ]

In [128]:[dna.fold == 0], y[dna.fold == 0])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

pred_t2 = clf_t2.predict(X[dna.fold == 1])

truth = dna[dna.fold == 1]['class_num']

tp = pred_t2[(np.array(pred_t2) == 1) & (np.array(truth) == 1)].size
tn = pred_t2[(np.array(pred_t2) == 0) & (np.array(truth) == 0)].size
fp = pred_t2[(np.array(pred_t2) == 1) & (np.array(truth) == 0)].size
fn = pred_t2[(np.array(pred_t2) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
print cm % (tp, fn, fp, tn)

Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	245		2316
	(-)-truth	224		26608

Extremely Randomized Tree

In extremely randomized trees (see ExtraTreesClassifier and ExtraTreesRegressor classes), randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias:



  • P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006.

# compute cross validated accuracy of the model

clf_t3 = ExtraTreesClassifier(n_estimators=10, max_depth=None,
                              min_samples_split=2, random_state=0)
scores = cross_val_score(clf_t3, X, y, cv=5)

print scores
print scores.mean()

[ 0.90480915  0.90813051  0.90859744  0.90835863  0.91028485]

In [134]:[dna.fold == 0], y[dna.fold == 0])

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)

pred_t3 = clf_t3.predict(X[dna.fold == 1])

In [136]:
truth = dna[dna.fold == 1]['class_num']

tp = pred_t3[(np.array(pred_t3) == 1) & (np.array(truth) == 1)].size
tn = pred_t3[(np.array(pred_t3) == 0) & (np.array(truth) == 0)].size
fp = pred_t3[(np.array(pred_t3) == 1) & (np.array(truth) == 0)].size
fn = pred_t3[(np.array(pred_t3) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
print cm % (tp, fn, fp, tn)

Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	233		2328
	(-)-truth	236		26596

Random Forest on selected features

features = [x for x in dna.columns[1:][:-6] if 'pssm' in x] +\
            [x for x in dna.columns[1:][:-6] if 'glbl_aa_comp' in x] +\
            [x for x in dna.columns[1:][:-6] if 'glbl_sec' in x] +\
            [x for x in dna.columns[1:][:-6] if 'glbl_acc' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_mass' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_hyd' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_cbeta' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_charge' in x] +\
            [x for x in dna.columns[1:][:-6] if 'inf_PP' in x] +\
            [x for x in dna.columns[1:][:-6] if 'isis_bin' in x] +\
            [x for x in dna.columns[1:][:-6] if 'isis_raw' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profbval_raw' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_sec_raw' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_sec_bin' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_acc_bin' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_normalize' in x] +\
            [x for x in dna.columns[1:][:-6] if 'pfam_within_domain' in x] +\
            [x for x in dna.columns[1:][:-6] if 'pfam_dom_cons' in x]

X = dna[features]
y = dna.class_num

# compute cross validated accuracy of the model

clf_t4 = RandomForestClassifier(n_estimators=10, max_depth=None,
                                 min_samples_split=2, random_state=0)
scores = cross_val_score(clf_t4, X, y, cv=5)
print scores
print scores.mean()

[ 0.90253298  0.9058542   0.90632113  0.90917581  0.908417  ]

In [175]:[dna.fold == 0], y[dna.fold == 0])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

pred_t4 = clf_t4.predict(X[dna.fold == 1])

In [177]:
truth = dna[dna.fold == 1]['class_num']

tp = pred_t4[(np.array(pred_t4) == 1) & (np.array(truth) == 1)].size
tn = pred_t4[(np.array(pred_t4) == 0) & (np.array(truth) == 0)].size
fp = pred_t4[(np.array(pred_t4) == 1) & (np.array(truth) == 0)].size
fn = pred_t4[(np.array(pred_t4) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
print cm % (tp, fn, fp, tn)

Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	190		2371
	(-)-truth	257		26575

While there is a significant accuracy improvement going from Decision Tree to Random Forest, the resulting prediction from Extremely Random Forest only improves the accuracy by the margin. Likewise, manually handpicking the features does not seem to improve the performance of the accuracy.

IV: Other Ensemble Learning Methods


The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction.



  • Y. Freund, and R. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”, 1997.
  • J. Zhu, H. Zou, S. Rosset, T. Hastie. “Multi-class AdaBoost”, 2009.

X = dna[dna.columns[1:][:-6]]
y = dna.class_num

# compute cross validated accuracy of the model

ada = AdaBoostClassifier(n_estimators=100)
scores = cross_val_score(ada, X, y, cv=5)

print scores
print scores.mean()

[ 0.90673515  0.90194362  0.89406409  0.90777492  0.89691805]

In [185]:[dna.fold == 0], y[dna.fold == 0])

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=100, random_state=None)

pred_ada = ada.predict(X[dna.fold == 1])

In [188]:
truth = dna[dna.fold == 1]['class_num']

tp = pred_ada[(np.array(pred_ada) == 1) & (np.array(truth) == 1)].size
tn = pred_ada[(np.array(pred_ada) == 0) & (np.array(truth) == 0)].size
fp = pred_ada[(np.array(pred_ada) == 1) & (np.array(truth) == 0)].size
fn = pred_ada[(np.array(pred_ada) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
print cm % (tp, fn, fp, tn)

Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	665		1896
	(-)-truth	800		26032


  • xNA Binding a hard(er than expected) classification problem
  • Good accuracy and precision; okay recall for most basic ML algos
  • Most features are not i.i.d.
  • Manual selection of features doesn't improve performances

Some solutions that might work:

I. Quantitative features selection using RFE/RFA over complete feature spaces

Problem: Feature spaces might be too large for conventional canned algorithms.

Possible Hacks:

-- Bagging of features (55+ feature groups vs. 500+ features)

-- Removing similar features before RFE (elimination via cosine similarity et al.?)

-- Dimensionality reductions (t-SNE, PCA et al.?)

II. Regularization: Might work considering the system is not entirely overdetermined and many features are not actually informative + the tendency of the problem to overcomplicate.

Generally combination of I and II would make some sense.

For continous class value [0:1] (for submission)

SVM, Random Forest and AdaBoost regressor.
