Binding Site Prediction

In this notebook we perform various machine learning methods and compare various aspects of machine learning paradigms:

  • Zero-knowledge vs. domain-knowledge based prediction
  • Single algorithms vs. ensemble methods
  • Prediction over normalized vs. non-normalized space

We also reviewed several machine learning algorithms such as Support Vector Machine including its variants (c-SVN, regressive SVN etc), tree based methods (decision tree, random forest, extremely random forest etc) and other ensembel methods (AdaBoost)


In [1]:
## matrix and vector tools

import pandas as pd
from pandas import DataFrame as df
from pandas import Series
import numpy as np

In [181]:
## sklearn

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
from sklearn.svm import SVC

from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

from sklearn.feature_selection import VarianceThreshold

In [3]:
# matplotlib et al.

from matplotlib import pyplot as plt

%matplotlib inline


//anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

Data Import and pre-processing


In [96]:
dna = df.from_csv('../../data/training_data_binding_site_prediction/dna_big.csv')

In [97]:
## embed class

dna = dna.reset_index(drop=False)
dna['class_bool'] = dna['class'] == '+'
dna['class_num'] = dna.class_bool.apply(lambda x: 1 if x else 0)

In [98]:
## added protein ID and corresponding position

dna['ID'] = dna.ID_pos.apply(lambda x: ''.join(x.split('_')[:-1]))
dna['pos'] = dna.ID_pos.apply(lambda x: x.split('_')[-1])

In [99]:
## data columns

dna.columns


Out[99]:
Index([u'ID_pos', u'-2_A_pssm', u'-2_R_pssm', u'-2_N_pssm', u'-2_D_pssm',
       u'-2_C_pssm', u'-2_Q_pssm', u'-2_E_pssm', u'-2_G_pssm', u'-2_H_pssm',
       ...
       u'intermediate_composition3', u'buried_composition1',
       u'buried_composition2', u'buried_composition3', u'fold', u'class',
       u'class_bool', u'class_num', u'ID', u'pos'],
      dtype='object', length=524)

In [8]:
## print available features
for feature in dna.columns[:-6]:
    
    print feature


ID_pos
-2_A_pssm
-2_R_pssm
-2_N_pssm
-2_D_pssm
-2_C_pssm
-2_Q_pssm
-2_E_pssm
-2_G_pssm
-2_H_pssm
-2_I_pssm
-2_L_pssm
-2_K_pssm
-2_M_pssm
-2_F_pssm
-2_P_pssm
-2_S_pssm
-2_T_pssm
-2_W_pssm
-2_Y_pssm
-2_V_pssm
-1_A_pssm
-1_R_pssm
-1_N_pssm
-1_D_pssm
-1_C_pssm
-1_Q_pssm
-1_E_pssm
-1_G_pssm
-1_H_pssm
-1_I_pssm
-1_L_pssm
-1_K_pssm
-1_M_pssm
-1_F_pssm
-1_P_pssm
-1_S_pssm
-1_T_pssm
-1_W_pssm
-1_Y_pssm
-1_V_pssm
0_A_pssm
0_R_pssm
0_N_pssm
0_D_pssm
0_C_pssm
0_Q_pssm
0_E_pssm
0_G_pssm
0_H_pssm
0_I_pssm
0_L_pssm
0_K_pssm
0_M_pssm
0_F_pssm
0_P_pssm
0_S_pssm
0_T_pssm
0_W_pssm
0_Y_pssm
0_V_pssm
1_A_pssm
1_R_pssm
1_N_pssm
1_D_pssm
1_C_pssm
1_Q_pssm
1_E_pssm
1_G_pssm
1_H_pssm
1_I_pssm
1_L_pssm
1_K_pssm
1_M_pssm
1_F_pssm
1_P_pssm
1_S_pssm
1_T_pssm
1_W_pssm
1_Y_pssm
1_V_pssm
2_A_pssm
2_R_pssm
2_N_pssm
2_D_pssm
2_C_pssm
2_Q_pssm
2_E_pssm
2_G_pssm
2_H_pssm
2_I_pssm
2_L_pssm
2_K_pssm
2_M_pssm
2_F_pssm
2_P_pssm
2_S_pssm
2_T_pssm
2_W_pssm
2_Y_pssm
2_V_pssm
-2_A_perc
-2_R_perc
-2_N_perc
-2_D_perc
-2_C_perc
-2_Q_perc
-2_E_perc
-2_G_perc
-2_H_perc
-2_I_perc
-2_L_perc
-2_K_perc
-2_M_perc
-2_F_perc
-2_P_perc
-2_S_perc
-2_T_perc
-2_W_perc
-2_Y_perc
-2_V_perc
-1_A_perc
-1_R_perc
-1_N_perc
-1_D_perc
-1_C_perc
-1_Q_perc
-1_E_perc
-1_G_perc
-1_H_perc
-1_I_perc
-1_L_perc
-1_K_perc
-1_M_perc
-1_F_perc
-1_P_perc
-1_S_perc
-1_T_perc
-1_W_perc
-1_Y_perc
-1_V_perc
0_A_perc
0_R_perc
0_N_perc
0_D_perc
0_C_perc
0_Q_perc
0_E_perc
0_G_perc
0_H_perc
0_I_perc
0_L_perc
0_K_perc
0_M_perc
0_F_perc
0_P_perc
0_S_perc
0_T_perc
0_W_perc
0_Y_perc
0_V_perc
1_A_perc
1_R_perc
1_N_perc
1_D_perc
1_C_perc
1_Q_perc
1_E_perc
1_G_perc
1_H_perc
1_I_perc
1_L_perc
1_K_perc
1_M_perc
1_F_perc
1_P_perc
1_S_perc
1_T_perc
1_W_perc
1_Y_perc
1_V_perc
2_A_perc
2_R_perc
2_N_perc
2_D_perc
2_C_perc
2_Q_perc
2_E_perc
2_G_perc
2_H_perc
2_I_perc
2_L_perc
2_K_perc
2_M_perc
2_F_perc
2_P_perc
2_S_perc
2_T_perc
2_W_perc
2_Y_perc
2_V_perc
-2_infPP
-1_infPP
0_infPP
1_infPP
2_infPP
-2_relW
-1_relW
0_relW
1_relW
2_relW
-2_isis_plus
-1_isis_plus
0_isis_plus
1_isis_plus
2_isis_plus
-2_isis_minus
-1_isis_minus
0_isis_minus
1_isis_minus
2_isis_minus
-2_isis_raw
-1_isis_raw
0_isis_raw
1_isis_raw
2_isis_raw
-2_profbval_raw1
-1_profbval_raw1
0_profbval_raw1
1_profbval_raw1
2_profbval_raw1
-2_profbval_raw2
-1_profbval_raw2
0_profbval_raw2
1_profbval_raw2
2_profbval_raw2
-2_md_plus
-1_md_plus
0_md_plus
1_md_plus
2_md_plus
-2_md_minus
-1_md_minus
0_md_minus
1_md_minus
2_md_minus
-2_md_raw
-1_md_raw
0_md_raw
1_md_raw
2_md_raw
-2_md_ri
-1_md_ri
0_md_ri
1_md_ri
2_md_ri
-2_helix
-1_helix
0_helix
1_helix
2_helix
-2_strand
-1_strand
0_strand
1_strand
2_strand
-2_loop
-1_loop
0_loop
1_loop
2_loop
-2_OtH
-1_OtH
0_OtH
1_OtH
2_OtH
-2_OtE
-1_OtE
0_OtE
1_OtE
2_OtE
-2_OtL
-1_OtL
0_OtL
1_OtL
2_OtL
-2_ri_sec
-1_ri_sec
0_ri_sec
1_ri_sec
2_ri_sec
-2_e
-1_e
0_e
1_e
2_e
-2_i
-1_i
0_i
1_i
2_i
-2_b
-1_b
0_b
1_b
2_b
-2_rel_acc
-1_rel_acc
0_rel_acc
1_rel_acc
2_rel_acc
-2_ri_acc
-1_ri_acc
0_ri_acc
1_ri_acc
2_ri_acc
-2_A_psic
-2_R_psic
-2_N_psic
-2_D_psic
-2_C_psic
-2_Q_psic
-2_E_psic
-2_G_psic
-2_H_psic
-2_I_psic
-2_L_psic
-2_K_psic
-2_M_psic
-2_F_psic
-2_P_psic
-2_S_psic
-2_T_psic
-2_W_psic
-2_Y_psic
-2_V_psic
-1_A_psic
-1_R_psic
-1_N_psic
-1_D_psic
-1_C_psic
-1_Q_psic
-1_E_psic
-1_G_psic
-1_H_psic
-1_I_psic
-1_L_psic
-1_K_psic
-1_M_psic
-1_F_psic
-1_P_psic
-1_S_psic
-1_T_psic
-1_W_psic
-1_Y_psic
-1_V_psic
0_A_psic
0_R_psic
0_N_psic
0_D_psic
0_C_psic
0_Q_psic
0_E_psic
0_G_psic
0_H_psic
0_I_psic
0_L_psic
0_K_psic
0_M_psic
0_F_psic
0_P_psic
0_S_psic
0_T_psic
0_W_psic
0_Y_psic
0_V_psic
1_A_psic
1_R_psic
1_N_psic
1_D_psic
1_C_psic
1_Q_psic
1_E_psic
1_G_psic
1_H_psic
1_I_psic
1_L_psic
1_K_psic
1_M_psic
1_F_psic
1_P_psic
1_S_psic
1_T_psic
1_W_psic
1_Y_psic
1_V_psic
2_A_psic
2_R_psic
2_N_psic
2_D_psic
2_C_psic
2_Q_psic
2_E_psic
2_G_psic
2_H_psic
2_I_psic
2_L_psic
2_K_psic
2_M_psic
2_F_psic
2_P_psic
2_S_psic
2_T_psic
2_W_psic
2_Y_psic
2_V_psic
-2_psic_numSeq
-1_psic_numSeq
0_psic_numSeq
1_psic_numSeq
2_psic_numSeq
-2_pfam_within_domain
-1_pfam_within_domain
0_pfam_within_domain
1_pfam_within_domain
2_pfam_within_domain
-2_pfam_domain_conservation
-1_pfam_domain_conservation
0_pfam_domain_conservation
1_pfam_domain_conservation
2_pfam_domain_conservation
-2_pfam_residue_fit
-1_pfam_residue_fit
0_pfam_residue_fit
1_pfam_residue_fit
2_pfam_residue_fit
-2_pfam_pp
-1_pfam_pp
0_pfam_pp
1_pfam_pp
2_pfam_pp
-2_chemprop_mass
-1_chemprop_mass
0_chemprop_mass
1_chemprop_mass
2_chemprop_mass
-2_chemprop_vol
-1_chemprop_vol
0_chemprop_vol
1_chemprop_vol
2_chemprop_vol
-2_chemprop_hyd
-1_chemprop_hyd
0_chemprop_hyd
1_chemprop_hyd
2_chemprop_hyd
-2_chemprop_cbeta
-1_chemprop_cbeta
0_chemprop_cbeta
1_chemprop_cbeta
2_chemprop_cbeta
-2_chemprop_hbreaker
-1_chemprop_hbreaker
0_chemprop_hbreaker
1_chemprop_hbreaker
2_chemprop_hbreaker
-2_chemprop_charge
-1_chemprop_charge
0_chemprop_charge
1_chemprop_charge
2_chemprop_charge
-2_prosite
-1_prosite
0_prosite
1_prosite
2_prosite
A_composition
R_composition
N_composition
D_composition
C_composition
Q_composition
E_composition
G_composition
H_composition
I_composition
L_composition
K_composition
M_composition
F_composition
P_composition
S_composition
T_composition
W_composition
Y_composition
V_composition
length_category1
length_category2
length_category3
length_category4
helix_composition1
helix_composition2
helix_composition3
strand_composition1
strand_composition2
strand_composition3
loop_composition1
loop_composition2
loop_composition3
exposed_composition1
exposed_composition2
exposed_composition3
intermediate_composition1
intermediate_composition2
intermediate_composition3
buried_composition1
buried_composition2
buried_composition3

In [152]:
dna


Out[152]:
ID_pos -2_A_pssm -2_R_pssm -2_N_pssm -2_D_pssm -2_C_pssm -2_Q_pssm -2_E_pssm -2_G_pssm -2_H_pssm ... intermediate_composition3 buried_composition1 buried_composition2 buried_composition3 fold class class_bool class_num ID pos
0 Q6NS38_1 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 1
1 Q6NS38_2 0.047426 0.017986 0.006693 0.002473 0.017986 0.047426 0.006693 0.006693 0.017986 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 2
2 Q6NS38_3 0.119203 0.017986 0.119203 0.999665 0.002473 0.047426 0.731059 0.017986 0.017986 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 3
3 Q6NS38_4 0.731059 0.997527 0.119203 0.268941 0.006693 0.731059 0.119203 0.119203 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 4
4 Q6NS38_5 0.731059 0.268941 0.119203 0.006693 0.017986 0.268941 0.268941 0.047426 0.047426 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 5
5 Q6NS38_6 0.119203 0.880797 0.017986 0.017986 0.047426 0.047426 0.268941 0.119203 0.017986 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 6
6 Q6NS38_7 0.731059 0.119203 0.119203 0.119203 0.047426 0.119203 0.017986 0.017986 0.017986 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 7
7 Q6NS38_8 0.119203 0.880797 0.119203 0.119203 0.268941 0.982014 0.268941 0.119203 0.268941 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 8
8 Q6NS38_9 0.500000 0.982014 0.119203 0.119203 0.006693 0.268941 0.268941 0.952574 0.017986 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 9
9 Q6NS38_10 0.982014 0.268941 0.017986 0.500000 0.017986 0.047426 0.268941 0.047426 0.047426 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 10
10 Q6NS38_11 0.268941 0.880797 0.500000 0.268941 0.017986 0.731059 0.731059 0.119203 0.119203 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 11
11 Q6NS38_12 0.880797 0.268941 0.268941 0.119203 0.119203 0.268941 0.268941 0.731059 0.880797 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 12
12 Q6NS38_13 0.500000 0.268941 0.268941 0.952574 0.500000 0.119203 0.268941 0.952574 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 13
13 Q6NS38_14 0.731059 0.731059 0.268941 0.500000 0.047426 0.731059 0.731059 0.119203 0.880797 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 14
14 Q6NS38_15 0.500000 0.500000 0.268941 0.268941 0.047426 0.731059 0.268941 0.119203 0.119203 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 15
15 Q6NS38_16 0.880797 0.952574 0.119203 0.268941 0.268941 0.268941 0.500000 0.880797 0.268941 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 16
16 Q6NS38_17 0.500000 0.880797 0.268941 0.119203 0.119203 0.731059 0.500000 0.119203 0.119203 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 17
17 Q6NS38_18 0.500000 0.880797 0.500000 0.500000 0.006693 0.880797 0.500000 0.268941 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 18
18 Q6NS38_19 0.268941 0.731059 0.500000 0.880797 0.119203 0.880797 0.982014 0.119203 0.268941 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 19
19 Q6NS38_20 0.731059 0.500000 0.731059 0.880797 0.017986 0.268941 0.880797 0.268941 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 20
20 Q6NS38_21 0.268941 0.880797 0.119203 0.268941 0.006693 0.982014 0.880797 0.047426 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 21
21 Q6NS38_22 0.500000 0.880797 0.119203 0.268941 0.006693 0.500000 0.952574 0.500000 0.268941 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 22
22 Q6NS38_23 0.268941 0.500000 0.119203 0.119203 0.002473 0.880797 0.731059 0.047426 0.731059 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 23
23 Q6NS38_24 0.500000 0.731059 0.268941 0.268941 0.017986 0.500000 0.268941 0.268941 0.119203 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 24
24 Q6NS38_25 0.731059 0.731059 0.500000 0.731059 0.017986 0.268941 0.500000 0.731059 0.119203 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 25
25 Q6NS38_26 0.500000 0.500000 0.500000 0.880797 0.006693 0.880797 0.880797 0.268941 0.500000 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 26
26 Q6NS38_27 0.500000 0.731059 0.500000 0.500000 0.017986 0.731059 0.880797 0.731059 0.268941 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 27
27 Q6NS38_28 0.500000 0.268941 0.119203 0.500000 0.006693 0.500000 0.500000 0.119203 0.731059 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 28
28 Q6NS38_29 0.880797 0.500000 0.268941 0.268941 0.731059 0.731059 0.500000 0.268941 0.119203 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 29
29 Q6NS38_30 0.731059 0.500000 0.268941 0.268941 0.268941 0.268941 0.268941 0.500000 0.268941 ... 0.0 1.0 0.5 0.0 2 - False 0 Q6NS38 30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
85634 P11746_257 0.119203 0.500000 0.731059 0.119203 0.017986 0.997527 0.268941 0.268941 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 257
85635 P11746_258 0.500000 0.119203 0.880797 0.500000 0.006693 0.880797 0.119203 0.880797 0.993307 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 258
85636 P11746_259 0.268941 0.047426 0.047426 0.119203 0.006693 0.731059 0.017986 0.119203 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 259
85637 P11746_260 0.731059 0.119203 0.500000 0.119203 0.002473 0.500000 0.119203 0.119203 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 260
85638 P11746_261 0.500000 0.268941 0.952574 0.119203 0.000911 0.500000 0.119203 0.268941 0.952574 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 261
85639 P11746_262 0.731059 0.047426 0.119203 0.119203 0.006693 0.731059 0.119203 0.880797 0.731059 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 262
85640 P11746_263 0.880797 0.017986 0.731059 0.268941 0.047426 0.500000 0.006693 0.731059 0.500000 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 263
85641 P11746_264 0.731059 0.017986 0.731059 0.047426 0.047426 0.731059 0.119203 0.500000 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 264
85642 P11746_265 0.268941 0.047426 0.017986 0.047426 0.000911 0.500000 0.119203 0.047426 0.268941 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 265
85643 P11746_266 0.268941 0.119203 0.731059 0.119203 0.047426 0.731059 0.119203 0.047426 0.993307 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 266
85644 P11746_267 0.119203 0.047426 0.119203 0.006693 0.017986 0.500000 0.047426 0.119203 0.500000 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 267
85645 P11746_268 0.268941 0.500000 0.993307 0.500000 0.268941 0.952574 0.268941 0.268941 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 268
85646 P11746_269 0.880797 0.017986 0.268941 0.119203 0.047426 0.731059 0.119203 0.268941 0.993307 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 269
85647 P11746_270 0.731059 0.500000 0.500000 0.500000 0.119203 0.880797 0.880797 0.047426 0.731059 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 270
85648 P11746_271 0.119203 0.731059 0.731059 0.268941 0.119203 0.993307 0.119203 0.880797 0.952574 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 271
85649 P11746_272 0.268941 0.119203 0.993307 0.268941 0.047426 0.880797 0.268941 0.268941 0.952574 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 272
85650 P11746_273 0.880797 0.268941 0.268941 0.119203 0.002473 0.880797 0.268941 0.047426 0.119203 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 273
85651 P11746_274 0.880797 0.017986 0.119203 0.268941 0.017986 0.731059 0.119203 0.268941 0.268941 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 274
85652 P11746_275 0.268941 0.500000 0.119203 0.119203 0.006693 0.731059 0.017986 0.119203 0.982014 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 275
85653 P11746_276 0.268941 0.119203 0.500000 0.268941 0.002473 0.997527 0.500000 0.268941 0.731059 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 276
85654 P11746_277 0.119203 0.500000 0.731059 0.500000 0.002473 0.997527 0.500000 0.119203 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 277
85655 P11746_278 0.268941 0.500000 0.500000 0.268941 0.006693 0.731059 0.119203 0.731059 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 278
85656 P11746_279 0.119203 0.268941 0.268941 0.047426 0.017986 0.500000 0.017986 0.119203 0.500000 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 279
85657 P11746_280 0.119203 0.268941 0.731059 0.268941 0.006693 0.993307 0.268941 0.268941 0.952574 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 280
85658 P11746_281 0.268941 0.047426 0.500000 0.500000 0.006693 0.952574 0.952574 0.119203 0.880797 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 281
85659 P11746_282 0.047426 0.006693 0.268941 0.500000 0.002473 0.982014 0.500000 0.006693 0.500000 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 282
85660 P11746_283 0.119203 0.500000 0.731059 0.047426 0.000911 0.997527 0.119203 0.047426 0.731059 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 283
85661 P11746_284 0.119203 0.880797 0.119203 0.017986 0.006693 0.997527 0.500000 0.047426 0.731059 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 284
85662 P11746_285 0.119203 0.268941 0.982014 0.268941 0.006693 0.731059 0.119203 0.993307 0.500000 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 285
85663 P11746_286 0.119203 0.047426 0.500000 0.017986 0.006693 0.999089 0.268941 0.119203 0.982014 ... 0.0 0.5 0.0 0.0 0 - False 0 P11746 286

85664 rows × 524 columns

Data Pre-processing: normalization

We apply Student's t normalization on each column:

$$X' = \frac{X - \hat{X}}{s}$$

This set would be used in parallel to normal dataset for comparison


In [95]:
## create column-wise normalized data-set

dna_norm = dna.copy()

for col in dna_norm[dna_norm.columns[1:][:-6]].columns:
    dna_norm[col] = (dna_norm[col] - dna_norm[col].mean()) / (dna_norm[col].std() + .00001)

In [153]:
dna_norm


Out[153]:
ID_pos -2_A_pssm -2_R_pssm -2_N_pssm -2_D_pssm -2_C_pssm -2_Q_pssm -2_E_pssm -2_G_pssm -2_H_pssm ... intermediate_composition3 buried_composition1 buried_composition2 buried_composition3 fold class class_bool class_num ID pos
0 Q6NS38_1 0.471432 0.333399 0.356656 0.424321 0.536658 0.167189 0.331158 0.821829 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 1
1 Q6NS38_2 -1.263070 -1.249199 -1.354682 -1.189555 -0.911272 -1.383960 -1.273175 -0.970122 -1.344461 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 2
2 Q6NS38_3 -0.987983 -1.249199 -0.964372 2.045131 -0.957874 -1.383960 1.082607 -0.929099 -1.344461 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 3
3 Q6NS38_4 1.356970 1.966934 -0.964372 -0.325185 -0.945197 0.959118 -0.907270 -0.561426 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 4
4 Q6NS38_5 1.356970 -0.425236 -0.964372 -1.175865 -0.911272 -0.624740 -0.420290 -0.822158 -1.255325 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 5
5 Q6NS38_6 -0.987983 1.583672 -1.315504 -1.139232 -0.822838 -1.383960 -0.420290 -0.561426 -1.344461 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 6
6 Q6NS38_7 1.356970 -0.916874 -0.964372 -0.810906 -0.822838 -1.137952 -1.236446 -0.929099 -1.344461 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 7
7 Q6NS38_8 -0.987983 1.583672 -0.964372 -0.810906 -0.157423 1.819240 -0.420290 -0.561426 -0.584634 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 8
8 Q6NS38_9 0.471432 1.915998 -0.964372 -0.810906 -0.945197 -0.624740 -0.420290 2.465816 -1.344461 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 9
9 Q6NS38_10 2.318762 -0.425236 -1.315504 0.424321 -0.911272 -1.383960 -0.420290 -0.822158 -1.255325 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 10
10 Q6NS38_11 -0.414106 1.583672 0.356656 -0.325185 -0.911272 0.959118 1.082607 -0.561426 -1.038003 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 11
11 Q6NS38_12 1.930846 -0.425236 -0.444912 -0.810906 -0.607225 -0.624740 -0.420290 1.661155 1.267906 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 12
12 Q6NS38_13 0.471432 -0.425236 -0.444912 1.892379 0.536658 -1.137952 -0.420290 2.465816 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 13
13 Q6NS38_14 1.356970 1.092035 -0.444912 0.424321 -0.822838 0.959118 1.082607 -0.561426 1.267906 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 14
14 Q6NS38_15 0.471432 0.333399 -0.444912 -0.325185 -0.822838 0.959118 -0.420290 -0.561426 -1.038003 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 15
15 Q6NS38_16 1.930846 1.819338 -0.964372 -0.325185 -0.157423 -0.624740 0.331158 2.205084 -0.584634 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 16
16 Q6NS38_17 0.471432 1.583672 -0.444912 -0.810906 -0.607225 0.959118 0.331158 -0.561426 -1.038003 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 17
17 Q6NS38_18 0.471432 1.583672 0.356656 0.424321 -0.945197 1.472330 0.331158 -0.017497 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 18
18 Q6NS38_19 -0.414106 1.092035 0.356656 1.659549 -0.607225 1.472330 1.898763 -0.561426 -0.584634 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 19
19 Q6NS38_20 1.356970 0.333399 1.158225 1.659549 -0.911272 -0.624740 1.569586 -0.017497 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 20
20 Q6NS38_21 -0.414106 1.583672 -0.964372 -0.325185 -0.945197 1.819240 1.569586 -0.822158 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 21
21 Q6NS38_22 0.471432 1.583672 -0.964372 -0.325185 -0.945197 0.167189 1.803020 0.821829 -0.584634 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 22
22 Q6NS38_23 -0.414106 0.333399 -0.964372 -0.810906 -0.957874 1.472330 1.082607 -0.822158 0.814537 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 23
23 Q6NS38_24 0.471432 1.092035 -0.444912 -0.325185 -0.911272 0.167189 -0.420290 -0.017497 -1.038003 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 24
24 Q6NS38_25 1.356970 1.092035 0.356656 1.173828 -0.911272 -0.624740 0.331158 1.661155 -1.038003 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 25
25 Q6NS38_26 0.471432 0.333399 0.356656 1.659549 -0.945197 1.472330 1.569586 -0.017497 0.114952 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 26
26 Q6NS38_27 0.471432 1.092035 0.356656 0.424321 -0.911272 0.959118 1.569586 1.661155 -0.584634 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 27
27 Q6NS38_28 0.471432 -0.425236 -0.964372 0.424321 -0.945197 0.167189 0.331158 -0.561426 0.814537 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 28
28 Q6NS38_29 1.930846 0.333399 -0.444912 -0.325185 1.230739 0.959118 0.331158 -0.017497 -1.038003 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 29
29 Q6NS38_30 1.356970 0.333399 -0.444912 -0.325185 -0.157423 -0.624740 -0.420290 0.821829 -0.584634 ... -0.396931 0.638864 0.573303 -0.124084 2 - False 0 Q6NS38 30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
85634 P11746_257 -0.987983 0.333399 1.158225 -0.810906 -0.911272 1.872411 -0.420290 -0.017497 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 257
85635 P11746_258 0.471432 -0.916874 1.677685 0.424321 -0.945197 1.472330 -0.907270 2.205084 1.608557 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 258
85636 P11746_259 -0.414106 -1.152540 -1.213375 -0.810906 -0.945197 0.959118 -1.236446 -0.561426 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 259
85637 P11746_260 1.356970 -0.916874 0.356656 -0.810906 -0.957874 0.167189 -0.907270 -0.561426 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 260
85638 P11746_261 0.471432 -0.425236 1.926688 -0.810906 -0.962565 0.167189 -0.907270 -0.017497 1.485228 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 261
85639 P11746_262 1.356970 -1.152540 -0.964372 -0.810906 -0.945197 0.959118 -0.907270 2.205084 0.814537 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 262
85640 P11746_263 1.930846 -1.249199 1.158225 -0.325185 -0.822838 0.167189 -1.273175 1.661155 0.114952 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 263
85641 P11746_264 1.356970 -1.249199 1.158225 -1.043736 -0.822838 0.959118 -0.907270 0.821829 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 264
85642 P11746_265 -0.414106 -1.152540 -1.315504 -1.043736 -0.962565 0.167189 -0.907270 -0.822158 -0.584634 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 265
85643 P11746_266 -0.414106 -0.916874 1.158225 -0.810906 -0.822838 0.959118 -0.907270 -0.822158 1.608557 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 266
85644 P11746_267 -0.987983 -1.152540 -0.964372 -1.175865 -0.911272 0.167189 -1.140703 -0.561426 0.114952 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 267
85645 P11746_268 -0.414106 0.333399 2.067995 0.424321 -0.157423 1.718338 -0.420290 -0.017497 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 268
85646 P11746_269 1.930846 -1.249199 -0.444912 -0.810906 -0.822838 0.959118 -0.907270 -0.017497 1.608557 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 269
85647 P11746_270 1.356970 0.333399 0.356656 0.424321 -0.607225 1.472330 1.569586 -0.822158 0.814537 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 270
85648 P11746_271 -0.987983 1.092035 1.158225 -0.325185 -0.607225 1.857947 -0.907270 2.205084 1.485228 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 271
85649 P11746_272 -0.414106 -0.916874 2.067995 -0.325185 -0.822838 1.472330 -0.420290 -0.017497 1.485228 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 272
85650 P11746_273 1.930846 -0.425236 -0.444912 -0.810906 -0.957874 1.472330 -0.420290 -0.822158 -1.038003 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 273
85651 P11746_274 1.930846 -1.249199 -0.964372 -0.325185 -0.911272 0.959118 -0.907270 -0.017497 -0.584634 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 274
85652 P11746_275 -0.414106 0.333399 -0.964372 -0.810906 -0.945197 0.959118 -1.236446 -0.561426 1.574364 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 275
85653 P11746_276 -0.414106 -0.916874 0.356656 -0.325185 -0.957874 1.872411 0.331158 -0.017497 0.814537 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 276
85654 P11746_277 -0.987983 0.333399 1.158225 0.424321 -0.957874 1.872411 0.331158 -0.561426 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 277
85655 P11746_278 -0.414106 0.333399 0.356656 -0.325185 -0.945197 0.959118 -0.907270 1.661155 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 278
85656 P11746_279 -0.987983 -0.425236 -0.444912 -1.043736 -0.911272 0.167189 -1.236446 -0.561426 0.114952 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 279
85657 P11746_280 -0.987983 -0.425236 1.158225 -0.325185 -0.945197 1.857947 -0.420290 -0.017497 1.485228 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 280
85658 P11746_281 -0.414106 -1.152540 0.356656 0.424321 -0.945197 1.718338 1.803020 -0.561426 1.267906 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 281
85659 P11746_282 -1.263070 -1.286279 -0.444912 0.424321 -0.957874 1.819240 0.331158 -0.970122 0.114952 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 282
85660 P11746_283 -0.987983 0.333399 1.158225 -1.043736 -0.962565 1.872411 -0.907270 -0.822158 0.814537 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 283
85661 P11746_284 -0.987983 1.583672 -0.964372 -1.139232 -0.945197 1.872411 0.331158 -0.822158 0.814537 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 284
85662 P11746_285 -0.987983 -0.425236 2.028817 -0.325185 -0.945197 0.959118 -0.907270 2.613780 0.114952 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 285
85663 P11746_286 -0.987983 -1.152540 0.356656 -1.139232 -0.945197 1.877763 -0.420290 -0.561426 1.574364 ... -0.396931 -1.565229 -1.513792 -0.124084 0 - False 0 P11746 286

85664 rows × 524 columns

Analysis I: Measuring the variation of significance among PSSM features

We want to see whether certain evolution patterns have any influence on DNA binding mechanism. We apply Recursive Feature Elimination (RFE) to rank our all PSSM-based features according to their predictive power using linear SVM (that is, SVM with linear kernel function).

Reference:

  • Guyon, I., Weston, J., Barnhill, S., & Vapnik, V., “Gene selection for cancer classification using support vector machines”, Mach. Learn., 46(1-3), 389–422, 2002.

In [10]:
# extract dataset and prediction
X = dna[dna.columns[1:][:-6]]
X = X[[x for x in X.columns.tolist() if 'pssm' in x]]
X = X.iloc[range(1000)]
y = dna['class_bool']
y = y[range(1000)]

# apply RFE on linear c-SVM
estimator = SVC(kernel="linear")
selector = RFE(estimator, 5, step=1)
selector = selector.fit(X, y)

print selector.ranking_


[ 5 36 72 79 57 41 49 61 65 64 76 40 44 33 95 78 69 51 68 48 26 82 86 38 35
  8 55 52 43  1  1  2  3 39 28 96 37 93 77  1 15 89 46 10 12 19 25 75 24 14
 67 30 23 32 74 94 17 21 66 34  9 22 85 84 80 20 45 31 47 13 88 73 11 81 54
 58 27 90 62  6 91 59 63 56 50  7  1 83 92  4 29 42 16 87 18 70 71 53 60  1]

In [11]:
# redid previous routine on the whole data

pssm_rank = pd.DataFrame()
cat = dna['class']

for i in range(dna.index.size / 1000):
    
    this_cat = cat[range(i * 1000, (i + 1) * 1000)]
    if this_cat.unique().size > 1:
    
        X = dna[dna.columns[1:][:-6]]
        X = X[[c for c in X.columns.tolist() if 'pssm' in c]]
        X = X.iloc[range(i * 1000, (i + 1) * 1000)]
        y = dna['class_bool']
        y = y[range(i * 1000, (i + 1) * 1000)]

        estimator = SVC(kernel="linear")
        selector = RFE(estimator, 5, step=1)
        selector = selector.fit(X, y)
        print selector.ranking_
    
        pssm_rank[str(i)] = selector.ranking_
    
pssm_rank.index = [c for c in X.columns.tolist() if 'pssm' in c]


[ 5 36 72 79 57 41 49 61 65 64 76 40 44 33 95 78 69 51 68 48 26 82 86 38 35
  8 55 52 43  1  1  2  3 39 28 96 37 93 77  1 15 89 46 10 12 19 25 75 24 14
 67 30 23 32 74 94 17 21 66 34  9 22 85 84 80 20 45 31 47 13 88 73 11 81 54
 58 27 90 62  6 91 59 63 56 50  7  1 83 92  4 29 42 16 87 18 70 71 53 60  1]
[59  9 48  7 65 33  4 15 76 49 14 13 64 51 19 28 31 45 95 11 55 82  6  2 66
 17 32 84 61 56 10 86 91 68 29 57 36 26 78 60 22 93 85 38 92 43 39 21 46 35
  1 37  5 70 52  3  8 83 88 41 23 54 12 20 90 94 77 47 30  1 18 62 75 79 71
 25 72 69 81  1 53 34 24 67 50 96 58 80 27  1 40 89 44 63 42 16 74 73 87  1]
[26 67  2  1 60 56  8 72 22 25  7 23 81 34 70 24 71 38 93 88 69 42 28 78 61
 85 27 39 76 57 18 91 36 84 16 94 83 35 58 54 66 30 95 86 65 48  4  1 10 49
 59 45 79 41  1 75 33  1  9 29 17 12 37 63 89 47  3 14 53 74 43 13 50 46 52
 19 96 68 77 11  5 80 55 21 82 90 20  1 51 73 62 15 64 40 44  6 87 31 92 32]
[53 50 38 24 73 58  1 52 49 36 23 92  8 69 74 87 65 18 14 35 10 17 42 86 33
 71 68 41  1 61  4 16 55 40 72  6 89 76 47 37 12 43 46 96 44 26  2 93 66 32
 62 85 31 15 91  1 51 45 67 94 21  7 77 64 30 84 34 88 56 29 83 82 60 57 11
 20 22 79  9 13 90 28 78 54  1 63 39 19  3 95 75 27 80 25  1 70 48 59 81  5]
[17 64 53 33 93 84 62 31 85 73 87 90  5 39 10  1 44 40 91 74 20 76 19 14 29
 89 82 96 55 81 11 16 67 69  9 18 58  8 70 30 79 80 65 72 15  6 35 83 21 51
 54 61  1  2  7  1 71  1 24 23 27 45 92 52  3 78 77 86 43 94 34 59  1 28 25
 26 66 38 63 22 75 13 49 37 32 42 60 88 41 36 95 46 57 12 68 47 48  4 50 56]
[16 29 28 20 76  6  1 13 87 38 89 54 37 56 55 35 64 53 39 90  5 25 23 73 96
 42  4  8 59 34 93  9 17 12 68 66 86 10 85 27 47 63 57 74 78 80 11 70 51 60
 52  3 46 92 30  1 19 21 33 48 88 91 67  7 75 49 61 65 44  1 58 50 26  1 95
 32 18 41  2 81 40 14 77 72 82 15 45 36 24  1 83 84 31 71 79 22 62 43 69 94]
[ 4 22 38 25 18 19 10 30 85 95 27  7 67 68 92  1 54 63 61 84 20 90 73  6 59
 32 28 26 82  1  1  5 51 49 11 37  2 66 79  1 52 13 21 17 78 91 39 36 23 93
 50 15 60 65 69 24 83 89 87 42  3 48 12 86 94 75 33 47 64 35 31 46 96 71 58
  1 72 40 41 80 43  8 74 34 53 76 70 16 88 56 44  9 55 81 45 14 29 62 77 57]
[53 71 93 92  6 15  1 70 51 66 80 16 75 38 61  5 84 17 58 72 52 78 83 59  4
  1 12 87 55  1 96  9 46  1 77 82  8 89  3 34 63 37 76 57 25 32 29 86 81 79
 60 64 11 14 62 31 36 67 56 35 10 74 47 95 54  2  1 65 26 33 43 48 91 44  7
 73 49 94 90 50 13 24 28 41 39 19 21 88 42 69 30 23 18 27 40 45 20 85 22 68]
[93 69 47 16 64 37 38 18 60 14 61 94 42 23 58 92 81 33 59 95 62 24 84 57 79
  2 56 86 49 54 32  1 40 83 89 77 67 12 87 51  9  8 76  1 75 52 71 39  1 31
 78 10 63 91 66 19 21 26 27 30 85  1 70  4 11 73 96  3 17 34 45  1 44 90 53
 35 36 20 72 74 46 29 65  6 80 88 25 55 41 50 48 28 82  7 43  5 13 15 22 68]
[55  6 83 79 34 15  5 76  4 26  3 17 19 14 56 35 74 46 29 30 42 77 24 23 45
 47  2 94 25 60  1  1 80 87 71  1 52 16 41 31 20 43 78 67 75 68 10 21 65 69
 39 11 88 38 92 48 91 81 32 93 62 22 44 89 73 85 63 53 70 96 12 82 13 95 33
 86 64 90 18 51 54 84  8 40 27  1 72  9  7 61 50  1 49 57 66 36 37 28 58 59]
[79 25 95 10 54 12 22 87 81 86 78 61 40 57 83  7  1 49 47  2  4 92 55  8 33
 28  9 19 29 30 82 24 20 73 53 85 60 88 64 68  1 84 18 94 45 39  6 34 80 38
 46 31 11 62 59 51  1 50 65 89  5 70 15 43 23 44 17 36 41 56 58 48 32 26 27
  1 35 72 37 13 67  3 91 66 69 21 77 90 76 16 63  1 52 96 42 93 71 75 14 74]
[68 23 41 50 44 90 14 29 30  1  1 93 11 20  8 92 35 91 59 38 26  2 76 89 65
  9 55 53 61 32 10 12  1 13 74 31 73 94 51 84 63 66 87 81 33 27 49 28 40 21
 17 43 22 86 70 62 57 67 34 36 83  1 24 60 39  1 77 80 25 58 52  4  6  5 96
 19 45 88 75 95 56  3 48 47 16 82 46 85 64 71 79 42 15 78 37 69 18 72  7 54]
[48 88  5 33 11  6 41 76 43  3 77 72 65  9  7  1 91 78 75 79 45  1 30 12 17
 52 24 60 93 21 34 25 13 58  1  1 56 19 89 27 44 28 16 51 90 86  4 40 70 92
 54 31 26 10 47 50 87  8 71 42 29 14 95 57 55 32 85 53 82 96  2 66 74 81  1
 15 37 39 46 59 80 23 68 18 83 67 22 69 84 61 38 94 36 64 20 62 63 49 73 35]
[76 77 16 15 88 82 54 40 20 29 14 56 93 24 66 13 12 58 61 50 22 33  1  1 45
 79  4  3 81 39  1 96 23 78 47 18 19 60 65 21 11 34 91 48 68  1 67 49 72 43
 36  2 95 32 53 90 52 31 25 44 74 41 83 27 85  8 26 59 46 57 37 73 89 71 30
 94 55  7 42  5 17 92 75  6 84 38  1 35 80 69 10 51 87 86  9 70 63 28 64 62]
[96 47 27 26 78  5  8 39 95 53  3  9 65 84 43 64 46 70 60 29 17 51 24 81 22
 14  4 19 77  1 58 15 87 11 62 42 20 79 12 86 68 40 83 69 74 35  1 94 49 89
 21  1 23 56 91 41  6 55 75 44 92 28 54 16 36 71  1 57 30 76 32  1 31 34  7
 50 59 52 33 25 90 93 38 88 48 85 10 80 82 73  2 18 13 37 72 45 67 61 63 66]
[38 85 87 58 21 23 18 80 42 40 37 14 16 66 89 39 28 62 60 13  2  6 45 36 26
 90 15 82 63 22  1 69 43 27 53  1 20 86  9  1 17 19 68 32 59 10  8 96 49  4
 25 91 47 75 65 95 31 92 52  1  3 24 64 48 29 44  7 94 93 56 88 77 81 35 72
  5 51 57 55 12 79 74 11 61 33 73 50 84 34 71 83 67 78 41 30  1 46 76 54 70]
[69 92 48 61 55 46 18 14 51 75 43 45 47  3 12 54 90 32  9 84 70  1 89  1 10
  1  1 86 52 85 16 19 20 22 29 35 39 34 15 57 66 49 87 64 28  6  8 63 80 71
 53 50 44 27  2  4  5  7 59 65 95 37 67 26 73 58 13 91 81 41 76 17 72 79 38
 40 94  1 24 42 74 23 62 82 60 56 21 96 31 78 11 36 30 83 68 33 88 25 93 77]
[29 16 13 88 61 46 49 11  6 31 30 87 73 86 78 71 18 39 58 32 44  1 96 23 10
  1 83 14 53 21 20 70  2 79 72  7 51 64 43 48 42  5 91 69 22 80  3 68 50 34
 36 57  1 55 89  4 67 93 56 26 12 84 65 54  1 77 19 25 82 35 74  9 24 17 59
 75 92  8 52 33 37 38 47  1 60 40 81 90 27 41 62 15 76 94 28 45 95 63 66 85]
[67  8 16 71 25 38 17 15  7 58 79 68 74 39 90  1 45 49 80 46 35  1 11 48 53
 28 12 52  4 18  1 20 23 37  1  9 51 94 43 34 55 84 13 10 40 31 59 27 82 91
 26 47 22 21 88 62 19 33 32 78 72 63 14 89 92 64  1 56 75 69 85 95 57  2 73
  5 81 24 41 70 30 93 54 66 77 86 44 87 60  3  6 42 96 36 65 29 76 50 61 83]
[ 3 47 17 71 11 19 37 80 14 15 48 96 88 53 20 66 34 77 52 24 13 32  2 29 25
 30  1 87 81  1  1 69 36 60 64 79 68 91 62 12 57 90 26 40  1 61  6 73 54  5
 95 42  7 44 86 27 46 50 65  1 41 28  4 94 18 33 21 49 93 22 72 75 74 45 58
 39 84  8 43 89 70 55  9 56 51 83 23 31 35 38 59 16 85 76 82 10 63 78 92 67]
[63 74 16 73 58 57  1 25 56  9  1 86 67 95 49 26 84 27 68  7  3 66 76  2 46
 20 13 17 62 29  1 19 59 18  8 12 54 85 83 91 34 10 41 24 87 88 39 11 77 22
  1 23  5 15 75 38 44 92 14 21 35 89 96 28 78 90 50 55 82 53  1 93 52 36 51
 33 32 81 47 60 30 80 65 64 61 70 79 45 69 71 48 37  6 40  4 94 31 43 42 72]
[70 22 78 23 95 83  1 51 15 47 21 37 77 18 71 81 27 65 96 38  1 30 93 29  1
 10  2 87  7 20 13  1 19 41 82  1 88 84 55 89 42 74 85  4 92 94 48 11  9 54
  6 59 25 57 66 16 28 90 58 40 60 62 44  8 80 31 32 67 56 52  3 72 50 46  5
 33 34 12 64 53 91 17 86 61 68 63 73 24 49 76 79 26 69 14 39 36 35 45 43 75]
[ 2 84 63 25 58 43  4 62 52  1 75 79 74 87 35 27 13 20 53 26 22 78 50 39 48
 83 10 28 65 77 46 32 15  9 76 70 21 16 17 66 47 12 85 23  1 68 24 19 40 38
 14 81 37 60 11 45 93 59 54 94 91 33 18 73 90 31 41  8 96 61 42 34 69 29  3
 55  6 44 72 95 89 49  5 64 67 80 88 56 57  1  1 71 36 92  1 51 86 82 30  7]
[12 43 44 39 52 13 30 15 81  7 83  3 59 68 84 47 25 35 82  9  1 86 42  1 26
  5 10 40 88 85  1 61 76 63 14 16  1 54 21 75 94 67 33 58 62 19 60 29 87 74
 77 32 78 24 79 55 28 18 20  8 90 64 72 57 66 22 93 23 71 53 31 95  4 73 17
 27 34 41  6 92 91 70 37 96 48 11 69 89 45 65 80 49 36  2 38 46 51 56  1 50]
[73 49 50 52 30 19 18 29 85  1  2 54 91 70 34 94 36 40 77 57  1 79 23 67 43
 41  1 46 24 66  1  4 75 25  3 72  1 64 86 65 44 82 74 33 13 78 12 88 11 69
 63 89 80 27 47 62  8 26 48 76 28 71 90 96 38 51 16 45 95 14  9 59 81 20 10
 58 22 39 17 15 21 35 42  7 56 55  5 84 93 68  6 61 92 32 87 83 53 60 31 37]
[72 12 39 29 20 51 16 36 35 84 71 48 69 87 96 89 22 32 41 33 42 91 52  4  1
 13 73 40 11 94 83 67 25 37 95 93  1  8  6 34 60 77 53 14 38 74 55  2 68  1
  3 45 43 31 63 65 23 46 57  1 66 44 78  1 76 62 18 85 79 50 49 19 64  7 26
 56 59 92 10 90 15 28 24 17 47 70 75 54 27 88 21 81  9 82 80 58 61 30  5 86]
[ 8  2 59 86 92 11 85 15 53 33 39  1 18 84 77 88 87 70 51 58  1 68  3  1 45
 83 57 52 73 38  4 65 61 55 64 40 78 89 43 44 49 66 46 62 25 93 34 22 82 28
 37 60 30 91 50 41 95 81 17 23 94 67 29 21  7 69 26  1 14 90  1 36  6 80 24
 31 63 79  9 74 56 32 10 19 54 75 20 16 35 76 42 12 71 13  5 47 48 72 27 96]
[15 37  3 25 45 86 30 84 90 73  2 63  5 46 52 70 35 79 75 16 64  9 48 58 11
 18  1 56 68 39  1 20 44 21 62  1  1 34 22 38 81 40 27 66 95 36 23 43 76 78
 14 85 96 54 51 80 41 10 53 49 61 67 26 31 93 32 33 88  8 87 28  7 82 50 42
 57 19  1 89 83 47 91 77 94 74 60 24 59  4 92 72 17 71 12 29 13 55 69 65  6]
[80 22 56 60 41 10 76 73 25 95 24 23 27 15 66 20 94 14 59  8 67  1 33 34  3
 90  1 37 31 64 52 69 57 17  4 53 72 21 70  2 50 40 84 87 58 96 88 49 38 85
 46 89  1 92 35 77 11  6 12  1 18 30 13 29 16 65 83 91 54 61 45 75 47 86 79
 51 81 32 28  1 39  7 42 93 26 55 48 36 78 74 63  9 62 71 82 43 44 19 68  5]
[87 72  3  1 75 91  1 88 57 10  1 55  9 22 35 64 27 17 23 85 30 93 45  4 95
 46 14 48 89 71 70  1  5 54 80 47 92 81 74 12 59 49 66 31 21 78 32 76 41 50
 40 63 96 39 33 42 37 16 73 28  2 61 29 20 62 56 19 13 79 67 34 60 15 24 18
  6 52 82 43 11 36 94 38  8 53 58 90 26 77  7 69 86 25 83  1 68 44 65 84 51]
[82 24 13 41 94  4 56 64 16 78 95  5 93 27 54 31 30 29 52 47 14  7  3  1  9
  8  1 68 75 11  1 43 90  2  1 37 10 61 83 77 59 67 96 84 18 85 15 23 17 81
 19 65 57 46 79 58 76 62 49 48 88 26 51 86 44 28 72  6 42 21 34 20 87 74 32
 66 35 70 36 92 80 91 45 40 55 63 12  1 53 38 69 73 71 60 50 22 25 89 33 39]
[77 70 64 83 59 50 16 61 13 41 24 15 88 39 72 92 43 28 12 44  6  2  8 27 19
  9  1 26  5 62  1  7 96 36 91 82 49 20 18 21 22 53 56 94 14 47 45 46 30 68
 29 54 60 73 78  4  3 11 55 75 69 79 84 23 76 40 63 93 52 31 25 81 37 51 85
 48 35 38 66 32 65 87 95  1 86 90 42 10 58 33  1 17 57 71  1 74 80 67 89 34]
[15 36 53 57 42 58 20 85 32 51 79 40 82 18 47 64 14 45 68 87 10  1 96 19  1
 94 90 55 12 91  5 17 88 52 72 93  1 81 43 56  8  6 89 33 25 41 78 71 34 83
 31  4 59 73 69  7 21 63 76 39 95  1 60  9 80 61 86 66 54 48 49  2 26 27  1
 13 30 35 28 67 24 92 50 38 16 37 65 75 84 46 77 11 22 62 74 23 29  3 70 44]
[11 17 43 93 78 91 33 34 90 50 73 18 87 79 23  1 32 31 29 72  1 63 46 47 74
 84 92 28 62 64 20 96 55  3 41  1 12 44  4 54 30 60 76 15 42 52 51 82 59 89
 69 25 57  6 35  1 13  9 58 10  2 86  8 24 16 61 21 40 88 19 77  1 71 27 75
  5  7 39 80 22 48 65 68 45 36 14 94 56 53 95 83 66 38 49 85 37 67 70 81 26]
[ 5 52 82 53 65 83 34 69 26 57 29 36 81 89 17 45  1 39 64 63 10 43 62 95 49
 84 14 68 50 79  6 15 32 60  3 77 16 47 33 74 31 24 70 78 76 73 22 19 72  4
 38 23 92 48 35 94 13 75 40  7 80 46 91 86 61 18 11 42 85  1  1 12 59  9 54
 56 90 96  8  2 88 25 66 87 58 44 21 30 37  1 71 41 55 28 27 67 93 51 20  1]
[ 6 68 48 50 64 77 30 74 15 76 17 59 57 78 28 36 49 18 53 40  2  1 38 43 95
  1 22 69 58 60  1 61 75 45 89 46 65 16 47  7 33 12 62 88 79 34 42 93 44 35
 54 39  3 37 85 32 31 92 90 96 19 84 55 67 80  1 56 83 41 26 11  1 70 27 82
 24 86 91 14 13 94 23  8 20 71 21  4 81 29 10  5 87 25 63 52 73  9 66 72 51]
[10  4 37 15  3  2 96 35 18 60 77 42 95 92 87 16 33 53 75 58 13 72 31 47  6
 21  1 93 82 20  1  1 85 63 90 79  9 64 26 88 86 23 62 73 28 38  8 94 80 22
 12 40 45 61 19 46 49 27 17 54 81 51 41 89 24 57 32 39 84 56 29 59 25 74 55
 48 52 67 66 78 50  7 76 34 11 44  5  1 65 36  1 14 43 69 91 30 83 71 70 68]
[77 57 52 91 82 21 14 72  6 39  1 62 35 59 47 34 84 45 29 78 15  3 66 33  5
 43  2 25 56 74  1 68 71 67  8 37 12 41 69 88 61 53 19 51 23 55 16 76 90  1
 54 85 96 32 81 63 11 44 24  1 70 46 42 40 20 65 83 26 10 13  1 36 27  7 89
 75 22 31 48 28 87 18 50 38 73 64  9 94 17 79 60 93 80 92 95 58  4 86 30 49]
[52 76 67 96 93 38 63 82 44 95 14 51 70 11 71 20 87 69 43 21  3 19 79 34 16
 18  1 31 54 91  1  1  9 77  7  2 62 72 83 65 81 88 61 60 29  6 92 45 48 57
 55 59 28  1 35 47 46 32  1 15 75 86 58  5 84 33 64 68 50 80 36 90 37 78 12
 24 25 73 74 30  4 49 26  8 53 22 39 85 89 42 56 23 41 13 27 10 40 17 66 94]
[56 55 70 46 27 71  2 91 18 66 45  3  1  1 34 94 48 32 72 41 29 44 15 53  7
  4  1 77  9 61  8 90 38 21 16 68 78 13 25 60  5 73 22 33 11 47  1 93 64 50
 49 39 31  1 69 75 74  6 37 51 52 65 79 92 88 83 59 62 30 10 57 86 80 19 67
 54 76 20 82 12 28 17 95 23 63 87 24 26 14 58 36 96 40 43 81 85 35 84 42 89]
[88 54 89 87 47 78 13 20 46 96 50 85 76 23 29 11 25 65  1 32 74 45 68 10 66
 39 49  6 69 27 26  9 41 33 93 81 92  2 35 95 24 55 38 42 19 18 61 16 28 58
  4 56  1 44 84 62 15 71 82 34 14 63 79 30 91 53 94 37 67 12 22 60 21 73 80
 43 59 77 40 83  1  1  7 70 86  8 75 90 31  1 51  3 72 52 36 48 57 64 17  5]
[ 9 76 57  3 46 62 14 29 23 16  7 96 92 31 20 52 13 84 79 17 56  4 74  5 61
 91 66 10  8 45  1 38 39 22 41 68 63 35 30 44 70 69 36 15 53 34 25 11 12 51
 19 88 60  1 81 47 95 64  1 54 27 55 50 86 83 94 73  1 78 49  2 82 48 21 42
 72 71 80 32 37 24 75 59 65 89 77 67  1 58  6 28 85 40 18 87 33 90 93 43 26]
[89 29 46 72 93 48  4 31 96 34 23 24 83  9 54 45 50 78 76 84 19  1 42 15  8
 58  1 88 55 51  1 56 52 20  2 70  7 47 69 14 25 62 32 49 77 28  1 30 71 68
  1 90 10 92 39 13 79 73 85 86 75 91 80 95 41 82  6 21 67 57 74 38 22 16 18
 87  3 33 81  5 65 63 35 64 44 27 26 17 66 12 36 40 37 59 61 60 53 94 43 11]
[18 40 61 27 94 36  1 70 72 48 17  3 96 43 39 60 71 35 91 88 59 82 42 32 87
  8  5 64 19 52 30  4  1 13 12 22 54 84  9 51 81 38 45 44 83 10 90 29 89  6
 79 37 86 34 74 53 46 31 11 55  7 62 95 85 73  2 28 25 77 16 76 63 80 65 58
 57 47 69 23 75 49 93 78 21 68 67  1 92 15 50 26  1  1 20 66 56 33 41 14 24]
[93 25  5 75 50 21 23 61 67 37 32  1 81 24 83 58 48 44 35 40  1 57  4 56 27
 17  3 46 41 74  1  7 62 14 33 20 10 47 15 73 66 22  8 68 18 64 49 55 19 86
  2 43 13 11 60 36 92 16 12  6  1 88  9 80 72 95 31 52 76 30 28 65 82 96 34
 42 87 51 71 63 39 79 85 69 91 38 26 53 78 77 89 45 90 94 84  1 29 70 54 59]
[72 32 81 57 33 91  1 37 29 90 61 83  1 95 36 58 89 55 54 56 52  9 43 62 63
 92  8 48 13  1 41 84 12 74 79  6 60 78 75  1 47  1 65 35 80  5 19 93  4 14
 15 76 39 87 53 24 77 46 86  2 51 17 50 49 85 68 16 94 25 26 66 40 34 88 28
 67 23 22 73 27 42 10 82 20 38 59 18 69  7  3 45 11 71 64 96 44 21 31 30 70]
[73 62 45 90 88 82  5 11 16  6 24  8 77 31 69 95  1 63 28  4 46 84 89 32 13
  2  7 94  1  9 61 85 79 56 52 15 86 70 92 91 39 75 83 60 25 93 71 55 50 26
 43 65 36 67 51 10 47 42 34 87 66 29 35 49 54 37 22 78 74 12  3 27  1 33 64
 14 76 96 17 72 44 68 19 30 23 38 80 40 57  1 58 20 48 41 53 18 21 81 59  1]
[94 20 68 61 96  3 35 78 49  1  1 72 69 43 95 42 12 51 65 54 28  1 71 81 84
  6 21 26 62 18 48 66 50 70 46 16  2 90 17 15 67  7 83  4 19 11 77 88 82 33
 34 41 85  9 89  1 86 56 73 93 58 22 87 52 59 38  8 45 40 75 37 32 64 63 80
  1 74 55 39 91 44  5 36 92 30 57 14 60 25 13 47 10 29 23 76 31 53 79 24 27]
[55 38 53 95  1 70 11 69 65 76 49 14 79 59 30 86 83 18 75 87 35 62 68 40  1
 71  1 94 80 93 23  1 81 41 51 88 67 58 73  7 32 21 60  3 12 43  5 31  2 96
 63  4 91 57 34 19 61 20 56 10 84 39  8  9 74 48 24 77 90 29 54 28 27 42 78
 17 72 33 36 13 64 85 47 37 92 22  1 46 50 25 16  6 82 52 66 44 45 15 89 26]
[17 84 45 42 44 35  2 34 70 56 21  3 79 52 23 73 74 47 89 92 69 38 66 65 41
 83 15 19 90 32  4 16 64 95 93 96 80 59 94 39  1 20 43 58 40 87 10 26 54 57
 78 22 91 82 85 88 27 77  6  9 61 31 24 71 68 11 53 30 25 33  1 55  1 51 63
  7 46 37 14  1 62 67 81 72 29 50 49 36  1 76 28 60 86 18  8 48  5 12 13 75]
[94  9 73 14 51 40 15 86 13  3  1 93 47 56 96 49 50 37  8 67 69 16 10 45  1
 78  2 81 91 22  1 76 79 12 31  1 30 43 11 23 24 64 55 44 29 66 70 54 39 57
  5 19 82 58 90 74 71 18 27 85 21 28 95 41 36 92 46 84 68 83  4 77 38 88 25
 80 32 17 35 53 52 60 42 72 87 33 75 34 26 48 62 59 63 61  6 65 89  1 20  7]
[17 71 48 45 34 81  9 20 86 77 76 64 52 47 63 37 49 26 22 23 88 54  2 25  3
  1 24 21 69 62 41  8 33 56  7 78 12 36  4 91 95 18  1 79 29 75  1 57 32 16
 92 74 59 28 60 80 27 89 72 30 53 90 31 50 70 68  5  1 44 40 38 43 65 46 11
 35 10 73 96 84 87 61 13 58 83 19 14 82 93  6 66 15 67 42 51 39 55 85  1 94]
[19 25 81 72 69 44 55 18 60  1 34 79 15 73 26 71 66 68 61 40 10 33 84 94 90
  1  1 62 27 38 36  3 14 85  6 46  7 35 64 20 49 52 80 63 29 82 41 78 70 28
 21 30 77 42 67 13 16 96 37 56 53 59 31 17 91 95  4 88 93  8 45  2 48 12 54
 32 51 58 75  9 65 89 23 50 57 47 74 24 11  1  1  5 43 92 86 76 87 39 83 22]
[91 18 43 15 85 63 58 40 20 56 75 34 12 13 55 21 22 66  1 33 93 49 96 57  1
  9 53 92  2 83  3 10 52 46 36 38 69 79 47 78 80 71 59 31 25 88 82 42  5 90
 48 14 26 62 89 84 24 51 68 23 81 37 72 54  1 41 60 77 86 32  6 19 35 74 11
 67 27 95 17 64 50 73 16 76 70 39  1  4  8 65 29  1 61 94 28  7 44 87 45 30]
[34 17 84 81  6 38 35 90 20 80 41  4 82 93 76 69 71 56 78 43  1 30 32  9 75
 70 39 36 83 42  3 79 48 53 16  5 45 72 95 91 89 37 52 27 61 77 18 94 51 68
  8 31 26 54 46 19 11 67 65 88 73 15 57 28 62 40 23  1 12  1 29 87  1 86 55
 14 74 49 22  2 66 58 64 63  7 96 21 25 33 59 24 50 13 44 47 92  1 85 10 60]
[85 62  6  1 83 30 31 67 64 32  7 88 63 43 37 96 12 74 70 15 16 13 80  5 36
 76 10 20  4 33  1 40 14 79 91 41 39 27 51 77 82 75 90 48 55 86  1 72 54 78
  1  3 81 34 92 89 18 35 53 93 95 47 45 17 22 68 11 50 42 73 65 49 24 58  8
 19 66 46 57 38 87 94 61  1 21 29 71 25 23 52  2  9 28 56 84 60 26 44 59 69]
[14  1 61 29 91 67 90  9  7 23 20 43 55 37 45 86  4 87  1 56  6  1  8 76 95
 58  1 17 11 22 82 88 75  3 94 53 33 71  1 69 66 31 60 74 83  5 18 68 59 52
 21 32 34 41 28 80 47 10 12 48 85 92 78 16 89 72 79 84 39 65 13 30 54 27 44
 26 25 42 24 57 63 40 64 96 51 35 62 93 36 73 81 49 46 70 77 19 50 15 38  2]
[48  1 79 38 45 44 49  6 11  5 41 61 57 68 12 27  1 30 92 66 88  7 75  1 83
  2 43 22 77  4 72 89  1 78 93 71 16 36 94  3 69 26 13 14 52 73 81 24 32 25
 87 29 70 46 90 76  1  9 60 28 23 62 91 85 74 80 67  8 17 54 55 10 33 65 35
 82 21 47 95 96 15 37 19 20 58 86 59 42 53 31 64 34 40 50 84 39 51 63 18 56]
[84  5 79 42 14  6 87 96 64 61 90 16 15 92 85 17 20 82 80 71  8 38 59 57 23
  4 39 28 95 34 76  7 93 73 77 29 83 19 18  1 33 24 65 72 48  1 81 12 75 44
 43 86 74 60 46 22 25 10 11 67 45 56 70 63 78 58  3 30 53 27  1 37 66 91 13
  1 62 21 47 26 40 32 51 50 49  2 69  9 35 55 31 94 68 88  1 41 89 36 52 54]
[37  8 57 61 49 48 22 17 19  4 88 10 41  1  1 39 13 56  5 78 64  6  1  1 70
 14 60 46 59 40 80 31 30 76 69 95 58 82 26 42 51 81 28 36 87 86 27  3 85 89
 63 62 77 90 73 65 84 53 96 91 68  7 33 12 15 20  1 71 44 54 52 32 67 75 25
 35 94 43 16 55 29 50 66 38 11 24 18 34 21 74  9 72 79 45 23  2 47 93 83 92]
[ 3 85 26 50 70 45 80 73 40 42  1 13 21 66 39 81 16 84 12  2 57  1 48 63 91
 24  4 38 46 89  1 76 60  8 10  1 20 28 27 33 23 95 29 90 32 83 72 86 14 11
 88 44  6 18 15 71 43 79 93 67 49  5  7 19 41 74 94 55 65 37 82 64 35 36 78
 62 54 56 92 59 52 31 25 17 69 30 75 53 77 96 87 68 61  1 22 34 58 47  9 51]
[ 3 40 49 14 48 64 46 11 15 22 28 62 21 55 37  5 93 71 47 86  1  4 25  7 19
 52 51 72 30 63 60 78 18 84 23  1 58 44 68 82  6 89 24  9 26 66 29 70 92 61
 73 77 90 59 74  2 33 50 20 80 79 83 35  1 42 91 81 53 36 32 76 12 31 75 41
  1 94 10 17 34 85 96 65 13 95 69 54  1 87 57 45 38 43 27 67 16 39 88 56  8]
[ 6 62 81  7 86 63 10 27 16 94  1 44  1 37 46 52 39  8 28 76 21 78 51  1 55
 68 15 77 96 95  1  9 24 73 26 20 19 90 71 18 40 30  4  1 70 67 83 22 13 33
 35 32 91 92  5 41 61 14 34 74 88 93  2  3 38 31 59 47 84 79 50 65 29 25 85
 49 48 36 43 54 45 66 12 11 17 57 56 69 72 89 60 23 87 80 58 53 42 75 64 82]
[17 66 34 67 75 19 76 20 36 96 79 15 91 44 25 35 12 74 53 71  3 29  1 83 60
  2 10  1  1 18 61 21 31 27 80 70  5  4  8 62 64  7 30 90 57 28 58 24 86 93
 22  6 47 11 65  9 95 42 52 49 54 84 81 26 87 45 94 92 63 50 33 59 37 82 69
 73 23 48 89 85 40 72 14 55 32 13  1 39  1 41 51 38 43 78 68 16 77 46 88 56]
[ 7 80 52 38 23 81 62 53 91 17 22 31 29  6 66 87  1 30  9 13  4 10 25  8 48
 95 61 32 75 50  1 82 57 20 93 24 56 43 76 65 33 60  1 59 28 73 16 64 96 72
 27 36 63 88 71  1 85 19 77 41 47 14 83 34 18 84 35 89 42 94 26 12 92 79 68
 39  5 51 40 69 70  3 78 55 86 11 90 58 45 46 67  1  2 54 49 15 74 21 37 44]
[58 18 30 73 71 11 10 86 77  2 48 20 81 66 42 46 79 26 23 55 67 33 89 65 32
  9  1 17 41 34  1 85 74 70 27 60 78 69 75 68 25 15 21  6 29  3  4 53 31 82
 24 39 94 83 95 47 43 93 91 22 45 80 44 13 19  8  5 87 92 64 14 16 57 96 84
 40 12 52 28 63 62 76 56 37 54 88 36 61 59 51  1 72 50  1 35  7 49 38  1 90]
[44 28 45 23 37 65  3 42 69  1 41  2 75 26 61  1 89 12 27  1 87 51  9  7 17
  1 29 83  8  4 96  1 39 72 59 10 13 31 16 64 11 84 18 67 94 19 78 22  6 80
 86 52 20 53 47 36 73 76 71 38 85 79 50 48 43 34 91 55 63 15 58  5 90 54 24
 25 92 95 82 88 35 60 14 77 68 32 46 62 57 66 74 33 21 70 81 40 93 49 30 56]
[36  7 13  1 24 18 85  4  1 35 22 93 77 88 39 63 69 57 43 92 26 17  3 12 11
 15 73 31 10 49 65 14 16 60 61 62 29  5 82 51 64 55 71 83  9 68 38 42 96  6
 25 21  2  8 76 45 19 44 67 87 52 78 70 20 32 47 84 56 89  1  1 50 86 90 79
 66 48 95 23  1 53 75 74 72 40 80 46 41 54 37 34 30 58 33 28 91 81 59 27 94]
[20  1 24 62 57  1 36 78 79 59 94 35 47 52 49 86 56 71 80 67  3  1 64 32 40
  1  7 83 60 30  1 29 21 43 75  2 39 76 73 13 42 48 87 88  9 15  4 95 28 45
 85 14  5 33 72 96 70 58 63 90 54 31 81 89 92 12 66 74 46 69 23  6 41 65  8
 53 10 34 68 26 50 11 55 16 38 93 91 84 77 25 27 22 61 51 37 17 18 44 82 19]
[43 38 93 94 29  1 52 36 27  9 74  1 56 39 35 79 88 89 64 69 13  2 70 37 40
  5  3 32 78 65 49 19  4 60 66 44 15 80 23 41 51 33 90 57 24  1 58 31 83 12
 20  1 28 46 73 63 45 76 11 72 50  7 87 85  6 30 81  8 59 47 48 61 92 95 84
 75 71 82 91 54 21 10 42 62 68 67 22 14  1 55 53 96 77 34 16 17 25 18 86 26]
[50 37 68 63 22 35  1 27 79 64 16  1 89 85  1 34 70 32 71 17 29 15 20 19 52
  7  2 10 66 77  4  1 69 13 90 31 30 14 65 88 56 41 12 72 43 92  1 74 42 23
  9 11 61 54 36 55 39 45 78 62 38 75 24 83 80  8 86 91 93 53 76 60 25 47  5
 67 57 81 21 94 44 40 73 58 82 96 18 95 33 46 87 26 28 49  3 59 48  6 51 84]
[89 80 79 86 52 78 11  7 53 39 15 29 51 19 38 69 42 56 20 61 66  1  3 65 33
  1 13 37 55  5 45 49  6  1 90 83 12 30 81 41 93 87 44 96 62 76 25 75 77 10
 94 27 17 68 26 72 35 23 40  9 58 82 22 64 74 59 54 50 88 28 46 34 31 60 16
  4  8 91 32 21 43 67 48 85  1 71 57 95 84  2 70 47 73 63  1 36 24 18 92 14]
[40 78 67 51 75 79 71 52 61 57 32 27 31 38 47  1 29  2 39 80  1 30 34 96 42
 53  9 59 21 22  8 56 41 23 54 82 33 49 64  7 17 50 87 13 55 84 14 70 10 62
 24 81 20 18 37 76  4  3 83 72 91 68 35 36 44 95  1 88 16 48 94  6 46  1 12
 63  5  1 74 92 85 90 58 66 86 45 65 26 15 69 43 89 77 28 73 25 19 11 60 93]
[58 36 53  4 91 50 45 56 63 67 13 72  5 70 80 30 83 12 10  7 40 20 39 23 76
 28  6 21 64 66 16  2 31 90  8 41 14 55 59  9 73 29 61 93 54 86 57 52 22  3
  1 25 81 92 65 34 71 75 32 51 42 79 37 18 35 95 62  1 24 60 82 89 19 74 33
 44 46 69 84  1 27 49 11 85 68 48 47 96 15  1 94 38 78 88 87 26 43 77 17  1]
[32  5 24 95 31  6 62 92 84 28  7 66 45 37 33  1 87 47 78 81 16  1 17 26 52
 48  1 57 11 65  9 25 68 75 51  1 40 61 74 60 39 86 54 53  3 59 20 77 58 18
 85 73 70 15 79 80 91 10 13 83  4 72 43 42  2 22 63 30 23 41 82 89  1 35 44
 76 14 94 34 71 64 56 38 96 46 50 67 12 90 93 69 88 21 19  8 36 29 49 55 27]
[90 57 74 16 55 78  8 63 32 48  7 88 70 40 81 28 79 53 39 47 56 85 77 37  1
 10  1 38 94 50  2  3 67 66 52  1 58 60 84 17 93 76 26 31 36 19  1 29 18 73
  1 75 44 45 23 21 35 43 86 87 11 15 46 42 27 62 41 49 51 13 68 14 89 69 61
 24 25 95 82 12 72 59 54 80  6 83 30 20 65 34  5 96  4 64 92 22 71  9 91 33]
[52 72 41 31 56 48 95 57 26  1  1 47 65 37 38 60 25 92 76 55  1  4 61 21 45
  1 29 68 19 28  9  1 80 15 84 53 74 50 89 27 77 35 54  5 81 93 78 42 36  2
  3 11 12 13 86 39 24 75 46 63 69 59 71 10 73 64 66 87 51 18  8 20 94 16 40
 88 91 70 17 34 58 22 83  7 85 82 67 23 14 33 44  6 49 32 90 79 30 96 62 43]
[73 54 37 64 18 75  5 91 94 17 14 46 13  4 81 35 21 68 95 74 11 50 84 15 87
  6  1 78 67 62  1  3 86 47 32  9 42 88 80 34 53 96 22 63 30 71 12 82 79 66
 19 51 70 45 36 52 92 90 24 31 83 93 60 77 29 69  1 16 48 65  2  1 58 28 89
 25 39 76 44 59 57 10 55 61 20 33  7 72 41 26 43  8  1 38 49 56 23 85 40 27]
[75 62  3 34 84 67  1  6 13  1 58 61 36 81 21  7  5 89  2  1 42 41 20 46 83
 80 10 96 91 14 54 71 47 76 17 45 37 25 85 40 70 35 33 90 26 38  9 93 78 23
 24 69 15 19 16  8 63 49  1 52 18 56 88 32 59 92 31 82 73 66 60 53 87 79 48
  4 68 64  1 65 28 72 51 44 43 22 50 11 39 27 74 95 55 12 29 30 94 57 86 77]
[41 10 75 69 15 61 30 43  9 40 70 55 60 23 32 66 65 58 29 72 35 21 59 14 91
 82  1  6 11 57 31 48  3  1  1  1 45 52 36 22 28 25 33 78 95 86 12 24 76 44
 37 38 92  2 64 46 85 13 19 83 47 51 34 63 94 53 20 27 90 96 74 50 84  1 79
  8 73 77 42 87 26 16  4 80 93 89 88 54 17 71 18 39 68 67 62  5 56 49 81  7]
[56 65 66 32 12 63 16 68 95 62 26 46 71 70 30 82 85 41  9 51 25  5 19 20 24
 77  3 47 18 17  4 94 13  1 90 50 58 27 28 91 15 39 10 33 64  1  8 96 36 59
 87  6 34  1 40 88 44 73 75 67 21 31 42 89 69 35  1 84 79 60 74 53 57  1  7
 83 48 72 52 93 37 38 49 76 80 61 14 86 29 23  2 43 45 92 81 54 55 78 11 22]
[ 6 93 60 59 95 89 17 51 91 92 13 37 38 72 41 70 94 30 73 87  3  1 40 56 77
 66  1 53 86 85  2 64 90 28  1 31  4 52 29 67 26 49 96 44 45 78 12  9 65 81
 10 42 55 82  5 47 20 46 54 36  1 16 32 69 39 57 88 48 83 25 24 58 23 50 21
  1  7 76 43 22  8 74 79 62 14 84 35 68 63 19 75 33 80 71 15 11 34 61 27 18]
[66 53 38 26 91 22 82 57 78 32  3 83 11 34 45 70 84 42 27 67 13 90 12 23 28
 31  1 46 60 93 49 51 64  1 58 29 30 72 18  1  8 81 16 54 71 50 36  1 73 52
 59 80 41  9 61 95 43 65 55 68  5 92 37 79 20 21 96 74 40  1  2 77 56 19 63
 94 14 47 87 69 88 76 35 44 89 25 75  4 10  7 85 24 33 17 62 48 39 15 86  6]
[54 21 16 53 93 61 15 25 45 44 14 58 86 10 42 70 43  9 46 74 20 88  1  1  4
 81 34 83 95 91 17 72 89 76 39 26 19  1  6 82 52 64 18 94 75 65 12 78 51 23
 69 41 22 37 63 77  5 57 68  8 80 40 11 36 35  7 62 92 84 96 50 87 66  2 79
 67  3  1 55  1 48 13 90 30 31 60 28 71 47 59 85 32 33 29 27 49 24 38 73 56]

In [12]:
##sort PSSM features by its predictive power

rank_av = [np.mean(pssm_rank.ix[i]) for i in pssm_rank.index]
arg_rank_av = np.argsort(rank_av)
pssm_rank_sorted = pssm_rank.ix[pssm_rank.index[arg_rank_av]]
pssm_rank_sorted['RANK_AV'] = np.sort(rank_av)

In [13]:
pssm_rank_sorted


Out[13]:
0 1 2 3 4 5 6 7 8 9 ... 76 77 78 79 80 81 82 83 84 RANK_AV
-1_E_pssm 55 32 27 68 82 4 28 12 56 2 ... 1 29 1 10 1 3 1 1 34 21.369048
-1_L_pssm 1 10 18 4 11 93 1 96 32 1 ... 2 9 1 54 31 4 2 49 17 24.119048
-1_A_pssm 26 55 69 10 20 5 20 52 62 42 ... 56 1 11 42 35 25 3 13 20 29.714286
-2_E_pssm 49 4 8 1 62 1 10 1 38 5 ... 8 95 5 1 30 16 17 82 15 30.500000
0_E_pssm 25 39 4 2 35 11 39 29 71 10 ... 1 78 12 9 12 8 12 36 12 33.642857
-1_D_pssm 38 2 78 86 14 73 6 59 57 23 ... 37 21 15 46 14 20 56 23 1 35.083333
-1_R_pssm 82 82 42 17 76 25 90 78 24 77 ... 85 4 50 41 21 5 1 90 88 35.440476
-2_L_pssm 76 14 7 23 87 89 27 80 61 3 ... 7 1 14 58 70 26 13 3 14 36.988095
-1_Q_pssm 8 17 85 71 89 42 32 1 2 47 ... 10 1 6 80 82 77 66 31 81 37.190476
-1_K_pssm 2 86 91 16 16 9 5 9 1 1 ... 3 1 3 71 48 94 64 51 72 37.428571
0_L_pssm 67 1 59 62 54 52 50 60 78 39 ... 1 3 19 24 37 87 10 59 69 40.190476
1_L_pssm 88 18 43 83 34 58 31 43 45 12 ... 68 8 2 60 74 74 24 2 50 40.690476
-1_T_pssm 37 36 83 89 58 86 2 8 67 52 ... 58 74 42 37 45 58 4 30 19 40.940476
-1_C_pssm 35 66 61 33 29 96 59 4 79 45 ... 1 45 87 83 91 24 77 28 4 41.059524
1_S_pssm 58 25 19 20 26 32 1 73 35 86 ... 24 88 25 4 8 83 1 94 67 41.273810
2_K_pssm 42 89 15 27 46 84 9 23 28 1 ... 96 6 8 95 39 43 33 24 32 41.309524
2_S_pssm 70 16 6 70 47 22 14 45 5 36 ... 22 79 56 30 5 54 11 48 49 41.321429
-1_S_pssm 96 57 94 6 18 66 37 82 77 1 ... 1 53 9 45 1 50 31 29 26 41.488095
-2_I_pssm 64 49 25 36 73 38 95 66 14 26 ... 48 1 17 1 40 62 92 32 44 43.107143
-2_R_pssm 36 9 67 50 64 29 22 71 69 6 ... 57 72 54 62 10 65 93 53 21 43.238095
2_E_pssm 1 58 20 39 60 45 70 21 25 72 ... 30 67 7 50 88 14 35 75 28 43.547619
0_K_pssm 30 37 45 85 61 3 15 64 10 11 ... 75 11 51 69 38 6 42 80 41 43.678571
-1_N_pssm 86 6 28 42 19 23 73 83 84 24 ... 77 61 84 20 59 19 40 12 1 43.940476
0_T_pssm 17 8 33 51 71 19 83 36 21 91 ... 35 24 92 63 85 44 20 43 5 44.202381
1_E_pssm 45 77 3 34 77 61 33 1 96 63 ... 41 66 1 31 20 1 88 96 62 44.416667
-2_K_pssm 40 13 23 92 90 54 7 16 94 17 ... 88 47 46 61 55 46 37 83 58 44.523810
-1_F_pssm 39 68 84 40 69 12 49 1 83 87 ... 66 15 47 76 1 1 28 1 76 45.059524
0_F_pssm 32 70 41 15 2 92 65 14 91 38 ... 45 13 45 19 2 1 82 9 37 45.071429
1_P_pssm 54 71 52 11 25 95 58 7 53 33 ... 61 40 89 48 79 7 21 63 79 45.119048
1_I_pssm 13 1 74 29 94 1 35 33 34 96 ... 13 18 65 66 96 60 25 1 96 45.273810
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2_T_pssm 71 74 87 48 48 62 29 20 13 37 ... 71 30 23 94 56 55 34 39 24 50.035714
-1_Y_pssm 77 78 58 47 70 85 79 3 87 41 ... 84 89 80 85 36 28 29 18 6 50.273810
2_M_pssm 16 44 64 80 57 31 55 18 82 49 ... 4 49 1 55 68 45 80 33 33 50.297619
1_A_pssm 9 23 17 21 27 88 3 10 85 62 ... 11 69 83 18 47 21 1 5 80 50.464286
-1_H_pssm 43 61 76 1 55 59 82 55 49 25 ... 94 19 67 91 11 18 86 60 95 50.476190
0_H_pssm 24 46 10 66 21 51 23 81 1 65 ... 18 36 79 78 76 36 65 73 51 50.714286
2_G_pssm 83 80 1 19 88 36 16 88 55 9 ... 20 23 72 11 54 86 68 4 71 50.726190
-2_V_pssm 48 11 88 35 74 90 84 72 95 30 ... 47 55 74 1 72 51 87 67 74 50.761905
2_Y_pssm 60 87 92 81 50 69 77 22 22 58 ... 91 62 40 86 81 11 27 86 73 50.964286
1_R_pssm 22 54 12 7 45 91 48 74 1 22 ... 15 59 93 56 51 31 16 92 40 51.154762
-2_Y_pssm 68 95 93 14 91 39 61 58 59 29 ... 39 76 95 2 29 9 73 27 46 51.202381
-1_W_pssm 93 26 35 76 8 10 66 89 12 16 ... 60 50 88 25 52 27 52 72 1 51.321429
0_D_pssm 10 38 86 96 72 74 17 57 1 67 ... 31 5 63 90 78 33 44 54 94 51.452381
2_F_pssm 87 63 40 25 12 71 81 27 7 57 ... 64 32 38 12 67 92 71 17 29 51.583333
1_N_pssm 85 12 37 77 92 67 12 47 70 44 ... 46 71 60 88 34 42 32 37 11 51.595238
0_G_pssm 75 21 1 93 83 70 36 86 39 21 ... 29 42 82 93 24 96 9 1 78 51.642857
1_Q_pssm 20 94 47 84 78 49 75 2 73 85 ... 62 64 69 92 53 35 57 21 7 51.809524
0_N_pssm 46 85 95 46 65 57 21 76 76 78 ... 26 54 22 33 33 10 96 16 18 52.011905
2_A_pssm 91 53 5 90 75 40 43 13 46 54 ... 72 58 57 28 26 37 8 88 48 52.130952
0_Q_pssm 19 43 48 26 6 80 91 32 52 68 ... 19 93 71 38 86 1 78 50 65 52.333333
2_W_pssm 53 73 31 59 4 43 62 85 15 28 ... 9 96 85 57 49 78 61 15 38 52.821429
2_Q_pssm 7 96 90 63 42 15 76 19 88 1 ... 83 82 33 22 89 61 84 25 60 53.011905
1_G_pssm 31 47 14 88 86 65 47 65 3 53 ... 49 87 16 82 27 84 48 74 92 53.440476
-2_M_pssm 44 64 81 8 5 37 67 75 42 19 ... 70 65 13 36 60 71 38 11 86 54.059524
2_C_pssm 50 50 82 1 32 82 53 39 80 27 ... 6 85 20 43 93 80 14 89 31 54.142857
1_C_pssm 80 90 89 30 3 75 94 54 11 73 ... 27 73 29 59 94 69 39 20 35 54.345238
1_W_pssm 90 69 68 79 38 41 40 94 20 90 ... 95 70 76 64 77 72 76 47 1 54.345238
-2_C_pssm 57 65 60 73 93 76 18 6 64 34 ... 55 56 18 84 15 12 95 91 93 54.714286
0_P_pssm 74 52 1 91 7 30 69 62 66 92 ... 23 86 36 16 64 40 5 61 63 57.738095
1_H_pssm 47 30 53 56 43 44 64 26 17 70 ... 51 51 48 73 90 79 83 40 84 57.940476

100 rows × 85 columns


In [14]:
# plot average rank of all HSSP values

plt.hist([np.mean(pssm_rank.ix[i]) for i in pssm_rank.index], bins=60, alpha=.5)
plt.title("Histogram of Average HSSP Features Rank (RFE on linear SVM)")
fig = plt.gcf()
fig.set_size_inches(10, 6)


Some PSSM shows deviation from expected normal distribution -- which should be the case in neutral information setting due to Central Limit Theorem (CLT).

Analysis II: Support Vector Machine

We use c-SVM on two models :

  • all features model
  • handpicked features model

Reference

  • “Support-vector networks”, C. Cortes, V. Vapnik - Machine Learning, 20, 273-297 (1995).
  • “Automatic Capacity Tuning of Very Large VC-dimension Classifiers”, I. Guyon, B. Boser, V. Vapnik - Advances in neural information processing 1993.

SVM on all features

Without Normalization


In [102]:
X = dna[dna.columns[1:][:-6]]
y = dna.class_num

In [103]:
## train c-SVM

clf_svm1 = SVC(kernel='rbf', C=0.7)
clf_svm1.fit(X[dna.fold == 0], y[dna.fold == 0])


Out[103]:
SVC(C=0.7, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [104]:
## predict class

pred = clf_svm1.predict(dna[dna.fold == 1][dna.columns[1:][:-6]])

In [105]:
truth = dna[dna.fold == 1]['class_num']

tp = pred[(np.array(pred) == 1) & (np.array(truth) == 1)].size
tn = pred[(np.array(pred) == 0) & (np.array(truth) == 0)].size
fp = pred[(np.array(pred) == 1) & (np.array(truth) == 0)].size
fn = pred[(np.array(pred) == 0) & (np.array(truth) == 1)].size


cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
            "\t(-)-truth\t%d\t\t%d"
    
print cm % (tp, fn, fp, tn)


Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	0		2561
	(-)-truth	0		26832

In [165]:
print "Size of (-)- and (+)-sets:\n\t(+)\t %d\n\t(-)\t%d" % (truth[truth == 1].index.size, truth[truth == 0].index.size)


Size of (-)- and (+)-sets:
	(+)	 2561
	(-)	26832

With normalization


In [80]:
X_norm = dna_norm[dna_norm.columns[1:][:-6]]
y = dna_norm.class_num

In [81]:
## train c-SVM

clf_svm2 = SVC(kernel='rbf', C=0.7)
clf_svm2.fit(X_norm[dna_norm.fold == 0], y[dna_norm.fold == 0])


Out[81]:
SVC(C=0.7, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [108]:
## predict class

pred2 = clf_svm2.predict(dna_norm[dna_norm.fold == 1][dna_norm.columns[1:][:-6]])

In [109]:
truth = dna_norm[dna_norm.fold == 1]['class_num']

tp = pred2[(np.array(pred2) == 1) & (np.array(truth) == 1)].size
tn = pred2[(np.array(pred2) == 0) & (np.array(truth) == 0)].size
fp = pred2[(np.array(pred2) == 1) & (np.array(truth) == 0)].size
fn = pred2[(np.array(pred2) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
            "\t(-)-truth\t%d\t\t%d"
    
print cm % (tp, fn, fp, tn)


Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	323		2238
	(-)-truth	156		26676

Normalization over each feature space reduces the complexity of the problem which in turn improves the result.

SVM on PSSM + other putative useful features

We now test the performance on SVM using non-zero knowledge approach. We scourged through the complete list of features acquired by PredictProtein and include features that might have certain influence on DNA/RNA binding


In [88]:
## hand-pick features

features = [x for x in dna.columns[1:][:-6] if 'pssm' in x] +\
            [x for x in dna.columns[1:][:-6] if 'glbl_aa_comp' in x] +\
            [x for x in dna.columns[1:][:-6] if 'glbl_sec' in x] +\
            [x for x in dna.columns[1:][:-6] if 'glbl_acc' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_mass' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_hyd' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_cbeta' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_charge' in x] +\
            [x for x in dna.columns[1:][:-6] if 'inf_PP' in x] +\
            [x for x in dna.columns[1:][:-6] if 'isis_bin' in x] +\
            [x for x in dna.columns[1:][:-6] if 'isis_raw' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profbval_raw' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_sec_raw' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_sec_bin' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_acc_bin' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_normalize' in x] +\
            [x for x in dna.columns[1:][:-6] if 'pfam_within_domain' in x] +\
            [x for x in dna.columns[1:][:-6] if 'pfam_dom_cons' in x]

In [89]:
X_norm = dna_norm[features]
y = dna_norm.class_num

In [90]:
## train c-SVM

clf_svm3 = SVC(kernel='rbf', C=0.7)
clf_svm3.fit(X_norm[dna_norm.fold == 0], y[dna_norm.fold == 0])


Out[90]:
SVC(C=0.7, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [106]:
## predict class

pred3 = clf_svm3.predict(X_norm[dna_norm.fold == 1])

In [107]:
truth = dna_norm[dna_norm.fold == 1]['class_num']

tp = pred3[(np.array(pred3) == 1) & (np.array(truth) == 1)].size
tn = pred3[(np.array(pred3) == 0) & (np.array(truth) == 0)].size
fp = pred3[(np.array(pred3) == 1) & (np.array(truth) == 0)].size
fn = pred3[(np.array(pred3) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
            "\t(-)-truth\t%d\t\t%d"
    
print cm % (tp, fn, fp, tn)


Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	166		2395
	(-)-truth	99		26733

Analysis III: Tree-based Classificators

We picked three tree-based algorithms: Decision Tree (DT), Random Forest (RT) and Extremely Random Forest (ERT). From left to right, the algorithm allows more complexity into the models by introducing more randomness and biased than the previous algorithm.


In [111]:
X = dna[dna.columns[1:][:-6]]
y = dna.class_num

Decision Tree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Reference

L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.


In [151]:
# compute cross validated accuracy of the model

clf_t1 = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
                             random_state=0)

scores = cross_val_score(clf_t1, X, y, cv=5)

print scores
print scores.mean()


[ 0.84656239  0.84001634  0.85005545  0.86370535  0.85284847]
0.850637599853

In [116]:
clf_t1.fit(X[dna.fold == 0], y[dna.fold == 0])


Out[116]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=0, splitter='best')

In [119]:
pred_t1 = clf_t1.predict(X[dna.fold == 1])

In [126]:
truth = dna[dna.fold == 1]['class_num']

tp = pred_t1[(np.array(pred_t1) == 1) & (np.array(truth) == 1)].size
tn = pred_t1[(np.array(pred_t1) == 0) & (np.array(truth) == 0)].size
fp = pred_t1[(np.array(pred_t1) == 1) & (np.array(truth) == 0)].size
fn = pred_t1[(np.array(pred_t1) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
            "\t(-)-truth\t%d\t\t%d"
    
print cm % (tp, fn, fp, tn)


Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	760		1801
	(-)-truth	3237		23595

Random Forest

In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

[SKL]

Reference:

  • Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.
  • Breiman, “Arcing Classifiers”, Annals of Statistics 1998.

In [149]:
# compute cross validated accuracy of the model

clf_t2 = RandomForestClassifier(n_estimators=10, max_depth=None,
                                 min_samples_split=2, random_state=0)
scores = cross_val_score(clf_t2, X, y, cv=5)
print scores
print scores.mean()


[ 0.90253298  0.9058542   0.90632113  0.90917581  0.908417  ]
0.90646022366

In [128]:
clf_t2.fit(X[dna.fold == 0], y[dna.fold == 0])


Out[128]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

In [130]:
pred_t2 = clf_t2.predict(X[dna.fold == 1])

In [131]:
truth = dna[dna.fold == 1]['class_num']

tp = pred_t2[(np.array(pred_t2) == 1) & (np.array(truth) == 1)].size
tn = pred_t2[(np.array(pred_t2) == 0) & (np.array(truth) == 0)].size
fp = pred_t2[(np.array(pred_t2) == 1) & (np.array(truth) == 0)].size
fn = pred_t2[(np.array(pred_t2) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
            "\t(-)-truth\t%d\t\t%d"
    
print cm % (tp, fn, fp, tn)


Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	245		2316
	(-)-truth	224		26608

Extremely Randomized Tree

In extremely randomized trees (see ExtraTreesClassifier and ExtraTreesRegressor classes), randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias:

[SKL]

Reference:

  • P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006.

In [150]:
# compute cross validated accuracy of the model

clf_t3 = ExtraTreesClassifier(n_estimators=10, max_depth=None,
                              min_samples_split=2, random_state=0)
scores = cross_val_score(clf_t3, X, y, cv=5)

print scores
print scores.mean()


[ 0.90480915  0.90813051  0.90859744  0.90835863  0.91028485]
0.9080361155

In [134]:
clf_t3.fit(X[dna.fold == 0], y[dna.fold == 0])


Out[134]:
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)

In [135]:
pred_t3 = clf_t3.predict(X[dna.fold == 1])

In [136]:
truth = dna[dna.fold == 1]['class_num']

tp = pred_t3[(np.array(pred_t3) == 1) & (np.array(truth) == 1)].size
tn = pred_t3[(np.array(pred_t3) == 0) & (np.array(truth) == 0)].size
fp = pred_t3[(np.array(pred_t3) == 1) & (np.array(truth) == 0)].size
fn = pred_t3[(np.array(pred_t3) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
            "\t(-)-truth\t%d\t\t%d"
    
print cm % (tp, fn, fp, tn)


Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	233		2328
	(-)-truth	236		26596

Random Forest on selected features


In [172]:
features = [x for x in dna.columns[1:][:-6] if 'pssm' in x] +\
            [x for x in dna.columns[1:][:-6] if 'glbl_aa_comp' in x] +\
            [x for x in dna.columns[1:][:-6] if 'glbl_sec' in x] +\
            [x for x in dna.columns[1:][:-6] if 'glbl_acc' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_mass' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_hyd' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_cbeta' in x] +\
            [x for x in dna.columns[1:][:-6] if 'chemprop_charge' in x] +\
            [x for x in dna.columns[1:][:-6] if 'inf_PP' in x] +\
            [x for x in dna.columns[1:][:-6] if 'isis_bin' in x] +\
            [x for x in dna.columns[1:][:-6] if 'isis_raw' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profbval_raw' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_sec_raw' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_sec_bin' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_acc_bin' in x] +\
            [x for x in dna.columns[1:][:-6] if 'profphd_normalize' in x] +\
            [x for x in dna.columns[1:][:-6] if 'pfam_within_domain' in x] +\
            [x for x in dna.columns[1:][:-6] if 'pfam_dom_cons' in x]

In [173]:
X = dna[features]
y = dna.class_num

In [174]:
# compute cross validated accuracy of the model

clf_t4 = RandomForestClassifier(n_estimators=10, max_depth=None,
                                 min_samples_split=2, random_state=0)
scores = cross_val_score(clf_t4, X, y, cv=5)
print scores
print scores.mean()


[ 0.90253298  0.9058542   0.90632113  0.90917581  0.908417  ]
0.90646022366

In [175]:
clf_t4.fit(X[dna.fold == 0], y[dna.fold == 0])


Out[175]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

In [176]:
pred_t4 = clf_t4.predict(X[dna.fold == 1])

In [177]:
truth = dna[dna.fold == 1]['class_num']

tp = pred_t4[(np.array(pred_t4) == 1) & (np.array(truth) == 1)].size
tn = pred_t4[(np.array(pred_t4) == 0) & (np.array(truth) == 0)].size
fp = pred_t4[(np.array(pred_t4) == 1) & (np.array(truth) == 0)].size
fn = pred_t4[(np.array(pred_t4) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
            "\t(-)-truth\t%d\t\t%d"
    
print cm % (tp, fn, fp, tn)


Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	190		2371
	(-)-truth	257		26575

While there is a significant accuracy improvement going from Decision Tree to Random Forest, the resulting prediction from Extremely Random Forest only improves the accuracy by the margin. Likewise, manually handpicking the features does not seem to improve the performance of the accuracy.

IV: Other Ensemble Learning Methods

AdaBoost

The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction.

[SKL]

Reference:

  • Y. Freund, and R. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”, 1997.
  • J. Zhu, H. Zou, S. Rosset, T. Hastie. “Multi-class AdaBoost”, 2009.

In [179]:
X = dna[dna.columns[1:][:-6]]
y = dna.class_num

In [184]:
# compute cross validated accuracy of the model

ada = AdaBoostClassifier(n_estimators=100)
scores = cross_val_score(ada, X, y, cv=5)

print scores
print scores.mean()


[ 0.90673515  0.90194362  0.89406409  0.90777492  0.89691805]
0.901487164628

In [185]:
ada.fit(X[dna.fold == 0], y[dna.fold == 0])


Out[185]:
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=100, random_state=None)

In [187]:
pred_ada = ada.predict(X[dna.fold == 1])

In [188]:
truth = dna[dna.fold == 1]['class_num']

tp = pred_ada[(np.array(pred_ada) == 1) & (np.array(truth) == 1)].size
tn = pred_ada[(np.array(pred_ada) == 0) & (np.array(truth) == 0)].size
fp = pred_ada[(np.array(pred_ada) == 1) & (np.array(truth) == 0)].size
fn = pred_ada[(np.array(pred_ada) == 0) & (np.array(truth) == 1)].size

cm = "Confusion Matrix:\n\tX\t\t(+)-pred\t(-)-pred\n" +\
            "\t(+)-truth\t%d\t\t%d\n" +\
            "\t(-)-truth\t%d\t\t%d"
    
print cm % (tp, fn, fp, tn)


Confusion Matrix:
	X		(+)-pred	(-)-pred
	(+)-truth	665		1896
	(-)-truth	800		26032

Conclusions:

  • xNA Binding a hard(er than expected) classification problem
  • Good accuracy and precision; okay recall for most basic ML algos
  • Most features are not i.i.d.
  • Manual selection of features doesn't improve performances

Some solutions that might work:

I. Quantitative features selection using RFE/RFA over complete feature spaces

Problem: Feature spaces might be too large for conventional canned algorithms.

Possible Hacks:

-- Bagging of features (55+ feature groups vs. 500+ features)

-- Removing similar features before RFE (elimination via cosine similarity et al.?)

-- Dimensionality reductions (t-SNE, PCA et al.?)

II. Regularization: Might work considering the system is not entirely overdetermined and many features are not actually informative + the tendency of the problem to overcomplicate.

Generally combination of I and II would make some sense.

For continous class value [0:1] (for submission)

SVM, Random Forest and AdaBoost regressor.

Reference