Before we start

This notebook, presentation, functions and data can be retrieved from GitHub:

git clone https://github.com/Kleurenprinter/prompred.git






The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.

CODE SHOW : turned ON
CODE SHOW : turned OFF

In [2]:
import matplotlib.pyplot as plt
from prompred import *
import numpy as np
import math
import sklearn
import warnings
import pandas as pd
from IPython.core.display import HTML
warnings.filterwarnings('ignore')

%matplotlib inline


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-eefdf818453c> in <module>()
      1 import matplotlib.pyplot as plt
----> 2 from prompred import *
      3 import numpy as np
      4 import math
      5 import sklearn

ImportError: No module named 'prompred'

The application of machine learning techniques in promoter engineering of prokaryotic microorganisms

Jim Clauwaert

Research update

December 2016

Machine Learning

Promoter Engineering

  • Predict the strength of a promoter sequence used in E.coli using machine learning techniques.

  • Make predictions of the promoter strength in function of the sigma factor used by the RNA polymerase.

Data

Mutalik Database

254 promoters

  • 137 randomized promoters: randomization within 35- and 10-box
  • 117 modulated promoters: promoters built from a set of fragments

sequence length: $[35,49]$

- varying spacer lengths 
- up-element

Insulated promoters


In [4]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 90000;


Feature creation and selection

nucleotides with respictive positions


In [1]:
dfDatasetOrder = pd.read_csv("../../data/mut_rand_mod_lib.csv")
dfDataset = dfDatasetOrder.reindex(np.random.permutation(dfDatasetOrder.index))
dfDataset.iloc[:5, 0:2]


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-2816a3868f6d> in <module>()
----> 1 dfDatasetOrder = pd.read_csv("../../data/mut_rand_mod_lib.csv")
      2 dfDataset = dfDatasetOrder.reindex(np.random.permutation(dfDatasetOrder.index))
      3 dfDataset.iloc[:5, 0:2]

NameError: name 'pd' is not defined

Creation of two reference regions

Creation of dummy variables for every position


In [6]:
#sequence range
seqRange = [-47,1]

#regions of interest (wrt 35- and 10-box)
ROI =  [[-12,14],[-8,12]]

#
labels, positionBox, spacer = regionSelect(dfDataset, ROI, seqRange)
print(dfDataset['sequence'][0],"\n",positionBox.values[0])


AAAAAGAGTATTGACTTAAAGTCTAACCTATAGGATACTTACAGCCAT 
 [1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0
 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1
 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 0
 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0
 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0
 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0]

Data Preprocessing

Sequence annotation

  • add reference regions to used sequences

In [7]:
dfDataset.iloc[:5,[1,2,4,5]].style.set_properties(**{"font-size":"12px"})


Out[7]:
sequence mean_score 35boxstart 10boxstart
63 AAAAAATTTATTTGCTTTCGCATCTTTTTGTACCTATAATGTGTGGAT 425.41 11 34
217 TTGACAATTAATCATCCGGCTCGTAGGGTTTGTGGA 164.05 0 23
11 AAAAAGAGTATTGACTTCAGGAAAATTTTTCTGTATAATAGATTCAT 523.35 10 33
136 TTTTTCCTTAATCATCGGCTCGTATAATGTGTGGA 3.69 0 22
236 TTCCACCTTAATCATCCGGCTCGTATAATGTGTGGA 33.51 0 23

Labels

  • create homogenous distribution to keep estimation confidence interval over full range at a minimum

In [8]:
yData = dfDataset['mean_score']
yRooted = [math.sqrt(math.sqrt(u)) for u in yData]
plt.figure(num=None, figsize=(10, 4), dpi=80, facecolor='w', edgecolor='k')
plt.subplot(121)
plt.hist(yData,10,normed=1)
plt.title('Histogram $fluorescence$ values')
plt.subplot(122)
plt.hist(yRooted,10,normed=1)
plt.title('Histogram $\sqrt[4]{fluorescence}$ values')


Out[8]:
<matplotlib.text.Text at 0x7f029db3a4a8>

Code Implementation

The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.

CODE SHOW : turned ON
CODE SHOW : turned OFF

Supported models

  • Ordinary Least Squares: OLS
      Parameters: coef0
  • Ridge Regressen: ridge
      Parameters: alpha, coef0
  • Lasso Regression: lasso
      Parameters: alpha, coef0
  • Random Forests (Classification + Regression) : forestReg, forestClass
      Parameters: max_depth, max_features, min_samples
  • Support vectors (Classification + Regression) : SVR, SVC
      Parameters: alpha, gamma, coef0
      Kernels: poly, RBF, sigmoid, ...
  • Ridge regression kernels: ridge
      Parameters: alpha, gamma, coef0
      Kernels: poly, RBF, sigmoid, ...

K-fold cross validation

Nested K-fold cross validation


In [9]:
# Model specification
parModel = {"regType":'ridge', "poly":3, "kernel":"poly", "coef0":1}
# To be evaluated parameter(s)
parLabel = ['alpha']
parRange = [15] 
# Define kfold validation parameters
testSize = 0.2 
k = 5 
kInner = 5
#
X = positionBox.values
y = yRooted
# Run function
scoresParCV, optimalParCV = KfoldCV(X,y,k,parModel,parLabel[0],parRange[0])



In [10]:
meanScores = np.mean(np.ndarray.max(scoresParCV,axis=1))
print("K FOLD CV \n---------- \n\n Maximum Score: ",np.max(scoresParCV), "\n Mean optimal score: ", meanScores ,"\n sd optimal scores: ", math.sqrt(np.sum(np.power((np.ndarray.max(scoresParCV,axis=1)-meanScores),2)))   , "\n Optimal parEval:\n", optimalParCV, "\n parEval Scores:\n", scoresParCV,"\n\n\n")\


K FOLD CV 
---------- 

 Maximum Score:  0.799712244586 
 Mean optimal score:  0.71828884522 
 sd optimal scores:  0.1535514781534831 
 Optimal parEval:
 [ 100.   10.   10.   10.   10.] 
 parEval Scores:
 [[ 0.31989756  0.31989758  0.31989776  0.31989961  0.31991811  0.32010276
   0.32192307  0.33795852  0.42008236  0.58898159  0.68370929  0.29640506
  -1.65266186 -4.44410483 -5.28106803]
 [ 0.51080289  0.51080291  0.51080306  0.51080455  0.51081945  0.51096811
   0.51241678  0.52401693  0.56517912  0.61233096  0.59169948  0.31607843
  -1.12683883 -3.4371721  -4.15089014]
 [ 0.70372468  0.7037247   0.70372491  0.70372698  0.70374768  0.70395398
   0.70595244  0.72118814  0.76496797  0.79971224  0.70280537  0.3204901
  -0.97382186 -3.33005372 -4.08065618]
 [ 0.522472    0.52247202  0.5224723   0.52247508  0.52250286  0.52277968
   0.5254527   0.54579531  0.61776066  0.71057502  0.6684023   0.35829674
  -1.14365655 -3.53183989 -4.2556942 ]
 [ 0.63368022  0.63368024  0.63368047  0.63368276  0.63370564  0.63393391
   0.63616824  0.65459056  0.72010534  0.78511671  0.78160052  0.42718334
  -1.6473034  -4.83777626 -5.79585222]] 




In [11]:
scoresParNCV, optimalParNCV, scoresNCV = nestedKfoldCV(X,y,k,kInner,parModel,parLabel[0],parRange[0])



In [12]:
print("NESTED K FOLD CV \n----------------- \n\n Maximum Score: ",np.max(scoresParNCV), "\n Mean optimal score: ", np.mean(scoresParNCV) ,"\n sd optimal scores: ", math.sqrt(np.sum(np.power((np.ndarray.max(scoresParCV,axis=1)-np.mean(scoresParNCV)),2))) , "\n Optimal parEval:\n", optimalParNCV, "\n parEval Scores:\n", scoresParNCV,"\n\n\n")


NESTED K FOLD CV 
----------------- 

 Maximum Score:  0.799712244586 
 Mean optimal score:  0.71828884522 
 sd optimal scores:  0.1535514781534831 
 Optimal parEval:
 [[ 100.  100.  100.  100.  100.]
 [  10.   10.   10.   10.   10.]
 [  10.   10.   10.   10.   10.]
 [  10.   10.   10.   10.   10.]
 [  10.   10.   10.   10.   10.]] 
 parEval Scores:
 [ 0.68370929  0.61233096  0.79971224  0.71057502  0.78511671] 



Validation

The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.

CODE SHOW : turned ON
CODE SHOW : turned OFF

Something about rank

Spearman's rank correlation coefficient

Anderson promoter library

  • 19 well-recognized promoters
  • recovered by Chris Anderson
  • sequence range: $ [-35,1]$
  • single nucleotide variations

In [13]:
dfDatasetTest = pd.read_csv("data/anderson_lib.csv")
dfDatasetTest['sequence'] = dfDatasetTest['sequence'].str.upper()
dfDatasetTest.iloc[:5,:3]


Out[13]:
ID sequence mean_score
0 BBa_J23100 TTGACGGCTAGCTCAGTCCTAGGTACAGTGCTAGC 1.00
1 BBa_J23101 TTTACAGCTAGCTCAGTCCTAGGTATTATGCTAGC 0.70
2 BBa_J23102 TTGACAGCTAGCTCAGTCCTAGGTACTGTGCTAGC 0.86
3 BBa_J23103 CTGATAGCTAGCTCAGTCCTAGGGATTATGCTAGC 0.01
4 BBa_J23104 TTGACAGCTAGCTCAGTCCTAGGTATTGTGCTAGC 0.72

In [14]:
labelsTest, positionBoxTest, spacerTest = regionSelect(dfDatasetTest, ROI, seqRange)
Xtest = positionBoxTest.values

parInput = {"regType":'ridge', "poly":3, "kernel":'poly', "gamma":0.1, "alpha": 10000, "coef0":1} 
reg = selectRegression(**parInput)
reg.fit(X,y)
rankPredict = reg.predict(Xtest)

#print(np.transpose(np.vstack((dfDatasetTest['sequence'].values,dfDatasetTest['mean_score'].values,rankPredict))))
print(stats.spearmanr(dfDatasetTest['mean_score'].values,rankPredict))
plt.plot(dfDatasetTest['mean_score'].values,rankPredict, 'ro')
plt.xlabel('True label')
plt.ylabel('Predicted label')


SpearmanrResult(correlation=0.80728709394205445, pvalue=2.9393274121457126e-05)
Out[14]:
<matplotlib.text.Text at 0x7f028dfa0da0>

  • 18 promoters
  • sequence range: $ [-40,-1]$
  • single nucleotide variations focused in and around TATA- and TTGACA-box

In [15]:
dfDatasetTest = pd.read_csv("data/brewster_lib.csv")
dfDatasetTest.iloc[:5,:3].style.set_properties(**{"font-size":"12px"})


Out[15]:
ID sequence mean_score
0 UV5 TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGG 41.8
1 WT CAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGG 53.45
2 WTDL10 CAGGCATTACACTTTATGCTTCCGGCTCGTATGTTGTGTGG 57.83
3 WTDL20 CAGGCTTAAGACTTTATGCTTCCGGCTCGTATGTTGTGTGG 69.03
4 WTDL20v2 CAGGCCTTAGACTTTATGCTTCCGGCTCGTATGTTGTGTGG 69.93

In [16]:
labelsTest, positionBoxTest, spacerTest = regionSelect(dfDatasetTest, ROI, seqRange)
Xtest = positionBoxTest.values

rankPredict = reg.predict(Xtest)
#print(np.transpose(np.vstack((dfDatasetTest['sequence'].values,dfDatasetTest['mean_score'].values,rankPredict))))
print(stats.spearmanr(dfDatasetTest['mean_score'].values,rankPredict))
plt.plot(dfDatasetTest['mean_score'].values,rankPredict, 'ro')
plt.xlabel('True label')
plt.ylabel('Predicted label')


SpearmanrResult(correlation=-0.91124871001031982, pvalue=1.4607836787089575e-07)
Out[16]:
<matplotlib.text.Text at 0x7f028df8f320>

  • 36 promoters
  • sequence range: $ [-52,+7]$
  • high level of variation over whole sequence

In [65]:
dfDatasetTest = pd.read_csv("data/hammer_lib.csv")
dfDatasetTest.iloc[:5,:3].style.set_properties(**{"font-size":"10px"})


Out[65]:
ID sequence mean_score
0 CP18=CP27 CATTTTGCAGTTTATTCTTGACATTGTGTGCTTCGGGTGTATAATACTAAGTACTGTT 0.2
1 CP37 CATCATTAAGTTTATTCTTCACATTGGCCGGAATTGTTGTATAATACCTTAGTACTGTT 16
2 CP17 CATTCTGGAGTTTATTCTTGACCGCTCAGTATGCAGTGGTATAATAGTACAGTACTGTT 10
3 CP2 CATTTGCTAGTTTATTCTTGACATGAAGCGTGCCTAATGGTATATTACTTGAGTACTGTT 2.8
4 CP4 GATGTTTTAGTTTATTCTTGACACCGTATCGTGCGCGTGATATAATCGGGATCCTTAAGA 1.3

In [42]:
labelsTest, positionBoxTest, spacerTest = regionSelect(dfDatasetTest, ROI, seqRange)
Xtest = positionBoxTest.values
rankPredict = reg.predict(Xtest)

print(stats.spearmanr(dfDatasetTest['mean_score'].values,rankPredict))
plt.plot(dfDatasetTest['mean_score'].values,rankPredict, 'ro')


SpearmanrResult(correlation=0.34255922865570765, pvalue=0.040839010065422274)
Out[42]:
[<matplotlib.lines.Line2D at 0x7fde8c791e48>]

  • 52 promoters
  • sequence range: $ [-53,+4]$
  • high level of variation over whole sequence

In [46]:
dfDatasetTest = pd.read_csv("data/inbio_lib.csv")
dfDatasetTest.iloc[:5,:3].style.set_properties(**{"font-size":"9px"})


Out[46]:
ID sequence mean_score
0 p1 TTAATCAATATTAGTTTCTTGACATCTTCTCCCGCATATGATATAATTAGTGAGTTACCATGGCGGCCGCTCTAGAAGAAGCTTGGGATCCGTCGACCTCGAATTCGGAGGAAACAAAGATG 3779.86
1 p11 TTACTGGACAAAATTTTCTTGACATATTTTTCGTTATATGATATAATGATGAGGTTTCCATGGCGGCCGCTCTAGAAGAAGCTTGGGATCCGTCGACCTCGAATTCGGAGGAAACAAAGATG 787.37
2 p12 TTAGAAGGAATTTGGTTCTTACATATTCGCACGCATGTGATATAATGGGGTTATTTCATGGCGGCCGCTSTAGAAGAAGCTTGGGATCCGTCGACCTSGAATTCGGAGGAAACAAAGATG 154.69
3 p13 CATACAGTTAAATTCTTCTTGACATATTCGGGTAAATATGGTATAATGGGAGCATATCCATGGCGGCCGCTCTAGAAGAAGCTTGGGATCCGTCGACCTCGAATTCGGAGGAAACAAAGATG 3149.04
4 p14 CTTCATTCTATAAGTTTCTTGACATCTTGGCCGGCATATGGTATAATAGGGAAATTTCCATGGCGGCCGCTCTAGAAGAAGCTTGGGATCCGTCGACCTCGAATTCGGAGGAAACAAAGATG 7606.83

In [44]:
labelsTest, positionBoxTest, spacerTest = regionSelect(dfDatasetTest, ROI, seqRange)
Xtest = positionBoxTest.values
rankPredict = reg.predict(Xtest)

print(stats.spearmanr(dfDatasetTest['mean_score'].values,rankPredict))
plt.plot(dfDatasetTest['mean_score'].values,rankPredict, 'ro')
plt.xlabel('True label')
plt.ylabel('Predicted label')


SpearmanrResult(correlation=0.024503212316951642, pvalue=0.86310286460841112)
Out[44]:
<matplotlib.text.Text at 0x7fde8c778c50>

Remarks

  • Model is only strong in predicting changes within and in close range of the -35 and -10 regions
  • Training data only encompasses the $[-47,1]$ region
  • Simple features are used, giving a simple yet robust model
  • Model has been optimized (used loss function) to minimize difference between predicted and true label
  • Model cannot be further optimized with multiple dataset

Future work and focus

Ranked based optimization methods

  • rank dependent loss function
  • optimization over different datasets.

In [49]:
import pandas as pd
dfDatasetTest = pd.read_csv("data/pairedDBanalysis.csv")
dfDatasetTest.iloc[:5]


Out[49]:
Unnamed: 0 ID_1 ID_2 sequence_1 sequence_2 score_1 score_2 rank 35boxstart_1 35boxstart_2 10boxstart_1 10boxstart_2
0 0 apFAB29 apFAB30 AAAAAGAGTATTGACTTAAAGTCTAACCTATAGGATACTTACAGCCAT AAAAAGAGTATTGACTTAAAGTCTAACCTATAGGTATAATGTGTGGAT 565.95 731.46 -1 10 10 34 34
1 1 apFAB29 apFAB31 AAAAAGAGTATTGACTTAAAGTCTAACCTATAGGATACTTACAGCCAT AAAAAGAGTATTGACTTAAAGTCTAACCTATAGGTATAATAGATTCAT 565.95 781.83 -1 10 10 34 34
2 2 apFAB29 apFAB32 AAAAAGAGTATTGACTTAAAGTCTAACCTATAGGATACTTACAGCCAT AAAAAGAGTATTGACTTAAAGTCTAACCTATAGGCATAATTATTTCAT 565.95 645.52 -1 10 10 34 34
3 3 apFAB29 apFAB33 AAAAAGAGTATTGACTTAAAGTCTAACCTATAGGATACTTACAGCCAT AAAAAGAGTATTGACTTAAAGTCTAACCTATAGGTAGATTTAACGTAT 565.95 340.48 1 10 10 34 34
4 4 apFAB29 apFAB34 AAAAAGAGTATTGACTTAAAGTCTAACCTATAGGATACTTACAGCCAT AAAAAGAGTATTGACTATTAATCATCCGGCTCGATACTTACAGCCAT 565.95 14.63 1 10 10 34 33

Neural Networks

  • protein DNA interactions
  • deep neural networks and convolution


In [ ]: