Profile hidden Markov models using PyPore and YAHMM

Author: Jacob Schreiber (jmschreiber91@gmail.com)

Hidden Markov models (HMMs) are powerful tools for performing sequence analysis. In contrast to other techniques, like a Support Vector Machine or a Random Forest, which take in data where each thing to be classified always has the same number of features, a great strength of HMMs is that the data of many different lengths can be interpreted by the same model.

For example, detecting the transmembrane helicies in a 7 transmembrane helix protein was done by Bystroff and Krogh in their 2008 tutorial Hidden Markov Models for Prediction of Protein Features. The goal is to take in a 7 transmembrane domain (7TMD) protein sequence, and predict where the transmembrane helicies may be. This is done by a two-state HMM, with the states being either "transmembrane helix" or "not."

We can try to recreate this model using YAHMM. First, lets import the usual suspects.


In [1]:
from yahmm import *
import numpy as np
import seaborn as sns
from matplotlib.pyplot import *
from IPython.display import *
%matplotlib inline

Now, lets create the model. It is a two state model, with emissions across the amino acids. The transition and emission probabilities are taken from the paper. This code is adapted from Tamas Nagy's notebook where he tries to do the same thing.


In [2]:
# Define your model object and name it
model = Model( name="7TMD" )

# Take the distributions from Table 1 of the paper across all amino acids
aa = list( 'IFLVMWAYGCTSPHNQDEKR' )
transmembrane = np.array([ 1826, 1370, 2562, 1751, 616, 414, 1657, 615, 1243,
                    289, 755, 806, 423, 121, 250, 141, 104, 110, 78, 83 ]) / 15214.
background = np.array([2187, 1854, 4156, 2935, 1201, 819, 3382, 1616,
                       3352, 960, 2852, 3410, 2640, 1085, 2279, 2054,
                       2551, 2983, 2651, 2933]) / 47900.

# Define the two states
tm = State( DiscreteDistribution( dict( zip( aa, transmembrane ) ) ), name="tm" )
null = State( DiscreteDistribution( dict( zip( aa, background ) ) ), name="null" )

# Take transitions determined in the paper
model.add_transition( model.start, null, 1.00 )
model.add_transition( tm, null, 0.046)
model.add_transition( tm, tm, 0.954)
model.add_transition( null, null, 0.984)
model.add_transition( null, tm, 0.016)

# Finalize the model
model.bake( verbose=True )
model.draw( node_size=4000, labels={tm: 'tm', null:'null', model.start:'start', model.end:'end'}, font_size=20 )


Now lets try out the model on the ADA1D protein used in the tutorial. We can run Viterbi decoding and posterior decoding on the sequence in order to try to identify where the transmembrane helicies are. The Viterbi path can do this by finding the most likely states for each of the amino acids, and classify them as either part of a transmembrane helix or not. The posterior looks at the posterior, as calculated by the forward-backward algorithm, and looks at the probability of each sequence member being emitted by the transmembrane helix state.


In [3]:
# Get the sequence of protein ADA1D
ada1d = list('MTFRDLLSVSFEGPRPDSSAGGSSAGGGGGSAGGAAPSEGPAVGGVPGGAGGGGGVVGAG\
SGEDNRSSAGEPGSAGAGGDVNGTAAVGGLVVSAQGVGVGVFLAAFILMAVAGNLLVILS\
VACNRHLQTVTNYFIVNLAVADLLLSATVLPFSATMEVLGFWAFGRAFCDVWAAVDVLCC\
TASILSLCTISVDRYVGVRHSLKYPAIMTERKAAAILALLWVVALVVSVGPLLGWKEPVP\
PDERFCGITEEAGYAVFSSVCSFYLPMAVIVVMYCRVYVVARSTTRSLEAGVKRERGKAS\
EVVLRIHCRGAATGADGAHGMRSAKGHTFRSSLSVRLLKFSREKKAAKTLAIVVGVFVLC\
WFPFFFVLPLGSLFPQLKPSEGVFKVIFWLGYFNSCVNPLIYPCSSREFKRAFLRLLRCQ\
CRRRRRRRPLWRVYGHHWRASTSGLRQDCAPSSGDAPPGAPLALTALPDPDPEPPGTPEM\
QAPVASRRKPPSAFREWRLLGPFRRPTTQLRAKVSSLSHKIRAGGAQRAEAACAQRSEVE\
AVSLGVPHEVAEGATCQAYELADYSNLRETDI')

# Calculate the Viterbi prediction to transmembrane domains
logp, path = model.viterbi( ada1d )
viterbi_tmds = np.array( map( lambda x: x[1].name == 'tm', path[1:-1] ) )

starts = np.nonzero( viterbi_tmds & ~np.roll(viterbi_tmds, 1) )[0]
ends = np.nonzero( viterbi_tmds & ~np.roll(viterbi_tmds, -1) )[0]
transmems = zip(starts, ends)

# Set up plotting
colors = sns.color_palette('deep', n_colors=6, desat=0.5)

figure( figsize=(20,6) )

axhline(y=1.1, c=colors[0], alpha=0.7, label="Viterbi Path")
axhline(y=0.5, c=colors[1], alpha=0.7, label="Actual Helix Locations")
xlim([1, len(ada1d)+1])
ylim([0,1.2])
ylabel('probability')
axis = gca()

# Plot viterbi predicted TMDs
for start, end in transmems:
    axis.add_patch( Rectangle((start+1, 1.075), end-start+1, 0.05, facecolor=colors[0], alpha=0.7) )

# Actual transmembranes from http://beta.uniprot.org/uniprot/P25100
for start, end in [(96,121),(134,159),(170,192),(214,238),
                   (252,275),(349,373),(381,405)]:
    axis.add_patch( Rectangle((start+1, .475), end-start+1, 0.05, facecolor=colors[1], alpha=0.7) )

# Get indices
indices = { state.name: i for i, state in enumerate( model.states ) }

# Use posterior decoding to attempt to calculate the transmembrane domains
trans, ems = model.forward_backward( ada1d )
ems = np.exp( ems )

plot( xrange(1, len(ada1d)+1), ems[:, indices['tm']] / ems.sum( axis=1 ), c=colors[2], alpha=0.7, label="Posterior Probability" )
legend( fontsize=18 )


Out[3]:
<matplotlib.legend.Legend at 0x16669c50>

It looks like the posterior decoding was significantly better than the Viterbi path. An improvement to the Viterbi path would be to build a 15 state model with interlaced transmembrane helix and null states, to ensure that the Viterbi path found the 7 most likely regions to be transmembrane helicies, as opposed to trying to guess an undetermined number. The posterior decoding seems to do alright, but seems to predict around 8 helicies instead of the 7.

However, this is not a profile HMM. A Profile HMM models a reference sequence, and allows certain deviations from the reference based on what features you expect to see in the observed sequences. A typical biosequence alignment allows for mismatches by having pseudocounts on each match state, allows for insertions into the observed sequence with another symbol-emitting state, and deletions from the reference by a silent state which jumps over that match state. The previous HMM was not close to this, as a profile HMM will grow with respect to length of the reference, while the previous HMM would always be two states.

Picture from Biological Sequence Analysis: Probabilistic Methods of Proteins and Nucleic Acids by Durbin, Eddy, Krogh, and Mitchison

See http://nbviewer.ipython.org/github/jmschrei/yahmm/blob/master/examples/Global%20Sequence%20Alignment.ipynb for a more detailed analysis of using this model to perform sequence alignment.

However, not all tasks are alignment to a reference! Sometimes you want to perform classification based on adding forks to the profile, and classifying based on which fork was taken at that point in the model. This can be excessively complicated to code, especially if you begin to add more features than just INDELs. This brings us to the module structure, which is the idea that you model the repeating subunit, which takes in a single symbol to be modelled and has silent state 'ports' on either side.


In [4]:
SVG(filename="C:\\Users\\jmschrei\\Desktop\\IPython\\BINF ProfileHMM GSAHMM Boards.svg")


Out[4]:
M I D I M I D S ... ... ... M I D M I D E M I D I M I D S ... ... ... S1 S2 M I D E1 E2 S1 S2 M I D E1 E2 I S ... ... S1 S2 M I D E1 E2 S1 S2 M I D E1 E2 S1 S2 M I D E1 E2 ... ... ... ... ... ... S1 S2 M I D E1 E2 S1 S2 M I D E1 E2 S1 S2 M I D E1 E2 b c d a

We can code this module subunit easily using YAHMM and PyPore. PyPore has a class named HMMBoard, which will build a module with n silent state ports on either side. The user will then fill in what can happen in each one of these modules. Each module should represent a single symbol in the reference, so that it only needs to be fed a single distribution representing the probability of various symbols at some position in the profile being modelled.

The coding is done by defining a function which takes in some arbitrary distribution, a name, and a distribution over the insertions. You can have whatever states you want, and whatever edges connecting them which work for you. This creates an example of the above module structure for a traditional global sequence alignment of biosequences.


In [5]:
def GlobalAlignmentModule( distribution, name, insert ):
    '''
    This is the repeating subunit for a typical global sequence alignment HMM.
    '''
    
    # Create the initial board with n ports on either side, in this case 2.
    board = HMMBoard( 2, name )
    
    # Define the three states in the repeating subunit, the match, the insert, and the delete
    delete = State( None, name="D:{}".format( name ) )
    insert = State( insert, name="I:{}".format( name ) )
    match = State( distribution, name="M:{}".format( name ) )
    
    # Define the transitions
    board.add_transition( board.s1, delete, 1.00 )
    board.add_transition( board.s2, match, 1.00 )
    
    board.add_transition( match, board.e2, 0.9 )
    board.add_transition( match, board.e1, 0.05 )
    board.add_transition( match, insert, 0.05 )
    
    board.add_transition( delete, insert, 0.15 )
    board.add_transition( delete, board.e1, 0.15 )
    board.add_transition( delete, board.e2, 0.7 )
    
    board.add_transition( insert, insert, 0.7 )
    board.add_transition( insert, board.e1, 0.15 )
    board.add_transition( insert, board.e2, 0.15 )
    
    # Return the model, unbaked
    return board

Now we need to find a way to join these modules together to form a linear profile. ModularProfileModel does a good job of this, by taking in the module function, the list of distributions, a name, and a distribution for the insert state, and returning a prepared model. What it does is simply builds a module for each distribution in the profile, and then connects the ports together appropriately with probability 1.0 transitions. This makes building Profile HMMs as easy as defining the repeating subunit. If we compare the results of this model to the model from the Global Alignment notebook referenced earlier, we can test to see if we're getting the same log probabilities for the same sequence as the entirely hand-built model. We can see that we basically do. The middle two values are slightly off because the transition probabilities to and from the initial insert before the profile starts are slightly different, and in my opinion more correct in the ModularProfileModel class. I just didn't want to change the other example, since it was still a valid example.


In [6]:
from PyPore.hmm import *

insert = DiscreteDistribution( {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25 } )
distributions = [ DiscreteDistribution({ "A": 0.95, 'C': 0.01, 'G': 0.01, 'T': 0.02 }), 
                  DiscreteDistribution({ "A": 0.003, 'C': 0.99, 'G': 0.003, 'T': 0.004 }),
                  DiscreteDistribution({ "A": 0.01, 'C': 0.01, 'G': 0.01, 'T': 0.97 }) ]


model = ModularProfileModel( GlobalAlignmentModule, distributions, "ProfileHMM", insert=insert )

for sequence in map( list, ('ACT', 'GGC', 'GAT', 'ACC') ):
    logp, path = model.viterbi( sequence )
    print "Sequence: '{}'  -- Log Probability: {} -- Path: {}".format(
        ''.join( sequence ), logp, " ".join( state.name for idx, state in path[1:-1] ) )


Sequence: 'ACT'  -- Log Probability: -0.513244900357 -- Path: M:0 M:1 M:2 b2e2
Sequence: 'GGC'  -- Log Probability: -12.8116898336 -- Path: D:0 I:0 I:0 M:1 D:2 b2e2
Sequence: 'GAT'  -- Log Probability: -9.94172694734 -- Path: D:0 I:0 I:0 D:1 M:2 b2e2
Sequence: 'ACC'  -- Log Probability: -5.08795587886 -- Path: M:0 M:1 M:2 b2e2

However, we can now make this reference arbitrarily long without needing to write any more code! As we change n, we get a more complex profile, but we don't need to actually write any more code. Basically, no matter how long your profile is, all you need to do is pass in your list of distributions and the function will take care of the rest.


In [7]:
n = 25

# Build a profile of length N, where one nucleotide is chosen at a time for the profile,
# and the rest remain at probability 0.01.
profile = [{ 'A': 0.01, 'C': 0.01, 'G': 0.01, 'T': 0.01 } for i in xrange(n)]
for i in xrange( n ):
    nucleotide = random.choice(['A', 'C', 'G', 'T'])
    profile[i][nucleotide] = 0.97
distributions = map( DiscreteDistribution, profile ) 

model = ModularProfileModel( GlobalAlignmentModule, distributions, "LongProfileHMM", insert )

seq = model.sample()
print "Sequence: {}".format( ''.join( seq ) )

logp, path = model.viterbi( seq )
print "Log Probability: {}".format( logp ) 
print "Path: {}".format( ' '.join( state.name for i, state in path[1:-1] ) )


Sequence: GTTATGGCAGTATAGTTAGTTGCAGTCT
Log Probability: -33.8383231789
Path: M:0 I:0 I:0 I:0 M:1 M:2 M:3 M:4 M:5 M:6 D:7 M:8 M:9 M:10 M:11 M:12 M:13 D:14 M:15 M:16 M:17 M:18 M:19 M:20 M:21 M:22 I:22 I:22 I:22 D:23 M:24 b24e2

While this module may not model the features we want to, it still could be used to model nanopore data as a series of numbers instead of characters. All we need to do is change the distributions from DiscreteDistribution to whatever number generating distributions we want. Since nanopore data is basically a Gaussian distribution, we can use NormalDistribution. Lets try on an arbitrarily created example.


In [8]:
distributions = [ NormalDistribution( 5, 1 ),
                  NormalDistribution( 10, 1 ),
                  NormalDistribution( 23, 1 ),
                  NormalDistribution( 16, 1 ),
                  NormalDistribution( 18, 1 ) ]
insert = UniformDistribution( 0, 90 )

model = ModularProfileModel( GlobalAlignmentModule, distributions, "ContinuousProfileHMM", insert=insert )

seq = model.sample()
print "Sequence: {}".format( ' '.join( map( str, map( round, *[seq, [2]*len(seq) ] ) ) ) )

logp, path = model.viterbi( seq )
print "Log Probability: {}".format( logp ) 
print "Path: {}".format( ' '.join( state.name for i, state in path[1:-1] ) )

print
seq = [5, 10, 23, 22, 23, 17, 18]
print "Sequence: {}".format( ' '.join( map( str, map( round, *[seq, [2]*len(seq) ] ) ) ) )

logp, path = model.viterbi( seq )
print "Log Probability: {}".format( logp ) 
print "Path: {}".format( ' '.join( state.name for i, state in path[1:-1] ) )


Sequence: 5.33 10.84 22.86 77.04 57.73 79.08
Log Probability: -24.8469114833
Path: M:0 M:1 M:2 D:3 I:3 I:3 I:3 D:4 b4e2

Sequence: 5.0 10.0 23.0 22.0 23.0 17.0 18.0
Log Probability: -19.8706417873
Path: M:0 M:1 I:1 I:1 M:2 M:3 M:4 b4e2

This model is good for biosequences, but not so good for nanopore data. Inserts can occur in the middle of segments in the form of transient spikes, and backslips can occur depending on the enzyme used. The enzyme activity plays a large role in the things which the model needs to include. $\Phi$29 may need backslips and transient spike handling, but another enzyme may have longer or more frequent backslips. Lastly, the segmenter needs to also be modelled, as oversegmentation and undersegmentation will always be a problem. In order to handle these things, the following model has been created.

Lets try to build another module, which takes into account some features of $\Phi$29. We should model backslips, transient insertions, and the possibility of oversegmentation. We see (1) transient spikes in teal, which present as a short spike before a return to the previous ionic current level, and can be solved by the teal arrow from the insert state back to the match state, (2) oversegmentation caused by a mishap with the segmenter, which can be patched up by adding a self loop on the match state, and (3) backslips in a sequence of ionic current which is supposed to repeat the 45-40-35 cycle over and over by fluctuating between 45 and 35 once. Backslips can be solved by adding a track on the bottom of silent states, which indicate the ability to move backward through a sequence in addition to forward.


In [9]:
SVG( filename="C:\\Users\\jmschrei\\Desktop\\IPython\\IPython Simple NanoporeHMM.svg")


Out[9]:
40 ms 70 60 50 40 30 50 40 30 20 250 ms 25 ms 50 45 40 35 30 S1 S2 S3 S3 E1 E2 E3 E3 I I D D M2 M2 B B

This repeating subunit is not terribly complicated, and can be written as a repeating subunit in the following manner:


In [10]:
def NanoporeGlobalAlignmentModule( distribution, name, insert ):
    '''
    Creates a single board from a distribution. This will create a module which
    is based off the traditional global sequence alignment hmm which contains
    inserts, deletes, and matches, but adds a self loop for matches, a loop back
    to a match from an insert, and a backslip state as well.
    '''

    # Create the board object
    board = HMMBoard( 3, name )
    board.directions = [ '>', '>', '<' ]

    # Create the four states in the module
    insert = State( insert, name="I:{}".format( name ) )
    match = State( distribution, name="M:{}".format( name ) )
    delete = State( None, name="D:{}".format( name ) )
    backslip = State( None, name="B:{}".format( name ) )

    # Add transitions between these states.
    board.add_transition( board.s1, delete, 1.0 )
    board.add_transition( board.s2, match, 1.0 )
    board.add_transition( board.e3, backslip, 1.0 )

    # Add transitions from the backslip state 
    board.add_transition( backslip, match, 0.85 )
    board.add_transition( backslip, board.s3, 0.15 )

    # Add transitions from the delete state
    board.add_transition( delete, board.e1, 0.1 )
    board.add_transition( delete, board.e2, 0.8 )
    board.add_transition( delete, insert, 0.1 )

    # Add transitions from the match state
    board.add_transition( match, board.s3, 0.033 )
    board.add_transition( match, match, 0.4 )
    board.add_transition( match, board.e2, 0.5 )
    board.add_transition( match, insert, 0.033 )
    board.add_transition( match, board.e1, 0.34 )

    # Add transitions from the insert state
    board.add_transition( insert, insert, 0.50 )
    board.add_transition( insert, match, 0.20 )
    board.add_transition( insert, board.e1, 0.05 )
    board.add_transition( insert, board.e2, 0.25)

    # Return the board
    return board

Once we have the repeating subunit created, we use almost the exact same code to turn it into a model and test it. The only differences are that when printing out the sequence, instead of using ''.join( seq ), I round all the values so that they don't needlessly display to 8 significant figures or more.


In [11]:
insert = UniformDistribution( 0, 50 ) 
distributions = [ NormalDistribution( 15, 0.3 ),
                  NormalDistribution( 23, 0.3 ),
                  NormalDistribution( 14, 0.4 ),
                  NormalDistribution( 17, 0.4 ),
                  NormalDistribution( 26, 0.4 ),
                  NormalDistribution( 22, 0.3 ) ]

model = ModularProfileModel( NanoporeGlobalAlignmentModule, distributions, "Nanopore", insert )

seq = model.sample()
print "Sequence: {}".format( map( round, *[ seq, [2]*len(seq) ]) )

logp, path = model.viterbi( seq )
print "Log Probability: {}".format( logp ) 
print "Path: {}".format( ' '.join( state.name for i, state in path[1:-1] ) )


Sequence: [15.01, 14.14, 13.62, 13.82, 13.98, 13.99, 14.05, 30.53, 32.12, 22.12]
Log Probability: -24.0604430121
Path: M:0 D:1 M:2 M:2 M:2 M:2 M:2 M:2 D:3 I:3 I:3 D:4 M:5 b5e2

So now we see that we can get the same functionality when using continuous symbols as discrete characters. The point of the modular format was to easily allow forks, though. So lets try building a model which contains forks. It is as simple as using a dictionary of distributions where the keys are the name of the fork, and the values are the distribution in that fork at that position. No additional code is needed on the model side. In order to show that it worked as intended, we can calculate the Viterbi path of both model sequences to show that they do indeed go down different paths.


In [12]:
distributions = [ NormalDistribution( 15, 0.3 ),
                  NormalDistribution( 23, 0.3 ),
                  NormalDistribution( 14, 0.4 ),
                  NormalDistribution( 17, 0.4 ),
                  NormalDistribution( 26, 0.4 ),
                  NormalDistribution( 22, 0.3 ),
                  {'A': NormalDistribution( 27, 0.3 ), 'B': NormalDistribution( 24, 0.4 ) },
                  {'A': NormalDistribution( 28, 0.3 ), 'B': NormalDistribution( 26, 0.3 ) },
                  NormalDistribution( 25, 0.2 ),
                  NormalDistribution( 18, 0.3 ) ]

model = ModularProfileModel( NanoporeGlobalAlignmentModule, distributions, "TestModel", insert )

seq_A = [ 15, 23, 14, 17, 26, 22, 27, 28, 25, 18 ]
seq_B = [ 15, 23, 14, 17, 26, 22, 24, 26, 25, 18 ]

logp_A, path_A = model.viterbi( seq_A )
logp_B, path_B = model.viterbi( seq_B )

print "\n".join( "{}\t{}".format( a.name, b.name ) for (i, a), (j, b) in zip( path_A[1:-1], path_B[1:-1] ) )


M:0	M:0
M:1	M:1
M:2	M:2
M:3	M:3
M:4	M:4
M:5	M:5
b5e2	b5e2
M:A:7	M:B:7
M:A:8	M:B:8
M:8	M:8
M:9	M:9
b9e2	b9e2

By running the model sequence for each fork through the model, we get pretty much what we expected. There is an additional silent state in the middle of the Viterbi path because at a fork, each silent state port has multiple transitions instead of the single probability 1.00 transition, and so cannot be merged. This can actually be useful if you need a marker as to when a sequence enters a fork.

Since all of these boards are just YAHMM models, and putting boards together doesn't make it less of a YAHMM model, we can still do all of the things that a YAHMM model can, such as sampling from the model, and training. If we sample a bunch of random sequences, and then train on those sequences in order to try to get a better model, we can use the traditional YAHMM methods without needing to write any new code ourselves.


In [13]:
sequences = [ model.sample() for i in xrange( 30 ) ]
model.train( sequences, max_iterations=5, use_pseudocount=True )


Training improvement: -0.160916259102
Total Training Improvement:  -0.160916259102
Out[13]:
-0.16091625910174479

When we look at the posterior, we can calculate soft calls instead of the hard calls of the Viterbi algorithm or Maximum a Posteriori. We can do this by calculating the emission weight matrix according to forward-backward, and then summing across each of the match states in that fork. This gives us the probability of any segments aligning to these states, and if we normalize by the total counts, we can get percentages that an event corresponds to each of the models. It is often useful to see that an event is 99% A and 1% B, for example, because sometimes saying something is 52% A and 48% B is not good enough for a classification, but a hard decision rule like the Viterbi algorithm may simply call that 'A' with you none the wiser.


In [14]:
def classify( sequence, model ):
    '''
    Take in a sequence, and use Mean a Posteriori in order to return soft calls as to
    which fork was taken.
    '''
    
    indices = { state.name: i for i, state in enumerate( model.states ) }
    trans, ems = model.forward_backward( sequence )
    
    pA = np.exp( ems[ :, np.array( [indices[ 'M:A:7' ], indices['M:A:8']] ) ] ).sum()
    pB = np.exp( ems[ :, np.array( [indices[ 'M:B:7' ], indices['M:B:8']] ) ] ).sum()
    
    return pA / (pA+pB), pB / (pA+pB)

seq = model.sample()
pA, pB = classify( model.sample(), model )

print "Sequence: {}".format( map( round, *[ seq, [2]*len(seq) ] ) )
print "P(A|Sequence): {}    P(B|Sequence): {}".format( pA, pB )


Sequence: [15.02, 23.27, 17.02, 25.49, 22.05, 22.14, 24.12, 24.6, 25.15]
P(A|Sequence): 0.0124319144361    P(B|Sequence): 0.987568085564

It is useful to look at these numbers and see the percentages for each event. But maybe it would also be useful to see these values plotted across the entire profile. This should hopefully show spikes as the event went across a specific fork, and low values otherwise. If this is not the case, there may be a problem with the model.


In [15]:
def classify_and_plot( sequence, model ):
    '''
    Take in a sequence, and use Mean a Posteriori in order to plot soft calls across a sequence.
    '''
    
    indices = { state.name : i for i, state in enumerate( model.states ) }
    trans, ems = model.forward_backward( sequence )
    
    pA = np.exp( ems[ :, np.array( [indices[ 'M:A:7' ], indices['M:A:8']] ) ] ).sum( axis=1 )
    pB = np.exp( ems[ :, np.array( [indices[ 'M:B:7' ], indices['M:B:8']] ) ] ).sum( axis=1 )
    
    figure( figsize=(15, 4) )
    plot( pA, linewidth=2, alpha=0.66, c='c', label='A' )
    plot( pB, linewidth=2, alpha=0.66, c='m', label='B' )
    legend()
    show()

classify_and_plot( model.sample(), model )


This gives us a powerful tool for making classifications between things which differ in various areas based almost entirely on these forks-- which are effortless to put in now! By simply using a form of posterior decoding, we only need to identify which states are unique to each fork, say that going through those states indicates that a classification, and use that. This also gives a convenient way to show dispersion of classes, in that an event may be 98% A and 2% B instead of saying that it's just "A". This does not matter in most cases, but in some cases saying that something is "A", or saying it is "52% A", can make a huge difference in your confidence in that classification.

Lets show this working on some real nanopore data, using a more sophisticated nanopore alignment module. In addition to the features mentioned before, two additional features are added.

The first is a match 'model' instead of a single match state. The two match states have the same underlying distribution, but differing self loop probabilities. This is in order to handle a segmenter phenomena which causes it to oversegment severely sometimes on high variance regions. This caused the probability of seeing a segment oversegmented into 6 or 7 segments more likely than the probability of seeing it segmented into 3. This match model turns the number of segments expected into a double exponential instead of a single exponential.

The second is fluctuation between two adjacent states. This is an observed, but unexplained phenomena when using $\Phi$29 to mediate strand movement. This means that the probability of seeing a second backslip increases when the first backslip is seen. The simple model before can't handle this, and would expect the distribution of backslips to be a simple exponential. This creates a slower-decaying exponential in order to handle this phenomena.


In [16]:
SVG( "C:\\Users\\jmschrei\\Desktop\\IPython\\IPython Intermediate NanoporeHMM.svg" )


Out[16]:
40 ms 70 60 50 40 30 50 40 30 20 250 ms 25 ms 50 45 40 35 30 55 50 45 40 35 30 S1 S2 S3 S4 S5 E1 E2 E5 M2 I D S S M2 B M2 M2 E3 E4 500 ms

In [17]:
def Phi29GlobalAlignmentModule( distribution, name, insert=UniformDistribution( 0, 90 ) ):
    """
    Create a module which represents a full slice of the PSSM. Take in
    the distribution which should be represented at that position, and
    create a board with 7 ports on either side.
    """

    def match_model( distribution, name ):
        """
        Build a small match model, allowing for oversegmentation where the
        distribution representing number of segments is a mixture of two
        exponentials.
        """

        model = Model( name=str(name) )

        match = State( distribution, name="M:{}".format( name ) ) # Match without oversegmentation
        match_os = State( distribution, name="MO:{}".format( name ) ) # Match with oversegmentation

        model.add_states( [ match, match_os ] )

        model.add_transition( model.start, match, 0.95 )
        model.add_transition( model.start, match_os, 0.05 )

        model.add_transition( match, match, 0.10 )
        model.add_transition( match, model.end, 0.90 )

        model.add_transition( match_os, match_os, 0.80 )
        model.add_transition( match_os, model.end, 0.20 )
        return model
    
    # Create the board object
    board = HMMBoard( n=5, name=str(name) )
    board.directions = ['>', '>', '<', '>', '<']

    # Create the states in the model
    delete = State( None, name="D:{}".format( name ) )
    match = match_model( distribution, name=name )
    insert = State( insert, name="I:{}".format( name ) )
    match_s = State( distribution, name="MS:{}".format( name ) )
    match_e = State( distribution, name="ME:{}".format( name ) )

    # Add the match model
    board.add_model( match )
    
    # Add all the individual states
    board.add_states( [delete, insert, match_s, match_e] )

    # Add all transitions from the ports to the states
    board.add_transition( board.s1, delete, 1.00 )
    board.add_transition( board.s2, match.start, 1.00 )
    board.add_transition( board.e3, match_e, 1.00 )
    board.add_transition( board.s4, match_s, 1.00 )
    board.add_transition( board.e5, match.start, 0.90 )
    board.add_transition( board.e5, match_e, 0.05 )
    board.add_transition( board.e5, match.start, 0.05 )

    board.add_transition( delete, board.e1, 0.1 )
    board.add_transition( delete, insert, 0.1 )
    board.add_transition( delete, board.e2, 0.8 )

    board.add_transition( insert, match.start, 0.10 )
    board.add_transition( insert, insert, 0.50 )
    board.add_transition( insert, board.e1, 0.05 )
    board.add_transition( insert, board.e2, 0.35 )

    board.add_transition( match.end, insert, 0.01 )
    board.add_transition( match.end, board.e1, 0.01 )
    board.add_transition( match.end, board.e2, .97 )
    board.add_transition( match.end, board.s5, .01 )

    board.add_transition( match_s, board.s3, 0.80 )
    board.add_transition( match_s, match_s, 0.20 )

    board.add_transition( match_e, board.e2, 0.10 )
    board.add_transition( match_e, match_e, 0.10 )
    board.add_transition( match_e, board.e4, 0.80 )
    return board

In order to validate this repeating subunit, we turn to some real nanopore data. I chose to run it against the data from my PNAS paper Error Rates for Nanopore Discrimination Among Cytosine, Methylcytosine, and Hydroxymethylcytosine Along Individual DNA Strands. Briefly, DNA strands bearing either a single cytosine, methylcytosine, or hydroxymethylcytosine and a corresponding label were embedded in an otherwise-identical sequence were run through the nanopore. Machine learning methods were used to estimate an error rate for making the classification based on the differences in ionic current created as each of these cytosine variants passed through the nanopore, and validated against the unambiguous label. The data collected and tagging was done entirely by hand. Error rates depended on surrounding sequence context, as the nanopore used reads ~4 nucleotides at a time. The GCGC context performed the best with a ~2% error rate, and the CCGG context performed the worst with a ~12% error rate. Lets look at some of the CCGG data to see if an automated system can do decently on it.

The model used was a profile with three forks. The strands used would go through in reverse direction as the $\Phi$29 enzyme stripped off a 'blocking oligomer' which was complimentary to the template strand. This caused the first fork in the model, as the cytosine variant went through the nanopore backwards. Then, synthesis would begin and the strand was pulled forward, causing the cytosine variant to go through the nanopore again. Lastly, the label would come through after the cytosine variant, causing the third fork.

~5 examples of each cytosine variant and label were collected and analyzed by hand in order to get seed data for this method, and saved in an excel file. Lets unpack that and build the initial profile.


In [18]:
import pandas as pd

data = pd.read_excel( "C:\\Users\\jmschrei\\Desktop\\automation\\data.xlsx", "Sheet1" )
dists = {}

for name, frame in data.groupby( 'Label' ):
    means, stds = frame.mean(axis=0), frame.std(axis=0)
    dists[name] = [ NormalDistribution( m, s ) for m, s in zip( means, stds ) if not np.isnan( m ) ]

labels = ['T', 'CAT', 'X' ]
cytosines = [ 'C', 'mC', 'hmC' ]

# Initialize the profile with a single reverse CAT repeat
profile = []
profile.extend( dists['CAT'][::-1] )

# Create the unzipping fork 
for i in xrange( 9 ):
    profile.append( { 'C': dists['C'][8-i], 'mC': dists['mC'][8-i], 'hmC': dists['hmC'][8-i] } )

# Add the middle, including mirror from reverse CATs to actual CATs
profile.extend( dists['CAT'][::-1] )
profile.extend( dists['CAT'][::-1] )
profile.extend( dists['CAT'] )
profile.extend( dists['CAT'] )

# Create the synthesis fork
for i in xrange( 9 ):
    profile.append( { 'C': dists['C'][i], 'mC': dists['mC'][i], 'hmC': dists['hmC'][i] } )

# Add the CAT between the synthesis and label forks 
profile.extend( dists['CAT'] )

# Create the label forks
for i in xrange( 6 ):
    profile.append( { 'T': dists['T'][i], 'X': dists['X'][i], 'CAT': dists['CAT'][i%3] } )

# Add the poly-CAT tail ending
profile.extend( dists['CAT'] )
profile.extend( dists['CAT'] )
profile.extend( dists['CAT'] )
profile.extend( dists['CAT'] )

Lets go through each one of the events in a model, and look at the posterior probabilities of each of the cytosine variants across an event, and the posterior probabilities of each of the labels. In this case, C corresponds to T, mC corresponds to CAT, and hmC corresponds to X. A miscall is when these do not line up. Percentages for each events can easily be calculated by showing the probability of going down a single one of those forks.


In [19]:
from PyPore.DataTypes import *
from PyPore.parsers import *

def analyze_event( model, sequence ):
    
    indices = { state.name: i for i, state in enumerate( model.states ) }
    C = [ "M:C:{}".format( i ) for i in xrange( 25, 34 ) ]
    mC = [ "M:mC:{}".format( i ) for i in xrange( 25, 34 ) ]
    hmC = [ "M:hmC:{}".format( i ) for i in xrange( 25, 34 ) ]
    
    X = [ "M:X:{}".format( i ) for i in xrange( 37, 43 ) ]
    T = [ "M:T:{}".format( i ) for i in xrange( 37, 43 ) ]
    CAT = [ "M:CAT:{}".format( i ) for i in xrange( 37, 43 ) ]

    C = np.array( map( indices.__getitem__, C ) )
    mC = np.array( map( indices.__getitem__, mC ) )
    hmC = np.array( map( indices.__getitem__, hmC ) )

    X = np.array( map( indices.__getitem__, X ) )
    T = np.array( map( indices.__getitem__, T ) )
    CAT = np.array( map( indices.__getitem__, CAT ) )

    trans, ems = model.forward_backward( sequence, tie=False )

    pC = np.exp( ems[ :, C ] ).sum( axis=1 )
    pmC = np.exp( ems[ :, mC ] ).sum( axis=1 )
    phmC = np.exp( ems[ :, hmC ] ).sum( axis=1 )

    pX = np.exp( ems[ :, X ] ).sum( axis=1 )
    pT = np.exp( ems[ :, T ] ).sum( axis=1 )
    pCAT = np.exp( ems[ :, CAT ] ).sum( axis=1 )
    
    c_norm = pC.sum() + pmC.sum() + phmC.sum()
    l_norm = pX.sum() + pT.sum() + pCAT.sum()
    
    plot( pC, alpha=0.66, label='C', c='b' )
    plot( pmC, alpha=0.66, label='mC', c='r' )
    plot( phmC, alpha=0.66, label='hmC', c='c' )
    plot( pX, alpha=0.66, label='X', c='g' )
    plot( pT, alpha=0.66, label='T', c='m' )
    plot( pCAT, alpha=0.66, label='CAT', c='#FF0000' )
    ylabel( 'Probability' )
    xlabel( 'Sequence Position' )
    legend()
    
    return np.array( [pC.sum(), pmC.sum(), phmC.sum() ] ) / c_norm, np.array( [pT.sum(), pCAT.sum(), pX.sum() ] ) / l_norm

file = File( "C:\\Users\\jmschrei\\Desktop\\automation\\14418004-s04.abf" )
file.parse( lambda_event_parser(  threshold=90, rules=[ lambda event: event.duration > 0.5*file.second, lambda event: event.min > 0. ] ) )
model = ModularProfileModel( Phi29GlobalAlignmentModule, profile, "NanoporeHMM-54", UniformDistribution( 0, 90 ) )

for event in file.events:
    event.filter()
    event.parse( SpeedyStatSplit( prior_segments_per_second=40., sampling_freq=1.e5, cutoff_freq=2000. ) )
    sequence = [ segment.mean for segment in event.segments ]
    
    figure( figsize=(20, 6) )
    
    subplot(211)
    event.plot( color='cycle' )
    
    subplot(212)
    (C, mC, hmC), (T, CAT, X) = analyze_event( model, sequence )
    
    show()
    
    print "{}% C, {}% mC, {}% hmC".format( round( C*100, 2), round( mC*100, 2 ), round( hmC*100, 2 ) )


99.03% C, 0.0% mC, 0.97% hmC
95.32% C, 3.61% mC, 1.07% hmC
0.0% C, 5.49% mC, 94.51% hmC
5.16% C, 10.74% mC, 84.1% hmC
0.0% C, 100.0% mC, 0.0% hmC
57.04% C, 30.67% mC, 12.29% hmC
0.0% C, 99.99% mC, 0.01% hmC
0.0% C, 100.0% mC, 0.0% hmC
0.0% C, 99.75% mC, 0.25% hmC
0.02% C, 0.38% mC, 99.59% hmC
0.53% C, 98.7% mC, 0.77% hmC
0.0% C, 99.99% mC, 0.01% hmC
0.0% C, 100.0% mC, 0.0% hmC
0.01% C, 58.38% mC, 41.61% hmC
75.14% C, 22.46% mC, 2.4% hmC
0.06% C, 41.62% mC, 58.32% hmC
0.01% C, 0.0% mC, 99.99% hmC
0.0% C, 100.0% mC, 0.0% hmC
3.57% C, 0.19% mC, 96.23% hmC
0.0% C, 0.28% mC, 99.72% hmC
12.16% C, 1.37% mC, 86.48% hmC
0.0% C, 0.24% mC, 99.76% hmC
0.0% C, 0.0% mC, 100.0% hmC
0.0% C, 100.0% mC, 0.0% hmC
0.0% C, 96.29% mC, 3.71% hmC
0.0% C, 7.89% mC, 92.11% hmC
0.02% C, 0.7% mC, 99.28% hmC
0.0% C, 57.31% mC, 42.69% hmC
0.0% C, 100.0% mC, 0.0% hmC
0.0% C, 100.0% mC, 0.0% hmC
0.0% C, 100.0% mC, 0.0% hmC
0.0% C, 100.0% mC, 0.0% hmC
0.0% C, 4.07% mC, 95.93% hmC
0.0% C, 8.86% mC, 91.14% hmC
4.12% C, 71.07% mC, 24.81% hmC
16.81% C, 82.13% mC, 1.06% hmC
0.0% C, 0.0% mC, 100.0% hmC
99.9% C, 0.0% mC, 0.1% hmC
4.61% C, 4.71% mC, 90.67% hmC
0.0% C, 0.0% mC, 100.0% hmC
0.0% C, 97.22% mC, 2.78% hmC
0.0% C, 0.0% mC, 100.0% hmC
0.0% C, 100.0% mC, 0.0% hmC
0.0% C, 5.53% mC, 94.47% hmC
0.0% C, 0.0% mC, 100.0% hmC
0.0% C, 100.0% mC, 0.0% hmC
0.17% C, 99.38% mC, 0.45% hmC

That's pretty cool. Using the build in functionality of PyPore, very little work was needed! Far more work was needed to unpack the data from the excel file than to build this complicated model with a 5-port repeating subunit, and three forks three times. Once the model was built, only some knowledge of posterior decoding was needed in order to get this thing working.

As a disclaimed, this method was not the final or most accurate way of making this classification, and that module structure was not the final one used. Both will be revealed in an upcoming (hopefully) paper!

Package Information

YAHMM and PyPore are both freely available under the MIT license at https://github.com/jmschrei/yahmm and https://github.com/jmschrei/PyPore. Feel free to contribute or comment on the libraries!

Installing YAHMM is easy, but simply calling pip install yahmm. Dependencies are numpy, scipy, matplotlib, networkx, and Cython. If you do not have these, and pip fails when building the prerequisites, the Anaconda distribution is a good way to get many scientific packages for Python, and has been shown on multiple platforms to allow YAHMM to install.

PyPore is has not yet been added to pip, but will be soon. Currently the easiest way to get it is to clone it from GitHub, or download the zip.

If you have questions or comments, my email is jmschreiber91@gmail.com.


In [19]: