Tied States Hidden Markov Model

authors:
Jacob Schreiber [jmchreiber91@gmail.com],
Nicholas Farn [nicholasfarn@gmail.com]

An example of using tied states to represent the same distribution across multiple states. This example is a toy example derived from biology, where we will look at DNA sequences.

The fake structure we will pretend exists is:

start -> background -> CG island -> background -> poly-T region

DNA is comprised of four nucleotides, A, C, G, and T. Lets say that in the background sequence, all of these occur at the same frequency. In the CG island, the nucleotides C and G occur more frequently. In the poly T region, T occurs most frequently.

We need the graph structure, because we fake know that the sequence must return to the background distribution between the CG island and the poly-T region. However, we also fake know that both background distributions need to be the same.


In [1]:
from pomegranate import *
import random
import numpy as np

random.seed(0)

Lets start off with an example without tied states and see what happens.


In [2]:
untiedmodel = HiddenMarkovModel( "No Tied States" )

Here we'll define the four states.


In [3]:
background_one = State( DiscreteDistribution({'A': 0.25, 'C':0.25, 'G': 0.25, 'T':0.25 }), name="B1" )
CG_island = State( DiscreteDistribution({'A': 0.1, 'C':0.4, 'G': 0.4, 'T':0.1 }), name="CG" )
background_two = State( DiscreteDistribution({'A': 0.25, 'C':0.25, 'G': 0.25, 'T':0.25 }), name="B2" )
poly_T = State( DiscreteDistribution({'A': 0.1, 'C':0.1, 'G': 0.1, 'T':0.7 }), name="PT" )

Then add the starting transitions.


In [4]:
untiedmodel.add_transition( untiedmodel.start, background_one, 1. )

The transition matrix.


In [5]:
untiedmodel.add_transition( background_one, background_one, 0.9 )
untiedmodel.add_transition( background_one, CG_island, 0.1 )
untiedmodel.add_transition( CG_island, CG_island, 0.8 )
untiedmodel.add_transition( CG_island, background_two, 0.2 )
untiedmodel.add_transition( background_two, background_two, 0.8 )
untiedmodel.add_transition( background_two, poly_T, 0.2 )
untiedmodel.add_transition( poly_T, poly_T, 0.7 )

And finally the ending transitions.


In [6]:
untiedmodel.add_transition( poly_T, untiedmodel.end, 0.3)

Finishing with the method "bake" to finalize the structure of our model.


In [7]:
untiedmodel.bake( verbose=True )

Now let's define the following sequences. Keep in mind training must by done on a list of lists, not on a string in order to allow strings of any length.


In [8]:
sequences = [ numpy.array(list("TAGCACATCGCAGCGCATCACGCGCGCTAGCATATAAGCACGATCAGCACGACTGTTTTT")),
	      numpy.array(list("TAGAATCGCTACATAGACGCGCGCTCGCCGCGCTCGATAAGCTACGAACACGATTTTTTA")),
	      numpy.array(list("GATAGCTACGACTACGCGACTCACGCGCGCGCTCCGCATCAGACACGAATATAGATAAGATATTTTTT")) ]

Lets check our distributions before training our model.


In [9]:
result = "\n".join( "{}: {}".format( state.name, state.distribution )
	for state in untiedmodel.states if not state.is_silent() )

print(result)


B1: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.25,
            "C" :0.25,
            "G" :0.25,
            "T" :0.25
        }
    ],
    "frozen" :false
}
B2: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.25,
            "C" :0.25,
            "G" :0.25,
            "T" :0.25
        }
    ],
    "frozen" :false
}
CG: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.1,
            "C" :0.4,
            "G" :0.4,
            "T" :0.1
        }
    ],
    "frozen" :false
}
PT: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.1,
            "C" :0.1,
            "G" :0.1,
            "T" :0.7
        }
    ],
    "frozen" :false
}

Now lets train our model.


In [10]:
untiedmodel.fit( sequences, stop_threshold=0.01 )


Out[10]:
{
    "class" : "HiddenMarkovModel",
    "name" : "No Tied States",
    "start" : {
        "class" : "State",
        "distribution" : null,
        "name" : "No Tied States-start",
        "weight" : 1.0
    },
    "end" : {
        "class" : "State",
        "distribution" : null,
        "name" : "No Tied States-end",
        "weight" : 1.0
    },
    "states" : [
        {
            "class" : "State",
            "distribution" : {
                "class" : "Distribution",
                "dtype" : "str",
                "name" : "DiscreteDistribution",
                "parameters" : [
                    {
                        "A" : 0.31639790507943677,
                        "C" : 0.2920431428628597,
                        "G" : 0.21191370125516387,
                        "T" : 0.1796452508025395
                    }
                ],
                "frozen" : false
            },
            "name" : "B1",
            "weight" : 1.0
        },
        {
            "class" : "State",
            "distribution" : {
                "class" : "Distribution",
                "dtype" : "str",
                "name" : "DiscreteDistribution",
                "parameters" : [
                    {
                        "A" : 0.4178729606164158,
                        "C" : 0.22068400964255355,
                        "G" : 0.19558677075666792,
                        "T" : 0.16585625898436263
                    }
                ],
                "frozen" : false
            },
            "name" : "B2",
            "weight" : 1.0
        },
        {
            "class" : "State",
            "distribution" : {
                "class" : "Distribution",
                "dtype" : "str",
                "name" : "DiscreteDistribution",
                "parameters" : [
                    {
                        "A" : 0.061102410527201965,
                        "C" : 0.5058121809386029,
                        "G" : 0.3384433263771247,
                        "T" : 0.09464208215707043
                    }
                ],
                "frozen" : false
            },
            "name" : "CG",
            "weight" : 1.0
        },
        {
            "class" : "State",
            "distribution" : {
                "class" : "Distribution",
                "dtype" : "str",
                "name" : "DiscreteDistribution",
                "parameters" : [
                    {
                        "A" : 0.09055188417721106,
                        "C" : 4.501540436607585e-09,
                        "G" : 0.01138728389704469,
                        "T" : 0.8980608274242039
                    }
                ],
                "frozen" : false
            },
            "name" : "PT",
            "weight" : 1.0
        },
        {
            "class" : "State",
            "distribution" : null,
            "name" : "No Tied States-start",
            "weight" : 1.0
        },
        {
            "class" : "State",
            "distribution" : null,
            "name" : "No Tied States-end",
            "weight" : 1.0
        }
    ],
    "end_index" : 5,
    "start_index" : 4,
    "silent_index" : 4,
    "edges" : [
        [
            4,
            0,
            1.0,
            1.0,
            null
        ],
        [
            0,
            0,
            0.9489674911696251,
            0.9,
            null
        ],
        [
            0,
            2,
            0.05103250883037496,
            0.1,
            null
        ],
        [
            2,
            2,
            0.9256192712029668,
            0.8,
            null
        ],
        [
            2,
            1,
            0.07438072879703318,
            0.2,
            null
        ],
        [
            1,
            1,
            0.9570959686922387,
            0.8,
            null
        ],
        [
            1,
            3,
            0.04290403130776134,
            0.2,
            null
        ],
        [
            3,
            3,
            0.8417505801851788,
            0.7,
            null
        ],
        [
            3,
            5,
            0.15824941981482132,
            0.3,
            null
        ]
    ],
    "distribution ties" : []
}

And check our new distributions after training.


In [11]:
result = "\n".join( "{}: {}".format( state.name, state.distribution )
	for state in untiedmodel.states if not state.is_silent() )
    
print(result)


B1: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.31639790507943677,
            "C" :0.2920431428628597,
            "G" :0.21191370125516387,
            "T" :0.1796452508025395
        }
    ],
    "frozen" :false
}
B2: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.4178729606164158,
            "C" :0.22068400964255355,
            "G" :0.19558677075666792,
            "T" :0.16585625898436263
        }
    ],
    "frozen" :false
}
CG: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.061102410527201965,
            "C" :0.5058121809386029,
            "G" :0.3384433263771247,
            "T" :0.09464208215707043
        }
    ],
    "frozen" :false
}
PT: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.09055188417721106,
            "C" :4.501540436607585e-09,
            "G" :0.01138728389704469,
            "T" :0.8980608274242039
        }
    ],
    "frozen" :false
}

Now we can try our example with tied states.


In [12]:
tiedmodel = HiddenMarkovModel( "Tied States" )

Lets redefine the four states.


In [13]:
background = DiscreteDistribution({'A': 0.25, 'C':0.25, 'G': 0.25, 'T':0.25 })

background_one = State( background, name="B1" )
CG_island = State( DiscreteDistribution({'A': 0.1, 
	'C':0.4, 'G': 0.4, 'T':0.1 }), name="CG" )
background_two = State( background, name="B2" )
poly_T = State( DiscreteDistribution({'A': 0.1, 
	'C':0.1, 'G': 0.1, 'T':0.7 }), name="PT" )

Then add the starting transitions.


In [14]:
tiedmodel.add_transition( tiedmodel.start, background_one, 1. );

Then the tranisiton matrix.


In [15]:
tiedmodel.add_transition( background_one, background_one, 0.9 )
tiedmodel.add_transition( background_one, CG_island, 0.1 )
tiedmodel.add_transition( CG_island, CG_island, 0.8 )
tiedmodel.add_transition( CG_island, background_two, 0.2 )
tiedmodel.add_transition( background_two, background_two, 0.8 )
tiedmodel.add_transition( background_two, poly_T, 0.2 )
tiedmodel.add_transition( poly_T, poly_T, 0.7 )

Finally adding the ending transitions.


In [16]:
tiedmodel.add_transition( poly_T, tiedmodel.end, 0.3 )

We "bake" the model to finalize its structure.


In [17]:
tiedmodel.bake( verbose=True )

Now let's use the following sequences to train our model.


In [18]:
sequences = [ numpy.array(list("TAGCACATCGCAGCGCATCACGCGCGCTAGCATATAAGCACGATCAGCACGACTGTTTTT")),
			  numpy.array(list("TAGAATCGCTACATAGACGCGCGCTCGCCGCGCTCGATAAGCTACGAACACGATTTTTTA")),
			  numpy.array(list("GATAGCTACGACTACGCGACTCACGCGCGCGCTCCGCATCAGACACGAATATAGATAAGATATTTTTT")) ]

But before that let's check the distributions in our model.


In [19]:
result = "\n".join( "{}: {}".format( state.name, state.distribution ) 
	for state in tiedmodel.states if not state.is_silent() )

print(result)


B1: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.25,
            "C" :0.25,
            "G" :0.25,
            "T" :0.25
        }
    ],
    "frozen" :false
}
B2: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.25,
            "C" :0.25,
            "G" :0.25,
            "T" :0.25
        }
    ],
    "frozen" :false
}
CG: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.1,
            "C" :0.4,
            "G" :0.4,
            "T" :0.1
        }
    ],
    "frozen" :false
}
PT: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.1,
            "C" :0.1,
            "G" :0.1,
            "T" :0.7
        }
    ],
    "frozen" :false
}

Now let's train our model.


In [20]:
tiedmodel.fit( sequences, stop_threshold=0.01 )


Out[20]:
{
    "class" : "HiddenMarkovModel",
    "name" : "Tied States",
    "start" : {
        "class" : "State",
        "distribution" : null,
        "name" : "Tied States-start",
        "weight" : 1.0
    },
    "end" : {
        "class" : "State",
        "distribution" : null,
        "name" : "Tied States-end",
        "weight" : 1.0
    },
    "states" : [
        {
            "class" : "State",
            "distribution" : {
                "class" : "Distribution",
                "dtype" : "str",
                "name" : "DiscreteDistribution",
                "parameters" : [
                    {
                        "A" : 0.3825505694630474,
                        "C" : 0.23803901803476255,
                        "G" : 0.20043821449422436,
                        "T" : 0.1789721980079656
                    }
                ],
                "frozen" : false
            },
            "name" : "B1",
            "weight" : 1.0
        },
        {
            "class" : "State",
            "distribution" : {
                "class" : "Distribution",
                "dtype" : "str",
                "name" : "DiscreteDistribution",
                "parameters" : [
                    {
                        "A" : 0.3825505694630474,
                        "C" : 0.23803901803476255,
                        "G" : 0.20043821449422436,
                        "T" : 0.1789721980079656
                    }
                ],
                "frozen" : false
            },
            "name" : "B2",
            "weight" : 1.0
        },
        {
            "class" : "State",
            "distribution" : {
                "class" : "Distribution",
                "dtype" : "str",
                "name" : "DiscreteDistribution",
                "parameters" : [
                    {
                        "A" : 0.10879783006206004,
                        "C" : 0.4779841622676619,
                        "G" : 0.3129317572182933,
                        "T" : 0.10028625045198475
                    }
                ],
                "frozen" : false
            },
            "name" : "CG",
            "weight" : 1.0
        },
        {
            "class" : "State",
            "distribution" : {
                "class" : "Distribution",
                "dtype" : "str",
                "name" : "DiscreteDistribution",
                "parameters" : [
                    {
                        "A" : 0.09294944999091662,
                        "C" : 1.4280989721539988e-17,
                        "G" : 0.0060865515333644584,
                        "T" : 0.9009639984757188
                    }
                ],
                "frozen" : false
            },
            "name" : "PT",
            "weight" : 1.0
        },
        {
            "class" : "State",
            "distribution" : null,
            "name" : "Tied States-start",
            "weight" : 1.0
        },
        {
            "class" : "State",
            "distribution" : null,
            "name" : "Tied States-end",
            "weight" : 1.0
        }
    ],
    "end_index" : 5,
    "start_index" : 4,
    "silent_index" : 4,
    "edges" : [
        [
            4,
            0,
            1.0,
            1.0,
            null
        ],
        [
            0,
            0,
            0.9331385232071054,
            0.9,
            null
        ],
        [
            0,
            2,
            0.06686147679289454,
            0.1,
            null
        ],
        [
            2,
            2,
            0.9433476039763057,
            0.8,
            null
        ],
        [
            2,
            1,
            0.05665239602369436,
            0.2,
            null
        ],
        [
            1,
            1,
            0.9580129624347591,
            0.8,
            null
        ],
        [
            1,
            3,
            0.04198703756524091,
            0.2,
            null
        ],
        [
            3,
            3,
            0.8397947514093073,
            0.7,
            null
        ],
        [
            3,
            5,
            0.16020524859069266,
            0.3,
            null
        ]
    ],
    "distribution ties" : [
        [
            0,
            1
        ],
        [
            1,
            0
        ]
    ]
}

Now let's check our new distributions.


In [21]:
result = "\n".join( "{}: {}".format( state.name, state.distribution ) 
	for state in tiedmodel.states if not state.is_silent() )

print(result)


B1: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.3825505694630474,
            "C" :0.23803901803476255,
            "G" :0.20043821449422436,
            "T" :0.1789721980079656
        }
    ],
    "frozen" :false
}
B2: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.3825505694630474,
            "C" :0.23803901803476255,
            "G" :0.20043821449422436,
            "T" :0.1789721980079656
        }
    ],
    "frozen" :false
}
CG: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.10879783006206004,
            "C" :0.4779841622676619,
            "G" :0.3129317572182933,
            "T" :0.10028625045198475
        }
    ],
    "frozen" :false
}
PT: {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.09294944999091662,
            "C" :1.4280989721539988e-17,
            "G" :0.0060865515333644584,
            "T" :0.9009639984757188
        }
    ],
    "frozen" :false
}

Notice that states B1 and B2 are the same after training with tied states, not so without tied states.