How to gapfill a genome scale metabolic model

Getting started

Installing libraries

Before you start, you will need to install a couple of libraries:

The ModelSeedDatabase has all the biochemistry we'll need. You can install that with git clone.

The PyFBA library has detailed installation instructions. Don't be scared, its mostly just pip install.

(Optional) Also, get the SEED Servers as you can get a lot of information from them. You can install the git python repo from github. Make sure that the SEED_Servers_Python is in your PYTHONPATH.

We start with importing some modules that we are going to use.

We import sys so that we can use standard out and standard error if we have some error messages. We import copy so that we can make a deep copy of data structures for later comparisons.

Then we import the PyFBA module to get started.



In [23]:

    
import sys
import copy

import PyFBA

Running an SBML model

If you have run your genome through RAST, you can download the SBML model and use that directly.

We have provided an SBML model of Citrobacter sedlakii that you can download and use. You can right/ctrl click on this link and save the SBML file in the same location you are running this iPython notebook.

We use this SBML model to demonstrate the key points of the FBA approach: defining the reactions, including the boundary, or drainflux, reactions; the compounds, including the drain compounds; the media; and the reaction bounds.

We'll take it step by step!

We start by parsing the model:



In [2]:

    
sbml = PyFBA.parse.parse_sbml_file('Citrobacter_sedlakii.sbml')

Find all the reactions and identify those that are boundary reactions

We need a set of reactions to run in the model. In this case, we are going to run all the reactions in our SBML file. However, you can change this set if you want to knock out reactions, add reactions, or generally modify the model. We store those in the reactions_to_run set.

The boundary reactions are compounds that are secreted but then need to be removed from the model. We usually include a consumption of those compounds that is open ended, as if they are draining away. We store those reactions in the uptake_secretion_reactions dictionary.



In [3]:

    
# get a dict of reactions. The key is the reaction ID, and the value is a metabolism.reaction.Reaction object
reactions = sbml.reactions
reactions_to_run = set()
uptake_secretion_reactions = {}
biomass_equation = None
for r in reactions:
    if 'biomass_equation' in reactions[r].name.lower():
        biomass_equation = reactions[r]
        continue
    is_boundary = False
    for c in reactions[r].all_compounds():
        if c.uptake_secretion:
            is_boundary = True
    if is_boundary:
        reactions[r].is_uptake_secretion = True
        uptake_secretion_reactions[r] = reactions[r]
    else:
        reactions_to_run.add(r)

At this point, we can take a look at how many reactions are in the model:



In [4]:

    
print("There are {} reactions in the model".format(len(reactions)))
print("There are {} uptake/secretion reactions in the model".format(len(uptake_secretion_reactions)))
print("There are {} reactions to be run in the model".format(len(reactions_to_run)))









    



There are 1574 reactions in the model
There are 174 uptake/secretion reactions in the model
There are 1399 reactions to be run in the model

Find all the compounds in the model, and filter out those that are secreted

We need to filter out uptake and secretion compounds from our list of all compounds before we can make a stoichiometric matrix



In [5]:

    
# Get a dict of compounds. 
# The key is the string representation of the compound and the value is a metabolite.compound.Compound object
all_compounds = sbml.compounds
# filter for compounds that are boundary compounds
filtered_compounds = {}
for c in all_compounds:
    if not all_compounds[c].uptake_secretion:
        filtered_compounds[c] = all_compounds[c]

Again, we can see how many compounds there are in the model



In [6]:

    
print("There are {} total compounds in the model".format(len(all_compounds)))
print("There are {} compounds that are not involved in uptake and secretion".format(len(filtered_compounds)))









    



There are 1475 total compounds in the model
There are 1301 compounds that are not involved in uptake and secretion

And now we have the size of our stoichiometric matrix! Notice that the stoichiometric matrix is composed of the reactions that we are going to run and the compounds that are in those reactions (but not the uptake/secretion reactions and compounds).



In [7]:

    
print("The stoichiometric matrix will be {} reactions by {} compounds".format(len(reactions_to_run), len(filtered_compounds)))









    



The stoichiometric matrix will be 1399 reactions by 1301 compounds

Read the media file, and correct the media names

In our media directory, we have a lot of different media formulations, most of which we use with the Genotype-Phenotype project. For this example, we are going to use Lysogeny Broth (LB). There are many different formulations of LB, but we have included the recipe created by the folks at Argonne so that it is comparable with their analysis. You can download ArgonneLB.txt and put it in the same directory as this iPython notebook to run it.

Once we have read the file we need to correct the names in the compounds. Sometimes when compound names are exported to the SBML file they are modified slightly. This just corrects those names.



In [8]:

    
# read the media file
media = PyFBA.parse.read_media_file('ArgonneLB.txt')
# correct the names
media = PyFBA.parse.correct_media_names(media, all_compounds)

Set the reaction bounds for uptake/secretion compounds

The uptake and secretion compounds typically have reaction bounds that allow them to be consumed (i.e. diffuse away from the cell) but not produced. However, our media components can also increase in concentration (i.e. diffuse to the cell) and thus the bounds are set higher. Whenever you change the growth media, you also need to adjust the reaction bounds to ensure that the media can be consumed!



In [9]:

    
# adjust the lower bounds of uptake secretion reactions for things that are not in the media
for u in uptake_secretion_reactions:
    is_media_component = False
    for c in uptake_secretion_reactions[u].all_compounds():
        if c in media:
            is_media_component = True
    if not is_media_component:
        reactions[u].lower_bound = 0.0
        uptake_secretion_reactions[u].lower_bound = 0.0

Run the FBA

Now that we have constructed our model, we can run the FBA!



In [10]:

    
status, value, growth = PyFBA.fba.run_fba(filtered_compounds, reactions, reactions_to_run, media, biomass_equation,
                                          uptake_secretion_reactions)
print("The FBA completed with value; {} and growth: {}".format(value, growth))









    



The FBA completed with value; 281.841757437 and growth: True

Running a basic model

The SBML model is great if you have built a model elsewhere, what about if you want to build a model from a genome.

We typically start with an assigned_functions file from RAST. The easiest way to find that is in the RAST directory by choosing Genome Directory from the Downloads menu on the job details page.

For this example, here is an assigned_functions file from our Citrobacter model that you can download to the same directory as this iPython notebook. Notice that it has two columns, the first column is the protein ID (using SEED standard IDs that start with fig|, and then have the taxonomy ID and version number of the genome, and then peg to indicate protein encoding gene, rna to indicate RNA, crispr_spacer to indicate crispr spacers or other acronym, followed by the feature number. After the tab is the functional role of that feature. Download that file to use in this test.

We start by converting this assigned_functions file to a list of reactions.



In [11]:

    
# assigned functions is a dict of peg id and functional role
assigned_functions = PyFBA.parse.read_assigned_functions('citrobacter.assigned_functions')
# get a list of unique functional roles
roles = set(assigned_functions.values())
print("There are {} unique roles in this genome".format(len(roles)))









    



There are 3606 unique roles in this genome

Convert those roles to reactions. We start with a dict of roles and reactions, but we only need a list of unique reactions, so we convert the keys to a set.



In [13]:

    
roles_to_reactions = PyFBA.filters.roles_to_reactions(roles)
reactions_to_run = set()
for role in roles_to_reactions:
    reactions_to_run.update(roles_to_reactions[role])
print("There are {} unique reactions associated with this genome".format(len(reactions_to_run)))









    



There are 1305 unique reactions associated with this genome

Read all the reactions and compounds in our database

We read all the reactions, compounds, and enzymes in the ModelSEEDDatabase into three data structures. Each one is a dictionary with a string representation of the object as the key and the object as the value.

We modify the reactions specifically for Gram negative models (there are also options for Gram positive models, Mycobacterial models, general microbial models, and plant models).



In [14]:

    
compounds, reactions, enzymes = PyFBA.parse.model_seed.compounds_reactions_enzymes('gramnegative')

Update reactions to run, making sure that all reactions are in the list!

There are a very reactions that come from functional roles that do not appear in the reactions list. We're working on tracking these down, but for now we just check that all reaction IDs in reactions_to_run are in reactions, too.



In [16]:

    
tempset = set()
for r in reactions_to_run:
    if r in reactions:
        tempset.add(r)
    else:
        sys.stderr.write("Reaction ID {} is not in our reactions list. Skipped\n".format(r))
reactions_to_run = tempset









    



Reaction ID rxn37218 is not in our reactions list. Skipped

Test whether these reactions grow on ArgonneLB

We can test whether this set of reactions grows on ArgonneLB media. The media is the same one we used above, and you can download the ArgonneLB.txt and text file and put it in the same directory as this iPython notebook to run it.

(Note: unlike above, we don't need to convert the media components, because the media and compounds come from the same source.)



In [17]:

    
media = PyFBA.parse.read_media_file('ArgonneLB.txt')
print("Our media has {} components.".format(len(media)))









    



Our media has 65 components.

Define a biomass equation

The biomass equation is the part that says whether the model will grow! This is a metabolism.reaction.Reaction object.



In [18]:

    
biomass_equation = PyFBA.metabolism.biomass_equation('gramnegative')

Run the FBA

With the reactions, compounds, reactions_to_run, media, and biomass model, we can test whether the model grows on this media



In [19]:

    
status, value, growth = PyFBA.fba.run_fba(compounds, reactions, reactions_to_run, media, biomass_equation, verbose=True)
print("Initial run has " + str(value) + " --> Growth: " + str(growth))









    



Initial run has 0.0 --> Growth: False






    



In parsing the bounds we found 65 media uptake and secretion reactions and 110 other u/s reactions
Length of the media: 65
Number of reactions to run: 1304
Number of compounds in SM: 1276
Number of reactions in SM: 1480
Revised number of total reactions: 34871
Number of total compounds: 45616
SMat dimensions: 1276 x 1480

Gap fill the model

Since the model does not grow on ArgonneLB we need to gap fill it to ensure growth. There are several ways that we can gap fill, and we will work through them until we get growth.

As you will see, we update the reactions_to_run list each time, and keep the media and everything else consistent. Then we just need to run the FBA like we have done above and see if we get growth.

We also keep a copy of the original reactions_to_run, and a list with all the reactions that we are adding, so once we are done we can go back and bisect the reactions that are added.



In [24]:

    
added_reactions=[]
original_reactions_to_run = copy.copy(reactions_to_run)

Essential reactions

There are ~100 reactions that are in every model we have tested, and we construe these to be essential for all models, so we typically add these first!



In [25]:

    
essential_reactions = PyFBA.gapfill.suggest_essential_reactions()
added_reactions.append(essential_reactions)
reactions_to_run.update(essential_reactions)



In [28]:

    
status, value, growth = PyFBA.fba.run_fba(compounds, reactions, reactions_to_run, media, biomass_equation)
print("FBA has " + str(value) + " --> Growth: " + str(growth))









    



FBA has 0.0 --> Growth: False

Media import reactions

We need to make sure that the cell can import everything that is in the media ... otherwise it won't be able to grow. Obviously only do this step if you are sure that the cell can grow on the media you are testing.



In [32]:

    
media_reactions = PyFBA.gapfill.suggest_from_media(compounds, reactions, reactions_to_run, media)
added_reactions.append(media_reactions)
reactions_to_run.update(media_reactions)



In [33]:

    
status, value, growth = PyFBA.fba.run_fba(compounds, reactions, reactions_to_run, media, biomass_equation)
print("FBA has " + str(value) + " --> Growth: " + str(growth))









    



FBA has 1.11493060109e-12 --> Growth: False

Subsystems

The reactions connect us to subsystems, and this test ensures that all the subsystem are complete. We add reactions required to complete the subsystem.



In [34]:

    
subsystem_reactions = PyFBA.gapfill.suggest_reactions_from_subsystems(reactions, reactions_to_run, threshold=0.5)
added_reactions.append(subsystem_reactions)
reactions_to_run.update(subsystem_reactions)



In [35]:

    
status, value, growth = PyFBA.fba.run_fba(compounds, reactions, reactions_to_run, media, biomass_equation)
print("FBA has " + str(value) + " --> Growth: " + str(growth))









    



FBA has 4.51193262361e-28 --> Growth: False

Orphan compounds

Orphan compounds are those compounds which are only in one reaction. They are either produced, or trying to be consumed. We need to add reaction(s) that complete the network of those compounds.

You can change the maximum number of reactions that a compound is in to be considered an orphan (try increasing it to 2 or 3).



In [37]:

    
orphan_reactions = PyFBA.gapfill.suggest_by_compound(compounds, reactions, reactions_to_run, max_reactions=1)
added_reactions.append(orphan_reactions)
reactions_to_run.update(orphan_reactions)



In [38]:

    
status, value, growth = PyFBA.fba.run_fba(compounds, reactions, reactions_to_run, media, biomass_equation)
print("FBA has " + str(value) + " --> Growth: " + str(growth))









    



---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-38-eeee196c27f7> in <module>()
----> 1 status, value, growth = PyFBA.fba.run_fba(compounds, reactions, reactions_to_run, media, biomass_equation)
      2 print("FBA has " + str(value) + " --> Growth: " + str(growth))

/data/PyFBA/PyFBA/fba/run_fba.pyc in run_fba(compounds, reactions, reactions_to_run, media, biomass_equation, uptake_secretion, verbose)
     31 
     32     cp, rc, reactions = PyFBA.fba.create_stoichiometric_matrix(reactions_to_run, reactions, compounds, media, biomass_equation,
---> 33                                                      uptake_secretion, verbose=False)
     34 
     35     rbvals = PyFBA.fba.reaction_bounds(reactions, rc, media, verbose=verbose)

/data/PyFBA/PyFBA/fba/create_stoichiometric_matrix.pyc in create_stoichiometric_matrix(reactions_to_run, reactions, compounds, media, biomass_equation, uptake_secretion, verbose)
    119 
    120     # load the data into the model
--> 121     PyFBA.lp.load(data, cp, rc)
    122 
    123     # now set the objective function.It is the biomass_equation

/data/PyFBA/PyFBA/lp/glpk_solver.pyc in load(matrix, rowheaders, colheaders, verbose)
     63     if verbose > 4:
     64         sys.stderr.write("Matrix: " + str(temp) + "\n")
---> 65     solver.matrix = temp
     66 
     67     # name the rows and columns

MemoryError:



In [41]:

    
reactions_to_run = original_reactions_to_run
reactions_to_run.update(essential_reactions)
reactions_to_run.update(media_reactions)
reactions_to_run.update(subsystem_reactions)
print("There are {} reactions to run".format(len(reactions_to_run)))
print("Orphans is {}".format(len(orphan_reactions)))









    



There are 1780 reactions to run
Orphans is 3290

We also gap fill on closely related organisms. We assume that an organism is most likely to have reactions in its genome that are similar to those in closely related organisms.



In [42]:

    
with open('our_reactions.txt', 'w') as out:
    out.write("\n".join(reactions_to_run))



In [ ]:



In [ ]:

    
reactions_from_other_orgs = PyFBA.gapfill.suggest_from_roles(args.c, reactions, True)

How to gapfill a genome scale metabolic model

Getting started

Installing libraries

Running an SBML model

Find all the reactions and identify those that are boundary reactions

Find all the compounds in the model, and filter out those that are secreted

Read the media file, and correct the media names

Set the reaction bounds for uptake/secretion compounds

Run the FBA

Running a basic model

Read all the reactions and compounds in our database

Update reactions to run, making sure that all reactions are in the list!

Test whether these reactions grow on ArgonneLB

Define a biomass equation

Run the FBA

Gap fill the model

Essential reactions

Media import reactions

Subsystems

Orphan compounds

Reactions from closely related organisms