Medusa has two families of methods for generating ensembles: expansion and degradation. Expansion approaches currently consist of gapfilling algorithms. Degradation approaches include random degradation (useful for benchmarking new gapfilling methods) and omics integration algorithms that constrain network (e.g. transcriptomics integration; not currently implemented).
The most common network expansion approach involving metabolic networks is algorithmic gapfilling, where the goal is to identify reactions to add to a network that allow a feasible solution. An example of this is adding a minimal number of reactions to enable biomass production in a model for an organism in a specific condition (e.g. SMILEY [1]). See the gapfilling documentation in cobrapy for the formulation of this problem.
Adding the minimum number of reactions to satisfy a biological function is just one approach to the gapfilling strategy. An alternative approach is to reformulate the problem to add the minimum amount of flux through candidate reactions for gapfilling. This has the advantage of being an entirely continuous problem, rather than the integer problem posed by SMILEY, so the time to find a solution is usually 1-2 orders of magnitude shorter.
In medusa
, implementations of both gapfilling strategies are available, but we recommend the continuous approach, which we demonstrate below.
The key inputs for gapfilling are a cobra.Model
object representing the GENRE you are filling gaps in and a second cobra.Model
object containing reactions that form a universal reaction database (sometimes called a universal reaction bag). Additionally, context-specific information, such as the environmental conditions in which a phenotype was observed, may be needed to constrain the model during gapfilling.
Let's use test data available in medusa
for gapfilling. The approach we'll take to generate multiple solutions involves iteratively gapfilling across multiple media conditions to generate a single gapfilled model. We repeat the process with the original model but with a shuffled version of the media conditions (changing the order in which media conditions are used during gapfilling), each time generating a new solution. You can see examples of this approach in Biggs & Papin [2] and Medlock & Papin [3].
In [1]:
# Load the test model for Staphylococcus aureus, originally generated with ModelSEED
import medusa
from medusa.test import create_test_model
model = create_test_model('Saureus_seed')
# Load the biolog data from Plata et al., Nature 2014
from medusa.test import load_biolog_plata
biolog_base_composition, biolog_base_dict, biolog_thresholded = load_biolog_plata()
biolog_base_composition
Out[1]:
Here, biolog_base_composition
describes the media components that are present in every biolog condition (Note: if you are using these data for your own purposes, keep in mind that we added Heme and H2S2O3 due to common issues encountered in models. These are not actually in the biolog medium).
The biolog_base_dict
is a dictionary version of this, which we'll use as direct input to the models as part of model.medium
In [2]:
biolog_base_dict
Out[2]:
The actual growth/no growth data is in biolog_thresholded
, which is a pandas DataFrame
with organism species/genus as rows, and biolog media conditions as columns represented by the ModelSEED metabolite ID for the single carbon/nitrogen source present. The original source of these data is [4]; there, you can find the non-thresholded values if curious. Here, we've thresholded the growth data using the same threshold reported in the paper (>=10 relative units of tetrazolium dye).
In [3]:
# Just inspect the first 5 species
biolog_thresholded.head(5)
Out[3]:
Now we'll extract the positive growth conditions for the species we're interested in (Staphylococcus aureus)
In [4]:
test_mod_pheno = biolog_thresholded.loc['Staphylococcus aureus']
test_mod_pheno = list(test_mod_pheno[test_mod_pheno == True].index)
test_mod_pheno
Out[4]:
In order to gapfill this model, we have to make sure that the biolog media components are in the model, and that there are exchange reactions for each of these metabolites. To make this process more convenient, we'll load the universal reaction database now, which we will also use later in the process. The universal model is large, and the load_universal_modelseed
does some extra processing of the model, so loading it may take a few minutes. First we'll check for changes that need to be made:
In [5]:
# load the universal reaction database
from medusa.test import load_universal_modelseed
from cobra.core import Reaction
universal = load_universal_modelseed()
# check for biolog base components in the model and record
# the metabolites/exchanges that need to be added
add_mets = []
add_exchanges = []
for met in list(biolog_base_dict.keys()):
try:
model.metabolites.get_by_id(met)
except:
print('no '+met)
add_met = universal.metabolites.get_by_id(met).copy()
add_mets.append(add_met)
model.add_metabolites(add_mets)
for met in list(biolog_base_dict.keys()):
# Search for exchange reactions
try:
model.reactions.get_by_id('EX_'+met)
except:
add_met = universal.metabolites.get_by_id(met)
ex_rxn = Reaction('EX_' + met)
ex_rxn.name = "Exchange reaction for " + met
ex_rxn.lower_bound = -1000
ex_rxn.upper_bound = 1000
ex_rxn.add_metabolites({add_met:-1})
add_exchanges.append(ex_rxn)
model.add_reactions(add_exchanges)
Next, we need to do the same for the single carbon/nitrogen sources in the biolog data. When performing this workflow on your own GENRE, you may want to check that all of the media components that enable growth have suitable transporters in the universal model (or already in the draft model).
In [6]:
# Find metabolites from the biolog data that are missing in the test model
# and add them from the universal
missing_mets = []
missing_exchanges = []
media_dicts = {}
for met_id in test_mod_pheno:
try:
model.metabolites.get_by_id(met_id)
except:
print(met_id + " was not in model, adding met and exchange reaction")
met = universal.metabolites.get_by_id(met_id).copy()
missing_mets.append(met)
ex_rxn = Reaction('EX_' + met_id)
ex_rxn.name = "Exchange reaction for " + met_id
ex_rxn.lower_bound = -1000
ex_rxn.upper_bound = 1000
ex_rxn.add_metabolites({met:-1})
missing_exchanges.append(ex_rxn)
media_dicts[met_id] = biolog_base_dict.copy()
media_dicts[met_id] = {'EX_'+k:v for k,v in media_dicts[met_id].items()}
media_dicts[met_id]['EX_'+met_id] = 1000
model.add_metabolites(missing_mets)
model.add_reactions(missing_exchanges)
Now, let's fill some gaps using the iterative_gapfill_from_binary_phenotypes
function. For simplicity, we'll just take the first 5 conditions and perform gapfilling for 10 cycles, which should yield an ensemble with 10 members. We set lower_bound = 0.05
, which requires that the model produces 0.05 units of flux through the previous objective function (here, biomass production) which is now set as a constraint (i.e. vbm >= 0.05). inclusion_threshold
is the amount of flux through a reaction required to include it in the gapfill solution, which is necessary because of the limits of numerical precision. Generally a small number (e.g. < 1E-8) is a good choice. However, some gapfill solutions may have reactions with non-zero flux in ranges lower than this; if this occurs, Medusa will raise an error letting you know that it failed to validate the gapfill solution, and that you should try lowering the threshold.
In [7]:
from medusa.reconstruct.expand import iterative_gapfill_from_binary_phenotypes
# select a subset of the biolog conditions to perform gapfilling with
sources = list(media_dicts.keys())
sub_dict = {sources[0]:media_dicts[sources[0]],
sources[1]:media_dicts[sources[1]],
sources[2]:media_dicts[sources[2]],
sources[3]:media_dicts[sources[3]],
sources[4]:media_dicts[sources[4]]}
num_cycles = 10
lower_bound = 0.05
flux_cutoff = 1E-10
ensemble = iterative_gapfill_from_binary_phenotypes(model,universal,sub_dict,num_cycles,\
lower_bound=lower_bound,\
inclusion_threshold=1E-10,\
exchange_reactions=False,\
demand_reactions=False,\
exchange_prefix='EX')
In [8]:
print(len(ensemble.members))
print(ensemble.members)
In [9]:
# Check out the features that vary across the ensemble
print(len(ensemble.features))
print([feature.base_component.id for feature in ensemble.features])
In [ ]:
[1]: Reed et al., "Systems approach to refining genome annotation", PNAS 2006
[2]: Biggs & Papin, "Managing uncertainty in metabolic network structure and improving predictions using EnsembleFBA", PLoS Computational Biology 2017
[3]: Medlock & Papin, "Guiding the refinement of biochemical knowledgebases with ensembles of metabolic networks and semi-supervised learning", BioRxiv 2018
[4]: Plata et al., "Long-term phenotypic evolution of bacteria", Nature 2015