KBase (http://kbase.us) is a powerful resource for obtaining genome-scale network reconstructions for microbial genomes. These reconstructions are distributed as SBML files, which must be processed prior to reverse ecology analysis. This notebook describes how to obtain reconstructions from KBase, and how to process them.
Briefly, genomes (as fasta files containing unannotated contigs) can be pushed from a computer to KBase. Once there, a KBase Narrative (iPython notebook) can be used to build reconstructions for your genomes. The script code\pushToKBase\loadGenomes.pl
pushes these genomes to the KBase narratives created for this project. For more details, follow the instructions in code/pushToKbase/README.md
.
Genomes were pushed to four separate narratives:
SAGs: KBase Narrative, workspace ID: joshamilton:1452727482251
Mendota MAGs: KBase Narrative, workspace ID: joshamilton:1452793144933
Trout Bog MAGs KBase Narrative, workspace ID: joshamilton:1452798604003
MAGs from other research groups KBase Narrative, workspace ID: joshamilton:1452801292037
Within each narrative, the "Annotate Contigs" and "Build Metabolic Model" KBase apps were run for each genome. Annotated genomes (Genbank format), and models (SBML and tsv formats) were downloaded for each genome and stored in:
data/refGenomes
- genomes and annotations in a variety for formatsmodels/rawModels
- metabolic models in SBML and 'tsv' formatsOnce the genomes were downloaded, the Genbank-formatted genomes were converted to fasta nucleotide (ffn), fasta amino acid (ffa), and gff format, using the scripts described below. All scripts are located in the code/pushToKBase
directory:
Reconstructions from KBase require further processing before they are suitable for use in reverse ecology. These processing steps include:
A note about post-processing: When KBase detects that one or more subunits of a complex are present, it creates a "full" GPR by adding 'Unknown' genes for the other subunits. At the time of analysis, CobraPy lakced functions to remove these genes. As such, these models should not be used to perform any simulations which rely on GPRs.
As output, the code returns processed SBML files in the 'processedDataDir' folder. It also returns a summary of the model sizes, in the 'summaryStatsDir' folder.
The first chunk of code imports the Python packages necessary for this analysis, and the second calls the function which processes each SBML file and preps it for analysis.
In [ ]:
# Import Python modules
# These custom-written modules should have been included with the package
# distribution.
from reverseEcology import sbmlFunctions as sf
# Define local folder structure for data input and processing.
rawModelDir = '../models/rawModels'
processedDataDir = '../models/processedModels'
summaryStatsDir = '../data/modelStats'
In [ ]:
sf.processSBMLforRE(rawModelDir, processedDataDir, summaryStatsDir)
The next step is to prune currency metabolites from the metabolic network, in order for the network's directed graph to better reflect physiological metabolic transformations. The approach is similar to one outlined by Ma et al [1,2], and later adopted by Borenstein et al [3].
Briefly, "currency metabolites" are defined as metabolites which serve to transfer functional groups (such as ATP), as well as the functional groups themselves (such as phosphorous). To identify such metabolites in KBase, we scanned the ModelSEED reaction database for metabolites listed in Ma et al [1,2], with the addition of cytochromes and quinones (H+ transfer) and acetyl-CoA/CoA (acetate transfer). The full set of such metabolites are included in the reverseEcology Python package.
The function below updates the stoichiometry of all reactions containing these metabolites.
packageData\currencyRemovePairs.txt
. packageData\currencyAminoPairs.txt
.packageData\currencyRemoveSingletons.txt
. The script below loops over the set of processed models and removes these metabolites from their associated reactions.
In [ ]:
# Import Python modules
# These custom-written modules should have been included with the package
# distribution.
from reverseEcology import sbmlFunctions as sf
# Define local folder structure for data input and processing.
modelDir = '../models/processedModels'
summaryStatsDir = '../data/modelStats'
sf.pruneCurrencyMetabs(modelDir, summaryStatsDir)
The first step in reverse ecology analysis is to convert the SBML representation of the metabolic network to a graph. The network is represented as a directed graph, where nodes denote compounds and edges denote reactions. A directed edge from A to B indicates that compound A is a substrate in a reaction which produces compound B. That is, for a given reaction, all the nodes that represent its substrates are connected by directed edges to all the nodes that represent its products.
The code below converts the SBML representations to metabolic network graphs.
In [ ]:
# Import Python modules
# These custom-written modules should have been included with the package
# distribution.
from reverseEcology import metadataFunctions as mf
from reverseEcology import sbmlFunctions as sf
# Define local folder structure for data input and processing.
modelDir = '../models/processedModels'
summaryStatsDir = '../data/modelStats'
# Import the list of models
dirList = mf.getDirList(modelDir)
modelStatArray = sf.dirListToAdjacencyList(dirList, modelDir, summaryStatsDir)
As shown by Borenstein et al [3], reverse ecology analysis is sensitive to genome incompleteness. To overcome this issue, we decided to merge metabolic models of all those genomes belonging to a particular clade.
The process begins with a single genome from that clade. For the next genome from that clade, unique metabolic pathways are identified. Those unique pathways are appended to the original graph, giving a metabolic network graph which contains the content of both genomes. The process is repeated, with unique metabolic pathways being appended to the composite network graph until all genomes have been exhausted.
Reverse ecology analysis will then be performed on these merged models.
The code below creates the necessary folder structure. Then, it reads in the taxonomy {lineage, clade, tribe} for each genomes and aggregates all genomes belonging to the same tribe.
In [ ]:
# Import Python modules
# These custom-written modules should have been included with the package
# distribution.
from reverseEcology import metadataFunctions as mf
from reverseEcology import graphFunctions as gf
# Define local folder structure for data input and processing.
genomeModelDir = '../models/processedModels'
mergedModelDir = '../models/merged'
externalDataDir = '../data/externalData'
taxonomyFile= externalDataDir+'/taxonomy.csv'
In [ ]:
mergeLevel = 'Clade'
cladeSampleDict = mf.importTaxonomy(taxonomyFile, mergeLevel)
gf.createMergedGraph(cladeSampleDict, mergedModelDir, genomeModelDir)
In [ ]: