Reverse Ecology and Metatranscriptomics of Uncultivated Freshwater Actinobacteria

Genome Annotation

KBase (http://kbase.us) is a powerful resource for obtaining genome-scale network reconstructions for microbial genomes. These reconstructions are distributed as SBML files, which must be processed prior to reverse ecology analysis. This notebook describes how to obtain reconstructions from KBase, and how to process them.

Obtaining and Preparing SBML Files

Briefly, genomes (as fasta files containing unannotated contigs) can be pushed from a computer to KBase. Once there, a KBase Narrative (iPython notebook) can be used to build reconstructions for your genomes. The script code\pushToKBase\loadGenomes.pl pushes these genomes to the KBase narratives created for this project. For more details, follow the instructions in code/pushToKbase/README.md.

Genomes were pushed to four separate narratives:

Within each narrative, the "Annotate Contigs" and "Build Metabolic Model" KBase apps were run for each genome. Annotated genomes (Genbank format), and models (SBML and tsv formats) were downloaded for each genome and stored in:

  • data/refGenomes - genomes and annotations in a variety for formats
  • models/rawModels - metabolic models in SBML and 'tsv' formats

Once the genomes were downloaded, the Genbank-formatted genomes were converted to fasta nucleotide (ffn), fasta amino acid (ffa), and gff format, using the scripts described below. All scripts are located in the code/pushToKBase directory:

  • No script given - For each genome, this script concatenates Genbank files for each indiviual scaffold, giving a single Genbank file for the genome.
  • kBaseGenbankToFasta.py - Converts Genbank files to fasta nucleotide (ffn), fasta amino acid (ffa) format.
  • kBaseGenbankToGff.py - Converts Genbank files to gff format.
  • cleanUpGFF.pl - Performs some additional processing on the gff files.
  • No script given - KBase SBML and 'tsv' file names were simplified.

Processing SBML Files

Reconstructions from KBase require further processing before they are suitable for use in reverse ecology. These processing steps include:

  1. Reformat gene locus tags, for compatability with the CobraPy software package
  2. Remove biomass, exchange, transport, spontaneous, DNA/RNA biosynthesis reactions and their corresponding genes
  3. Import metabolite formulas
  4. Check mass- and charge-balancing of reactions in the reconstruction
  5. Reformat reaction and metabolite names, for compatability with the cobrapy software package

A note about post-processing: When KBase detects that one or more subunits of a complex are present, it creates a "full" GPR by adding 'Unknown' genes for the other subunits. At the time of analysis, CobraPy lakced functions to remove these genes. As such, these models should not be used to perform any simulations which rely on GPRs.

As output, the code returns processed SBML files in the 'processedDataDir' folder. It also returns a summary of the model sizes, in the 'summaryStatsDir' folder.

The first chunk of code imports the Python packages necessary for this analysis, and the second calls the function which processes each SBML file and preps it for analysis.


In [ ]:
# Import Python modules 
# These custom-written modules should have been included with the package
# distribution. 
from reverseEcology import sbmlFunctions as sf

# Define local folder structure for data input and processing.
rawModelDir = '../models/rawModels'
processedDataDir = '../models/processedModels'
summaryStatsDir = '../data/modelStats'

In [ ]:
sf.processSBMLforRE(rawModelDir, processedDataDir, summaryStatsDir)

Pruning Currency Metabolites

The next step is to prune currency metabolites from the metabolic network, in order for the network's directed graph to better reflect physiological metabolic transformations. The approach is similar to one outlined by Ma et al [1,2], and later adopted by Borenstein et al [3].

Briefly, "currency metabolites" are defined as metabolites which serve to transfer functional groups (such as ATP), as well as the functional groups themselves (such as phosphorous). To identify such metabolites in KBase, we scanned the ModelSEED reaction database for metabolites listed in Ma et al [1,2], with the addition of cytochromes and quinones (H+ transfer) and acetyl-CoA/CoA (acetate transfer). The full set of such metabolites are included in the reverseEcology Python package.

The function below updates the stoichiometry of all reactions containing these metabolites.

  1. First, all pairs of currency metabolites are removed. The set of such pairs is included in the reverseEcology Python package, in the file packageData\currencyRemovePairs.txt.
  2. Some currency pairs involved in amino acid metabolism are subject to additional scrutiny, and removed only if a free amino group does not participate in the reaction. This ensures that reactions which synthesize these compounds are retained. The set of such pairs is included in the reverseEcology Python package, in the file packageData\currencyAminoPairs.txt.
  3. Finally, all metabolites which represent free forms of functional groups are removed (H+, NH4+, CO2, O2, H2O, etc). The set of such pairs is included in the reverseEcology Python package, in the file packageData\currencyRemoveSingletons.txt.

The script below loops over the set of processed models and removes these metabolites from their associated reactions.


In [ ]:
# Import Python modules 
# These custom-written modules should have been included with the package
# distribution. 
from reverseEcology import sbmlFunctions as sf

# Define local folder structure for data input and processing.
modelDir = '../models/processedModels'
summaryStatsDir = '../data/modelStats'

sf.pruneCurrencyMetabs(modelDir, summaryStatsDir)

Genome Merging

Converting KBase SBML files to Graphs

The first step in reverse ecology analysis is to convert the SBML representation of the metabolic network to a graph. The network is represented as a directed graph, where nodes denote compounds and edges denote reactions. A directed edge from A to B indicates that compound A is a substrate in a reaction which produces compound B. That is, for a given reaction, all the nodes that represent its substrates are connected by directed edges to all the nodes that represent its products.

The code below converts the SBML representations to metabolic network graphs.


In [ ]:
# Import Python modules 
# These custom-written modules should have been included with the package
# distribution. 
from reverseEcology import metadataFunctions as mf
from reverseEcology import sbmlFunctions as sf

# Define local folder structure for data input and processing.
modelDir = '../models/processedModels'
summaryStatsDir = '../data/modelStats'

# Import the list of models
dirList = mf.getDirList(modelDir)

modelStatArray = sf.dirListToAdjacencyList(dirList, modelDir, summaryStatsDir)

Merging Network Graphs Belonging to the Same Clade

As shown by Borenstein et al [3], reverse ecology analysis is sensitive to genome incompleteness. To overcome this issue, we decided to merge metabolic models of all those genomes belonging to a particular clade.

The process begins with a single genome from that clade. For the next genome from that clade, unique metabolic pathways are identified. Those unique pathways are appended to the original graph, giving a metabolic network graph which contains the content of both genomes. The process is repeated, with unique metabolic pathways being appended to the composite network graph until all genomes have been exhausted.

Reverse ecology analysis will then be performed on these merged models.

The code below creates the necessary folder structure. Then, it reads in the taxonomy {lineage, clade, tribe} for each genomes and aggregates all genomes belonging to the same tribe.


In [ ]:
# Import Python modules 
# These custom-written modules should have been included with the package
# distribution. 
from reverseEcology import metadataFunctions as mf
from reverseEcology import graphFunctions as gf

# Define local folder structure for data input and processing.
genomeModelDir = '../models/processedModels'
mergedModelDir = '../models/merged'
externalDataDir = '../data/externalData'
taxonomyFile= externalDataDir+'/taxonomy.csv'

In [ ]:
mergeLevel = 'Clade'
cladeSampleDict =  mf.importTaxonomy(taxonomyFile, mergeLevel)
gf.createMergedGraph(cladeSampleDict, mergedModelDir, genomeModelDir)

References

  1. Ma, H., & Zeng, A. P. (2003). Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics, 19(2), 270–277. http://doi.org/10.1093/bioinformatics/19.2.270
  2. Ma, H. W., & Zeng, A.-P. (2003). The connectivity structure, giant strong component and centrality of metabolic networks. Bioinformatics, 19(11), 1423–1430. http://doi.org/10.1093/bioinformatics/btg177
  3. Borenstein, E., Kupiec, M., Feldman, M. W., & Ruppin, E. (2008). Large-scale reconstruction and phylogenetic analysis of metabolic environments. Proceedings of the National Academy of Sciences, 105(38), 14482–14487. http://doi.org/10.1073/pnas.0806162105

In [ ]: