Reverse Ecology and Metatranscriptomics of Uncultivated Freshwater Actinobacteria

Overview

The previous notebook predicted seed compounds for acI-A, acI-B, and acI-C composite genomes. An organism's seed set contains all of the metabolites which cannot be synthesized by its metabolic network. They may represent auxotrophies, or compounds which can be degraded. In the latter case, genes associated with the degradation of these compounds should be expressed.

However, seed compounds were computed from the compound metabolic network graph of a clade, and individual reactions in the network graph will be associated with genes from genes from multiple genomes. To overcome this obstacle, we decided to map metatranscriptome samples to the "pan-genome" of each clade.

To construct the pan-genome, we used our reference genome collection to define acI COGs (clusters of orthologous groups), and defined the pan-genome of a clade as the union of all COGs present in at least one genome. We then used BBMap to map metatranscriptome reads to our reference genome collection, and counted the unique reads which map to each actinobacterial COG.

Protein Clustering

We used OrthoMCL to identify clusters of orthologous genes (COGs) in the set of 36 freshwater acI genomes. OrthoMCL is an algorithm for grouping proteins into orthologous gene families based on sequence similarity. OrthoMCL takes as input a set of protein sequences and returns a list of COGs and the proteins which belong to each COG. The OrthoMCL pipeline consists of the following steps:

Format faa files to be compatible with OrthoMCL (script 01faaParser.py).
Run all-vs-all BLAST on the concatenated set of protein sequences (script 02parallelBlast).
Initialize the MySQL server to store OrthoMCL output and run OrthoMCL (scripts setupMySql.sh and runOrthoMCL.sh).
Rearrange the OrthoMCL output into a user-friendly format (script 05parseCOGs).

Detailed instructions for running these scripts can be found in code/orthoMCL/README.md. The output of these scripts are described below and located in data/orthoMCL.

cogTable.csv

A table listing the locus tags associated with each (genome, COG) pair.

	AAA023D18	AAA023J06	AAA024D14
group00000	AAA023D18.genome.CDS.1002; AAA023D18.genome.CDS.925; AAA023D18.genome.CDS.939	AAA023J06.genome.CDS.1227; AAA023J06.genome.CDS.862
group00001	AAA023D18.genome.CDS.800	AAA023J06.genome.CDS.798	AAA024D14.genome.CDS.945; AAA024D14.genome.CDS.1601

For example, in genome AAA023D18, the following genes belong to cog00000: AAA023D18.genome.CDS.1002, AAA023D18.genome.CDS.925, and AAA023D18.genome.CDS.939.

annotTable.csv

A table listing the annotations associated with each (genome, COG) pair.

	AAA023D18	AAA023J06	AAA024D14
group00000	Short-chain dehydrogenase/reductase in hypothetical Actinobacterial gene cluster; hypothetical protein; 3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100)	3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100); 3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100)
group00001	DNA gyrase subunit A (EC 5.99.1.3)	DNA gyrase subunit A (EC 5.99.1.3)	DNA gyrase subunit A (EC 5.99.1.3); Topoisomerase IV subunit A (EC 5.99.1.-)

annotSummary.csv

This table provides a list of all annotations associated with the genes in a COG. It can be further manually parsed to reveal the distribution of annotations associated with a COG.

genomes/genomeCOGs.txt

This file contains a list of (gene, COG) pairs, giving the COG associated with each gene in the genome. One such file is created per genome. For example,

AAA023D18.genome.CDS.834,group00573
AAA023D18.genome.CDS.1427,group00799

Metatranscriptomic Mapping

Downloading Metatranscriptome Samples

Metatranscriptomes used in this analysis are available from NCBI and can be downloaded as follows:

Processing Metatranscriptome Samples

Briefly, metatranscriptomic reads were quality filtered and merged using FLASH, and rRNA reads were removed using SortMeRNA. For additional details, check out the Github repo associated with the sampling operation.

Once processed, the reads were placed in data\metranscriptomes.samplefastq and mapped as follows.

Mapping Metatranscriptomic Reads

This section describes the steps to map, count, and normalize metatranscriptomic reads to the reference genomes using BBMap. In this approach, each metatranscriptomic sample is mapped to a single fasta file containing all acI reference genomes. Then, we aggregate all reads which map to a single (clade, COG) pairing and count the total number of reads.

Creation of composite genome

Downstream counting of reads requires the fasta and gff files to be compatible. The gff files created previously use the KBase contig naming scheme: >genome.contig.#. The script code\pushToKbase\kBaseGenbankToFasta generates KBase-formated fna files (refGenomes\fna), from which we created a single fasta file containg all acI genomes. This genome and its gff file are located in data\refGenomes\concat.

Mapping

For convenience, the script code\mapping\01readMapping.pl will execute the pipeline with a single command, which maps each metatranscriptome to the concatenated acI genome:

bbmap.sh t=numThreads in=sample.fastq outm=sample-genome.sam ref=genome.fna ambig=random minid=0.95 nodisk sam=1.3

With this command, reads which map equally well to multiple sites are mapped assigned to a site at random (ambig-random). An 95% percent identity cutoff (minid=0.85) was chosen as it represents a well-established criteria for identifying microbial species.

Clade-Level Gene Expression

Read Counting

Read counting is performed via the script code\mapping\02aggregateUniqueReads.pl, which counts the number of unique reads which map to each (clade, COG) pairing. Briefly, the script works as follows:

First, count the total number of reads which map to each (genome, gene) pairing. This counting is performed with HTSeq-Count.
Second, use this information to count the total number of unique reads which map to each (clade, COG) pairing. Here, cogTable.csv is used to link each (genome, gene) pair to a (clade, COG) pair. See the illustrated cogTable above for an idea of what this looks like.

The output of these scripts are two sets of files, MT-acI.CDS.out and MT-acI.COG.out, containing the counts for each (genome, gene) and (clade, COG) pair, respectively. The results are stored in data\mapping\htseq.

Expression Normalization

Gene expression was calculated on a normalized basis using RPKM. RPKM stands for 'Reads per Kilobase of Transcript per Million Mapped Reads' and is calculated as follows:

Reads were counted using htseq-count as described above.
Kilobase of gene length, this is obtained from the genome annotation. Read counts are normalized to kilobase of gene length to compare expression within a sample: genes which are longer will map more reads.
Million mapped reads, this is obtained by summing the reads which map to every gene within the genome. This value is normally used as a proxy for sequencing depth. If two identical samples were sequenced to different depths, the sample with deeper sequencing will map more reads.

Normalized RPKM values are averaged across all samples and expressed on a log2 basis.

Within each genome, RPKM values are also reported as a percentile rank, and the consensus annotation for that COG is also reported.

Normalization computations are performed via the script code\mapping\03readNormalization and the results stored in results\expression\clade.COG.norm.



In [ ]: