The previous notebook predicted seed compounds for acI-A, acI-B, and acI-C composite genomes. An organism's seed set contains all of the metabolites which cannot be synthesized by its metabolic network. They may represent auxotrophies, or compounds which can be degraded. In the latter case, genes associated with the degradation of these compounds should be expressed.
However, seed compounds were computed from the compound metabolic network graph of a clade, and individual reactions in the network graph will be associated with genes from genes from multiple genomes. To overcome this obstacle, we decided to map metatranscriptome samples to the "pan-genome" of each clade.
To construct the pan-genome, we used our reference genome collection to define acI COGs (clusters of orthologous groups), and defined the pan-genome of a clade as the union of all COGs present in at least one genome. We then used BBMap to map metatranscriptome reads to our reference genome collection, and counted the unique reads which map to each actinobacterial COG.
We used OrthoMCL to identify clusters of orthologous genes (COGs) in the set of 36 freshwater acI genomes. OrthoMCL is an algorithm for grouping proteins into orthologous gene families based on sequence similarity. OrthoMCL takes as input a set of protein sequences and returns a list of COGs and the proteins which belong to each COG. The OrthoMCL pipeline consists of the following steps:
faa
files to be compatible with OrthoMCL (script 01faaParser.py
).02parallelBlast
).setupMySql.sh
and runOrthoMCL.sh
).05parseCOGs
). Detailed instructions for running these scripts can be found in code/orthoMCL/README.md
. The output of these scripts are described below and located in data/orthoMCL
.
A table listing the locus tags associated with each (genome, COG) pair.
AAA023D18 | AAA023J06 | AAA024D14 | |
---|---|---|---|
group00000 | AAA023D18.genome.CDS.1002; AAA023D18.genome.CDS.925; AAA023D18.genome.CDS.939 | AAA023J06.genome.CDS.1227; AAA023J06.genome.CDS.862 | |
group00001 | AAA023D18.genome.CDS.800 | AAA023J06.genome.CDS.798 | AAA024D14.genome.CDS.945; AAA024D14.genome.CDS.1601 |
For example, in genome AAA023D18, the following genes belong to cog00000: AAA023D18.genome.CDS.1002, AAA023D18.genome.CDS.925, and AAA023D18.genome.CDS.939.
A table listing the annotations associated with each (genome, COG) pair.
AAA023D18 | AAA023J06 | AAA024D14 | |
---|---|---|---|
group00000 | Short-chain dehydrogenase/reductase in hypothetical Actinobacterial gene cluster; hypothetical protein; 3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100) | 3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100); 3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100) | |
group00001 | DNA gyrase subunit A (EC 5.99.1.3) | DNA gyrase subunit A (EC 5.99.1.3) | DNA gyrase subunit A (EC 5.99.1.3); Topoisomerase IV subunit A (EC 5.99.1.-) |
This table provides a list of all annotations associated with the genes in a COG. It can be further manually parsed to reveal the distribution of annotations associated with a COG.
This file contains a list of (gene, COG) pairs, giving the COG associated with each gene in the genome. One such file is created per genome. For example,
AAA023D18.genome.CDS.834,group00573
AAA023D18.genome.CDS.1427,group00799
Briefly, metatranscriptomic reads were quality filtered and merged using FLASH, and rRNA reads were removed using SortMeRNA. For additional details, check out the Github repo associated with the sampling operation.
Once processed, the reads were placed in data\metranscriptomes.samplefastq and mapped as follows.
This section describes the steps to map, count, and normalize metatranscriptomic reads to the reference genomes using BBMap. In this approach, each metatranscriptomic sample is mapped to a single fasta
file containing all acI reference genomes. Then, we aggregate all reads which map to a single (clade, COG) pairing and count the total number of reads.
Downstream counting of reads requires the fasta
and gff
files to be compatible. The gff
files created previously use the KBase contig naming scheme: >genome.contig.#
. The script code\pushToKbase\kBaseGenbankToFasta
generates KBase-formated fna files (refGenomes\fna
), from which we created a single fasta
file containg all acI genomes. This genome and its gff file are located in data\refGenomes\concat
.
For convenience, the script code\mapping\01readMapping.pl
will execute the pipeline with a single command, which maps each metatranscriptome to the concatenated acI genome:
bbmap.sh t=numThreads in=sample.fastq outm=sample-genome.sam ref=genome.fna ambig=random minid=0.95 nodisk sam=1.3
With this command, reads which map equally well to multiple sites are mapped assigned to a site at random (ambig-random
). An 95% percent identity cutoff (minid=0.85
) was chosen as it represents a well-established criteria for identifying microbial species.
Read counting is performed via the script code\mapping\02aggregateUniqueReads.pl
, which counts the number of unique reads which map to each (clade, COG) pairing. Briefly, the script works as follows:
cogTable.csv
is used to link each (genome, gene) pair to a (clade, COG) pair. See the illustrated cogTable above for an idea of what this looks like.The output of these scripts are two sets of files, MT-acI.CDS.out
and MT-acI.COG.out
, containing the counts for each (genome, gene) and (clade, COG) pair, respectively. The results are stored in data\mapping\htseq
.
Gene expression was calculated on a normalized basis using RPKM. RPKM stands for 'Reads per Kilobase of Transcript per Million Mapped Reads' and is calculated as follows:
Reads were counted using htseq-count as described above.
Kilobase of gene length, this is obtained from the genome annotation. Read counts are normalized to kilobase of gene length to compare expression within a sample: genes which are longer will map more reads.
Million mapped reads, this is obtained by summing the reads which map to every gene within the genome. This value is normally used as a proxy for sequencing depth. If two identical samples were sequenced to different depths, the sample with deeper sequencing will map more reads.
Normalized RPKM values are averaged across all samples and expressed on a log2 basis.
Within each genome, RPKM values are also reported as a percentile rank, and the consensus annotation for that COG is also reported.
Normalization computations are performed via the script code\mapping\03readNormalization
and the results stored in results\expression\clade.COG.norm
.
In [ ]: