Reverse Ecology and Metatranscriptomics of Uncultivated Freshwater Actinobacteria

Preliminaries

This figures requires the user to have run our complete pipeline, including mapping of metatranscriptomic reads. This pipeline encompasses the following files:

code/01-genomeAnnotationAndModelProcessing.ipynb
code/02-calculateSeedCompounds.ipynb
code/03-integrateREwithMTs.ipynb

Figure 3c: Metatranscriptomics of Clade acI-C

Overview

In the outer circle, this panel shows the Average Log2 RPKM for each COG in the composite acI-C genome. The inner circles show the presence/absence of each COG in each of the acI-C genomes. The visualization was constructed using Anvio.

To create the visualization, the following command is run from with anvio:

anvi-interactive -p profile.db -t tree.txt -d view_data.txt -A additional_view_data.txt --manual --title "Average RPKM of the acI-C Composite Genome"

where the files are:

* profile.db - profile file
* tree.txt - phylogenetic tree giving clustering of the COGs
* view_data.txt - presence/absence of COG in each genome
* additional_view_data.txt - Average Log2 RPKM for each COG

Thus, creating the visualization requires the following steps:

Creation of view_data.txt
Creation of additional_view_data.txt
Creation of tree.txt
Visualization!

The visualization was then manually touched up using Adobe Illustrator.

Step 0: Import Packages and Initialize Variables



In [1]:

    
%reset
################################################################################
### Import packages
################################################################################

import os
import pandas as pd
from scipy.spatial import distance as dist
from scipy.cluster import hierarchy as sch

################################################################################
### Define folder structure
################################################################################

externalDataDir = '../data/externalData'
taxonFile = externalDataDir+'/taxonomy.csv'
orthomclDir = '../data/orthoMCL'
resultsDir = '../results/'
exprDir = resultsDir+'/expression'

figureDir = '../figures/fig3-workflow'

taxonLevel = 'Clade'
clade = 'acI-C'

# Check that figureDir exists, results will be placed there
if not os.path.exists(figureDir):
    os.makedirs(figureDir)

################################################################################
### Create list of genomes in the specified clade
################################################################################

# Define a function to import taxonomy files
def importTaxonomy(taxonFile, level):

# Read in the taxonomic classification
    taxonClass = pd.DataFrame.from_csv(taxonFile, sep=',')
    taxonClass = taxonClass.dropna()
    
# Extract the unique tribes found in the dataset
    groupList = pd.unique(taxonClass[level].values)
    groupList.sort(axis=0)
    groupList = [ group for group in groupList if not group.startswith('Unknown') ]
    
# For each tribe, return the list of samples. Creates a dict and adds an entry
# for each tribe.
    groupSampleDict = {}

    for group in groupList:

# Identify the samples belonging to this tribe
        samples = taxonClass.loc[taxonClass[level] == group]
        samples = [sample for sample in samples.index]
        groupSampleDict[group] = samples
        
    return groupSampleDict
    
genomeSampleDict = importTaxonomy(taxonFile, taxonLevel)

genomeList = genomeSampleDict[clade]









    



Once deleted, variables cannot be recovered. Proceed (y/[n])? y

Step 1: Creation of view_data.txt



In [2]:

    
# Determine the set of COGs found within the specified clade
# Read in the cog table. Subset columns belonging to the clade. Subset rows
# with COGs in that clade.
cogTable = pd.read_csv(orthomclDir+'/cogTable.csv', delimiter=',', index_col=0)
cogTable = cogTable[genomeList]
cogTable = cogTable[~pd.isnull(cogTable).all(axis=1)]

# Replace CDS with '1' and 'nan' with 0
cogTable = cogTable.fillna(0)        
cogTable = cogTable.replace(to_replace='.+', value='1', regex=True)

# Make index a column and rearrange. Anvio requires the first row/column of 
# view data file to say 'contig'
cogTable['contig'] = cogTable.index
cogTable = cogTable[['contig']+genomeList]

cogTable.to_csv(figureDir+'/view_data.txt', sep='\t', index=False)

Step 2: Creation of additional_view_data.txt



In [3]:

    
# Determine the set of COGs found within the specified clade
# Read in the cog table. Subset columns belonging to the clade. Subset rows
# with COGs in that clade.
cogTable = pd.read_csv(orthomclDir+'/cogTable.csv', delimiter=',', index_col=0)
cogTable = cogTable[genomeList]
cogTable = cogTable[~pd.isnull(cogTable).all(axis=1)]

# Establish data table of RPKM values
rpkmTable = pd.read_csv(exprDir+'/'+clade+'.norm', delimiter=',', index_col=1)
rpkmTable = rpkmTable['Log2 Avg RPKM']

# Create empty dataframe and populate with values from rpkmTable
addlViewDataDF = pd.DataFrame(0, index=cogTable.index, columns=['log2_avg_rpkm'], dtype=float)
for cog in addlViewDataDF.index:
    if cog in rpkmTable.index:
        addlViewDataDF.set_value(cog, 'log2_avg_rpkm', rpkmTable.loc[cog])
        
addlViewDataDF.to_csv(figureDir+'/additional_view_data.txt', sep='\t')

Step 3: Creation of tree.txt



In [4]:

    
# Import the observations
obsDF = pd.read_csv(figureDir+'/view_data.txt', sep='\t', index_col=0)

leafNames = obsDF.index.tolist()
obsMatrix = obsDF.values

# Compute the distance matrix
distMatrix = dist.pdist(obsMatrix, metric='euclidean')

# Compute the linkage matrix
linkMatrix = sch.linkage(distMatrix, method='single', metric='euclidean')

# Export the linakge matrix as a newick file
# Stolen from StackOverflow: http://stackoverflow.com/questions/28222179/save-dendrogram-to-newick-format
def getNewick(node, newick, parentdist, leaf_names):
    if node.is_leaf():
        return "%s:%.2f%s" % (leaf_names[node.id], parentdist - node.dist, newick)
    else:
        if len(newick) > 0:
            newick = "):%.2f%s" % (parentdist - node.dist, newick)
        else:
            newick = ");"
        newick = getNewick(node.get_left(), newick, node.dist, leaf_names)
        newick = getNewick(node.get_right(), ",%s" % (newick), node.dist, leaf_names)
        newick = "(%s" % (newick)
        return newick

tree = sch.to_tree(linkMatrix, False)
newickTree = getNewick(tree, "", tree.dist, leafNames)

# Write to file
with open(figureDir+'/tree.txt', 'w') as outFileHandle:
    outFileHandle.write(newickTree)

Step 4: Visualization

The final step must be performed manually. In the terminal, navigate to the figureDir defined above. Launch Anvi'o via the command anvio. Then, type the command

anvi-interactive -p profile.db -t tree.txt -d view_data.txt -A additional_view_data.txt --manual --title "Average RPKM of the acI-C Composite Genome"

Change the DrawAngle to 360, press Draw in the Anvi'o window and save to svg format.