In the outer circle, this panel shows the Average Log2 RPKM for each COG in the composite acI-C genome. The inner circles show the presence/absence of each COG in each of the acI-C genomes. The visualization was constructed using Anvio.
To create the visualization, the following command is run from with anvio:
anvi-interactive -p profile.db -t tree.txt -d view_data.txt -A additional_view_data.txt --manual --title "Average RPKM of the acI-C Composite Genome"
where the files are:
* profile.db - profile file
* tree.txt - phylogenetic tree giving clustering of the COGs
* view_data.txt - presence/absence of COG in each genome
* additional_view_data.txt - Average Log2 RPKM for each COG
Thus, creating the visualization requires the following steps:
The visualization was then manually touched up using Adobe Illustrator.
In [1]:
%reset
################################################################################
### Import packages
################################################################################
import os
import pandas as pd
from scipy.spatial import distance as dist
from scipy.cluster import hierarchy as sch
################################################################################
### Define folder structure
################################################################################
externalDataDir = '../data/externalData'
taxonFile = externalDataDir+'/taxonomy.csv'
orthomclDir = '../data/orthoMCL'
resultsDir = '../results/'
exprDir = resultsDir+'/expression'
figureDir = '../figures/fig3-workflow'
taxonLevel = 'Clade'
clade = 'acI-C'
# Check that figureDir exists, results will be placed there
if not os.path.exists(figureDir):
os.makedirs(figureDir)
################################################################################
### Create list of genomes in the specified clade
################################################################################
# Define a function to import taxonomy files
def importTaxonomy(taxonFile, level):
# Read in the taxonomic classification
taxonClass = pd.DataFrame.from_csv(taxonFile, sep=',')
taxonClass = taxonClass.dropna()
# Extract the unique tribes found in the dataset
groupList = pd.unique(taxonClass[level].values)
groupList.sort(axis=0)
groupList = [ group for group in groupList if not group.startswith('Unknown') ]
# For each tribe, return the list of samples. Creates a dict and adds an entry
# for each tribe.
groupSampleDict = {}
for group in groupList:
# Identify the samples belonging to this tribe
samples = taxonClass.loc[taxonClass[level] == group]
samples = [sample for sample in samples.index]
groupSampleDict[group] = samples
return groupSampleDict
genomeSampleDict = importTaxonomy(taxonFile, taxonLevel)
genomeList = genomeSampleDict[clade]
In [2]:
# Determine the set of COGs found within the specified clade
# Read in the cog table. Subset columns belonging to the clade. Subset rows
# with COGs in that clade.
cogTable = pd.read_csv(orthomclDir+'/cogTable.csv', delimiter=',', index_col=0)
cogTable = cogTable[genomeList]
cogTable = cogTable[~pd.isnull(cogTable).all(axis=1)]
# Replace CDS with '1' and 'nan' with 0
cogTable = cogTable.fillna(0)
cogTable = cogTable.replace(to_replace='.+', value='1', regex=True)
# Make index a column and rearrange. Anvio requires the first row/column of
# view data file to say 'contig'
cogTable['contig'] = cogTable.index
cogTable = cogTable[['contig']+genomeList]
cogTable.to_csv(figureDir+'/view_data.txt', sep='\t', index=False)
In [3]:
# Determine the set of COGs found within the specified clade
# Read in the cog table. Subset columns belonging to the clade. Subset rows
# with COGs in that clade.
cogTable = pd.read_csv(orthomclDir+'/cogTable.csv', delimiter=',', index_col=0)
cogTable = cogTable[genomeList]
cogTable = cogTable[~pd.isnull(cogTable).all(axis=1)]
# Establish data table of RPKM values
rpkmTable = pd.read_csv(exprDir+'/'+clade+'.norm', delimiter=',', index_col=1)
rpkmTable = rpkmTable['Log2 Avg RPKM']
# Create empty dataframe and populate with values from rpkmTable
addlViewDataDF = pd.DataFrame(0, index=cogTable.index, columns=['log2_avg_rpkm'], dtype=float)
for cog in addlViewDataDF.index:
if cog in rpkmTable.index:
addlViewDataDF.set_value(cog, 'log2_avg_rpkm', rpkmTable.loc[cog])
addlViewDataDF.to_csv(figureDir+'/additional_view_data.txt', sep='\t')
In [4]:
# Import the observations
obsDF = pd.read_csv(figureDir+'/view_data.txt', sep='\t', index_col=0)
leafNames = obsDF.index.tolist()
obsMatrix = obsDF.values
# Compute the distance matrix
distMatrix = dist.pdist(obsMatrix, metric='euclidean')
# Compute the linkage matrix
linkMatrix = sch.linkage(distMatrix, method='single', metric='euclidean')
# Export the linakge matrix as a newick file
# Stolen from StackOverflow: http://stackoverflow.com/questions/28222179/save-dendrogram-to-newick-format
def getNewick(node, newick, parentdist, leaf_names):
if node.is_leaf():
return "%s:%.2f%s" % (leaf_names[node.id], parentdist - node.dist, newick)
else:
if len(newick) > 0:
newick = "):%.2f%s" % (parentdist - node.dist, newick)
else:
newick = ");"
newick = getNewick(node.get_left(), newick, node.dist, leaf_names)
newick = getNewick(node.get_right(), ",%s" % (newick), node.dist, leaf_names)
newick = "(%s" % (newick)
return newick
tree = sch.to_tree(linkMatrix, False)
newickTree = getNewick(tree, "", tree.dist, leafNames)
# Write to file
with open(figureDir+'/tree.txt', 'w') as outFileHandle:
outFileHandle.write(newickTree)
The final step must be performed manually. In the terminal, navigate to the figureDir
defined above. Launch Anvi'o via the command anvio
. Then, type the command
anvi-interactive -p profile.db -t tree.txt -d view_data.txt -A additional_view_data.txt --manual --title "Average RPKM of the acI-C Composite Genome"
Change the DrawAngle
to 360, press Draw
in the Anvi'o window and save to svg format.