Background

Example notebook for the visualiztion of metagenomic data using MinHash signatures calculated with sourmash compute, classified with sourmash gather, and compared with sourmash compare.

  • Signatures were computed with a --scaled 10000 and -k 31
sourmash compute --scaled 10000 -k 31 < filename >
- Signatures used in the example below can be found in the data directory 

sourmash gather -k 31 genbank-k31.sbt.json < filename >
  • Metagenomes were compared using sourmash compare
sourmash compare -k 31 < filename >

1) Import data visualiztion tools


In [1]:
#Import matplotlib
%matplotlib inline

In [2]:
#Import pandas, seaborn, and ipython display
import pandas as pd 
import seaborn as sns
from IPython.display import display, HTML

2) Convert sourmash output (i.e. csv) to dataframe from visualization


In [3]:
#Read in taxonmic classification results from sourmash with pandas 

#Dataframe name, read in csv file
mg_1_table = pd.read_csv("../data/mg_1")
mg_2_table = pd.read_csv("../data/mg_2")
mg_3_table = pd.read_csv("../data/mg_3")
mg_4_table = pd.read_csv("../data/mg_4")
mg_5_table = pd.read_csv("../data/mg_5")
mg_6_table = pd.read_csv("../data/mg_6")
mg_7_table = pd.read_csv("../data/mg_7")
mg_8_table = pd.read_csv("../data/mg_8")

#Display taxonomic classification results for 8 metagenomes 
#Display data frames as tabels with display()
#Remove dataframe by commenting out using the "#" symbol

#Display all dataframes

display(mg_1_table)
display(mg_2_table)
display(mg_3_table)
display(mg_4_table)
display(mg_5_table)
display(mg_6_table)
display(mg_7_table)
display(mg_8_table)


intersect_bp f_orig_query f_match f_unique_to_query name filename md5
0 5270000.0 0.528586 1.000000 0.528586 CU928164.2 Escherichia coli IAI39 chromosome, ... ../database/genbank-k31.sbt.json 312d479a6ac8faeeeafc460f08abcd67
1 4930000.0 0.494483 0.474645 0.234704 CP001855.1 Escherichia coli O83:H1 str. NRG 85... ../database/genbank-k31.sbt.json 3f6082c99612ef0622995b4b5c5fdd22
2 1920000.0 0.192578 0.969697 0.192578 CP000025.1 Campylobacter jejuni RM1221, comple... ../database/genbank-k31.sbt.json a38dd480662095f725894111f06057b3
3 1670000.0 0.167503 0.263473 0.044132 FPEE01000083.1 Campylobacter jejuni isolate 11... ../database/genbank-k31.sbt.json 37369bda9c5cc7a52e74de091608d311
intersect_bp f_orig_query f_match f_unique_to_query name filename md5
0 5150000.0 0.344021 0.998062 0.344021 CP012728.1 Bacillus anthracis strain PR10-4, c... ../database/genbank-k31.sbt.json 9c676987aacdf4b98f4a54a86ce81c09
1 4210000.0 0.281229 0.762681 0.281229 MYIF01001000.1 Salmonella enterica strain BCW_... ../database/genbank-k31.sbt.json 2d316a43b5f41173c5d5cb9e5c4dc392
2 4190000.0 0.279893 0.771372 0.259185 BA000007.2 Escherichia coli O157:H7 str. Sakai... ../database/genbank-k31.sbt.json 3a1b4e04038cd5886ba40c2ca8e166b9
3 1670000.0 0.111556 1.000000 0.111556 FPEE01000083.1 Campylobacter jejuni isolate 11... ../database/genbank-k31.sbt.json 37369bda9c5cc7a52e74de091608d311
4 4070000.0 0.271877 0.012245 0.004008 MZNJ01000001.1 Salmonella enterica subsp. ente... ../database/genbank-k31.sbt.json 908c1e7301363ef4ac5658d4f7650597
intersect_bp f_orig_query f_match f_unique_to_query name filename md5
0 5150000.0 0.473346 0.998062 0.473346 CP012728.1 Bacillus anthracis strain PR10-4, c... ../database/genbank-k31.sbt.json 9c676987aacdf4b98f4a54a86ce81c09
1 4060000.0 0.373162 0.828571 0.373162 MZNJ01000001.1 Salmonella enterica subsp. ente... ../database/genbank-k31.sbt.json 908c1e7301363ef4ac5658d4f7650597
2 1670000.0 0.153493 1.000000 0.153493 FPEE01000083.1 Campylobacter jejuni isolate 11... ../database/genbank-k31.sbt.json 37369bda9c5cc7a52e74de091608d311
intersect_bp f_orig_query f_match f_unique_to_query name filename md5
0 5150000.0 0.344021 0.998062 0.344021 CP012728.1 Bacillus anthracis strain PR10-4, c... ../database/genbank-k31.sbt.json 9c676987aacdf4b98f4a54a86ce81c09
1 4210000.0 0.281229 0.762681 0.281229 MYIF01001000.1 Salmonella enterica strain BCW_... ../database/genbank-k31.sbt.json 2d316a43b5f41173c5d5cb9e5c4dc392
2 4190000.0 0.279893 0.771372 0.259185 BA000007.2 Escherichia coli O157:H7 str. Sakai... ../database/genbank-k31.sbt.json 3a1b4e04038cd5886ba40c2ca8e166b9
3 1670000.0 0.111556 1.000000 0.111556 FPEE01000083.1 Campylobacter jejuni isolate 11... ../database/genbank-k31.sbt.json 37369bda9c5cc7a52e74de091608d311
4 4070000.0 0.271877 0.012245 0.004008 MZNJ01000001.1 Salmonella enterica subsp. ente... ../database/genbank-k31.sbt.json 908c1e7301363ef4ac5658d4f7650597
intersect_bp f_orig_query f_match f_unique_to_query name filename md5
0 5270000.0 0.528586 1.000000 0.528586 CU928164.2 Escherichia coli IAI39 chromosome, ... ../database/genbank-k31.sbt.json 312d479a6ac8faeeeafc460f08abcd67
1 4930000.0 0.494483 0.474645 0.234704 CP001855.1 Escherichia coli O83:H1 str. NRG 85... ../database/genbank-k31.sbt.json 3f6082c99612ef0622995b4b5c5fdd22
2 1920000.0 0.192578 0.969697 0.192578 CP000025.1 Campylobacter jejuni RM1221, comple... ../database/genbank-k31.sbt.json a38dd480662095f725894111f06057b3
3 1670000.0 0.167503 0.263473 0.044132 FPEE01000083.1 Campylobacter jejuni isolate 11... ../database/genbank-k31.sbt.json 37369bda9c5cc7a52e74de091608d311
intersect_bp f_orig_query f_match f_unique_to_query name filename md5
0 5330000.0 0.349508 1.0 0.349508 BX470251.1 Photorhabdus luminescens subsp. lau... ../database/genbank-k31.sbt.json f38b30cc34c8341c45dde18e84fc1a4f
1 4190000.0 0.274754 1.0 0.274754 CP001114.1 Deinococcus deserti VCD115, complet... ../database/genbank-k31.sbt.json e916a176f7035b4f1962c5c49e56818d
2 2970000.0 0.194754 1.0 0.194754 CP002985.1 Acidithiobacillus ferrivorans SS3, ... ../database/genbank-k31.sbt.json d85347d6bfe69560e5cd266463290765
3 2760000.0 0.180984 1.0 0.180984 FM211688.1 Listeria monocytogenes L99 serovar ... ../database/genbank-k31.sbt.json f406fb0fe4840a35d87293d8c0e26a2e
intersect_bp f_orig_query f_match f_unique_to_query name filename md5
0 5270000.0 0.474775 1.000000 0.474775 CU928164.2 Escherichia coli IAI39 chromosome, ... ../database/genbank-k31.sbt.json 312d479a6ac8faeeeafc460f08abcd67
1 4070000.0 0.366667 0.808163 0.356757 MZNJ01000001.1 Salmonella enterica subsp. ente... ../database/genbank-k31.sbt.json 908c1e7301363ef4ac5658d4f7650597
2 1870000.0 0.168468 0.944444 0.168468 CP000025.1 Campylobacter jejuni RM1221, comple... ../database/genbank-k31.sbt.json a38dd480662095f725894111f06057b3
intersect_bp f_orig_query f_match f_unique_to_query name filename md5
0 5270000.0 0.295735 1.000000 0.295735 CU928164.2 Escherichia coli IAI39 chromosome, ... ../database/genbank-k31.sbt.json 312d479a6ac8faeeeafc460f08abcd67
1 5150000.0 0.289001 0.998062 0.289001 CP012728.1 Bacillus anthracis strain PR10-4, c... ../database/genbank-k31.sbt.json 9c676987aacdf4b98f4a54a86ce81c09
2 2970000.0 0.166667 1.000000 0.166667 CP002985.1 Acidithiobacillus ferrivorans SS3, ... ../database/genbank-k31.sbt.json d85347d6bfe69560e5cd266463290765
3 2760000.0 0.154882 1.000000 0.154882 FM211688.1 Listeria monocytogenes L99 serovar ... ../database/genbank-k31.sbt.json f406fb0fe4840a35d87293d8c0e26a2e
4 1670000.0 0.093715 1.000000 0.093715 FPEE01000083.1 Campylobacter jejuni isolate 11... ../database/genbank-k31.sbt.json 37369bda9c5cc7a52e74de091608d311

Terms

  • intersect_bp - baspairs in shared by the query and the match

  • f_orig_query - fraction of the query

  • f_match - fraction of the match found

  • f_unique_to_query - fraction of the query that is unique to the match

  • name - name of the match

  • filename - search database used

  • md5 - unique identifier for data used to generate the signature

3) Compare taxa across metagenomes


In [4]:
#Combined output into a single file named all_gather_results.csv
!head -1 ../data/mg_1 \
> all_gather_results.csv; tail -n +2 -q ../data/mg_{1..8} >> all_gather_results.csv

sns.set(style="darkgrid")

#Ploting the frequency of detection of each match across the 8 metagenomes 
dx = pd.read_csv('all_gather_results.csv', header = 0)
dx['name'].value_counts().plot(kind="barh", fontsize=16, figsize=(12,12))

#plt.savefig('<file name>.pdf', bbox_inches='tight')


Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x11496dac8>

In [5]:
#Ploting average of the fraction of match detected across all metagenomes
newdx = dx[['f_match', 'name']].copy()
newdx
newdx_byname = newdx.set_index('name')
newdx_byname.groupby(level=0).mean().plot(kind="barh", fontsize=16, figsize=(12,12))

#plt.savefig('<insert name>.pdf', bbox_inches='tight')


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x116e4ad68>

3) Compare metagenomes with sourmash compare


In [6]:
#Calculate jaccard distance using sourmash compare and generate results in a csv named mg_compare

#Path to sourmash install, "compare", path to signatures, output format, output filename
!~/dev/sourmash/sourmash compare ../data/mg_*sig --csv mg_compare


loading ../data/mg_1.fna.gz.sig
loading ../data/mg_2.fna.gz.sig
loading ../data/mg_3.fna.gz.sig
loading ../data/mg_4.fna.gz.sig
loading ../data/mg_5.fna.gz.sig
loading ../data/mg_6.fna.gz.sig
loading ../data/mg_7.fna.gz.sig
loading ../data/mg_8.fna.gz.sig
0-         mg_1.fna.gz	[ 1.     0.189  0.093  0.189  1.     0.     0.513  0.333]
1-         mg_2.fna.gz	[ 0.189  1.     0.727  1.     0.189  0.     0.376  0.367]
2-         mg_3.fna.gz	[ 0.093  0.727  1.     0.727  0.093  0.     0.313  0.318]
3-         mg_4.fna.gz	[ 0.189  1.     0.727  1.     0.189  0.     0.376  0.367]
4-         mg_5.fna.gz	[ 1.     0.189  0.093  0.189  1.     0.     0.513  0.333]
5-         mg_6.fna.gz	[ 0.    0.    0.    0.    0.    1.    0.    0.21]
6-         mg_7.fna.gz	[ 0.513  0.376  0.313  0.376  0.513  0.     1.     0.287]
7-         mg_8.fna.gz	[ 0.333  0.367  0.318  0.367  0.333  0.21   0.287  1.   ]
min similarity in matrix: 0.000

4) Visualize metagenome comparisons


In [7]:
#Generate similarity matrix with hierchical clustering 
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(context="paper", font="monospace")

sns.set(font_scale=1.4)

#Define clustermap color scheme 
cmap = sns.cubehelix_palette(8, start=2, rot=0, dark=0, light=.95, as_cmap=True)

# Load the datset 
df = pd.read_csv("mg_compare", header=0)

# Draw the clustermap using seaborn
o = sns.clustermap(df, vmax=1, vmin=0, square=True, linewidths=.005, cmap=cmap)


#Bold labels and rotate 
plt.setp(o.ax_heatmap.get_yticklabels(), rotation=0, fontweight="bold")
plt.setp(o.ax_heatmap.get_xticklabels(), rotation=90, fontweight="bold")

#Set context with seaborn 
sns.set(context="paper",font="monospace")

#Save figure 
#plt.savefig(<filename>.pdf)



In [ ]: