Example notebook for the visualiztion of metagenomic data using MinHash signatures calculated with sourmash compute, classified with sourmash gather, and compared with sourmash compare.
sourmash compute --scaled 10000 -k 31 < filename >
- Signatures used in the example below can be found in the data directory
sourmash gather -k 31 genbank-k31.sbt.json < filename >
sourmash compare -k 31 < filename >
In [1]:
#Import matplotlib
%matplotlib inline
In [2]:
#Import pandas, seaborn, and ipython display
import pandas as pd
import seaborn as sns
from IPython.display import display, HTML
In [3]:
#Read in taxonmic classification results from sourmash with pandas
#Dataframe name, read in csv file
mg_1_table = pd.read_csv("../data/mg_1")
mg_2_table = pd.read_csv("../data/mg_2")
mg_3_table = pd.read_csv("../data/mg_3")
mg_4_table = pd.read_csv("../data/mg_4")
mg_5_table = pd.read_csv("../data/mg_5")
mg_6_table = pd.read_csv("../data/mg_6")
mg_7_table = pd.read_csv("../data/mg_7")
mg_8_table = pd.read_csv("../data/mg_8")
#Display taxonomic classification results for 8 metagenomes
#Display data frames as tabels with display()
#Remove dataframe by commenting out using the "#" symbol
#Display all dataframes
display(mg_1_table)
display(mg_2_table)
display(mg_3_table)
display(mg_4_table)
display(mg_5_table)
display(mg_6_table)
display(mg_7_table)
display(mg_8_table)
intersect_bp - baspairs in shared by the query and the match
f_orig_query - fraction of the query
f_match - fraction of the match found
f_unique_to_query - fraction of the query that is unique to the match
name - name of the match
filename - search database used
md5 - unique identifier for data used to generate the signature
In [4]:
#Combined output into a single file named all_gather_results.csv
!head -1 ../data/mg_1 \
> all_gather_results.csv; tail -n +2 -q ../data/mg_{1..8} >> all_gather_results.csv
sns.set(style="darkgrid")
#Ploting the frequency of detection of each match across the 8 metagenomes
dx = pd.read_csv('all_gather_results.csv', header = 0)
dx['name'].value_counts().plot(kind="barh", fontsize=16, figsize=(12,12))
#plt.savefig('<file name>.pdf', bbox_inches='tight')
Out[4]:
In [5]:
#Ploting average of the fraction of match detected across all metagenomes
newdx = dx[['f_match', 'name']].copy()
newdx
newdx_byname = newdx.set_index('name')
newdx_byname.groupby(level=0).mean().plot(kind="barh", fontsize=16, figsize=(12,12))
#plt.savefig('<insert name>.pdf', bbox_inches='tight')
Out[5]:
In [6]:
#Calculate jaccard distance using sourmash compare and generate results in a csv named mg_compare
#Path to sourmash install, "compare", path to signatures, output format, output filename
!~/dev/sourmash/sourmash compare ../data/mg_*sig --csv mg_compare
In [7]:
#Generate similarity matrix with hierchical clustering
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(context="paper", font="monospace")
sns.set(font_scale=1.4)
#Define clustermap color scheme
cmap = sns.cubehelix_palette(8, start=2, rot=0, dark=0, light=.95, as_cmap=True)
# Load the datset
df = pd.read_csv("mg_compare", header=0)
# Draw the clustermap using seaborn
o = sns.clustermap(df, vmax=1, vmin=0, square=True, linewidths=.005, cmap=cmap)
#Bold labels and rotate
plt.setp(o.ax_heatmap.get_yticklabels(), rotation=0, fontweight="bold")
plt.setp(o.ax_heatmap.get_xticklabels(), rotation=90, fontweight="bold")
#Set context with seaborn
sns.set(context="paper",font="monospace")
#Save figure
#plt.savefig(<filename>.pdf)
In [ ]: