This notebook is for visualizing matrices created with sourmash compare.
Sourmash compare quantifies how much different metagenomes resemble each other. A Jaccard distance of 1 will be reported for metagenomes that are identical to each other, and a Jaccard distance of 0 will be reported for metagenomes that do not share any of their k-mer content. All other degrees of similarity will be captured within the range of 0 to 1.
In this example, the complete Shakya et al. 2013 metagenome is being compared to small, medium, and large subsamples of itself after conservative or aggressive read filtering. The datasets used in this example are named according to their metagenome content and relative degree of read filtering:
SRR606249 = Accession number for the complete Shakya et al. 2013 metagenome
subset50 = 50% of the complete Shakya et al. 2013 metagenome
subset25 = 25% of the complete Shakya et al. 2013 metagenome
subset10 = 10% of the complete Shakya et al. 2013 metagenome
pe.trim2 = Conservative read filtering
pe.trim30 = Aggressive read filtering
Compare signatures and generate a cluster map
Determine which samples are most similar
Create MDS and TSNE plots
First, set the backend of matplotlib to the 'inline' backend. With this backend, the output of plotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document.
In [1]:
%matplotlib inline
Next, import the compare module
In [2]:
import compare
Then load and visulalize the table
In [3]:
from compare import load_sourmash_csv
# File name
load_sourmash_csv('SRR606249.pe.trim2and30_comparison.k51.csv')
Out[3]:
In [4]:
from compare import create_cluster_map
#Input file name, output image name, title
create_cluster_map("SRR606249.pe.trim2and30_comparison.k51.csv", "Yep.png", 'Shakya Complete and Subsampled with Variable Quality Trimming and K = 51')
In [5]:
from compare import sort_by_similarity
# Input file name, output file name
sort_by_similarity("SRR606249.pe.trim2and30_comparison.k51.csv", "sorted.SRR606249.pe.trim2and30_comparison.k51.csv")
Out[5]:
In [6]:
from compare import create_tsne
#Input file name, output image name
create_tsne("SRR606249.pe.trim2and30_comparison.k51.csv", "yes.png]")
Out[6]:
In [7]:
from compare import create_mds_plot
#Input file name, output image name
create_mds_plot("SRR606249.pe.trim2and30_comparison.k51.csv", "yep.png")
Out[7]:
In [8]:
import pandas as pd
import seaborn as sns
import numpy
from matplotlib import pyplot
from sklearn import manifold
import os.path
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import plotly
import plotly.plotly as py
df_csv = pd.read_csv("SRR606249.pe.trim2and30_comparison.k51.csv")
#def create_mds_plot(filename, save_fig):
m = numpy.loadtxt(open("SRR606249.pe.trim2and30_comparison.k51.csv"), delimiter="," , skiprows=1)
from sklearn.manifold import mds
from sklearn.preprocessing import StandardScaler
data_std = StandardScaler().fit_transform(m)
from sklearn.decomposition import PCA
pca = PCA(n_components=8, svd_solver='full')
data_pca = pca.fit_transform(data_std)
mds = manifold.MDS(n_components=2, max_iter=3000, eps=1e-9,
dissimilarity="euclidean", n_jobs=1).fit_transform(m)
df = pd.DataFrame(mds)
df.columns=['t1','t2']
# Rename index with column names - path
x = dict([(i,os.path.basename(i)) for i in df_csv.columns])
dfnew = df_csv.rename(index=str, columns=x)
dfnew ['']= dfnew.columns
output = dfnew.set_index('')
#output
#df['labels'] = output.columns
#df
#Convert to df to dic
new_output = output.to_dict()
new_output
#df_new
#pyplot.scatter(df.t1, df.t2, label=df['labels'])
py.plot(data=new_output)
In this example, the complete Shakya et al. 2013 metagenome was compared to small, medium, and large subsamples of itself after conservative or aggressive read filtering. We observed that the larger the subsampled percentage, the higher the similarity was between the subsampled metagenome and the complete Shakya et al. 2013 metagenome. To a lesser extent, we also observed more similarity between metagenomes that underwent the same degree of conservative or aggressive read filtering. This is consistent with the expected behavior of sourmash compare to report larger Jaccard distances for metagenomes that share a higher degree of their k-mer content.
More research needs to be done to interpret Jaccard distances calculated across diverse metagenomes that were sequenced with different chemistries and depths of coverage; however, this represents a baseline capability to compare the content of multiple metagenomes in a computationally efficient and reference database-independent manner. Given the high percentage of sequences in environmental metagenomes that have no known reference genome, it is advantageous to use this kind of approach rather than limiting metagenome comparisons to the sequences with known taxonomic classifications.