Summary:

This notebook is for visualizing matrices created with sourmash compare.

Sourmash compare quantifies how much different metagenomes resemble each other. A Jaccard distance of 1 will be reported for metagenomes that are identical to each other, and a Jaccard distance of 0 will be reported for metagenomes that do not share any of their k-mer content. All other degrees of similarity will be captured within the range of 0 to 1.

Example Use Case:

In this example, the complete Shakya et al. 2013 metagenome is being compared to small, medium, and large subsamples of itself after conservative or aggressive read filtering. The datasets used in this example are named according to their metagenome content and relative degree of read filtering:

  • SRR606249 = Accession number for the complete Shakya et al. 2013 metagenome

  • subset50 = 50% of the complete Shakya et al. 2013 metagenome

  • subset25 = 25% of the complete Shakya et al. 2013 metagenome

  • subset10 = 10% of the complete Shakya et al. 2013 metagenome

  • pe.trim2 = Conservative read filtering

  • pe.trim30 = Aggressive read filtering

Objectives:

  • Compare signatures and generate a cluster map

  • Determine which samples are most similar

  • Create MDS and TSNE plots

First, set the backend of matplotlib to the 'inline' backend. With this backend, the output of plotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document.


In [1]:
%matplotlib inline

Next, import the compare module


In [2]:
import compare

Then load and visulalize the table


In [3]:
from compare import load_sourmash_csv
# File name
load_sourmash_csv('SRR606249.pe.trim2and30_comparison.k51.csv')


Out[3]:
/data/SRR606249.pe.trim2.fq.gz /data/SRR606249.pe.trim30.fq.gz /data/SRR606249_subset10.pe.trim2.fq.gz /data/SRR606249_subset10.pe.trim30.fq.gz /data/SRR606249_subset25.pe.trim2.fq.gz /data/SRR606249_subset25.pe.trim30.fq.gz /data/SRR606249_subset50.pe.trim2.fq.gz /data/SRR606249_subset50.pe.trim30.fq.gz
0 1.000000 0.519851 0.407297 0.327220 0.584408 0.423174 0.776402 0.478097
1 0.519851 1.000000 0.600313 0.629410 0.651675 0.813978 0.593865 0.919683
2 0.407297 0.600313 1.000000 0.803394 0.542693 0.635679 0.472325 0.626031
3 0.327220 0.629410 0.803394 1.000000 0.515495 0.688278 0.412614 0.663707
4 0.584408 0.651675 0.542693 0.515495 1.000000 0.724107 0.608354 0.652656
5 0.423174 0.813978 0.635679 0.688278 0.724107 1.000000 0.521258 0.831044
6 0.776402 0.593865 0.472325 0.412614 0.608354 0.521258 1.000000 0.615781
7 0.478097 0.919683 0.626031 0.663707 0.652656 0.831044 0.615781 1.000000

In [4]:
from compare import create_cluster_map
#Input file name, output image name, title 
create_cluster_map("SRR606249.pe.trim2and30_comparison.k51.csv", "Yep.png", 'Shakya Complete and Subsampled with Variable Quality Trimming and K = 51')



In [5]:
from compare import sort_by_similarity
# Input file name, output file name 
sort_by_similarity("SRR606249.pe.trim2and30_comparison.k51.csv", "sorted.SRR606249.pe.trim2and30_comparison.k51.csv")


Out[5]:
Jaccard_similarity
SRR606249.pe.trim2.fq.gz SRR606249.pe.trim2.fq.gz 1.000000
SRR606249.pe.trim30.fq.gz SRR606249.pe.trim30.fq.gz 1.000000
SRR606249_subset50.pe.trim2.fq.gz SRR606249_subset50.pe.trim2.fq.gz 1.000000
SRR606249_subset25.pe.trim30.fq.gz SRR606249_subset25.pe.trim30.fq.gz 1.000000
SRR606249_subset25.pe.trim2.fq.gz SRR606249_subset25.pe.trim2.fq.gz 1.000000
SRR606249_subset10.pe.trim30.fq.gz SRR606249_subset10.pe.trim30.fq.gz 1.000000
SRR606249_subset10.pe.trim2.fq.gz SRR606249_subset10.pe.trim2.fq.gz 1.000000
SRR606249_subset50.pe.trim30.fq.gz SRR606249_subset50.pe.trim30.fq.gz 1.000000
SRR606249.pe.trim30.fq.gz 0.919683
SRR606249.pe.trim30.fq.gz SRR606249_subset50.pe.trim30.fq.gz 0.919683
SRR606249_subset25.pe.trim30.fq.gz SRR606249_subset50.pe.trim30.fq.gz 0.831044
SRR606249_subset50.pe.trim30.fq.gz SRR606249_subset25.pe.trim30.fq.gz 0.831044
SRR606249_subset25.pe.trim30.fq.gz SRR606249.pe.trim30.fq.gz 0.813978
SRR606249.pe.trim30.fq.gz SRR606249_subset25.pe.trim30.fq.gz 0.813978
SRR606249_subset10.pe.trim2.fq.gz SRR606249_subset10.pe.trim30.fq.gz 0.803394
SRR606249_subset10.pe.trim30.fq.gz SRR606249_subset10.pe.trim2.fq.gz 0.803394
SRR606249_subset50.pe.trim2.fq.gz SRR606249.pe.trim2.fq.gz 0.776402
SRR606249.pe.trim2.fq.gz SRR606249_subset50.pe.trim2.fq.gz 0.776402
SRR606249_subset25.pe.trim30.fq.gz SRR606249_subset25.pe.trim2.fq.gz 0.724107
SRR606249_subset25.pe.trim2.fq.gz SRR606249_subset25.pe.trim30.fq.gz 0.724107
SRR606249_subset25.pe.trim30.fq.gz SRR606249_subset10.pe.trim30.fq.gz 0.688278
SRR606249_subset10.pe.trim30.fq.gz SRR606249_subset25.pe.trim30.fq.gz 0.688278
SRR606249_subset50.pe.trim30.fq.gz SRR606249_subset10.pe.trim30.fq.gz 0.663707
SRR606249_subset10.pe.trim30.fq.gz SRR606249_subset50.pe.trim30.fq.gz 0.663707
SRR606249_subset50.pe.trim30.fq.gz SRR606249_subset25.pe.trim2.fq.gz 0.652656
SRR606249_subset25.pe.trim2.fq.gz SRR606249_subset50.pe.trim30.fq.gz 0.652656
SRR606249.pe.trim30.fq.gz 0.651675
SRR606249.pe.trim30.fq.gz SRR606249_subset25.pe.trim2.fq.gz 0.651675
SRR606249_subset10.pe.trim2.fq.gz SRR606249_subset25.pe.trim30.fq.gz 0.635679
SRR606249_subset25.pe.trim30.fq.gz SRR606249_subset10.pe.trim2.fq.gz 0.635679
... ... ...
SRR606249_subset50.pe.trim2.fq.gz SRR606249_subset50.pe.trim30.fq.gz 0.615781
SRR606249_subset50.pe.trim30.fq.gz SRR606249_subset50.pe.trim2.fq.gz 0.615781
SRR606249_subset50.pe.trim2.fq.gz SRR606249_subset25.pe.trim2.fq.gz 0.608354
SRR606249_subset25.pe.trim2.fq.gz SRR606249_subset50.pe.trim2.fq.gz 0.608354
SRR606249_subset10.pe.trim2.fq.gz SRR606249.pe.trim30.fq.gz 0.600313
SRR606249.pe.trim30.fq.gz SRR606249_subset10.pe.trim2.fq.gz 0.600313
SRR606249_subset50.pe.trim2.fq.gz SRR606249.pe.trim30.fq.gz 0.593865
SRR606249.pe.trim30.fq.gz SRR606249_subset50.pe.trim2.fq.gz 0.593865
SRR606249.pe.trim2.fq.gz SRR606249_subset25.pe.trim2.fq.gz 0.584408
SRR606249_subset25.pe.trim2.fq.gz SRR606249.pe.trim2.fq.gz 0.584408
SRR606249_subset10.pe.trim2.fq.gz 0.542693
SRR606249_subset10.pe.trim2.fq.gz SRR606249_subset25.pe.trim2.fq.gz 0.542693
SRR606249_subset25.pe.trim30.fq.gz SRR606249_subset50.pe.trim2.fq.gz 0.521258
SRR606249_subset50.pe.trim2.fq.gz SRR606249_subset25.pe.trim30.fq.gz 0.521258
SRR606249.pe.trim30.fq.gz SRR606249.pe.trim2.fq.gz 0.519851
SRR606249.pe.trim2.fq.gz SRR606249.pe.trim30.fq.gz 0.519851
SRR606249_subset10.pe.trim30.fq.gz SRR606249_subset25.pe.trim2.fq.gz 0.515495
SRR606249_subset25.pe.trim2.fq.gz SRR606249_subset10.pe.trim30.fq.gz 0.515495
SRR606249.pe.trim2.fq.gz SRR606249_subset50.pe.trim30.fq.gz 0.478097
SRR606249_subset50.pe.trim30.fq.gz SRR606249.pe.trim2.fq.gz 0.478097
SRR606249_subset50.pe.trim2.fq.gz SRR606249_subset10.pe.trim2.fq.gz 0.472325
SRR606249_subset10.pe.trim2.fq.gz SRR606249_subset50.pe.trim2.fq.gz 0.472325
SRR606249.pe.trim2.fq.gz SRR606249_subset25.pe.trim30.fq.gz 0.423174
SRR606249_subset25.pe.trim30.fq.gz SRR606249.pe.trim2.fq.gz 0.423174
SRR606249_subset10.pe.trim30.fq.gz SRR606249_subset50.pe.trim2.fq.gz 0.412614
SRR606249_subset50.pe.trim2.fq.gz SRR606249_subset10.pe.trim30.fq.gz 0.412614
SRR606249_subset10.pe.trim2.fq.gz SRR606249.pe.trim2.fq.gz 0.407297
SRR606249.pe.trim2.fq.gz SRR606249_subset10.pe.trim2.fq.gz 0.407297
SRR606249_subset10.pe.trim30.fq.gz SRR606249.pe.trim2.fq.gz 0.327220
SRR606249.pe.trim2.fq.gz SRR606249_subset10.pe.trim30.fq.gz 0.327220

64 rows × 1 columns


In [6]:
from compare import create_tsne
#Input file name, output image name 
create_tsne("SRR606249.pe.trim2and30_comparison.k51.csv", "yes.png]")


Out[6]:
<matplotlib.collections.PathCollection at 0x1154cfb38>

In [7]:
from compare import create_mds_plot
#Input file name, output image name
create_mds_plot("SRR606249.pe.trim2and30_comparison.k51.csv", "yep.png")


/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/manifold/mds.py:411: UserWarning: The MDS API has changed. ``fit`` now constructs an dissimilarity matrix from data. To use a custom dissimilarity matrix, set ``dissimilarity='precomputed'``.
  warnings.warn("The MDS API has changed. ``fit`` now constructs an"
Out[7]:
<matplotlib.collections.PathCollection at 0x115506438>

In [8]:
import pandas as pd
import seaborn as sns
import numpy
from matplotlib import pyplot
from sklearn import manifold
import os.path
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp

import plotly 
import plotly.plotly as py

df_csv = pd.read_csv("SRR606249.pe.trim2and30_comparison.k51.csv")

#def create_mds_plot(filename, save_fig):
m = numpy.loadtxt(open("SRR606249.pe.trim2and30_comparison.k51.csv"), delimiter="," , skiprows=1)
from sklearn.manifold import mds
        
from sklearn.preprocessing import StandardScaler
data_std = StandardScaler().fit_transform(m)

from sklearn.decomposition import PCA
pca = PCA(n_components=8, svd_solver='full')
data_pca = pca.fit_transform(data_std)

mds = manifold.MDS(n_components=2, max_iter=3000, eps=1e-9,
                   dissimilarity="euclidean", n_jobs=1).fit_transform(m)



df = pd.DataFrame(mds)
df.columns=['t1','t2']

# Rename index with column names - path 
x = dict([(i,os.path.basename(i)) for i in df_csv.columns])
dfnew = df_csv.rename(index=str, columns=x)

dfnew ['']= dfnew.columns
output = dfnew.set_index('')
#output

#df['labels'] = output.columns
#df

#Convert to df to dic 
new_output = output.to_dict()
new_output
#df_new

#pyplot.scatter(df.t1, df.t2, label=df['labels'])

py.plot(data=new_output)


/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/manifold/mds.py:411: UserWarning:

The MDS API has changed. ``fit`` now constructs an dissimilarity matrix from data. To use a custom dissimilarity matrix, set ``dissimilarity='precomputed'``.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-b0915c06ea62> in <module>()
     51 #pyplot.scatter(df.t1, df.t2, label=df['labels'])
     52 
---> 53 py.plot(data=new_output)

TypeError: plot() missing 1 required positional argument: 'figure_or_data'

Conclusions:

In this example, the complete Shakya et al. 2013 metagenome was compared to small, medium, and large subsamples of itself after conservative or aggressive read filtering. We observed that the larger the subsampled percentage, the higher the similarity was between the subsampled metagenome and the complete Shakya et al. 2013 metagenome. To a lesser extent, we also observed more similarity between metagenomes that underwent the same degree of conservative or aggressive read filtering. This is consistent with the expected behavior of sourmash compare to report larger Jaccard distances for metagenomes that share a higher degree of their k-mer content.

More research needs to be done to interpret Jaccard distances calculated across diverse metagenomes that were sequenced with different chemistries and depths of coverage; however, this represents a baseline capability to compare the content of multiple metagenomes in a computationally efficient and reference database-independent manner. Given the high percentage of sequences in environmental metagenomes that have no known reference genome, it is advantageous to use this kind of approach rather than limiting metagenome comparisons to the sequences with known taxonomic classifications.