This notebook is for visualizing assemblies made with SPAdes and MEGAHIT.
In this example, the complete Shakya et al. 2013 metagenome is being compared to small, medium, and large subsamples of itself after conservative or aggressive read filtering. The datasets used in this example are named according to their metagenome content and relative degree of read filtering:
...notebook data analysis components...
In this example, the complete Shakya et al. 2013 metagenome was compared to small, medium, and large subsamples of itself after conservative or aggressive read filtering. We observed that the larger the subsampled percentage, the higher the similarity was between the subsampled metagenome and the complete Shakya et al. 2013 metagenome. To a lesser extent, we also observed more similarity between metagenomes that underwent the same degree of conservative or aggressive read filtering. This is consistent with the expected behavior of sourmash compare to report larger Jaccard distances for metagenomes that share a higher degree of their k-mer content.
More research needs to be done to interpret Jaccard distances calculated across diverse metagenomes that were sequenced with different chemistries and depths of coverage; however, this represents a baseline capability to compare the content of multiple metagenomes in a computationally efficient and reference database-independent manner. Given the high percentage of sequences in environmental metagenomes that have no known reference genome, it is advantageous to use this kind of approach rather than limiting metagenome comparisons to the sequences with known taxonomic classifications.
Thanks,
Krista
for i in $(cat prokka_tsv_names.txt)
do
osf -u philliptbrooks@gmail.com \
-p dm938 \
fetch osfstorage/functional_inference/contig_annotations/${i} ${PWD}/${i}
echo ${i}
done
In [1]:
#Import matplotlib in line
%matplotlib inline
In [2]:
#Import pandas pyplot
import pandas as pd
import matplotlib.pyplot as plt
In [3]:
ls
In [4]:
# Read in CSVs
dfA = pd.read_csv('megahit_output_podar_metaG_100_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfB = pd.read_csv('spades_output_podar_metaG_100_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfC = pd.read_csv('megahit_output_podar_metaG_sub_50_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfD = pd.read_csv('spades_output_podar_metaG_50_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfE = pd.read_csv('megahit_output_podar_metaG_sub_25_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfF = pd.read_csv('spades_output_podar_metaG_sub_25_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfG = pd.read_csv('megahit_output_podar_metaG_sub_10_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfH = pd.read_csv('spades_output_podar_metaG_sub_10_quast_report/transposed_report.tsv', sep='\t', header = 0)
In [5]:
#Create names for each of the dataframes
new_index= ['Megahit_Shakya_Full', 'SPAdes_Shakya_Full', 'MegaHit_Shakya_Subset_50', 'SPAdes_Shakya_Subset_50',
'MegaHit_Shakya_Subset_25', 'SPAdes_Shakya_Subset_25', 'MegaHit_Shakya_Subset_10',
'SPAdes_Shakya_Subset_10',]
#Concatenate all of the dataframes using concatenate
frames = [dfA, dfB, dfC, dfD, dfE, dfF, dfG, dfH]
result = pd.concat(frames)
#Rename the index such that the sample names are in the first column
result["names"]= ['Megahit_Shakya_Full', 'SPAdes_Shakya_Full', 'MegaHit_Shakya_Subset_50', 'SPAdes_Shakya_Subset_50',
'MegaHit_Shakya_Subset_25', 'SPAdes_Shakya_Subset_25', 'MegaHit_Shakya_Subset_10',
'SPAdes_Shakya_Subset_10']
output = result.set_index("names")
output
Out[5]:
In [6]:
# Export as a csv
output.to_csv('Assembly_stats_comparison.csv')
In [7]:
import numpy
from pylab import *
In [8]:
ls
In [9]:
# Counts from MEGAHIT assembly of Shakya complete mapping to contigs
counts1 = [ x.split()[1] for x in open('megahit_salmon_counts_podar_100/SRR606249.quant.counts')]
counts1 = [ float(x) for x in counts1[1:] ]
counts1.sort(reverse=True)
counts1 = numpy.array(counts1)
plot(counts1, '*')
Out[9]:
In [10]:
# Counts from SPAdes assembly of Shakya complete mapping to contigs
counts2 = [ x.split()[1] for x in open('spades_salmon_counts_podar_100/SRR606249.quant.counts')]
counts2.sort(reverse=True)
counts2 = [ float(x) for x in counts1[1:] ]
counts2 = numpy.array(counts1)
plot(counts2, '*')
Out[10]:
In [11]:
plt.figure()
# sp1
plt.subplot(121)
counts1 = [ x.split()[1] for x in open('megahit_salmon_counts_podar_100/SRR606249.quant.counts')]
counts1 = [ float(x) for x in counts1[1:] ]
counts1.sort(reverse=True)
counts1 = numpy.array(counts1)
plot(counts1, '*')
# sp2
plt.subplot(122)
# Counts from MEGAHIT assembly of Shakya complete mapping to contigs
counts2 = [ x.split()[1] for x in open('megahit_salmon_counts_podar_100/SRR606249.quant.counts')]
counts2 = [ float(x) for x in counts1[1:] ]
counts2.sort(reverse=True)
counts2 = numpy.array(counts1)
plot(counts2, '*')
plt.show()
In [12]:
counts1 = [ x.split()[1] for x in open('megahit_salmon_counts_podar_100/SRR606249.quant.counts')]
counts1 = [ float(x) for x in counts1[1:] ]
counts1 = numpy.array(counts1)
counts2 = [ x.split()[1] for x in open('spades_salmon_counts_podar_100/SRR606249.quant.counts')]
counts2 = [ float(x) for x in counts1[1:] ]
counts2 = numpy.array(counts1)
plot(counts1, counts2, '*')
Out[12]:
In [ ]: