Summary:

This notebook is for visualizing assemblies made with SPAdes and MEGAHIT.

Example Use Case:

In this example, the complete Shakya et al. 2013 metagenome is being compared to small, medium, and large subsamples of itself after conservative or aggressive read filtering. The datasets used in this example are named according to their metagenome content and relative degree of read filtering:

  • SRR606249 = Accession number for the complete Shakya et al. 2013 metagenome
  • subset50 = 50% of the complete Shakya et al. 2013 metagenome
  • subset25 = 25% of the complete Shakya et al. 2013 metagenome
  • subset10 = 10% of the complete Shakya et al. 2013 metagenome
  • pe.trim2 = Conservative read filtering
  • pe.trim30 = Aggressive read filtering

Objectives:

...notebook data analysis components...

Conclusions:

In this example, the complete Shakya et al. 2013 metagenome was compared to small, medium, and large subsamples of itself after conservative or aggressive read filtering. We observed that the larger the subsampled percentage, the higher the similarity was between the subsampled metagenome and the complete Shakya et al. 2013 metagenome. To a lesser extent, we also observed more similarity between metagenomes that underwent the same degree of conservative or aggressive read filtering. This is consistent with the expected behavior of sourmash compare to report larger Jaccard distances for metagenomes that share a higher degree of their k-mer content.

More research needs to be done to interpret Jaccard distances calculated across diverse metagenomes that were sequenced with different chemistries and depths of coverage; however, this represents a baseline capability to compare the content of multiple metagenomes in a computationally efficient and reference database-independent manner. Given the high percentage of sequences in environmental metagenomes that have no known reference genome, it is advantageous to use this kind of approach rather than limiting metagenome comparisons to the sequences with known taxonomic classifications.

Thanks,

Krista

Summary

for i in $(cat prokka_tsv_names.txt)
do 
    osf -u philliptbrooks@gmail.com \
        -p dm938 \
        fetch osfstorage/functional_inference/contig_annotations/${i} ${PWD}/${i} 
    echo ${i}
done

In [1]:
#Import matplotlib in line 
%matplotlib inline

In [2]:
#Import pandas pyplot 
import pandas as pd 
import matplotlib.pyplot as plt

In [3]:
ls


Assembly_stats_comparison.csv
README.md
bandage_output_megahit_100.png
megahit_output_podar_metaG_100_quast_report/
megahit_output_podar_metaG_sub_10_quast_report/
megahit_output_podar_metaG_sub_25_quast_report/
megahit_output_podar_metaG_sub_50_quast_report/
megahit_salmon_counts_podar_100/
metagenome_assembly_comparison.ipynb
spades_output_podar_metaG_100_quast_report/
spades_output_podar_metaG_50_quast_report/
spades_output_podar_metaG_sub_10_quast_report/
spades_output_podar_metaG_sub_25_quast_report/
spades_salmon_counts_podar_100/

In [4]:
# Read in CSVs
dfA = pd.read_csv('megahit_output_podar_metaG_100_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfB = pd.read_csv('spades_output_podar_metaG_100_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfC = pd.read_csv('megahit_output_podar_metaG_sub_50_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfD = pd.read_csv('spades_output_podar_metaG_50_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfE = pd.read_csv('megahit_output_podar_metaG_sub_25_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfF = pd.read_csv('spades_output_podar_metaG_sub_25_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfG = pd.read_csv('megahit_output_podar_metaG_sub_10_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfH = pd.read_csv('spades_output_podar_metaG_sub_10_quast_report/transposed_report.tsv', sep='\t', header = 0)

In [5]:
#Create names for each of the dataframes
new_index= ['Megahit_Shakya_Full', 'SPAdes_Shakya_Full',  'MegaHit_Shakya_Subset_50', 'SPAdes_Shakya_Subset_50', 
            'MegaHit_Shakya_Subset_25', 'SPAdes_Shakya_Subset_25', 'MegaHit_Shakya_Subset_10', 
            'SPAdes_Shakya_Subset_10',]

#Concatenate all of the dataframes using concatenate 
frames = [dfA, dfB, dfC, dfD, dfE, dfF, dfG, dfH]
result = pd.concat(frames) 

#Rename the index such that the sample names are in the first column 
result["names"]= ['Megahit_Shakya_Full', 'SPAdes_Shakya_Full', 'MegaHit_Shakya_Subset_50', 'SPAdes_Shakya_Subset_50',
                  'MegaHit_Shakya_Subset_25', 'SPAdes_Shakya_Subset_25', 'MegaHit_Shakya_Subset_10',
                  'SPAdes_Shakya_Subset_10']

output = result.set_index("names")
output


Out[5]:
Assembly # contigs (>= 0 bp) # contigs (>= 1000 bp) # contigs (>= 5000 bp) # contigs (>= 10000 bp) # contigs (>= 25000 bp) # contigs (>= 50000 bp) Total length (>= 0 bp) Total length (>= 1000 bp) Total length (>= 5000 bp) ... Total length (>= 50000 bp) # contigs Largest contig Total length GC (%) N50 N75 L50 L75 # N's per 100 kbp
names
Megahit_Shakya_Full final.contigs 25029 14831 6031 3462 1693 859 203649893 198135753 177197479 ... 102472583 19841 1067762 201670522 51.43 51470 13245 827 2774 0.0
SPAdes_Shakya_Full contigs 28005 13955 5455 3187 1543 861 200949060 195267555 174994373 ... 109622618 18162 1599574 198291683 51.49 62753 14848 673 2335 0.0
MegaHit_Shakya_Subset_50 final.contigs 40294 21154 6153 3329 1502 705 198523578 187734386 154679774 ... 79362820 31471 1063575 195102189 51.07 31786 6630 1165 4699 0.0
SPAdes_Shakya_Subset_50 contigs 44943 19672 5573 3131 1467 745 197860387 186428185 155475173 ... 87152748 29080 1154114 193179362 51.16 40111 7735 957 3853 0.0
MegaHit_Shakya_Subset_25 final.contigs 76301 30326 5529 2589 893 389 182471352 157052834 106971166 ... 43343403 54078 708009 173832988 50.27 9928 2349 2612 12592 0.0
SPAdes_Shakya_Subset_25 contigs 95353 29638 5383 2488 905 381 187469894 157123594 108377446 ... 46312125 53676 726297 174094973 50.30 10506 2372 2357 12018 0.0
MegaHit_Shakya_Subset_10 final.contigs 108780 24962 3182 1233 326 107 127510334 84162742 43820080 ... 9239419 61521 258402 109369857 49.68 2895 1065 6091 22892 0.0
SPAdes_Shakya_Subset_10 contigs 175588 26044 3110 1284 345 114 149097425 86600130 44629318 ... 10228486 66655 335357 114498408 49.64 2672 1020 6658 25325 0.0

8 rows × 22 columns


In [6]:
# Export as a csv
output.to_csv('Assembly_stats_comparison.csv')

In [7]:
import numpy
from pylab import *

In [8]:
ls


Assembly_stats_comparison.csv
README.md
bandage_output_megahit_100.png
megahit_output_podar_metaG_100_quast_report/
megahit_output_podar_metaG_sub_10_quast_report/
megahit_output_podar_metaG_sub_25_quast_report/
megahit_output_podar_metaG_sub_50_quast_report/
megahit_salmon_counts_podar_100/
metagenome_assembly_comparison.ipynb
spades_output_podar_metaG_100_quast_report/
spades_output_podar_metaG_50_quast_report/
spades_output_podar_metaG_sub_10_quast_report/
spades_output_podar_metaG_sub_25_quast_report/
spades_salmon_counts_podar_100/

In [9]:
# Counts from MEGAHIT assembly of Shakya complete mapping to contigs 
counts1 = [ x.split()[1] for x in open('megahit_salmon_counts_podar_100/SRR606249.quant.counts')]

counts1 = [ float(x) for x in counts1[1:] ]
counts1.sort(reverse=True)
counts1 = numpy.array(counts1)
plot(counts1, '*')


Out[9]:
[<matplotlib.lines.Line2D at 0x1138f59e8>]

In [10]:
# Counts from SPAdes assembly of Shakya complete mapping to contigs 
counts2 = [ x.split()[1] for x in open('spades_salmon_counts_podar_100/SRR606249.quant.counts')]
counts2.sort(reverse=True)
counts2 = [ float(x) for x in counts1[1:] ]
counts2 = numpy.array(counts1)
plot(counts2, '*')


Out[10]:
[<matplotlib.lines.Line2D at 0x113727e48>]

In [11]:
plt.figure()

# sp1
plt.subplot(121)

counts1 = [ x.split()[1] for x in open('megahit_salmon_counts_podar_100/SRR606249.quant.counts')]

counts1 = [ float(x) for x in counts1[1:] ]
counts1.sort(reverse=True)
counts1 = numpy.array(counts1)
plot(counts1, '*')

# sp2
plt.subplot(122)
# Counts from MEGAHIT assembly of Shakya complete mapping to contigs 
counts2 = [ x.split()[1] for x in open('megahit_salmon_counts_podar_100/SRR606249.quant.counts')]

counts2 = [ float(x) for x in counts1[1:] ]
counts2.sort(reverse=True)
counts2 = numpy.array(counts1)
plot(counts2, '*')

plt.show()



In [12]:
counts1 = [ x.split()[1] for x in open('megahit_salmon_counts_podar_100/SRR606249.quant.counts')]
counts1 = [ float(x) for x in counts1[1:] ]
counts1 = numpy.array(counts1)

counts2 = [ x.split()[1] for x in open('spades_salmon_counts_podar_100/SRR606249.quant.counts')]
counts2 = [ float(x) for x in counts1[1:] ]
counts2 = numpy.array(counts1)
plot(counts1, counts2, '*')


Out[12]:
[<matplotlib.lines.Line2D at 0x1162bba90>]

In [ ]: