Summary:

This notebook is for visualizing assemblies made with SPAdes and MEGAHIT.

Example Use Case:

In this example, the complete Shakya et al. 2013 metagenome is being compared to small, medium, and large subsamples of itself after conservative or aggressive read filtering. The datasets used in this example are named according to their metagenome content and relative degree of read filtering:

SRR606249 = Accession number for the complete Shakya et al. 2013 metagenome
subset50 = 50% of the complete Shakya et al. 2013 metagenome
subset25 = 25% of the complete Shakya et al. 2013 metagenome
subset10 = 10% of the complete Shakya et al. 2013 metagenome
pe.trim2 = Conservative read filtering
pe.trim30 = Aggressive read filtering

Objectives:

...notebook data analysis components...

Conclusions:

In this example, the complete Shakya et al. 2013 metagenome was compared to small, medium, and large subsamples of itself after conservative or aggressive read filtering. We observed that the larger the subsampled percentage, the higher the similarity was between the subsampled metagenome and the complete Shakya et al. 2013 metagenome. To a lesser extent, we also observed more similarity between metagenomes that underwent the same degree of conservative or aggressive read filtering. This is consistent with the expected behavior of sourmash compare to report larger Jaccard distances for metagenomes that share a higher degree of their k-mer content.

More research needs to be done to interpret Jaccard distances calculated across diverse metagenomes that were sequenced with different chemistries and depths of coverage; however, this represents a baseline capability to compare the content of multiple metagenomes in a computationally efficient and reference database-independent manner. Given the high percentage of sequences in environmental metagenomes that have no known reference genome, it is advantageous to use this kind of approach rather than limiting metagenome comparisons to the sequences with known taxonomic classifications.

Thanks,

Krista

Summary

for i in $(cat prokka_tsv_names.txt)
do 
    osf -u philliptbrooks@gmail.com \
        -p dm938 \
        fetch osfstorage/functional_inference/contig_annotations/${i} ${PWD}/${i} 
    echo ${i}
done



In [1]:

    
#Import matplotlib in line 
%matplotlib inline



In [2]:

    
#Import pandas pyplot 
import pandas as pd 
import matplotlib.pyplot as plt



In [3]:

    
ls









    



Assembly_stats_comparison.csv
README.md
bandage_output_megahit_100.png
megahit_output_podar_metaG_100_quast_report/
megahit_output_podar_metaG_sub_10_quast_report/
megahit_output_podar_metaG_sub_25_quast_report/
megahit_output_podar_metaG_sub_50_quast_report/
megahit_salmon_counts_podar_100/
metagenome_assembly_comparison.ipynb
spades_output_podar_metaG_100_quast_report/
spades_output_podar_metaG_50_quast_report/
spades_output_podar_metaG_sub_10_quast_report/
spades_output_podar_metaG_sub_25_quast_report/
spades_salmon_counts_podar_100/



In [4]:

    
# Read in CSVs
dfA = pd.read_csv('megahit_output_podar_metaG_100_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfB = pd.read_csv('spades_output_podar_metaG_100_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfC = pd.read_csv('megahit_output_podar_metaG_sub_50_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfD = pd.read_csv('spades_output_podar_metaG_50_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfE = pd.read_csv('megahit_output_podar_metaG_sub_25_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfF = pd.read_csv('spades_output_podar_metaG_sub_25_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfG = pd.read_csv('megahit_output_podar_metaG_sub_10_quast_report/transposed_report.tsv', sep='\t', header = 0)
dfH = pd.read_csv('spades_output_podar_metaG_sub_10_quast_report/transposed_report.tsv', sep='\t', header = 0)



In [5]:

    
#Create names for each of the dataframes
new_index= ['Megahit_Shakya_Full', 'SPAdes_Shakya_Full',  'MegaHit_Shakya_Subset_50', 'SPAdes_Shakya_Subset_50', 
            'MegaHit_Shakya_Subset_25', 'SPAdes_Shakya_Subset_25', 'MegaHit_Shakya_Subset_10', 
            'SPAdes_Shakya_Subset_10',]

#Concatenate all of the dataframes using concatenate 
frames = [dfA, dfB, dfC, dfD, dfE, dfF, dfG, dfH]
result = pd.concat(frames) 

#Rename the index such that the sample names are in the first column 
result["names"]= ['Megahit_Shakya_Full', 'SPAdes_Shakya_Full', 'MegaHit_Shakya_Subset_50', 'SPAdes_Shakya_Subset_50',
                  'MegaHit_Shakya_Subset_25', 'SPAdes_Shakya_Subset_25', 'MegaHit_Shakya_Subset_10',
                  'SPAdes_Shakya_Subset_10']

output = result.set_index("names")
output









    Out[5]:






  
    
      
      Assembly
      # contigs (>= 0 bp)
      # contigs (>= 1000 bp)
      # contigs (>= 5000 bp)
      # contigs (>= 10000 bp)
      # contigs (>= 25000 bp)
      # contigs (>= 50000 bp)
      Total length (>= 0 bp)
      Total length (>= 1000 bp)
      Total length (>= 5000 bp)
      ...
      Total length (>= 50000 bp)
      # contigs
      Largest contig
      Total length
      GC (%)
      N50
      N75
      L50
      L75
      # N's per 100 kbp
    
    
      names
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Megahit_Shakya_Full
      final.contigs
      25029
      14831
      6031
      3462
      1693
      859
      203649893
      198135753
      177197479
      ...
      102472583
      19841
      1067762
      201670522
      51.43
      51470
      13245
      827
      2774
      0.0
    
    
      SPAdes_Shakya_Full
      contigs
      28005
      13955
      5455
      3187
      1543
      861
      200949060
      195267555
      174994373
      ...
      109622618
      18162
      1599574
      198291683
      51.49
      62753
      14848
      673
      2335
      0.0
    
    
      MegaHit_Shakya_Subset_50
      final.contigs
      40294
      21154
      6153
      3329
      1502
      705
      198523578
      187734386
      154679774
      ...
      79362820
      31471
      1063575
      195102189
      51.07
      31786
      6630
      1165
      4699
      0.0
    
    
      SPAdes_Shakya_Subset_50
      contigs
      44943
      19672
      5573
      3131
      1467
      745
      197860387
      186428185
      155475173
      ...
      87152748
      29080
      1154114
      193179362
      51.16
      40111
      7735
      957
      3853
      0.0
    
    
      MegaHit_Shakya_Subset_25
      final.contigs
      76301
      30326
      5529
      2589
      893
      389
      182471352
      157052834
      106971166
      ...
      43343403
      54078
      708009
      173832988
      50.27
      9928
      2349
      2612
      12592
      0.0
    
    
      SPAdes_Shakya_Subset_25
      contigs
      95353
      29638
      5383
      2488
      905
      381
      187469894
      157123594
      108377446
      ...
      46312125
      53676
      726297
      174094973
      50.30
      10506
      2372
      2357
      12018
      0.0
    
    
      MegaHit_Shakya_Subset_10
      final.contigs
      108780
      24962
      3182
      1233
      326
      107
      127510334
      84162742
      43820080
      ...
      9239419
      61521
      258402
      109369857
      49.68
      2895
      1065
      6091
      22892
      0.0
    
    
      SPAdes_Shakya_Subset_10
      contigs
      175588
      26044
      3110
      1284
      345
      114
      149097425
      86600130
      44629318
      ...
      10228486
      66655
      335357
      114498408
      49.64
      2672
      1020
      6658
      25325
      0.0
    
  

8 rows × 22 columns



In [6]:

    
# Export as a csv
output.to_csv('Assembly_stats_comparison.csv')



In [7]:

    
import numpy
from pylab import *



In [8]:

    
ls









    



Assembly_stats_comparison.csv
README.md
bandage_output_megahit_100.png
megahit_output_podar_metaG_100_quast_report/
megahit_output_podar_metaG_sub_10_quast_report/
megahit_output_podar_metaG_sub_25_quast_report/
megahit_output_podar_metaG_sub_50_quast_report/
megahit_salmon_counts_podar_100/
metagenome_assembly_comparison.ipynb
spades_output_podar_metaG_100_quast_report/
spades_output_podar_metaG_50_quast_report/
spades_output_podar_metaG_sub_10_quast_report/
spades_output_podar_metaG_sub_25_quast_report/
spades_salmon_counts_podar_100/



In [9]:

    
# Counts from MEGAHIT assembly of Shakya complete mapping to contigs 
counts1 = [ x.split()[1] for x in open('megahit_salmon_counts_podar_100/SRR606249.quant.counts')]

counts1 = [ float(x) for x in counts1[1:] ]
counts1.sort(reverse=True)
counts1 = numpy.array(counts1)
plot(counts1, '*')









    Out[9]:





[<matplotlib.lines.Line2D at 0x1138f59e8>]



In [10]:

    
# Counts from SPAdes assembly of Shakya complete mapping to contigs 
counts2 = [ x.split()[1] for x in open('spades_salmon_counts_podar_100/SRR606249.quant.counts')]
counts2.sort(reverse=True)
counts2 = [ float(x) for x in counts1[1:] ]
counts2 = numpy.array(counts1)
plot(counts2, '*')









    Out[10]:





[<matplotlib.lines.Line2D at 0x113727e48>]



In [11]:

    
plt.figure()

# sp1
plt.subplot(121)

counts1 = [ x.split()[1] for x in open('megahit_salmon_counts_podar_100/SRR606249.quant.counts')]

counts1 = [ float(x) for x in counts1[1:] ]
counts1.sort(reverse=True)
counts1 = numpy.array(counts1)
plot(counts1, '*')

# sp2
plt.subplot(122)
# Counts from MEGAHIT assembly of Shakya complete mapping to contigs 
counts2 = [ x.split()[1] for x in open('megahit_salmon_counts_podar_100/SRR606249.quant.counts')]

counts2 = [ float(x) for x in counts1[1:] ]
counts2.sort(reverse=True)
counts2 = numpy.array(counts1)
plot(counts2, '*')

plt.show()



In [12]:

    
counts1 = [ x.split()[1] for x in open('megahit_salmon_counts_podar_100/SRR606249.quant.counts')]
counts1 = [ float(x) for x in counts1[1:] ]
counts1 = numpy.array(counts1)

counts2 = [ x.split()[1] for x in open('spades_salmon_counts_podar_100/SRR606249.quant.counts')]
counts2 = [ float(x) for x in counts1[1:] ]
counts2 = numpy.array(counts1)
plot(counts1, counts2, '*')









    Out[12]:





[<matplotlib.lines.Line2D at 0x1162bba90>]



In [ ]:

	Assembly	# contigs (>= 0 bp)	# contigs (>= 1000 bp)	# contigs (>= 5000 bp)	# contigs (>= 10000 bp)	# contigs (>= 25000 bp)	# contigs (>= 50000 bp)	Total length (>= 0 bp)	Total length (>= 1000 bp)	Total length (>= 5000 bp)	...	Total length (>= 50000 bp)	# contigs	Largest contig	Total length	GC (%)	N50	N75	L50	L75	# N's per 100 kbp
names
Megahit_Shakya_Full	final.contigs	25029	14831	6031	3462	1693	859	203649893	198135753	177197479	...	102472583	19841	1067762	201670522	51.43	51470	13245	827	2774	0.0
SPAdes_Shakya_Full	contigs	28005	13955	5455	3187	1543	861	200949060	195267555	174994373	...	109622618	18162	1599574	198291683	51.49	62753	14848	673	2335	0.0
MegaHit_Shakya_Subset_50	final.contigs	40294	21154	6153	3329	1502	705	198523578	187734386	154679774	...	79362820	31471	1063575	195102189	51.07	31786	6630	1165	4699	0.0
SPAdes_Shakya_Subset_50	contigs	44943	19672	5573	3131	1467	745	197860387	186428185	155475173	...	87152748	29080	1154114	193179362	51.16	40111	7735	957	3853	0.0
MegaHit_Shakya_Subset_25	final.contigs	76301	30326	5529	2589	893	389	182471352	157052834	106971166	...	43343403	54078	708009	173832988	50.27	9928	2349	2612	12592	0.0
SPAdes_Shakya_Subset_25	contigs	95353	29638	5383	2488	905	381	187469894	157123594	108377446	...	46312125	53676	726297	174094973	50.30	10506	2372	2357	12018	0.0
MegaHit_Shakya_Subset_10	final.contigs	108780	24962	3182	1233	326	107	127510334	84162742	43820080	...	9239419	61521	258402	109369857	49.68	2895	1065	6091	22892	0.0
SPAdes_Shakya_Subset_10	contigs	175588	26044	3110	1284	345	114	149097425	86600130	44629318	...	10228486	66655	335357	114498408	49.64	2672	1020	6658	25325	0.0