Choosing Some Data Sets for Exemplar Data

Rob Edwards, Dec 9^th

This is selecting some data sets from the WGS metagenomes that we have identied.

As a background, Ben provided a list of 141,790 SRR IDs that they have identified as WGS metagenomes based on the annotations.

We developed partie to automatically detect WGS metagenomes from the SRA and have run it on 996,770 SRA datasets so far. Partie measures a few things, notably the frequency of unique k-mers (i.e. k-mers that appear only once). If you have a 16S amplicon data set, most of the k-mers appear many times since the 16S is so repetative. In contrast, if you have a random WGS metagenome, the k-mers are more unique.

Second, we compare to databases of 16S genes, prokaryotic genes, and phages.

Next, we also consider the length of the dataset, and typically rule out anything less than 5 MB. This is because there are samples with a single 150 bp read, and we can't accurately annotate them. However, this may be a bit low and we should perhaps increase this, because many, many datasets in the 5MB range are reconstructed genomes. But there are also plenty of real metagenomes that are that size.

Note that in the file below we have two columns: Automatic Annotation is our partie prediction without filtering by size and Partie Classification is after filtering by size. I don't bother including the orginal submittors designation, but could add it if requested.

I merged these two data sets, and at the moment we have:

Number	What	What that means
87,600	AGREE	partie and NCBI say they are WGS
54,076	DISAGREE	partie says it is not WGS
80,372	NCBI MISSED	NCBI hasn't processed yet
15	PARTIE MISSED	partie hadn't processed because the data is not publicly available.

Note: these numbers are constantly changing as I am updating partie and NCBI is processing more data. The actual counts are provided below once we read the file.

This data is in the file wgs_datasets.tsv.gz that I have shared on slack and is in this directory (so you can run this notebook!). Note that it is a gzip file to save space but pandas don't care.

I am still blown away that even though we have processed almost 1,000,000 datasets with partie there are 40,503 we hadn't seen yet. Those are processing and will be done shortly.

To generate the test data for the pre-hackathon, I'm using the 61,421 metagenomes from the AGREE dataset. These are ones that we are sure are metagenomes.

Lets use panda to take a look at this data!



In [1]:

    
# set up the environment. It doesn't matter, but I usually organize all my imports at the top.
# import our environment
import os
import sys

from math import log,e,sqrt

%matplotlib inline
import pandas as pd
import seaborn as sb
import numpy as np
import matplotlib.pyplot as plt



In [2]:

    
# this is a neat trick for getting markdown in our output. It is totally superfluous :)
# see https://stackoverflow.com/questions/23271575/printing-bold-colored-etc-text-in-ipython-qtconsole
# for the inspiration
from IPython.display import Markdown, display
def printmd(string, color=None):
    colorstr = "<span style='color:{}'>{}</span>".format(color, string)
    display(Markdown(colorstr))

Read the data

This just creates a dataframe from the file



In [3]:

    
df = pd.read_csv('wgs_datasets.tsv.gz', compression='gzip', sep='\t', header=0, dtype={'SRR': str, 'Status': str, 'NCBI Classification': str, 'Partie Classification': str, 'percent_unique_kmer': float, 'percent_16S': float, 'percent_phage': float, 'percent_Prokaryote': float, 'Partie Automatic Annotation': str, 'Size (bp)': int})
df.describe()









    Out[3]:







  
    
      
      percent_unique_kmer
      percent_16S
      percent_phage
      percent_Prokaryote
      Size (bp)
    
  
  
    
      count
      220189.000000
      220189.000000
      220189.000000
      220189.000000
      2.220480e+05
    
    
      mean
      69.150160
      9.052121
      1.611870
      23.709796
      2.197607e+09
    
    
      std
      32.283897
      24.816231
      8.021131
      29.433564
      6.141813e+09
    
    
      min
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000e+00
    
    
      25%
      53.154789
      0.000000
      0.000000
      0.570000
      4.968207e+07
    
    
      50%
      83.265600
      0.110000
      0.010000
      10.900000
      4.569872e+08
    
    
      75%
      94.749232
      0.260000
      0.180000
      35.080000
      1.804921e+09
    
    
      max
      100.000000
      100.000000
      100.000000
      100.000000
      2.730726e+11

Adding a log length column

Its length data. Its not going to plot clearly, so we add a column with log of the size so we can see what they look like. Note that we log₁₀ this so it makes more sense.



In [4]:

    
df['log size']=df.apply(lambda x: log(x['Size (bp)'] + 1)/log(10), axis=1)

How we compare

Do we agree or not?



In [5]:

    
df['Status'].value_counts()









    Out[5]:





AGREE          87600
NCBI MISSED    80372
DISAGREE       54076
Name: Status, dtype: int64

Size distributions

Here is a quick histogram of the sizes of all the data, just so we get an idea.



In [6]:

    
df.hist(by="Status", column='log size', sharex=True, sharey=True)









    Out[6]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fd07ab3ca20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd0774fcb00>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fd0795a3b70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b6c8be0>]],
      dtype=object)



In [7]:

    
df.hist(by="Partie Classification", column='log size', sharex=True)









    Out[7]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fd07c18e780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b5aebe0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b561160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b5086d8>]],
      dtype=object)

Choose the random metagenomes

We just filter here for columns where we both agree that they are metagenomes. This is a robust set to work with, although once we have a pipeline running and we know how much computation we need, we should probably just do everything and see what falls out.

dfa is a data frame where we agree, obvs.



In [8]:

    
dfa = df[df['Status'] == "AGREE"]
dfa.describe()









    Out[8]:







  
    
      
      percent_unique_kmer
      percent_16S
      percent_phage
      percent_Prokaryote
      Size (bp)
      log size
    
  
  
    
      count
      87600.000000
      87600.000000
      87600.000000
      87600.000000
      8.760000e+04
      87600.000000
    
    
      mean
      87.457616
      1.308648
      1.057385
      20.637408
      1.178075e+09
      8.556748
    
    
      std
      14.083073
      7.050435
      4.813508
      20.535379
      1.473818e+09
      0.976517
    
    
      min
      0.400864
      0.000000
      0.000000
      0.000000
      0.000000e+00
      0.000000
    
    
      25%
      83.153437
      0.060000
      0.000000
      4.180000
      1.143643e+08
      8.058291
    
    
      50%
      92.710567
      0.140000
      0.020000
      15.420000
      6.139474e+08
      8.788131
    
    
      75%
      97.060062
      0.220000
      0.140000
      30.140000
      1.574446e+09
      9.197128
    
    
      max
      100.000000
      100.000000
      99.970000
      100.000000
      2.660719e+10
      10.424999

Sample 1 - A Random Selection

Lets random sample 1,000 datasets from our list where we AGREE that they are WGS samples.

s1 is sample 1.



In [9]:

    
s1 = dfa.sample(1000)
s1.describe()









    Out[9]:







  
    
      
      percent_unique_kmer
      percent_16S
      percent_phage
      percent_Prokaryote
      Size (bp)
      log size
    
  
  
    
      count
      1000.000000
      1000.000000
      1000.000000
      1000.000000
      1.000000e+03
      1000.000000
    
    
      mean
      86.616445
      1.306690
      1.028940
      21.187310
      1.170206e+09
      8.522147
    
    
      std
      14.394792
      7.124295
      4.728729
      20.566357
      1.536446e+09
      0.981353
    
    
      min
      1.054877
      0.000000
      0.000000
      0.000000
      1.500000e+02
      2.178977
    
    
      25%
      81.170772
      0.060000
      0.000000
      4.080000
      9.681108e+07
      7.985925
    
    
      50%
      92.217166
      0.150000
      0.020000
      16.930000
      5.694686e+08
      8.755470
    
    
      75%
      96.876754
      0.220000
      0.130000
      31.112500
      1.553505e+09
      9.191312
    
    
      max
      100.000000
      100.000000
      57.090000
      100.000000
      1.132143e+10
      10.053901

Plot the data.

Remember that we have a cut off of 5MB (5 x 10⁶ which explains the lower limit here.



In [10]:

    
s1.hist(column=['Size (bp)', 'log size'])









    Out[10]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b3d65f8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b3e6898>]],
      dtype=object)

Print those IDs

These are those IDs, put into a file called random_selection.txt.

Remember, that the sampling here is random so each time you run it you'll get a different output!



In [11]:

    
with open("random_selection.txt", 'w') as out:
    out.write("\n".join(s1['SRR']))

Sample 2 - A Size Selected Selection

This selection has three different sets, and I've chosen them to be small, medium, and large, a third of each.

small : The 25^th percentile is 4.5 x 10⁷, so we take 2 x 10⁷ to 7 x 10⁷.

medium : The 50^th percentile is 4 x 10⁸, so we take 1.5 x 10⁸ to 6.5 x 10⁸.

large : The 75^th percentile is 9 x 10⁸, so we take 6.5 x 10⁸ to 1.5 x 10⁹.

From each set, we choose 333 samples at random.

Note: When I wrote this description and code, these were the percentiles (shown in the output from the first describe() command). However, I'm adding a bit more data so they will change slightly but these are probably still appropriate.

s2s, s2m, and s2l are sample 2, small, medium, and large.



In [12]:

    
s2s = dfa[(dfa['Size (bp)'] > 2e7) & (dfa['Size (bp)'] < 7e7)].sample(333)
s2s.describe()









    Out[12]:







  
    
      
      percent_unique_kmer
      percent_16S
      percent_phage
      percent_Prokaryote
      Size (bp)
      log size
    
  
  
    
      count
      333.000000
      333.000000
      333.000000
      333.000000
      3.330000e+02
      333.000000
    
    
      mean
      81.632995
      0.850631
      0.257838
      19.526787
      4.203816e+07
      7.598143
    
    
      std
      15.377918
      6.431853
      2.059734
      17.815105
      1.411548e+07
      0.151307
    
    
      min
      1.297722
      0.000000
      0.000000
      0.000000
      2.007900e+07
      7.302742
    
    
      25%
      69.683673
      0.060000
      0.000000
      7.330000
      2.995458e+07
      7.476463
    
    
      50%
      83.995852
      0.120000
      0.020000
      14.370000
      4.073370e+07
      7.609954
    
    
      75%
      95.832181
      0.200000
      0.060000
      28.330000
      5.336042e+07
      7.727219
    
    
      max
      99.653612
      73.680000
      26.620000
      93.570000
      6.985060e+07
      7.844170



In [13]:

    
s2m = dfa[(dfa['Size (bp)'] > 1.5e8) & (dfa['Size (bp)'] < 6.5e8)].sample(333)
s2m.describe()









    Out[13]:







  
    
      
      percent_unique_kmer
      percent_16S
      percent_phage
      percent_Prokaryote
      Size (bp)
      log size
    
  
  
    
      count
      333.000000
      333.000000
      333.000000
      333.000000
      3.330000e+02
      333.000000
    
    
      mean
      89.790225
      0.247057
      0.249279
      21.965285
      4.425031e+08
      8.624230
    
    
      std
      10.874524
      0.741782
      0.948176
      22.532429
      1.261778e+08
      0.146153
    
    
      min
      51.805587
      0.000000
      0.000000
      0.030000
      1.538449e+08
      8.187083
    
    
      25%
      86.810181
      0.080000
      0.000000
      5.420000
      3.600193e+08
      8.556326
    
    
      50%
      94.501335
      0.170000
      0.010000
      13.660000
      4.556504e+08
      8.658632
    
    
      75%
      97.263432
      0.230000
      0.110000
      31.450000
      5.411954e+08
      8.733354
    
    
      max
      99.629820
      7.820000
      13.250000
      97.260000
      6.492502e+08
      8.812412



In [14]:

    
s2l = dfa[(dfa['Size (bp)'] > 6.5e8) & (dfa['Size (bp)'] < 1.5e9)].sample(333)
s2l.describe()









    Out[14]:







  
    
      
      percent_unique_kmer
      percent_16S
      percent_phage
      percent_Prokaryote
      Size (bp)
      log size
    
  
  
    
      count
      333.000000
      333.000000
      333.000000
      333.000000
      3.330000e+02
      333.000000
    
    
      mean
      88.888272
      0.435165
      0.231081
      20.721772
      9.721343e+08
      8.974138
    
    
      std
      13.462267
      2.669578
      0.758849
      19.601196
      2.483251e+08
      0.107804
    
    
      min
      6.278427
      0.000000
      0.000000
      0.000000
      6.518720e+08
      8.814162
    
    
      25%
      86.871641
      0.080000
      0.000000
      5.640000
      7.767546e+08
      8.890284
    
    
      50%
      93.353438
      0.160000
      0.010000
      16.210000
      8.640000e+08
      8.936514
    
    
      75%
      97.379258
      0.210000
      0.090000
      28.670000
      1.190928e+09
      9.075886
    
    
      max
      99.444556
      30.890000
      5.330000
      94.520000
      1.488826e+09
      9.172844

join the small samples

Make one big data frame. Remember that append is not in place!



In [15]:

    
dfs = pd.DataFrame()
dfs = dfs.append([s2s, s2m, s2l])
dfs.describe()









    Out[15]:







  
    
      
      percent_unique_kmer
      percent_16S
      percent_phage
      percent_Prokaryote
      Size (bp)
      log size
    
  
  
    
      count
      999.000000
      999.000000
      999.000000
      999.000000
      9.990000e+02
      999.000000
    
    
      mean
      86.770497
      0.510951
      0.246066
      20.737948
      4.855585e+08
      8.398837
    
    
      std
      13.843553
      4.047151
      1.379167
      20.081901
      4.136769e+08
      0.599909
    
    
      min
      1.297722
      0.000000
      0.000000
      0.000000
      2.007900e+07
      7.302742
    
    
      25%
      79.982523
      0.080000
      0.000000
      6.090000
      5.337368e+07
      7.727327
    
    
      50%
      92.439168
      0.150000
      0.020000
      14.560000
      4.556504e+08
      8.658632
    
    
      75%
      97.058517
      0.210000
      0.080000
      29.465000
      7.761238e+08
      8.889931
    
    
      max
      99.653612
      73.680000
      26.620000
      97.260000
      1.488826e+09
      9.172844



In [16]:

    
dfs.hist(column=['Size (bp)', 'log size'])









    Out[16]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b338198>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b2cf438>]],
      dtype=object)



In [17]:

    
with open("size_selection.txt", 'w') as out:
    out.write("\n".join(dfs['SRR']))

Sample 3 - Size and Phage Selection

In order to maximize our success of finding phages, we should use the data we have! As part of partie, we calculate the percent of reads that map to phage genomes.

We can use that to hopefully improve our success of finding phages in the metagenomes – especially in this trial phase.

Lets take our size selections again, and this time sort by phage counts, and use the top 333 rather than a random selection of metagenomes.

In the random selection above our percent phage ranges from 0 - 54%. Here our percent phage ranges from 5 - 100%. Hopefully we can find a phage or two!



In [18]:

    
s3s = dfa[(dfa['Size (bp)'] > 2e7) & (dfa['Size (bp)'] < 7e7)].sort_values(by='percent_phage', ascending=False).head(333)
s3m = dfa[(dfa['Size (bp)'] > 1.5e8) & (dfa['Size (bp)'] < 6.5e8)].sort_values(by='percent_phage', ascending=False).head(333)
s3l = dfa[(dfa['Size (bp)'] > 6.5e8) & (dfa['Size (bp)'] < 1.5e9)].sort_values(by='percent_phage', ascending=False).head(333)

dfp = pd.DataFrame()
dfp = dfp.append([s3s, s3m, s3l])
dfp.describe()









    Out[18]:







  
    
      
      percent_unique_kmer
      percent_16S
      percent_phage
      percent_Prokaryote
      Size (bp)
      log size
    
  
  
    
      count
      999.000000
      999.000000
      999.000000
      999.000000
      9.990000e+02
      999.000000
    
    
      mean
      84.637763
      2.301261
      7.689830
      45.319990
      4.732711e+08
      8.385690
    
    
      std
      14.494037
      5.548649
      7.673417
      30.770878
      4.103529e+08
      0.598404
    
    
      min
      13.638484
      0.000000
      1.700000
      0.000000
      2.002338e+07
      7.301537
    
    
      25%
      76.849251
      0.150000
      3.480000
      17.040000
      5.343003e+07
      7.727785
    
    
      50%
      89.353225
      0.220000
      4.620000
      35.030000
      4.077483e+08
      8.610392
    
    
      75%
      95.815232
      0.550000
      8.385000
      79.280000
      7.747141e+08
      8.889141
    
    
      max
      99.908387
      39.670000
      90.480000
      98.330000
      1.499848e+09
      9.176047

Size Distribution

This looks about the same as the random selection. We could test it, but I don't think it's that important - we are just trying to optimize the workflow.



In [19]:

    
dfp.hist(column=['Size (bp)', 'log size'])









    Out[19]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b28ffd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b234240>]],
      dtype=object)

Write the list out



In [20]:

    
with open("phage_size_selection.txt", 'w') as out:
    out.write("\n".join(dfp['SRR']))

	percent_unique_kmer	percent_16S	percent_phage	percent_Prokaryote	Size (bp)
count	220189.000000	220189.000000	220189.000000	220189.000000	2.220480e+05
mean	69.150160	9.052121	1.611870	23.709796	2.197607e+09
std	32.283897	24.816231	8.021131	29.433564	6.141813e+09
min	0.000000	0.000000	0.000000	0.000000	0.000000e+00
25%	53.154789	0.000000	0.000000	0.570000	4.968207e+07
50%	83.265600	0.110000	0.010000	10.900000	4.569872e+08
75%	94.749232	0.260000	0.180000	35.080000	1.804921e+09
max	100.000000	100.000000	100.000000	100.000000	2.730726e+11

	percent_unique_kmer	percent_16S	percent_phage	percent_Prokaryote	Size (bp)	log size
count	87600.000000	87600.000000	87600.000000	87600.000000	8.760000e+04	87600.000000
mean	87.457616	1.308648	1.057385	20.637408	1.178075e+09	8.556748
std	14.083073	7.050435	4.813508	20.535379	1.473818e+09	0.976517
min	0.400864	0.000000	0.000000	0.000000	0.000000e+00	0.000000
25%	83.153437	0.060000	0.000000	4.180000	1.143643e+08	8.058291
50%	92.710567	0.140000	0.020000	15.420000	6.139474e+08	8.788131
75%	97.060062	0.220000	0.140000	30.140000	1.574446e+09	9.197128
max	100.000000	100.000000	99.970000	100.000000	2.660719e+10	10.424999

	percent_unique_kmer	percent_16S	percent_phage	percent_Prokaryote	Size (bp)	log size
count	1000.000000	1000.000000	1000.000000	1000.000000	1.000000e+03	1000.000000
mean	86.616445	1.306690	1.028940	21.187310	1.170206e+09	8.522147
std	14.394792	7.124295	4.728729	20.566357	1.536446e+09	0.981353
min	1.054877	0.000000	0.000000	0.000000	1.500000e+02	2.178977
25%	81.170772	0.060000	0.000000	4.080000	9.681108e+07	7.985925
50%	92.217166	0.150000	0.020000	16.930000	5.694686e+08	8.755470
75%	96.876754	0.220000	0.130000	31.112500	1.553505e+09	9.191312
max	100.000000	100.000000	57.090000	100.000000	1.132143e+10	10.053901

	percent_unique_kmer	percent_16S	percent_phage	percent_Prokaryote	Size (bp)	log size
count	333.000000	333.000000	333.000000	333.000000	3.330000e+02	333.000000
mean	81.632995	0.850631	0.257838	19.526787	4.203816e+07	7.598143
std	15.377918	6.431853	2.059734	17.815105	1.411548e+07	0.151307
min	1.297722	0.000000	0.000000	0.000000	2.007900e+07	7.302742
25%	69.683673	0.060000	0.000000	7.330000	2.995458e+07	7.476463
50%	83.995852	0.120000	0.020000	14.370000	4.073370e+07	7.609954
75%	95.832181	0.200000	0.060000	28.330000	5.336042e+07	7.727219
max	99.653612	73.680000	26.620000	93.570000	6.985060e+07	7.844170

	percent_unique_kmer	percent_16S	percent_phage	percent_Prokaryote	Size (bp)	log size
count	999.000000	999.000000	999.000000	999.000000	9.990000e+02	999.000000
mean	86.770497	0.510951	0.246066	20.737948	4.855585e+08	8.398837
std	13.843553	4.047151	1.379167	20.081901	4.136769e+08	0.599909
min	1.297722	0.000000	0.000000	0.000000	2.007900e+07	7.302742
25%	79.982523	0.080000	0.000000	6.090000	5.337368e+07	7.727327
50%	92.439168	0.150000	0.020000	14.560000	4.556504e+08	8.658632
75%	97.058517	0.210000	0.080000	29.465000	7.761238e+08	8.889931
max	99.653612	73.680000	26.620000	97.260000	1.488826e+09	9.172844