Choosing Some Data Sets for Exemplar Data

Rob Edwards, Dec 9th

This is selecting some data sets from the WGS metagenomes that we have identied.

As a background, Ben provided a list of 141,790 SRR IDs that they have identified as WGS metagenomes based on the annotations.

We developed partie to automatically detect WGS metagenomes from the SRA and have run it on 996,770 SRA datasets so far. Partie measures a few things, notably the frequency of unique k-mers (i.e. k-mers that appear only once). If you have a 16S amplicon data set, most of the k-mers appear many times since the 16S is so repetative. In contrast, if you have a random WGS metagenome, the k-mers are more unique.

Second, we compare to databases of 16S genes, prokaryotic genes, and phages.

Next, we also consider the length of the dataset, and typically rule out anything less than 5 MB. This is because there are samples with a single 150 bp read, and we can't accurately annotate them. However, this may be a bit low and we should perhaps increase this, because many, many datasets in the 5MB range are reconstructed genomes. But there are also plenty of real metagenomes that are that size.

Note that in the file below we have two columns: Automatic Annotation is our partie prediction without filtering by size and Partie Classification is after filtering by size. I don't bother including the orginal submittors designation, but could add it if requested.

I merged these two data sets, and at the moment we have:

Number What What that means
87,600 AGREE partie and NCBI say they are WGS
54,076 DISAGREE partie says it is not WGS
80,372 NCBI MISSED NCBI hasn't processed yet
15 PARTIE MISSED partie hadn't processed because the data is not publicly available.

Note: these numbers are constantly changing as I am updating partie and NCBI is processing more data. The actual counts are provided below once we read the file.

This data is in the file wgs_datasets.tsv.gz that I have shared on slack and is in this directory (so you can run this notebook!). Note that it is a gzip file to save space but pandas don't care.

I am still blown away that even though we have processed almost 1,000,000 datasets with partie there are 40,503 we hadn't seen yet. Those are processing and will be done shortly.

To generate the test data for the pre-hackathon, I'm using the 61,421 metagenomes from the AGREE dataset. These are ones that we are sure are metagenomes.

Lets use panda to take a look at this data!


In [1]:
# set up the environment. It doesn't matter, but I usually organize all my imports at the top.
# import our environment
import os
import sys

from math import log,e,sqrt

%matplotlib inline
import pandas as pd
import seaborn as sb
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# this is a neat trick for getting markdown in our output. It is totally superfluous :)
# see https://stackoverflow.com/questions/23271575/printing-bold-colored-etc-text-in-ipython-qtconsole
# for the inspiration
from IPython.display import Markdown, display
def printmd(string, color=None):
    colorstr = "<span style='color:{}'>{}</span>".format(color, string)
    display(Markdown(colorstr))

Read the data

This just creates a dataframe from the file


In [3]:
df = pd.read_csv('wgs_datasets.tsv.gz', compression='gzip', sep='\t', header=0, dtype={'SRR': str, 'Status': str, 'NCBI Classification': str, 'Partie Classification': str, 'percent_unique_kmer': float, 'percent_16S': float, 'percent_phage': float, 'percent_Prokaryote': float, 'Partie Automatic Annotation': str, 'Size (bp)': int})
df.describe()


Out[3]:
percent_unique_kmer percent_16S percent_phage percent_Prokaryote Size (bp)
count 220189.000000 220189.000000 220189.000000 220189.000000 2.220480e+05
mean 69.150160 9.052121 1.611870 23.709796 2.197607e+09
std 32.283897 24.816231 8.021131 29.433564 6.141813e+09
min 0.000000 0.000000 0.000000 0.000000 0.000000e+00
25% 53.154789 0.000000 0.000000 0.570000 4.968207e+07
50% 83.265600 0.110000 0.010000 10.900000 4.569872e+08
75% 94.749232 0.260000 0.180000 35.080000 1.804921e+09
max 100.000000 100.000000 100.000000 100.000000 2.730726e+11

Adding a log length column

Its length data. Its not going to plot clearly, so we add a column with log of the size so we can see what they look like. Note that we log10 this so it makes more sense.


In [4]:
df['log size']=df.apply(lambda x: log(x['Size (bp)'] + 1)/log(10), axis=1)

How we compare

Do we agree or not?


In [5]:
df['Status'].value_counts()


Out[5]:
AGREE          87600
NCBI MISSED    80372
DISAGREE       54076
Name: Status, dtype: int64

Size distributions

Here is a quick histogram of the sizes of all the data, just so we get an idea.


In [6]:
df.hist(by="Status", column='log size', sharex=True, sharey=True)


Out[6]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fd07ab3ca20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd0774fcb00>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fd0795a3b70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b6c8be0>]],
      dtype=object)

In [7]:
df.hist(by="Partie Classification", column='log size', sharex=True)


Out[7]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fd07c18e780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b5aebe0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b561160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b5086d8>]],
      dtype=object)

Choose the random metagenomes

We just filter here for columns where we both agree that they are metagenomes. This is a robust set to work with, although once we have a pipeline running and we know how much computation we need, we should probably just do everything and see what falls out.

dfa is a data frame where we agree, obvs.


In [8]:
dfa = df[df['Status'] == "AGREE"]
dfa.describe()


Out[8]:
percent_unique_kmer percent_16S percent_phage percent_Prokaryote Size (bp) log size
count 87600.000000 87600.000000 87600.000000 87600.000000 8.760000e+04 87600.000000
mean 87.457616 1.308648 1.057385 20.637408 1.178075e+09 8.556748
std 14.083073 7.050435 4.813508 20.535379 1.473818e+09 0.976517
min 0.400864 0.000000 0.000000 0.000000 0.000000e+00 0.000000
25% 83.153437 0.060000 0.000000 4.180000 1.143643e+08 8.058291
50% 92.710567 0.140000 0.020000 15.420000 6.139474e+08 8.788131
75% 97.060062 0.220000 0.140000 30.140000 1.574446e+09 9.197128
max 100.000000 100.000000 99.970000 100.000000 2.660719e+10 10.424999

Sample 1 - A Random Selection

Lets random sample 1,000 datasets from our list where we AGREE that they are WGS samples.

s1 is sample 1.


In [9]:
s1 = dfa.sample(1000)
s1.describe()


Out[9]:
percent_unique_kmer percent_16S percent_phage percent_Prokaryote Size (bp) log size
count 1000.000000 1000.000000 1000.000000 1000.000000 1.000000e+03 1000.000000
mean 86.616445 1.306690 1.028940 21.187310 1.170206e+09 8.522147
std 14.394792 7.124295 4.728729 20.566357 1.536446e+09 0.981353
min 1.054877 0.000000 0.000000 0.000000 1.500000e+02 2.178977
25% 81.170772 0.060000 0.000000 4.080000 9.681108e+07 7.985925
50% 92.217166 0.150000 0.020000 16.930000 5.694686e+08 8.755470
75% 96.876754 0.220000 0.130000 31.112500 1.553505e+09 9.191312
max 100.000000 100.000000 57.090000 100.000000 1.132143e+10 10.053901

Plot the data.

Remember that we have a cut off of 5MB (5 x 106 which explains the lower limit here.


In [10]:
s1.hist(column=['Size (bp)', 'log size'])


Out[10]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b3d65f8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b3e6898>]],
      dtype=object)

These are those IDs, put into a file called random_selection.txt.

Remember, that the sampling here is random so each time you run it you'll get a different output!


In [11]:
with open("random_selection.txt", 'w') as out:
    out.write("\n".join(s1['SRR']))

Sample 2 - A Size Selected Selection

This selection has three different sets, and I've chosen them to be small, medium, and large, a third of each.

small : The 25th percentile is 4.5 x 107, so we take 2 x 107 to 7 x 107.

medium : The 50th percentile is 4 x 108, so we take 1.5 x 108 to 6.5 x 108.

large : The 75th percentile is 9 x 108, so we take 6.5 x 108 to 1.5 x 109.

From each set, we choose 333 samples at random.

Note: When I wrote this description and code, these were the percentiles (shown in the output from the first describe() command). However, I'm adding a bit more data so they will change slightly but these are probably still appropriate.

s2s, s2m, and s2l are sample 2, small, medium, and large.


In [12]:
s2s = dfa[(dfa['Size (bp)'] > 2e7) & (dfa['Size (bp)'] < 7e7)].sample(333)
s2s.describe()


Out[12]:
percent_unique_kmer percent_16S percent_phage percent_Prokaryote Size (bp) log size
count 333.000000 333.000000 333.000000 333.000000 3.330000e+02 333.000000
mean 81.632995 0.850631 0.257838 19.526787 4.203816e+07 7.598143
std 15.377918 6.431853 2.059734 17.815105 1.411548e+07 0.151307
min 1.297722 0.000000 0.000000 0.000000 2.007900e+07 7.302742
25% 69.683673 0.060000 0.000000 7.330000 2.995458e+07 7.476463
50% 83.995852 0.120000 0.020000 14.370000 4.073370e+07 7.609954
75% 95.832181 0.200000 0.060000 28.330000 5.336042e+07 7.727219
max 99.653612 73.680000 26.620000 93.570000 6.985060e+07 7.844170

In [13]:
s2m = dfa[(dfa['Size (bp)'] > 1.5e8) & (dfa['Size (bp)'] < 6.5e8)].sample(333)
s2m.describe()


Out[13]:
percent_unique_kmer percent_16S percent_phage percent_Prokaryote Size (bp) log size
count 333.000000 333.000000 333.000000 333.000000 3.330000e+02 333.000000
mean 89.790225 0.247057 0.249279 21.965285 4.425031e+08 8.624230
std 10.874524 0.741782 0.948176 22.532429 1.261778e+08 0.146153
min 51.805587 0.000000 0.000000 0.030000 1.538449e+08 8.187083
25% 86.810181 0.080000 0.000000 5.420000 3.600193e+08 8.556326
50% 94.501335 0.170000 0.010000 13.660000 4.556504e+08 8.658632
75% 97.263432 0.230000 0.110000 31.450000 5.411954e+08 8.733354
max 99.629820 7.820000 13.250000 97.260000 6.492502e+08 8.812412

In [14]:
s2l = dfa[(dfa['Size (bp)'] > 6.5e8) & (dfa['Size (bp)'] < 1.5e9)].sample(333)
s2l.describe()


Out[14]:
percent_unique_kmer percent_16S percent_phage percent_Prokaryote Size (bp) log size
count 333.000000 333.000000 333.000000 333.000000 3.330000e+02 333.000000
mean 88.888272 0.435165 0.231081 20.721772 9.721343e+08 8.974138
std 13.462267 2.669578 0.758849 19.601196 2.483251e+08 0.107804
min 6.278427 0.000000 0.000000 0.000000 6.518720e+08 8.814162
25% 86.871641 0.080000 0.000000 5.640000 7.767546e+08 8.890284
50% 93.353438 0.160000 0.010000 16.210000 8.640000e+08 8.936514
75% 97.379258 0.210000 0.090000 28.670000 1.190928e+09 9.075886
max 99.444556 30.890000 5.330000 94.520000 1.488826e+09 9.172844

join the small samples

Make one big data frame. Remember that append is not in place!


In [15]:
dfs = pd.DataFrame()
dfs = dfs.append([s2s, s2m, s2l])
dfs.describe()


Out[15]:
percent_unique_kmer percent_16S percent_phage percent_Prokaryote Size (bp) log size
count 999.000000 999.000000 999.000000 999.000000 9.990000e+02 999.000000
mean 86.770497 0.510951 0.246066 20.737948 4.855585e+08 8.398837
std 13.843553 4.047151 1.379167 20.081901 4.136769e+08 0.599909
min 1.297722 0.000000 0.000000 0.000000 2.007900e+07 7.302742
25% 79.982523 0.080000 0.000000 6.090000 5.337368e+07 7.727327
50% 92.439168 0.150000 0.020000 14.560000 4.556504e+08 8.658632
75% 97.058517 0.210000 0.080000 29.465000 7.761238e+08 8.889931
max 99.653612 73.680000 26.620000 97.260000 1.488826e+09 9.172844

In [16]:
dfs.hist(column=['Size (bp)', 'log size'])


Out[16]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b338198>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b2cf438>]],
      dtype=object)

In [17]:
with open("size_selection.txt", 'w') as out:
    out.write("\n".join(dfs['SRR']))

Sample 3 - Size and Phage Selection

In order to maximize our success of finding phages, we should use the data we have! As part of partie, we calculate the percent of reads that map to phage genomes.

We can use that to hopefully improve our success of finding phages in the metagenomes – especially in this trial phase.

Lets take our size selections again, and this time sort by phage counts, and use the top 333 rather than a random selection of metagenomes.

In the random selection above our percent phage ranges from 0 - 54%. Here our percent phage ranges from 5 - 100%. Hopefully we can find a phage or two!


In [18]:
s3s = dfa[(dfa['Size (bp)'] > 2e7) & (dfa['Size (bp)'] < 7e7)].sort_values(by='percent_phage', ascending=False).head(333)
s3m = dfa[(dfa['Size (bp)'] > 1.5e8) & (dfa['Size (bp)'] < 6.5e8)].sort_values(by='percent_phage', ascending=False).head(333)
s3l = dfa[(dfa['Size (bp)'] > 6.5e8) & (dfa['Size (bp)'] < 1.5e9)].sort_values(by='percent_phage', ascending=False).head(333)

dfp = pd.DataFrame()
dfp = dfp.append([s3s, s3m, s3l])
dfp.describe()


Out[18]:
percent_unique_kmer percent_16S percent_phage percent_Prokaryote Size (bp) log size
count 999.000000 999.000000 999.000000 999.000000 9.990000e+02 999.000000
mean 84.637763 2.301261 7.689830 45.319990 4.732711e+08 8.385690
std 14.494037 5.548649 7.673417 30.770878 4.103529e+08 0.598404
min 13.638484 0.000000 1.700000 0.000000 2.002338e+07 7.301537
25% 76.849251 0.150000 3.480000 17.040000 5.343003e+07 7.727785
50% 89.353225 0.220000 4.620000 35.030000 4.077483e+08 8.610392
75% 95.815232 0.550000 8.385000 79.280000 7.747141e+08 8.889141
max 99.908387 39.670000 90.480000 98.330000 1.499848e+09 9.176047

Size Distribution

This looks about the same as the random selection. We could test it, but I don't think it's that important - we are just trying to optimize the workflow.


In [19]:
dfp.hist(column=['Size (bp)', 'log size'])


Out[19]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b28ffd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd07b234240>]],
      dtype=object)

Write the list out


In [20]:
with open("phage_size_selection.txt", 'w') as out:
    out.write("\n".join(dfp['SRR']))