Rob Edwards, Dec 9th
This is selecting some data sets from the WGS metagenomes that we have identied.
As a background, Ben provided a list of 141,790 SRR IDs that they have identified as WGS metagenomes based on the annotations.
We developed partie to automatically detect WGS metagenomes from the SRA and have run it on 996,770 SRA datasets so far. Partie measures a few things, notably the frequency of unique k-mers (i.e. k-mers that appear only once). If you have a 16S amplicon data set, most of the k-mers appear many times since the 16S is so repetative. In contrast, if you have a random WGS metagenome, the k-mers are more unique.
Second, we compare to databases of 16S genes, prokaryotic genes, and phages.
Next, we also consider the length of the dataset, and typically rule out anything less than 5 MB. This is because there are samples with a single 150 bp read, and we can't accurately annotate them. However, this may be a bit low and we should perhaps increase this, because many, many datasets in the 5MB range are reconstructed genomes. But there are also plenty of real metagenomes that are that size.
Note that in the file below we have two columns: Automatic Annotation
is our partie prediction without filtering by size and Partie Classification
is after filtering by size. I don't bother including the orginal submittors designation, but could add it if requested.
I merged these two data sets, and at the moment we have:
Number | What | What that means |
---|---|---|
87,600 | AGREE | partie and NCBI say they are WGS |
54,076 | DISAGREE | partie says it is not WGS |
80,372 | NCBI MISSED | NCBI hasn't processed yet |
15 | PARTIE MISSED | partie hadn't processed because the data is not publicly available. |
Note: these numbers are constantly changing as I am updating partie and NCBI is processing more data. The actual counts are provided below once we read the file.
This data is in the file wgs_datasets.tsv.gz
that I have shared on slack and is in this directory (so you can run this notebook!). Note that it is a gzip file to save space but pandas don't care.
I am still blown away that even though we have processed almost 1,000,000 datasets with partie there are 40,503 we hadn't seen yet. Those are processing and will be done shortly.
To generate the test data for the pre-hackathon, I'm using the 61,421 metagenomes from the AGREE
dataset. These are ones that we are sure are metagenomes.
Lets use panda to take a look at this data!
In [1]:
# set up the environment. It doesn't matter, but I usually organize all my imports at the top.
# import our environment
import os
import sys
from math import log,e,sqrt
%matplotlib inline
import pandas as pd
import seaborn as sb
import numpy as np
import matplotlib.pyplot as plt
In [2]:
# this is a neat trick for getting markdown in our output. It is totally superfluous :)
# see https://stackoverflow.com/questions/23271575/printing-bold-colored-etc-text-in-ipython-qtconsole
# for the inspiration
from IPython.display import Markdown, display
def printmd(string, color=None):
colorstr = "<span style='color:{}'>{}</span>".format(color, string)
display(Markdown(colorstr))
In [3]:
df = pd.read_csv('wgs_datasets.tsv.gz', compression='gzip', sep='\t', header=0, dtype={'SRR': str, 'Status': str, 'NCBI Classification': str, 'Partie Classification': str, 'percent_unique_kmer': float, 'percent_16S': float, 'percent_phage': float, 'percent_Prokaryote': float, 'Partie Automatic Annotation': str, 'Size (bp)': int})
df.describe()
Out[3]:
In [4]:
df['log size']=df.apply(lambda x: log(x['Size (bp)'] + 1)/log(10), axis=1)
In [5]:
df['Status'].value_counts()
Out[5]:
In [6]:
df.hist(by="Status", column='log size', sharex=True, sharey=True)
Out[6]:
In [7]:
df.hist(by="Partie Classification", column='log size', sharex=True)
Out[7]:
We just filter here for columns where we both agree that they are metagenomes. This is a robust set to work with, although once we have a pipeline running and we know how much computation we need, we should probably just do everything and see what falls out.
dfa
is a data frame where we agree, obvs.
In [8]:
dfa = df[df['Status'] == "AGREE"]
dfa.describe()
Out[8]:
In [9]:
s1 = dfa.sample(1000)
s1.describe()
Out[9]:
In [10]:
s1.hist(column=['Size (bp)', 'log size'])
Out[10]:
In [11]:
with open("random_selection.txt", 'w') as out:
out.write("\n".join(s1['SRR']))
This selection has three different sets, and I've chosen them to be small
, medium
, and large
, a third of each.
small
: The 25th percentile is 4.5 x 107, so we take 2 x 107 to 7 x 107.
medium
: The 50th percentile is 4 x 108, so we take 1.5 x 108 to 6.5 x 108.
large
: The 75th percentile is 9 x 108, so we take 6.5 x 108 to 1.5 x 109.
From each set, we choose 333 samples at random.
Note: When I wrote this description and code, these were the percentiles (shown in the output from the first describe()
command). However, I'm adding a bit more data so they will change slightly but these are probably still appropriate.
s2s
, s2m
, and s2l
are sample 2, small, medium, and large.
In [12]:
s2s = dfa[(dfa['Size (bp)'] > 2e7) & (dfa['Size (bp)'] < 7e7)].sample(333)
s2s.describe()
Out[12]:
In [13]:
s2m = dfa[(dfa['Size (bp)'] > 1.5e8) & (dfa['Size (bp)'] < 6.5e8)].sample(333)
s2m.describe()
Out[13]:
In [14]:
s2l = dfa[(dfa['Size (bp)'] > 6.5e8) & (dfa['Size (bp)'] < 1.5e9)].sample(333)
s2l.describe()
Out[14]:
In [15]:
dfs = pd.DataFrame()
dfs = dfs.append([s2s, s2m, s2l])
dfs.describe()
Out[15]:
In [16]:
dfs.hist(column=['Size (bp)', 'log size'])
Out[16]:
In [17]:
with open("size_selection.txt", 'w') as out:
out.write("\n".join(dfs['SRR']))
In order to maximize our success of finding phages, we should use the data we have! As part of partie, we calculate the percent of reads that map to phage genomes.
We can use that to hopefully improve our success of finding phages in the metagenomes – especially in this trial phase.
Lets take our size selections again, and this time sort by phage counts, and use the top 333 rather than a random selection of metagenomes.
In the random selection above our percent phage ranges from 0 - 54%. Here our percent phage ranges from 5 - 100%. Hopefully we can find a phage or two!
In [18]:
s3s = dfa[(dfa['Size (bp)'] > 2e7) & (dfa['Size (bp)'] < 7e7)].sort_values(by='percent_phage', ascending=False).head(333)
s3m = dfa[(dfa['Size (bp)'] > 1.5e8) & (dfa['Size (bp)'] < 6.5e8)].sort_values(by='percent_phage', ascending=False).head(333)
s3l = dfa[(dfa['Size (bp)'] > 6.5e8) & (dfa['Size (bp)'] < 1.5e9)].sort_values(by='percent_phage', ascending=False).head(333)
dfp = pd.DataFrame()
dfp = dfp.append([s3s, s3m, s3l])
dfp.describe()
Out[18]:
In [19]:
dfp.hist(column=['Size (bp)', 'log size'])
Out[19]:
In [20]:
with open("phage_size_selection.txt", 'w') as out:
out.write("\n".join(dfp['SRR']))