Trying to generate a plot showing the distribution of metagenome sizes by year, and hoping to do this by pulling data with the EBI restful API.

Some links that work:


In [1]:
import numpy as np
import matplotlib
matplotlib.use('agg')
import pandas as pd
%pylab inline


Populating the interactive namespace from numpy and matplotlib

Start with xml describing all studies where taxon is metagenome, and get those study ids.


In [2]:
!grep PRIMARY_ID ena.xml | wc -l


    2448

In [3]:
!grep PRIMARY_ID ena.xml | head


          <PRIMARY_ID>DRP000009</PRIMARY_ID>
          <PRIMARY_ID>DRP000160</PRIMARY_ID>
          <PRIMARY_ID>DRP000316</PRIMARY_ID>
          <PRIMARY_ID>DRP000319</PRIMARY_ID>
          <PRIMARY_ID>DRP000379</PRIMARY_ID>
          <PRIMARY_ID>DRP000446</PRIMARY_ID>
          <PRIMARY_ID>DRP000458</PRIMARY_ID>
          <PRIMARY_ID>DRP000463</PRIMARY_ID>
          <PRIMARY_ID>DRP000487</PRIMARY_ID>
          <PRIMARY_ID>DRP000498</PRIMARY_ID>

In [4]:
study_ids = !grep PRIMARY_ID ena.xml
print len(study_ids)
print study_ids[:10]
study_ids = [e.strip()[12:21] for e in study_ids]
print len(study_ids)
print study_ids[:10]


2448
['          <PRIMARY_ID>DRP000009</PRIMARY_ID>', '          <PRIMARY_ID>DRP000160</PRIMARY_ID>', '          <PRIMARY_ID>DRP000316</PRIMARY_ID>', '          <PRIMARY_ID>DRP000319</PRIMARY_ID>', '          <PRIMARY_ID>DRP000379</PRIMARY_ID>', '          <PRIMARY_ID>DRP000446</PRIMARY_ID>', '          <PRIMARY_ID>DRP000458</PRIMARY_ID>', '          <PRIMARY_ID>DRP000463</PRIMARY_ID>', '          <PRIMARY_ID>DRP000487</PRIMARY_ID>', '          <PRIMARY_ID>DRP000498</PRIMARY_ID>']
2448
['DRP000009', 'DRP000160', 'DRP000316', 'DRP000319', 'DRP000379', 'DRP000446', 'DRP000458', 'DRP000463', 'DRP000487', 'DRP000498']

Pull data on read counts per run on those studies, as well as year the study was deposited


In [5]:
data = []
for e in study_ids:
    url = """'http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession="%s"&result=read_run&fields=study_accession,sample_accession,run_accession,instrument_model,read_count,first_public'""" % e
    current_data = !curl $url
    data += [l.strip().split('\t') for l in current_data if l.startswith(e)]

In [6]:
columns = 'study_accession\tsample_accession\trun_accession\tinstrument_model\tread_count\tfirst_public'.split('\t')
print columns


['study_accession', 'sample_accession', 'run_accession', 'instrument_model', 'read_count', 'first_public']

Load the results into a pandas DataFrame


In [7]:
df = pd.DataFrame(data, columns=columns)
df


Out[7]:
study_accession sample_accession run_accession instrument_model read_count first_public
0 DRP000009 DRS000009 DRR000019 454 GS FLX 11052 2011-03-31
1 DRP000009 DRS000009 DRR000020 454 GS FLX 486919 2011-03-31
2 DRP000009 DRS000009 DRR000021 454 GS FLX 518041 2011-03-31
3 DRP000160 DRS000201 DRR000358 454 GS 20 1589403 2011-01-05
4 DRP000316 DRS000421 DRR000770 454 GS FLX Titanium 2014-06-17
5 DRP000316 DRS000422 DRR000771 454 GS FLX Titanium 2014-06-17
6 DRP000316 DRS000423 DRR000772 454 GS FLX Titanium 2014-06-17
7 DRP000316 DRS000424 DRR000773 454 GS FLX Titanium 2014-06-17
8 DRP000316 DRS000425 DRR000774 454 GS FLX Titanium 2014-06-17
9 DRP000316 DRS000426 DRR000775 454 GS FLX Titanium 2014-06-17
10 DRP000319 DRS000505 DRR000792 454 GS FLX Titanium 2014-06-17
11 DRP000319 DRS000505 DRR000793 454 GS FLX Titanium 2014-06-17
12 DRP000319 DRS000505 DRR000794 454 GS FLX Titanium 2014-06-17
13 DRP000319 DRS000505 DRR000795 454 GS FLX Titanium 2014-06-17
14 DRP000319 DRS000505 DRR000796 454 GS FLX Titanium 2014-06-17
15 DRP000319 DRS000505 DRR000797 454 GS FLX Titanium 2014-06-17
16 DRP000319 DRS000505 DRR000798 454 GS FLX Titanium 2014-06-17
17 DRP000319 DRS000505 DRR000799 454 GS FLX Titanium 2014-06-17
18 DRP000319 DRS000505 DRR000800 454 GS FLX Titanium 2014-06-17
19 DRP000319 DRS000505 DRR000801 454 GS FLX Titanium 2014-06-17
20 DRP000319 DRS000505 DRR000802 454 GS FLX Titanium 2014-06-17
21 DRP000319 DRS000505 DRR000803 454 GS FLX Titanium 2014-06-17
22 DRP000319 DRS000505 DRR000804 454 GS FLX Titanium 2014-06-17
23 DRP000319 DRS000505 DRR000805 454 GS FLX Titanium 2014-06-17
24 DRP000319 DRS000505 DRR000806 454 GS FLX Titanium 2014-06-17
25 DRP000319 DRS000505 DRR000807 454 GS FLX Titanium 2014-06-17
26 DRP000319 DRS000505 DRR000808 454 GS FLX Titanium 2014-06-17
27 DRP000319 DRS000505 DRR000809 454 GS FLX Titanium 2014-06-17
28 DRP000319 DRS000505 DRR000810 454 GS FLX Titanium 2014-06-17
29 DRP000319 DRS000505 DRR000811 454 GS FLX Titanium 2014-06-17
30 DRP000319 DRS000505 DRR000812 454 GS FLX Titanium 2014-06-17
31 DRP000319 DRS000505 DRR000813 454 GS FLX Titanium 2014-06-17
32 DRP000319 DRS000505 DRR000814 454 GS FLX Titanium 2014-06-17
33 DRP000319 DRS000505 DRR000815 454 GS FLX Titanium 2014-06-17
34 DRP000319 DRS000505 DRR000816 454 GS FLX Titanium 2014-06-17
35 DRP000319 DRS000505 DRR000817 454 GS FLX Titanium 2014-06-17
36 DRP000319 DRS000505 DRR000818 454 GS FLX Titanium 2014-06-17
37 DRP000319 DRS000505 DRR000819 454 GS FLX Titanium 2014-06-17
38 DRP000319 DRS000505 DRR000820 454 GS FLX Titanium 2014-06-17
39 DRP000319 DRS000505 DRR000821 454 GS FLX Titanium 2014-06-17
40 DRP000319 DRS000505 DRR000822 454 GS FLX Titanium 2014-06-17
41 DRP000319 DRS000505 DRR000823 454 GS FLX Titanium 2014-06-17
42 DRP000319 DRS000505 DRR000824 454 GS FLX Titanium 2014-06-17
43 DRP000319 DRS000505 DRR000825 454 GS FLX Titanium 2014-06-17
44 DRP000319 DRS000505 DRR000826 454 GS FLX Titanium 2014-06-17
45 DRP000319 DRS000505 DRR000827 454 GS FLX Titanium 2014-06-17
46 DRP000319 DRS000505 DRR000828 454 GS FLX Titanium 2014-06-17
47 DRP000319 DRS000505 DRR000829 454 GS FLX Titanium 2014-06-17
48 DRP000319 DRS000505 DRR000830 454 GS FLX Titanium 2014-06-17
49 DRP000319 DRS000505 DRR000831 454 GS FLX Titanium 2014-06-17
50 DRP000319 DRS000505 DRR000832 454 GS FLX Titanium 2014-06-17
51 DRP000379 DRS000637 DRR000979 454 GS FLX Titanium 130077 2014-06-17
52 DRP000446 DRS000850 DRR001355 Illumina Genome Analyzer II 1040584 2012-10-22
53 DRP000446 DRS000851 DRR001356 Illumina Genome Analyzer II 1561501 2012-10-22
54 DRP000446 DRS000852 DRR001357 Illumina Genome Analyzer II 682871 2012-10-22
55 DRP000446 DRS000853 DRR001358 Illumina Genome Analyzer II 957240 2012-10-22
56 DRP000446 DRS000854 DRR001359 Illumina Genome Analyzer II 1134848 2012-10-22
57 DRP000446 DRS000855 DRR001360 Illumina Genome Analyzer II 2163337 2012-10-22
58 DRP000458 DRS000992 DRR001438 454 GS 20 9845 2012-04-15
59 DRP000458 DRS000993 DRR001439 454 GS 20 8381 2012-04-15
... ... ... ... ... ...

45087 rows × 6 columns

Plot read count per run distribution by year (less interesting then read count per study, which is below)


In [8]:
years = []
for d in df['first_public']:
    if d:
        years.append(int(d.split('-')[0]))
    else:
        years.append(None)

read_counts = []
for rc in df['read_count']:
    if rc:
        read_counts.append(int(rc))
    else:
        read_counts.append(None)
        
df['year-public'] = years
df['read_count'] = read_counts
df.dropna(subset=('year-public', 'read_count'), inplace=True)

#df.to_csv('read_count_by_sample.csv')

ax = df.boxplot(column="read_count", by="year-public")
ax.set_ylim(0, 1000000)


Out[8]:
(0, 1000000)

Collapse samples by study


In [97]:
s = pd.DataFrame([df.groupby(by="study_accession", as_index=True)['read_count'].mean(),
                  df.groupby(by="study_accession", as_index=True)['year-public'].max(),
                  df.groupby(by="study_accession", as_index=True)['sample_accession'].count()])
s.index = ("Mean number of reads per sample (per study)", "Year of study publication", "Number of samples (per study)")
# uncomment to write results to file:
#s.to_csv('run_count_by_study.csv')

In [81]:
s


Out[81]:
study_accession DRP000009 DRP000160 DRP000379 DRP000446 DRP000458 DRP000463 DRP000487 DRP000498 DRP000524 DRP000543 DRP000544 DRP000563 DRP000568 DRP000569 DRP000572 DRP000618 DRP000648 DRP000671 DRP000736 DRP000759
Mean number of reads per sample (per study) 338670.666667 1589403 130077 1256730.166667 9113 7886458.666667 16385.8 26072.5 8700364 30083 20336 7688.5 10718150.538462 14827792 183695 1111278 311859 3466.833333 265662 216.292958 ...
Year of study publication 2011.000000 2011 2014 2012.000000 2012 2014.000000 2012.0 2011.0 2012 2012 2012 2014.0 2012.000000 2014 2012 2012 2012 2014.000000 2014 2014.000000 ...
Number of samples (per study) 3.000000 1 1 6.000000 2 3.000000 30.0 14.0 1 1 1 4.0 13.000000 6 1 1 1 6.000000 1 355.000000 ...

3 rows × 801 columns

Plot number of samples per study by year


In [95]:
ax = s.T.boxplot(column="Number of samples (per study)", by="Year of study publication", grid=False)
ax.set_ylim(0, 250)
ax.set_xlabel("Year of study publication")
ax.set_title("")
ax.set_ylabel("Number of samples per study")


Out[95]:
<matplotlib.text.Text at 0x10bade390>

In [96]:
ax = s.T.boxplot(column="Mean number of reads per sample (per study)", by="Year of study publication", grid=False)
ax.set_ylim(0, 10000000)
ax.set_xlabel("Year of study publication")
ax.set_title("")
ax.set_ylabel("Mean number of reads per sample (per study)")


Out[96]:
<matplotlib.text.Text at 0x10bc1b650>

In [89]:
s.T.groupby("Year of study publication").max()


Out[89]:
Mean number of reads per sample (per study) Number of samples (per study)
Year of study publication
2010 2.315884e+07 814
2011 1.275763e+08 224
2012 5.314684e+08 1276
2013 5.961895e+07 1067
2014 1.708313e+08 2809

5 rows × 2 columns