Genomics reference data: the eternal problem

This is the Python code I used for the analysis of the data. You can find Genomics_Reference_Data-Form_responses.tsv of my GitHub together with this file. The file on GitHub already has the personal information removed.

Load responses file and remove personal data


In [1]:
import csv

responses = []
with open('Genomics_Reference_Data-Form_responses.tsv', 'rb') as fh:
    _responses = csv.reader(fh, delimiter='\t')
    for row in _responses:
        # Remove columns 2 and 3 (name and email address of the participant)
        _row = row[:1] + row[3:]
        responses.append(_row)

# Save the responses file without personal information to share on GitHub :-)
with open('Genomics_Reference_Data-Form_responses_no_personal.tsv', 'wb') as fh:
    _responses = csv.writer(fh, delimiter='\t')
    for row in responses:
        _responses.writerow(row)

Load and study the data, first exploration:


In [2]:
import pandas as pd

# Dates are in the format DD/MM/YY H, so you need to tell Pandas about it, or it will treat
# the date and hour as separate columns
responses = pd.read_csv('Genomics_Reference_Data-Form_responses_no_personal.tsv', sep='\t',
                 parse_dates={'timestamp': [0]})

In [3]:
responses


Out[3]:
timestamp Do you work only with one species, or several species' genomes? What kind of reference data do you use for your research? Where do you fetch your data from? How do you fetch the reference data? Do you use any of these tools for downloading reference data? How do you keep your reference data up to date? Comments / Questions How do you structure the reference data What motivates you to use your own structure for your reference data? Where do you store your reference data? Personal data Untitled Question [Row 2]
0 2014-07-07 17:45:29 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/) I download the data manually myself NaN Staying consistent with versions of references... NaN I keep the same structure than the original so... NaN Local servers NaN NaN
1 2014-07-07 17:53:52 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... I download the data manually myself NaN Generally only want to use one version for a g... NaN NaN To unify data from different sources on a comm... Local servers NaN NaN
2 2014-07-07 18:02:02 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline Cloudbiolinux (http://cloudbiolinux.org/), Cos... I regularly check for new updates and fetch th... I'd really enjoy reading the summary results f... I use a different structure Just convenience (works better with our pipeli... A combination of local and cloud storage NaN NaN
3 2014-07-07 18:03:42 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline NaN NaN NaN I use a different structure To unify data from different sources on a comm... Local servers NaN NaN
4 2014-07-07 18:07:59 Only one specie Reference genome (FASTA files), Index files (B... UCSC (https://genome.ucsc.edu/), NCBI (http://... Combination of manual work and automated pipeline NaN I don't NaN I use a different structure I think it's more logical, To unify data from ... Local servers NaN NaN
5 2014-07-07 18:26:52 Several species Reference genome (FASTA files), Index files (B... UCSC (https://genome.ucsc.edu/) Combination of manual work and automated pipeline NaN I regularly check for new updates and fetch th... NaN I use a different structure Just convenience (works better with our pipeli... Local servers NaN NaN
6 2014-07-07 19:05:14 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... I use an automated pipeline for fetching the data Cloudbiolinux (http://cloudbiolinux.org/) I use one of the aforementioned tools to autom... NaN I use a different structure To unify data from different sources on a comm... A combination of local and cloud storage NaN NaN
7 2014-07-07 19:07:12 Only one specie Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline NaN I regularly check for new updates and fetch th... NaN I use a different structure I think it's more logical, To unify data from ... A combination of local and cloud storage NaN NaN
8 2014-07-07 19:08:36 Only one specie Reference genome (FASTA files), Variant Callin... ENSEMBL, BioMart (http://www.ensembl.org/), UC... I download the data manually myself NaN I regularly check for new updates and fetch th... We use versioning and we stick to only one set... I keep the same structure than the original so... NaN Local servers NaN NaN
9 2014-07-07 19:08:48 Several species Reference genome (FASTA files), Variant Callin... NCBI (http://www.ncbi.nlm.nih.gov/genome) Combination of manual work and automated pipeline NaN I regularly check for new updates and fetch th... NaN I keep the same structure than the original so... NaN Local servers NaN NaN
10 2014-07-07 19:52:20 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline Cloudbiolinux (http://cloudbiolinux.org/) I regularly check for new updates and fetch th... NaN I keep the same structure than the original so... NaN A combination of local and cloud storage NaN NaN
11 2014-07-07 20:50:48 Several species Reference genome (FASTA files), Index files (B... UCSC (https://genome.ucsc.edu/), various plant... Combination of manual work and automated pipeline NaN NaN NaN I use a different structure not using my own structure - borrowed it from ... Local servers NaN NaN
12 2014-07-07 21:11:50 Several species Reference genome (FASTA files) NCBI (http://www.ncbi.nlm.nih.gov/genome) I use an automated pipeline for fetching the data NaN I regularly check for new updates and fetch th... NaN I keep the same structure than the original so... NaN A combination of local and cloud storage NaN NaN
13 2014-07-07 21:42:29 Only one specie Reference genome (FASTA files), Index files (B... UCSC (https://genome.ucsc.edu/), NCBI (http://... Combination of manual work and automated pipeline NaN I regularly check for new updates and fetch th... NaN I use a different structure Just convenience (works better with our pipeli... A combination of local and cloud storage NaN NaN
14 2014-07-07 23:31:51 Only one specie Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... I use an automated pipeline for fetching the data NaN I use one of the aforementioned tools to autom... NaN I keep the same structure than the original so... NaN A combination of local and cloud storage NaN NaN
15 2014-07-07 23:40:30 Several species Reference genome (FASTA files), Index files (B... NCBI (http://www.ncbi.nlm.nih.gov/genome), Phy... I download the data manually myself NaN I regularly check for new updates and fetch th... NaN I keep the same structure than the original so... NaN Local servers NaN NaN
16 2014-08-07 00:03:36 Several species Reference genome (FASTA files), Genetic Variat... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline Bioconductor only when need, prefer working with one versio... The result of the analysis might change a lot ... I use a different structure I think it's more logical, To unify data from ... Local servers NaN NaN
17 2014-08-07 01:07:29 Several species Reference genome (FASTA files), Variant Callin... NCBI (http://www.ncbi.nlm.nih.gov/genome) Combination of manual work and automated pipeline NaN NaN NaN I use a different structure To unify data from different sources on a comm... Local servers NaN NaN
18 2014-08-07 09:05:51 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/) I download the data manually myself NaN I infrequently check for new updates and fetch... At Babraham we mostly worked on human and mous... I use a different structure Just convenience (works better with our pipeli... Local servers NaN NaN
19 2014-08-07 09:09:19 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/) I use an automated pipeline for fetching the data NaN I regularly check for new updates and fetch th... NaN I keep the same structure than the original so... NaN Local servers NaN NaN
20 2014-09-07 05:02:27 Only one specie Reference genome (FASTA files), Variant Callin... ENSEMBL, BioMart (http://www.ensembl.org/), NC... Combination of manual work and automated pipeline Cloudbiolinux (http://cloudbiolinux.org/) cron+wget/rsync for most sources NaN I use a different structure To unify data from different sources on a comm... A combination of local and cloud storage NaN NaN
21 2014-09-07 07:06:17 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... I use an automated pipeline for fetching the data Cloudbiolinux (http://cloudbiolinux.org/) I use one of the aforementioned tools to autom... NaN I use a different structure To unify data from different sources on a comm... Local servers NaN NaN
22 2014-09-07 14:32:37 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline In house stuff, but looking for stable alterna... I use one of the aforementioned tools to autom... Any way you will make the results available? I keep the same structure than the original so... NaN Local servers NaN NaN
23 2014-09-07 15:38:57 Several species Reference genome (FASTA files), Variant Callin... ENSEMBL, BioMart (http://www.ensembl.org/), UC... I download the data manually myself NaN NaN NaN I use a different structure Just convenience (works better with our pipeli... Local servers NaN NaN
24 2014-09-07 22:59:39 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline NaN I regularly check for new updates and fetch th... NaN I use a different structure To unify data from different sources on a comm... Local servers NaN NaN
25 2014-09-07 23:05:57 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... I download the data manually myself NaN I regularly check for new updates and fetch th... NaN I keep the same structure than the original so... NaN Local servers NaN NaN
26 2014-09-07 23:09:31 Only one specie Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/) I use an automated pipeline for fetching the data custom I use one of the aforementioned tools to autom... NaN I keep the same structure than the original so... NaN Local servers NaN NaN
27 2014-09-07 23:19:21 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... I download the data manually myself NaN when i remember or someone needs a newer version It's definitely a pain :) I use a different structure I think it's more logical Local servers NaN NaN
28 2014-09-07 23:31:00 Several species Reference genome (FASTA files), Index files (B... IGenome (http://support.illumina.com/sequencin... Combination of manual work and automated pipeline rsync whole directory structure from igenomes do manual check : Most of the time we stick to... Bacterial genomes : I downloaded from ncbi and... I keep the same structure than the original so... NaN Local servers NaN NaN
29 2014-10-07 09:18:30 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline wget, curl, Rcurl (R package), biomaRt (R pack... custom tool NaN I use a different structure To unify data from different sources on a comm... Local servers NaN NaN
30 2014-10-07 10:20:31 Several species Reference genome (FASTA files), Gene Transfer ... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline NaN I regularly check for new updates and fetch th... NaN I keep the same structure than the original so... NaN Local servers NaN NaN
31 2014-10-07 13:30:17 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline NaN regular downloads quarterly NaN I use a different structure To allow all bioinformaticians in the company ... Local servers NaN NaN
32 2014-10-07 14:07:18 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline NaN I regularly check for new updates and fetch th... NaN I use a different structure Just convenience (works better with our pipeli... Local servers NaN NaN
33 2014-10-07 14:12:13 Several species Reference genome (FASTA files), Index files (B... UCSC (https://genome.ucsc.edu/), IGenome (http... I download the data manually myself NaN I regularly check for new updates and fetch th... NaN I use a different structure I think it's more logical Local servers NaN NaN
34 2014-10-07 15:29:40 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline NaN I regularly check for new updates and fetch th... NaN I keep the same structure than the original so... NaN Local servers NaN NaN
35 2014-10-07 16:33:27 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... I download the data manually myself NaN Update as projects demand NaN I use a different structure To unify data from different sources on a comm... Local servers NaN NaN
36 2014-11-07 13:39:35 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... I download the data manually myself NaN I update data when I start a new project, but ... NaN I use a different structure To unify data from different sources on a comm... Local servers NaN NaN
37 2014-11-07 18:22:09 Several species Reference genome (FASTA files), Index files (B... UCSC (https://genome.ucsc.edu/), IGenome (http... I download the data manually myself NaN I regularly check for new updates and fetch th... NaN I use a different structure Just convenience (works better with our pipeli... Local servers NaN NaN
38 2014-12-07 08:11:11 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... I use an automated pipeline for fetching the data NaN custom cron scripts NaN I use a different structure To unify data from different sources on a comm... Local servers NaN NaN
39 2014-12-07 19:43:26 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline NaN following the corresponding mailing lists NaN I keep the same structure than the original so... NaN Local servers NaN NaN
40 2014-12-07 22:06:51 Several species Reference genome (FASTA files), Index files (B... ENSEMBL, BioMart (http://www.ensembl.org/), UC... Combination of manual work and automated pipeline NaN check at start of new projects NaN I keep the same structure than the original so... NaN Local servers NaN NaN

Participation


In [4]:
import datetime

participation = responses.groupby(responses['timestamp'].map(lambda x: datetime.date(x.year, x.month, x.day)))

In [5]:
participation.count()['timestamp']


Out[5]:
timestamp
2014-07-07    16
2014-08-07     4
2014-09-07     9
2014-10-07     7
2014-11-07     2
2014-12-07     3
Name: timestamp, dtype: int64

In [6]:
participation.count()['timestamp'].plot(kind='bar', title='Survey participation', grid=False)


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x10714cf90>

The questions

Do you work only with one species, or several species' genomes?


In [7]:
species = responses['Do you work only with one species, or several species\' genomes?']

In [8]:
species.value_counts().plot(kind='pie', autopct='%1.1f%%', title='Do yuo work with single or multiple species', figsize=(7,7))


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x107230d10>

What kind of reference data do you use for your research?

Will group the data in tearms of: How many groups use each kind of reference data?


In [9]:
ref_data_groups = responses['What kind of reference data do you use for your research?']

In [10]:
import re

ref_data = {}
for group in ref_data_groups:
    # Nasty parsing due to the wrong design of the survey, shame on me...
    # Remove everything within parenthesis and then split by comma
    group =re.sub(r'\([^)]*\)', '', group)
    data = group.split(',')
    for key in data:
        if ref_data.has_key(key):
            ref_data[key] += 1
        else:
            ref_data[key] = 1

In [11]:
df_ref_data = pd.DataFrame.from_dict(ref_data, orient='index')
df_ref_data


Out[11]:
0
Variant Calling data 26
Structural variantion data 1
Annotation data 22
Genetic Variation data 23
Reference genome 41
Index files 33
Complete Genomics 1
GATK bundle 1
hand-made annotation tab-delim files 1
Gene Transfer data 26
BED-detail files 1

In [12]:
df_ref_data = df_ref_data.sort(columns=0)
plot = df_ref_data.plot(kind='barh', legend=False, grid=False, title='Number of groups using each kind of reference data',
                 figsize=(10,10))


Where do you fetch your reference data from?


In [13]:
ref_data_locations = {}
for loc in responses['Where do you fetch your data from?']:
    # Nasty parsing due to the wrong design of the survey, shame on me...
    # Remove everything within parenthesis and then split by comma
    loc =re.sub(r'\([^)]*\)', '', loc)
    data = loc.split(',')
    for key in data:
        if ref_data_locations.has_key(key.lstrip()):
            ref_data_locations[key.lstrip()] += 1
        else:
            ref_data_locations[key.lstrip()] = 1
del(ref_data_locations['etc.'])
ref_data_locations


Out[13]:
{'BeeBase': 1,
 'BioMart ': 30,
 'ENSEMBL': 30,
 'IGenome ': 16,
 'IMG': 1,
 'JGI...': 1,
 'MG-RAST': 1,
 'NCBI ': 25,
 'Phytozome': 2,
 'RGD': 1,
 'TAIR': 1,
 'UCSC ': 31,
 'http://genomeinabottle.org/': 1,
 'various plant genome databases': 1}

In [14]:
df_ref_loc = pd.DataFrame.from_dict(ref_data_locations, orient='index')
df_ref_loc.sort(columns=0, ascending=False)


Out[14]:
0
UCSC 31
BioMart 30
ENSEMBL 30
NCBI 25
IGenome 16
Phytozome 2
IMG 1
MG-RAST 1
JGI... 1
TAIR 1
http://genomeinabottle.org/ 1
various plant genome databases 1
RGD 1
BeeBase 1

In [15]:
df_ref_loc = df_ref_loc.sort(columns=0)
plot = df_ref_loc.plot(kind='barh', legend=False, grid=False, title='Where do you fetch your data from?',
                 figsize=(10,10))


How do you fetch the reference data?


In [27]:
fetching_options = {}
for key in responses['How do you fetch the reference data?']:
    if fetching_options.has_key(key.lstrip()):
        fetching_options[key.lstrip()] += 1
    else:
        fetching_options[key.lstrip()] = 1
fetching_options


Out[27]:
{'Combination of manual work and automated pipeline': 22,
 'I download the data manually myself': 12,
 'I use an automated pipeline for fetching the data': 7}

In [28]:
fetching_options = pd.DataFrame.from_dict(fetching_options, orient='index')
fetching_options.sort(columns=0, ascending=False)


Out[28]:
0
Combination of manual work and automated pipeline 22
I download the data manually myself 12
I use an automated pipeline for fetching the data 7

In [158]:
pie = fetching_options.plot(kind='pie', autopct='%1.1f%%', title='How do you fetch the reference data?', subplots=True, figsize=(7,7))



In [41]:
rel = responses.groupby(["Do you work only with one species, or several species' genomes?", "How do you fetch the reference data?"])

In [50]:
rel.count()


Out[50]:
timestamp What kind of reference data do you use for your research? Where do you fetch your data from? Do you use any of these tools for downloading reference data? How do you keep your reference data up to date? Comments / Questions How do you structure the reference data What motivates you to use your own structure for your reference data? Where do you store your reference data? Personal data Untitled Question [Row 2]
Do you work only with one species, or several species' genomes? How do you fetch the reference data?
Only one specie Combination of manual work and automated pipeline 4 4 4 1 4 0 4 4 4 0 0
I download the data manually myself 1 1 1 0 1 1 1 0 1 0 0
I use an automated pipeline for fetching the data 2 2 2 1 2 0 2 0 2 0 0
Several species Combination of manual work and automated pipeline 18 18 18 6 15 4 18 10 18 0 0
I download the data manually myself 11 11 11 0 10 2 10 8 11 0 0
I use an automated pipeline for fetching the data 5 5 5 2 5 0 5 3 5 0 0

How do you structure the reference data


In [163]:
structure = {}
for key in responses['How do you structure the reference data'].dropna():
    if structure.has_key(key.lstrip()):
        structure[key.lstrip()] += 1
    else:
        structure[key.lstrip()] = 1
structure


Out[163]:
{'I keep the same structure than the original source': 16,
 'I use a different structure': 24}

In [164]:
structure = pd.DataFrame.from_dict(structure, orient='index')
structure.plot(kind='pie', autopct='%1.1f%%', title='How do you structure the reference data?', subplots=True, figsize=(7,7))


Out[164]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x10e2c4510>], dtype=object)

What motivates you to use your own structure for your reference data?


In [142]:
motivation = {}
for loc in responses['What motivates you to use your own structure for your reference data?'].dropna():
    # Nasty parsing due to the wrong design of the survey, shame on me...
    # Remove everything within parenthesis and then split by comma
    loc =re.sub(r'\([^)]*\)', '', loc)
    data = loc.split(',')
    for key in data:
        if motivation.has_key(key.lstrip()):
            motivation[key.lstrip()] += 1
        else:
            motivation[key.lstrip()] = 1
motivation


Out[142]:
{'2bit for sequence': 1,
 'Additional indexing': 1,
 "Historical - that's the way we've always done it": 1,
 "I think it's more logical": 9,
 'Just convenience ': 7,
 'Query support': 1,
 'To allow all bioinformaticians in the company to use the same reference': 1,
 'To unify data from different sources on a common structure': 19,
 'bed-detail for gene structure annotations': 1,
 'conversion': 1,
 'etc.': 1,
 'lookup performance': 1,
 'not using my own structure - borrowed it from UCSC': 1,
 'to overcome inconsistencies from different sources': 1}

In [143]:
del(motivation['etc.'])
# These two belong to the same question
del(motivation['2bit for sequence'])
del(motivation['conversion'])
del(motivation['bed-detail for gene structure annotations'])

In [144]:
motivation


Out[144]:
{'Additional indexing': 1,
 "Historical - that's the way we've always done it": 1,
 "I think it's more logical": 9,
 'Just convenience ': 7,
 'Query support': 1,
 'To allow all bioinformaticians in the company to use the same reference': 1,
 'To unify data from different sources on a common structure': 19,
 'lookup performance': 1,
 'not using my own structure - borrowed it from UCSC': 1,
 'to overcome inconsistencies from different sources': 1}

In [145]:
motivation = pd.DataFrame.from_dict(motivation, orient='index')

In [131]:
motivation.sort(columns=0, ascending=False)


Out[131]:
0
To unify data from different sources on a common structure 19
I think it's more logical 9
Just convenience 7
To allow all bioinformaticians in the company to use the same reference 1
Historical - that's the way we've always done it 1
Query support 1
lookup performance 1
Additional indexing 1
to overcome inconsistencies from different sources 1
not using my own structure - borrowed it from UCSC 1

How do you keep your reference data up to date?


In [146]:
update = {}
for loc in responses['How do you keep your reference data up to date?'].dropna():
    # Nasty parsing due to the wrong design of the survey, shame on me...
    # Remove everything within parenthesis and then split by comma
    loc =re.sub(r'\([^)]*\)', '', loc)
    data = loc.split(',')
    for key in data:
        if update.has_key(key.lstrip()):
            update[key.lstrip()] += 1
        else:
            update[key.lstrip()] = 1
update


Out[146]:
{'Generally only want to use one version for a given project.  Check for new versions if applicable when starting something new.': 1,
 "I don't": 1,
 'I infrequently check for new updates and fetch them manually': 1,
 'I regularly check for new updates and fetch them manually': 17,
 'I update data when I start a new project': 1,
 'I use one of the aforementioned tools to automatically check for new versions of the reference data': 5,
 'Staying consistent with versions of references  is more important than having the latest references.': 1,
 'Update as projects demand': 1,
 'but usually not in between': 1,
 'check at start of new projects': 1,
 'cron+wget/rsync for most sources': 1,
 'custom cron scripts': 1,
 'custom tool': 1,
 'do manual check : Most of the time we stick to stable version e.g. we are still using hg19 ': 1,
 'following the corresponding mailing lists': 1,
 'only when need': 1,
 'prefer working with one version across the project ': 1,
 'regular downloads quarterly': 1,
 'when i remember or someone needs a newer version': 1}

In [147]:
del(update['but usually not in between'])

In [150]:
for k in update.keys(): print k


check at start of new projects
I infrequently check for new updates and fetch them manually
do manual check : Most of the time we stick to stable version e.g. we are still using hg19 
I don't
regular downloads quarterly
custom cron scripts
I use one of the aforementioned tools to automatically check for new versions of the reference data
custom tool
following the corresponding mailing lists
Generally only want to use one version for a given project.  Check for new versions if applicable when starting something new.
cron+wget/rsync for most sources
I regularly check for new updates and fetch them manually
when i remember or someone needs a newer version
prefer working with one version across the project 
only when need
I update data when I start a new project
Staying consistent with versions of references  is more important than having the latest references.
Update as projects demand

Do you use any of these tools for downloading reference data?


In [176]:
tools = {}
for loc in responses['Do you use any of these tools for downloading reference data?'].dropna():
    # Nasty parsing due to the wrong design of the survey, shame on me...
    # Remove everything within parenthesis and then split by comma
    loc =re.sub(r'\([^)]*\)', '', loc)
    data = loc.split(',')
    for key in data:
        if tools.has_key(key.lstrip()):
            tools[key.lstrip()] += 1
        else:
            tools[key.lstrip()] = 1
tools


Out[176]:
{'Bioconductor': 1,
 'Cloudbiolinux ': 5,
 'Cosmid ': 1,
 'In house stuff': 1,
 'Rcurl ': 1,
 'arvados.org': 1,
 'biomaRt ': 1,
 'but looking for stable alternatives': 1,
 'curl': 1,
 'custom': 1,
 'rsync whole directory structure from igenomes': 1,
 'wget': 1}

In [177]:
del(tools['but looking for stable alternatives'])

In [178]:
tools = pd.DataFrame.from_dict(tools, orient='index')
tools.plot(kind='pie', autopct='%1.1f%%', title='How do you structure the reference data?', subplots=True, figsize=(7,7), legend=False)


Out[178]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x10ecb3d90>], dtype=object)

In [ ]: