Genomics reference data: the eternal problem

This is the Python code I used for the analysis of the data. You can find Genomics_Reference_Data-Form_responses.tsv of my GitHub together with this file. The file on GitHub already has the personal information removed.

Load responses file and remove personal data



In [1]:

    
import csv

responses = []
with open('Genomics_Reference_Data-Form_responses.tsv', 'rb') as fh:
    _responses = csv.reader(fh, delimiter='\t')
    for row in _responses:
        # Remove columns 2 and 3 (name and email address of the participant)
        _row = row[:1] + row[3:]
        responses.append(_row)

# Save the responses file without personal information to share on GitHub :-)
with open('Genomics_Reference_Data-Form_responses_no_personal.tsv', 'wb') as fh:
    _responses = csv.writer(fh, delimiter='\t')
    for row in responses:
        _responses.writerow(row)

Load and study the data, first exploration:



In [2]:

    
import pandas as pd

# Dates are in the format DD/MM/YY H, so you need to tell Pandas about it, or it will treat
# the date and hour as separate columns
responses = pd.read_csv('Genomics_Reference_Data-Form_responses_no_personal.tsv', sep='\t',
                 parse_dates={'timestamp': [0]})



In [3]:

    
responses









    Out[3]:






  
    
      
      timestamp
      Do you work only with one species, or several species' genomes?
      What kind of reference data do you use for your research?
      Where do you fetch your data from?
      How do you fetch the reference data?
      Do you use any of these tools for downloading reference data?
      How do you keep your reference data up to date?
      Comments / Questions
      How do you structure the reference data
      What motivates you to use your own structure for your reference data?
      Where do you store your reference data?
      Personal data
      Untitled Question [Row 2]
    
  
  
    
      0 
      2014-07-07 17:45:29
       Several species
       Reference genome (FASTA files), Index files (B...
              ENSEMBL, BioMart (http://www.ensembl.org/)
                     I download the data manually myself
                                                     NaN
       Staying consistent with versions of references...
                                                     NaN
       I keep the same structure than the original so...
                                                     NaN
                                  Local servers
      NaN
      NaN
    
    
      1 
      2014-07-07 17:53:52
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
                     I download the data manually myself
                                                     NaN
       Generally only want to use one version for a g...
                                                     NaN
                                                     NaN
       To unify data from different sources on a comm...
                                  Local servers
      NaN
      NaN
    
    
      2 
      2014-07-07 18:02:02
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
       Cloudbiolinux (http://cloudbiolinux.org/), Cos...
       I regularly check for new updates and fetch th...
       I'd really enjoy reading the summary results f...
                             I use a different structure
       Just convenience (works better with our pipeli...
       A combination of local and cloud storage
      NaN
      NaN
    
    
      3 
      2014-07-07 18:03:42
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
                                                     NaN
                                                     NaN
                                                     NaN
                             I use a different structure
       To unify data from different sources on a comm...
                                  Local servers
      NaN
      NaN
    
    
      4 
      2014-07-07 18:07:59
       Only one specie
       Reference genome (FASTA files), Index files (B...
       UCSC (https://genome.ucsc.edu/), NCBI (http://...
       Combination of manual work and automated pipeline
                                                     NaN
                                                 I don't
                                                     NaN
                             I use a different structure
       I think it's more logical, To unify data from ...
                                  Local servers
      NaN
      NaN
    
    
      5 
      2014-07-07 18:26:52
       Several species
       Reference genome (FASTA files), Index files (B...
                         UCSC (https://genome.ucsc.edu/)
       Combination of manual work and automated pipeline
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
                             I use a different structure
       Just convenience (works better with our pipeli...
                                  Local servers
      NaN
      NaN
    
    
      6 
      2014-07-07 19:05:14
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       I use an automated pipeline for fetching the data
               Cloudbiolinux (http://cloudbiolinux.org/)
       I use one of the aforementioned tools to autom...
                                                     NaN
                             I use a different structure
       To unify data from different sources on a comm...
       A combination of local and cloud storage
      NaN
      NaN
    
    
      7 
      2014-07-07 19:07:12
       Only one specie
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
                             I use a different structure
       I think it's more logical, To unify data from ...
       A combination of local and cloud storage
      NaN
      NaN
    
    
      8 
      2014-07-07 19:08:36
       Only one specie
       Reference genome (FASTA files), Variant Callin...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
                     I download the data manually myself
                                                     NaN
       I regularly check for new updates and fetch th...
       We use versioning and we stick to only one set...
       I keep the same structure than the original so...
                                                     NaN
                                  Local servers
      NaN
      NaN
    
    
      9 
      2014-07-07 19:08:48
       Several species
       Reference genome (FASTA files), Variant Callin...
               NCBI (http://www.ncbi.nlm.nih.gov/genome)
       Combination of manual work and automated pipeline
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
       I keep the same structure than the original so...
                                                     NaN
                                  Local servers
      NaN
      NaN
    
    
      10
      2014-07-07 19:52:20
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
               Cloudbiolinux (http://cloudbiolinux.org/)
       I regularly check for new updates and fetch th...
                                                     NaN
       I keep the same structure than the original so...
                                                     NaN
       A combination of local and cloud storage
      NaN
      NaN
    
    
      11
      2014-07-07 20:50:48
       Several species
       Reference genome (FASTA files), Index files (B...
       UCSC (https://genome.ucsc.edu/), various plant...
       Combination of manual work and automated pipeline
                                                     NaN
                                                     NaN
                                                     NaN
                             I use a different structure
       not using my own structure - borrowed it from ...
                                  Local servers
      NaN
      NaN
    
    
      12
      2014-07-07 21:11:50
       Several species
                          Reference genome (FASTA files)
               NCBI (http://www.ncbi.nlm.nih.gov/genome)
       I use an automated pipeline for fetching the data
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
       I keep the same structure than the original so...
                                                     NaN
       A combination of local and cloud storage
      NaN
      NaN
    
    
      13
      2014-07-07 21:42:29
       Only one specie
       Reference genome (FASTA files), Index files (B...
       UCSC (https://genome.ucsc.edu/), NCBI (http://...
       Combination of manual work and automated pipeline
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
                             I use a different structure
       Just convenience (works better with our pipeli...
       A combination of local and cloud storage
      NaN
      NaN
    
    
      14
      2014-07-07 23:31:51
       Only one specie
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       I use an automated pipeline for fetching the data
                                                     NaN
       I use one of the aforementioned tools to autom...
                                                     NaN
       I keep the same structure than the original so...
                                                     NaN
       A combination of local and cloud storage
      NaN
      NaN
    
    
      15
      2014-07-07 23:40:30
       Several species
       Reference genome (FASTA files), Index files (B...
       NCBI (http://www.ncbi.nlm.nih.gov/genome), Phy...
                     I download the data manually myself
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
       I keep the same structure than the original so...
                                                     NaN
                                  Local servers
      NaN
      NaN
    
    
      16
      2014-08-07 00:03:36
       Several species
       Reference genome (FASTA files), Genetic Variat...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
                                            Bioconductor
       only when need, prefer working with one versio...
       The result of the analysis might change a lot ...
                             I use a different structure
       I think it's more logical, To unify data from ...
                                  Local servers
      NaN
      NaN
    
    
      17
      2014-08-07 01:07:29
       Several species
       Reference genome (FASTA files), Variant Callin...
               NCBI (http://www.ncbi.nlm.nih.gov/genome)
       Combination of manual work and automated pipeline
                                                     NaN
                                                     NaN
                                                     NaN
                             I use a different structure
       To unify data from different sources on a comm...
                                  Local servers
      NaN
      NaN
    
    
      18
      2014-08-07 09:05:51
       Several species
       Reference genome (FASTA files), Index files (B...
              ENSEMBL, BioMart (http://www.ensembl.org/)
                     I download the data manually myself
                                                     NaN
       I infrequently check for new updates and fetch...
       At Babraham we mostly worked on human and mous...
                             I use a different structure
       Just convenience (works better with our pipeli...
                                  Local servers
      NaN
      NaN
    
    
      19
      2014-08-07 09:09:19
       Several species
       Reference genome (FASTA files), Index files (B...
              ENSEMBL, BioMart (http://www.ensembl.org/)
       I use an automated pipeline for fetching the data
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
       I keep the same structure than the original so...
                                                     NaN
                                  Local servers
      NaN
      NaN
    
    
      20
      2014-09-07 05:02:27
       Only one specie
       Reference genome (FASTA files), Variant Callin...
       ENSEMBL, BioMart (http://www.ensembl.org/), NC...
       Combination of manual work and automated pipeline
               Cloudbiolinux (http://cloudbiolinux.org/)
                        cron+wget/rsync for most sources
                                                     NaN
                             I use a different structure
       To unify data from different sources on a comm...
       A combination of local and cloud storage
      NaN
      NaN
    
    
      21
      2014-09-07 07:06:17
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       I use an automated pipeline for fetching the data
               Cloudbiolinux (http://cloudbiolinux.org/)
       I use one of the aforementioned tools to autom...
                                                     NaN
                             I use a different structure
       To unify data from different sources on a comm...
                                  Local servers
      NaN
      NaN
    
    
      22
      2014-09-07 14:32:37
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
       In house stuff, but looking for stable alterna...
       I use one of the aforementioned tools to autom...
            Any way you will make the results available?
       I keep the same structure than the original so...
                                                     NaN
                                  Local servers
      NaN
      NaN
    
    
      23
      2014-09-07 15:38:57
       Several species
       Reference genome (FASTA files), Variant Callin...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
                     I download the data manually myself
                                                     NaN
                                                     NaN
                                                     NaN
                             I use a different structure
       Just convenience (works better with our pipeli...
                                  Local servers
      NaN
      NaN
    
    
      24
      2014-09-07 22:59:39
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
                             I use a different structure
       To unify data from different sources on a comm...
                                  Local servers
      NaN
      NaN
    
    
      25
      2014-09-07 23:05:57
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
                     I download the data manually myself
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
       I keep the same structure than the original so...
                                                     NaN
                                  Local servers
      NaN
      NaN
    
    
      26
      2014-09-07 23:09:31
       Only one specie
       Reference genome (FASTA files), Index files (B...
              ENSEMBL, BioMart (http://www.ensembl.org/)
       I use an automated pipeline for fetching the data
                                                  custom
       I use one of the aforementioned tools to autom...
                                                     NaN
       I keep the same structure than the original so...
                                                     NaN
                                  Local servers
      NaN
      NaN
    
    
      27
      2014-09-07 23:19:21
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
                     I download the data manually myself
                                                     NaN
        when i remember or someone needs a newer version
                               It's definitely a pain :)
                             I use a different structure
                               I think it's more logical
                                  Local servers
      NaN
      NaN
    
    
      28
      2014-09-07 23:31:00
       Several species
       Reference genome (FASTA files), Index files (B...
       IGenome (http://support.illumina.com/sequencin...
       Combination of manual work and automated pipeline
           rsync whole directory structure from igenomes
       do manual check : Most of the time we stick to...
       Bacterial genomes : I downloaded from ncbi and...
       I keep the same structure than the original so...
                                                     NaN
                                  Local servers
      NaN
      NaN
    
    
      29
      2014-10-07 09:18:30
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
       wget, curl, Rcurl (R package), biomaRt (R pack...
                                             custom tool
                                                     NaN
                             I use a different structure
       To unify data from different sources on a comm...
                                  Local servers
      NaN
      NaN
    
    
      30
      2014-10-07 10:20:31
       Several species
       Reference genome (FASTA files), Gene Transfer ...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
       I keep the same structure than the original so...
                                                     NaN
                                  Local servers
      NaN
      NaN
    
    
      31
      2014-10-07 13:30:17
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
                                                     NaN
                             regular downloads quarterly
                                                     NaN
                             I use a different structure
       To allow all bioinformaticians in the company ...
                                  Local servers
      NaN
      NaN
    
    
      32
      2014-10-07 14:07:18
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
                             I use a different structure
       Just convenience (works better with our pipeli...
                                  Local servers
      NaN
      NaN
    
    
      33
      2014-10-07 14:12:13
       Several species
       Reference genome (FASTA files), Index files (B...
       UCSC (https://genome.ucsc.edu/), IGenome (http...
                     I download the data manually myself
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
                             I use a different structure
                               I think it's more logical
                                  Local servers
      NaN
      NaN
    
    
      34
      2014-10-07 15:29:40
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
       I keep the same structure than the original so...
                                                     NaN
                                  Local servers
      NaN
      NaN
    
    
      35
      2014-10-07 16:33:27
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
                     I download the data manually myself
                                                     NaN
                               Update as projects demand
                                                     NaN
                             I use a different structure
       To unify data from different sources on a comm...
                                  Local servers
      NaN
      NaN
    
    
      36
      2014-11-07 13:39:35
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
                     I download the data manually myself
                                                     NaN
       I update data when I start a new project, but ...
                                                     NaN
                             I use a different structure
       To unify data from different sources on a comm...
                                  Local servers
      NaN
      NaN
    
    
      37
      2014-11-07 18:22:09
       Several species
       Reference genome (FASTA files), Index files (B...
       UCSC (https://genome.ucsc.edu/), IGenome (http...
                     I download the data manually myself
                                                     NaN
       I regularly check for new updates and fetch th...
                                                     NaN
                             I use a different structure
       Just convenience (works better with our pipeli...
                                  Local servers
      NaN
      NaN
    
    
      38
      2014-12-07 08:11:11
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       I use an automated pipeline for fetching the data
                                                     NaN
                                     custom cron scripts
                                                     NaN
                             I use a different structure
       To unify data from different sources on a comm...
                                  Local servers
      NaN
      NaN
    
    
      39
      2014-12-07 19:43:26
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
                                                     NaN
               following the corresponding mailing lists
                                                     NaN
       I keep the same structure than the original so...
                                                     NaN
                                  Local servers
      NaN
      NaN
    
    
      40
      2014-12-07 22:06:51
       Several species
       Reference genome (FASTA files), Index files (B...
       ENSEMBL, BioMart (http://www.ensembl.org/), UC...
       Combination of manual work and automated pipeline
                                                     NaN
                          check at start of new projects
                                                     NaN
       I keep the same structure than the original so...
                                                     NaN
                                  Local servers
      NaN
      NaN

Participation



In [4]:

    
import datetime

participation = responses.groupby(responses['timestamp'].map(lambda x: datetime.date(x.year, x.month, x.day)))



In [5]:

    
participation.count()['timestamp']









    Out[5]:





timestamp
2014-07-07    16
2014-08-07     4
2014-09-07     9
2014-10-07     7
2014-11-07     2
2014-12-07     3
Name: timestamp, dtype: int64



In [6]:

    
participation.count()['timestamp'].plot(kind='bar', title='Survey participation', grid=False)









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x10714cf90>

The questions

Do you work only with one species, or several species' genomes?



In [7]:

    
species = responses['Do you work only with one species, or several species\' genomes?']



In [8]:

    
species.value_counts().plot(kind='pie', autopct='%1.1f%%', title='Do yuo work with single or multiple species', figsize=(7,7))









    Out[8]:





<matplotlib.axes._subplots.AxesSubplot at 0x107230d10>

What kind of reference data do you use for your research?

Will group the data in tearms of: How many groups use each kind of reference data?



In [9]:

    
ref_data_groups = responses['What kind of reference data do you use for your research?']



In [10]:

    
import re

ref_data = {}
for group in ref_data_groups:
    # Nasty parsing due to the wrong design of the survey, shame on me...
    # Remove everything within parenthesis and then split by comma
    group =re.sub(r'\([^)]*\)', '', group)
    data = group.split(',')
    for key in data:
        if ref_data.has_key(key):
            ref_data[key] += 1
        else:
            ref_data[key] = 1



In [11]:

    
df_ref_data = pd.DataFrame.from_dict(ref_data, orient='index')
df_ref_data









    Out[11]:






  
    
      
      0
    
  
  
    
       Variant Calling data 
       26
    
    
       Structural variantion data
        1
    
    
       Annotation data 
       22
    
    
       Genetic Variation data 
       23
    
    
      Reference genome 
       41
    
    
       Index files 
       33
    
    
       Complete Genomics
        1
    
    
       GATK bundle
        1
    
    
       hand-made annotation tab-delim files
        1
    
    
       Gene Transfer data 
       26
    
    
       BED-detail files
        1



In [12]:

    
df_ref_data = df_ref_data.sort(columns=0)
plot = df_ref_data.plot(kind='barh', legend=False, grid=False, title='Number of groups using each kind of reference data',
                 figsize=(10,10))

Where do you fetch your reference data from?



In [13]:

    
ref_data_locations = {}
for loc in responses['Where do you fetch your data from?']:
    # Nasty parsing due to the wrong design of the survey, shame on me...
    # Remove everything within parenthesis and then split by comma
    loc =re.sub(r'\([^)]*\)', '', loc)
    data = loc.split(',')
    for key in data:
        if ref_data_locations.has_key(key.lstrip()):
            ref_data_locations[key.lstrip()] += 1
        else:
            ref_data_locations[key.lstrip()] = 1
del(ref_data_locations['etc.'])
ref_data_locations









    Out[13]:





{'BeeBase': 1,
 'BioMart ': 30,
 'ENSEMBL': 30,
 'IGenome ': 16,
 'IMG': 1,
 'JGI...': 1,
 'MG-RAST': 1,
 'NCBI ': 25,
 'Phytozome': 2,
 'RGD': 1,
 'TAIR': 1,
 'UCSC ': 31,
 'http://genomeinabottle.org/': 1,
 'various plant genome databases': 1}



In [14]:

    
df_ref_loc = pd.DataFrame.from_dict(ref_data_locations, orient='index')
df_ref_loc.sort(columns=0, ascending=False)









    Out[14]:






  
    
      
      0
    
  
  
    
      UCSC 
       31
    
    
      BioMart 
       30
    
    
      ENSEMBL
       30
    
    
      NCBI 
       25
    
    
      IGenome 
       16
    
    
      Phytozome
        2
    
    
      IMG
        1
    
    
      MG-RAST
        1
    
    
      JGI...
        1
    
    
      TAIR
        1
    
    
      http://genomeinabottle.org/
        1
    
    
      various plant genome databases
        1
    
    
      RGD
        1
    
    
      BeeBase
        1



In [15]:

    
df_ref_loc = df_ref_loc.sort(columns=0)
plot = df_ref_loc.plot(kind='barh', legend=False, grid=False, title='Where do you fetch your data from?',
                 figsize=(10,10))

How do you fetch the reference data?



In [27]:

    
fetching_options = {}
for key in responses['How do you fetch the reference data?']:
    if fetching_options.has_key(key.lstrip()):
        fetching_options[key.lstrip()] += 1
    else:
        fetching_options[key.lstrip()] = 1
fetching_options









    Out[27]:





{'Combination of manual work and automated pipeline': 22,
 'I download the data manually myself': 12,
 'I use an automated pipeline for fetching the data': 7}



In [28]:

    
fetching_options = pd.DataFrame.from_dict(fetching_options, orient='index')
fetching_options.sort(columns=0, ascending=False)









    Out[28]:






  
    
      
      0
    
  
  
    
      Combination of manual work and automated pipeline
       22
    
    
      I download the data manually myself
       12
    
    
      I use an automated pipeline for fetching the data
        7



In [158]:

    
pie = fetching_options.plot(kind='pie', autopct='%1.1f%%', title='How do you fetch the reference data?', subplots=True, figsize=(7,7))



In [41]:

    
rel = responses.groupby(["Do you work only with one species, or several species' genomes?", "How do you fetch the reference data?"])



In [50]:

    
rel.count()









    Out[50]:






  
    
      
      
      timestamp
      What kind of reference data do you use for your research?
      Where do you fetch your data from?
      Do you use any of these tools for downloading reference data?
      How do you keep your reference data up to date?
      Comments / Questions
      How do you structure the reference data
      What motivates you to use your own structure for your reference data?
      Where do you store your reference data?
      Personal data
      Untitled Question [Row 2]
    
    
      Do you work only with one species, or several species' genomes?
      How do you fetch the reference data?
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Only one specie
      Combination of manual work and automated pipeline
        4
        4
        4
       1
        4
       0
        4
        4
        4
       0
       0
    
    
      I download the data manually myself
        1
        1
        1
       0
        1
       1
        1
        0
        1
       0
       0
    
    
      I use an automated pipeline for fetching the data
        2
        2
        2
       1
        2
       0
        2
        0
        2
       0
       0
    
    
      Several species
      Combination of manual work and automated pipeline
       18
       18
       18
       6
       15
       4
       18
       10
       18
       0
       0
    
    
      I download the data manually myself
       11
       11
       11
       0
       10
       2
       10
        8
       11
       0
       0
    
    
      I use an automated pipeline for fetching the data
        5
        5
        5
       2
        5
       0
        5
        3
        5
       0
       0

How do you structure the reference data



In [163]:

    
structure = {}
for key in responses['How do you structure the reference data'].dropna():
    if structure.has_key(key.lstrip()):
        structure[key.lstrip()] += 1
    else:
        structure[key.lstrip()] = 1
structure









    Out[163]:





{'I keep the same structure than the original source': 16,
 'I use a different structure': 24}



In [164]:

    
structure = pd.DataFrame.from_dict(structure, orient='index')
structure.plot(kind='pie', autopct='%1.1f%%', title='How do you structure the reference data?', subplots=True, figsize=(7,7))









    Out[164]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x10e2c4510>], dtype=object)

What motivates you to use your own structure for your reference data?



In [142]:

    
motivation = {}
for loc in responses['What motivates you to use your own structure for your reference data?'].dropna():
    # Nasty parsing due to the wrong design of the survey, shame on me...
    # Remove everything within parenthesis and then split by comma
    loc =re.sub(r'\([^)]*\)', '', loc)
    data = loc.split(',')
    for key in data:
        if motivation.has_key(key.lstrip()):
            motivation[key.lstrip()] += 1
        else:
            motivation[key.lstrip()] = 1
motivation









    Out[142]:





{'2bit for sequence': 1,
 'Additional indexing': 1,
 "Historical - that's the way we've always done it": 1,
 "I think it's more logical": 9,
 'Just convenience ': 7,
 'Query support': 1,
 'To allow all bioinformaticians in the company to use the same reference': 1,
 'To unify data from different sources on a common structure': 19,
 'bed-detail for gene structure annotations': 1,
 'conversion': 1,
 'etc.': 1,
 'lookup performance': 1,
 'not using my own structure - borrowed it from UCSC': 1,
 'to overcome inconsistencies from different sources': 1}



In [143]:

    
del(motivation['etc.'])
# These two belong to the same question
del(motivation['2bit for sequence'])
del(motivation['conversion'])
del(motivation['bed-detail for gene structure annotations'])



In [144]:

    
motivation









    Out[144]:





{'Additional indexing': 1,
 "Historical - that's the way we've always done it": 1,
 "I think it's more logical": 9,
 'Just convenience ': 7,
 'Query support': 1,
 'To allow all bioinformaticians in the company to use the same reference': 1,
 'To unify data from different sources on a common structure': 19,
 'lookup performance': 1,
 'not using my own structure - borrowed it from UCSC': 1,
 'to overcome inconsistencies from different sources': 1}



In [145]:

    
motivation = pd.DataFrame.from_dict(motivation, orient='index')



In [131]:

    
motivation.sort(columns=0, ascending=False)









    Out[131]:






  
    
      
      0
    
  
  
    
      To unify data from different sources on a common structure
       19
    
    
      I think it's more logical
        9
    
    
      Just convenience 
        7
    
    
      To allow all bioinformaticians in the company to use the same reference
        1
    
    
      Historical - that's the way we've always done it
        1
    
    
      Query support
        1
    
    
      lookup performance
        1
    
    
      Additional indexing
        1
    
    
      to overcome inconsistencies from different sources
        1
    
    
      not using my own structure - borrowed it from UCSC
        1

How do you keep your reference data up to date?



In [146]:

    
update = {}
for loc in responses['How do you keep your reference data up to date?'].dropna():
    # Nasty parsing due to the wrong design of the survey, shame on me...
    # Remove everything within parenthesis and then split by comma
    loc =re.sub(r'\([^)]*\)', '', loc)
    data = loc.split(',')
    for key in data:
        if update.has_key(key.lstrip()):
            update[key.lstrip()] += 1
        else:
            update[key.lstrip()] = 1
update









    Out[146]:





{'Generally only want to use one version for a given project.  Check for new versions if applicable when starting something new.': 1,
 "I don't": 1,
 'I infrequently check for new updates and fetch them manually': 1,
 'I regularly check for new updates and fetch them manually': 17,
 'I update data when I start a new project': 1,
 'I use one of the aforementioned tools to automatically check for new versions of the reference data': 5,
 'Staying consistent with versions of references  is more important than having the latest references.': 1,
 'Update as projects demand': 1,
 'but usually not in between': 1,
 'check at start of new projects': 1,
 'cron+wget/rsync for most sources': 1,
 'custom cron scripts': 1,
 'custom tool': 1,
 'do manual check : Most of the time we stick to stable version e.g. we are still using hg19 ': 1,
 'following the corresponding mailing lists': 1,
 'only when need': 1,
 'prefer working with one version across the project ': 1,
 'regular downloads quarterly': 1,
 'when i remember or someone needs a newer version': 1}



In [147]:

    
del(update['but usually not in between'])



In [150]:

    
for k in update.keys(): print k









    



check at start of new projects
I infrequently check for new updates and fetch them manually
do manual check : Most of the time we stick to stable version e.g. we are still using hg19 
I don't
regular downloads quarterly
custom cron scripts
I use one of the aforementioned tools to automatically check for new versions of the reference data
custom tool
following the corresponding mailing lists
Generally only want to use one version for a given project.  Check for new versions if applicable when starting something new.
cron+wget/rsync for most sources
I regularly check for new updates and fetch them manually
when i remember or someone needs a newer version
prefer working with one version across the project 
only when need
I update data when I start a new project
Staying consistent with versions of references  is more important than having the latest references.
Update as projects demand

Do you use any of these tools for downloading reference data?



In [176]:

    
tools = {}
for loc in responses['Do you use any of these tools for downloading reference data?'].dropna():
    # Nasty parsing due to the wrong design of the survey, shame on me...
    # Remove everything within parenthesis and then split by comma
    loc =re.sub(r'\([^)]*\)', '', loc)
    data = loc.split(',')
    for key in data:
        if tools.has_key(key.lstrip()):
            tools[key.lstrip()] += 1
        else:
            tools[key.lstrip()] = 1
tools









    Out[176]:





{'Bioconductor': 1,
 'Cloudbiolinux ': 5,
 'Cosmid ': 1,
 'In house stuff': 1,
 'Rcurl ': 1,
 'arvados.org': 1,
 'biomaRt ': 1,
 'but looking for stable alternatives': 1,
 'curl': 1,
 'custom': 1,
 'rsync whole directory structure from igenomes': 1,
 'wget': 1}



In [177]:

    
del(tools['but looking for stable alternatives'])



In [178]:

    
tools = pd.DataFrame.from_dict(tools, orient='index')
tools.plot(kind='pie', autopct='%1.1f%%', title='How do you structure the reference data?', subplots=True, figsize=(7,7), legend=False)









    Out[178]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x10ecb3d90>], dtype=object)



In [ ]:

	timestamp	Do you work only with one species, or several species' genomes?	What kind of reference data do you use for your research?	Where do you fetch your data from?	How do you fetch the reference data?	Do you use any of these tools for downloading reference data?	How do you keep your reference data up to date?	Comments / Questions	How do you structure the reference data	What motivates you to use your own structure for your reference data?	Where do you store your reference data?	Personal data	Untitled Question [Row 2]
0	2014-07-07 17:45:29	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/)	I download the data manually myself	NaN	Staying consistent with versions of references...	NaN	I keep the same structure than the original so...	NaN	Local servers	NaN	NaN
1	2014-07-07 17:53:52	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	I download the data manually myself	NaN	Generally only want to use one version for a g...	NaN	NaN	To unify data from different sources on a comm...	Local servers	NaN	NaN
2	2014-07-07 18:02:02	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	Cloudbiolinux (http://cloudbiolinux.org/), Cos...	I regularly check for new updates and fetch th...	I'd really enjoy reading the summary results f...	I use a different structure	Just convenience (works better with our pipeli...	A combination of local and cloud storage	NaN	NaN
3	2014-07-07 18:03:42	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	NaN	NaN	NaN	I use a different structure	To unify data from different sources on a comm...	Local servers	NaN	NaN
4	2014-07-07 18:07:59	Only one specie	Reference genome (FASTA files), Index files (B...	UCSC (https://genome.ucsc.edu/), NCBI (http://...	Combination of manual work and automated pipeline	NaN	I don't	NaN	I use a different structure	I think it's more logical, To unify data from ...	Local servers	NaN	NaN
5	2014-07-07 18:26:52	Several species	Reference genome (FASTA files), Index files (B...	UCSC (https://genome.ucsc.edu/)	Combination of manual work and automated pipeline	NaN	I regularly check for new updates and fetch th...	NaN	I use a different structure	Just convenience (works better with our pipeli...	Local servers	NaN	NaN
6	2014-07-07 19:05:14	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	I use an automated pipeline for fetching the data	Cloudbiolinux (http://cloudbiolinux.org/)	I use one of the aforementioned tools to autom...	NaN	I use a different structure	To unify data from different sources on a comm...	A combination of local and cloud storage	NaN	NaN
7	2014-07-07 19:07:12	Only one specie	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	NaN	I regularly check for new updates and fetch th...	NaN	I use a different structure	I think it's more logical, To unify data from ...	A combination of local and cloud storage	NaN	NaN
8	2014-07-07 19:08:36	Only one specie	Reference genome (FASTA files), Variant Callin...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	I download the data manually myself	NaN	I regularly check for new updates and fetch th...	We use versioning and we stick to only one set...	I keep the same structure than the original so...	NaN	Local servers	NaN	NaN
9	2014-07-07 19:08:48	Several species	Reference genome (FASTA files), Variant Callin...	NCBI (http://www.ncbi.nlm.nih.gov/genome)	Combination of manual work and automated pipeline	NaN	I regularly check for new updates and fetch th...	NaN	I keep the same structure than the original so...	NaN	Local servers	NaN	NaN
10	2014-07-07 19:52:20	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	Cloudbiolinux (http://cloudbiolinux.org/)	I regularly check for new updates and fetch th...	NaN	I keep the same structure than the original so...	NaN	A combination of local and cloud storage	NaN	NaN
11	2014-07-07 20:50:48	Several species	Reference genome (FASTA files), Index files (B...	UCSC (https://genome.ucsc.edu/), various plant...	Combination of manual work and automated pipeline	NaN	NaN	NaN	I use a different structure	not using my own structure - borrowed it from ...	Local servers	NaN	NaN
12	2014-07-07 21:11:50	Several species	Reference genome (FASTA files)	NCBI (http://www.ncbi.nlm.nih.gov/genome)	I use an automated pipeline for fetching the data	NaN	I regularly check for new updates and fetch th...	NaN	I keep the same structure than the original so...	NaN	A combination of local and cloud storage	NaN	NaN
13	2014-07-07 21:42:29	Only one specie	Reference genome (FASTA files), Index files (B...	UCSC (https://genome.ucsc.edu/), NCBI (http://...	Combination of manual work and automated pipeline	NaN	I regularly check for new updates and fetch th...	NaN	I use a different structure	Just convenience (works better with our pipeli...	A combination of local and cloud storage	NaN	NaN
14	2014-07-07 23:31:51	Only one specie	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	I use an automated pipeline for fetching the data	NaN	I use one of the aforementioned tools to autom...	NaN	I keep the same structure than the original so...	NaN	A combination of local and cloud storage	NaN	NaN
15	2014-07-07 23:40:30	Several species	Reference genome (FASTA files), Index files (B...	NCBI (http://www.ncbi.nlm.nih.gov/genome), Phy...	I download the data manually myself	NaN	I regularly check for new updates and fetch th...	NaN	I keep the same structure than the original so...	NaN	Local servers	NaN	NaN
16	2014-08-07 00:03:36	Several species	Reference genome (FASTA files), Genetic Variat...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	Bioconductor	only when need, prefer working with one versio...	The result of the analysis might change a lot ...	I use a different structure	I think it's more logical, To unify data from ...	Local servers	NaN	NaN
17	2014-08-07 01:07:29	Several species	Reference genome (FASTA files), Variant Callin...	NCBI (http://www.ncbi.nlm.nih.gov/genome)	Combination of manual work and automated pipeline	NaN	NaN	NaN	I use a different structure	To unify data from different sources on a comm...	Local servers	NaN	NaN
18	2014-08-07 09:05:51	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/)	I download the data manually myself	NaN	I infrequently check for new updates and fetch...	At Babraham we mostly worked on human and mous...	I use a different structure	Just convenience (works better with our pipeli...	Local servers	NaN	NaN
19	2014-08-07 09:09:19	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/)	I use an automated pipeline for fetching the data	NaN	I regularly check for new updates and fetch th...	NaN	I keep the same structure than the original so...	NaN	Local servers	NaN	NaN
20	2014-09-07 05:02:27	Only one specie	Reference genome (FASTA files), Variant Callin...	ENSEMBL, BioMart (http://www.ensembl.org/), NC...	Combination of manual work and automated pipeline	Cloudbiolinux (http://cloudbiolinux.org/)	cron+wget/rsync for most sources	NaN	I use a different structure	To unify data from different sources on a comm...	A combination of local and cloud storage	NaN	NaN
21	2014-09-07 07:06:17	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	I use an automated pipeline for fetching the data	Cloudbiolinux (http://cloudbiolinux.org/)	I use one of the aforementioned tools to autom...	NaN	I use a different structure	To unify data from different sources on a comm...	Local servers	NaN	NaN
22	2014-09-07 14:32:37	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	In house stuff, but looking for stable alterna...	I use one of the aforementioned tools to autom...	Any way you will make the results available?	I keep the same structure than the original so...	NaN	Local servers	NaN	NaN
23	2014-09-07 15:38:57	Several species	Reference genome (FASTA files), Variant Callin...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	I download the data manually myself	NaN	NaN	NaN	I use a different structure	Just convenience (works better with our pipeli...	Local servers	NaN	NaN
24	2014-09-07 22:59:39	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	NaN	I regularly check for new updates and fetch th...	NaN	I use a different structure	To unify data from different sources on a comm...	Local servers	NaN	NaN
25	2014-09-07 23:05:57	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	I download the data manually myself	NaN	I regularly check for new updates and fetch th...	NaN	I keep the same structure than the original so...	NaN	Local servers	NaN	NaN
26	2014-09-07 23:09:31	Only one specie	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/)	I use an automated pipeline for fetching the data	custom	I use one of the aforementioned tools to autom...	NaN	I keep the same structure than the original so...	NaN	Local servers	NaN	NaN
27	2014-09-07 23:19:21	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	I download the data manually myself	NaN	when i remember or someone needs a newer version	It's definitely a pain :)	I use a different structure	I think it's more logical	Local servers	NaN	NaN
28	2014-09-07 23:31:00	Several species	Reference genome (FASTA files), Index files (B...	IGenome (http://support.illumina.com/sequencin...	Combination of manual work and automated pipeline	rsync whole directory structure from igenomes	do manual check : Most of the time we stick to...	Bacterial genomes : I downloaded from ncbi and...	I keep the same structure than the original so...	NaN	Local servers	NaN	NaN
29	2014-10-07 09:18:30	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	wget, curl, Rcurl (R package), biomaRt (R pack...	custom tool	NaN	I use a different structure	To unify data from different sources on a comm...	Local servers	NaN	NaN
30	2014-10-07 10:20:31	Several species	Reference genome (FASTA files), Gene Transfer ...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	NaN	I regularly check for new updates and fetch th...	NaN	I keep the same structure than the original so...	NaN	Local servers	NaN	NaN
31	2014-10-07 13:30:17	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	NaN	regular downloads quarterly	NaN	I use a different structure	To allow all bioinformaticians in the company ...	Local servers	NaN	NaN
32	2014-10-07 14:07:18	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	NaN	I regularly check for new updates and fetch th...	NaN	I use a different structure	Just convenience (works better with our pipeli...	Local servers	NaN	NaN
33	2014-10-07 14:12:13	Several species	Reference genome (FASTA files), Index files (B...	UCSC (https://genome.ucsc.edu/), IGenome (http...	I download the data manually myself	NaN	I regularly check for new updates and fetch th...	NaN	I use a different structure	I think it's more logical	Local servers	NaN	NaN
34	2014-10-07 15:29:40	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	NaN	I regularly check for new updates and fetch th...	NaN	I keep the same structure than the original so...	NaN	Local servers	NaN	NaN
35	2014-10-07 16:33:27	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	I download the data manually myself	NaN	Update as projects demand	NaN	I use a different structure	To unify data from different sources on a comm...	Local servers	NaN	NaN
36	2014-11-07 13:39:35	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	I download the data manually myself	NaN	I update data when I start a new project, but ...	NaN	I use a different structure	To unify data from different sources on a comm...	Local servers	NaN	NaN
37	2014-11-07 18:22:09	Several species	Reference genome (FASTA files), Index files (B...	UCSC (https://genome.ucsc.edu/), IGenome (http...	I download the data manually myself	NaN	I regularly check for new updates and fetch th...	NaN	I use a different structure	Just convenience (works better with our pipeli...	Local servers	NaN	NaN
38	2014-12-07 08:11:11	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	I use an automated pipeline for fetching the data	NaN	custom cron scripts	NaN	I use a different structure	To unify data from different sources on a comm...	Local servers	NaN	NaN
39	2014-12-07 19:43:26	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	NaN	following the corresponding mailing lists	NaN	I keep the same structure than the original so...	NaN	Local servers	NaN	NaN
40	2014-12-07 22:06:51	Several species	Reference genome (FASTA files), Index files (B...	ENSEMBL, BioMart (http://www.ensembl.org/), UC...	Combination of manual work and automated pipeline	NaN	check at start of new projects	NaN	I keep the same structure than the original so...	NaN	Local servers	NaN	NaN

	0
Variant Calling data	26
Structural variantion data	1
Annotation data	22
Genetic Variation data	23
Reference genome	41
Index files	33
Complete Genomics	1
GATK bundle	1
hand-made annotation tab-delim files	1
Gene Transfer data	26
BED-detail files	1

	0
UCSC	31
BioMart	30
ENSEMBL	30
NCBI	25
IGenome	16
Phytozome	2
IMG	1
MG-RAST	1
JGI...	1
TAIR	1
http://genomeinabottle.org/	1
various plant genome databases	1
RGD	1
BeeBase	1

		timestamp	What kind of reference data do you use for your research?	Where do you fetch your data from?	Do you use any of these tools for downloading reference data?	How do you keep your reference data up to date?	Comments / Questions	How do you structure the reference data	What motivates you to use your own structure for your reference data?	Where do you store your reference data?	Personal data	Untitled Question [Row 2]
Do you work only with one species, or several species' genomes?	How do you fetch the reference data?
Only one specie	Combination of manual work and automated pipeline	4	4	4	1	4	0	4	4	4	0	0
	I download the data manually myself	1	1	1	0	1	1	1	0	1	0	0
	I use an automated pipeline for fetching the data	2	2	2	1	2	0	2	0	2	0	0
Several species	Combination of manual work and automated pipeline	18	18	18	6	15	4	18	10	18	0	0
	I download the data manually myself	11	11	11	0	10	2	10	8	11	0	0
	I use an automated pipeline for fetching the data	5	5	5	2	5	0	5	3	5	0	0

	0
To unify data from different sources on a common structure	19
I think it's more logical	9
Just convenience	7
To allow all bioinformaticians in the company to use the same reference	1
Historical - that's the way we've always done it	1
Query support	1
lookup performance	1
Additional indexing	1
to overcome inconsistencies from different sources	1
not using my own structure - borrowed it from UCSC	1