This is the Python code I used for the analysis of the data. You can find Genomics_Reference_Data-Form_responses.tsv of my GitHub together with this file. The file on GitHub already has the personal information removed.
In [1]:
import csv
responses = []
with open('Genomics_Reference_Data-Form_responses.tsv', 'rb') as fh:
_responses = csv.reader(fh, delimiter='\t')
for row in _responses:
# Remove columns 2 and 3 (name and email address of the participant)
_row = row[:1] + row[3:]
responses.append(_row)
# Save the responses file without personal information to share on GitHub :-)
with open('Genomics_Reference_Data-Form_responses_no_personal.tsv', 'wb') as fh:
_responses = csv.writer(fh, delimiter='\t')
for row in responses:
_responses.writerow(row)
Load and study the data, first exploration:
In [2]:
import pandas as pd
# Dates are in the format DD/MM/YY H, so you need to tell Pandas about it, or it will treat
# the date and hour as separate columns
responses = pd.read_csv('Genomics_Reference_Data-Form_responses_no_personal.tsv', sep='\t',
parse_dates={'timestamp': [0]})
In [3]:
responses
Out[3]:
In [4]:
import datetime
participation = responses.groupby(responses['timestamp'].map(lambda x: datetime.date(x.year, x.month, x.day)))
In [5]:
participation.count()['timestamp']
Out[5]:
In [6]:
participation.count()['timestamp'].plot(kind='bar', title='Survey participation', grid=False)
Out[6]:
In [7]:
species = responses['Do you work only with one species, or several species\' genomes?']
In [8]:
species.value_counts().plot(kind='pie', autopct='%1.1f%%', title='Do yuo work with single or multiple species', figsize=(7,7))
Out[8]:
What kind of reference data do you use for your research?
Will group the data in tearms of: How many groups use each kind of reference data?
In [9]:
ref_data_groups = responses['What kind of reference data do you use for your research?']
In [10]:
import re
ref_data = {}
for group in ref_data_groups:
# Nasty parsing due to the wrong design of the survey, shame on me...
# Remove everything within parenthesis and then split by comma
group =re.sub(r'\([^)]*\)', '', group)
data = group.split(',')
for key in data:
if ref_data.has_key(key):
ref_data[key] += 1
else:
ref_data[key] = 1
In [11]:
df_ref_data = pd.DataFrame.from_dict(ref_data, orient='index')
df_ref_data
Out[11]:
In [12]:
df_ref_data = df_ref_data.sort(columns=0)
plot = df_ref_data.plot(kind='barh', legend=False, grid=False, title='Number of groups using each kind of reference data',
figsize=(10,10))
Where do you fetch your reference data from?
In [13]:
ref_data_locations = {}
for loc in responses['Where do you fetch your data from?']:
# Nasty parsing due to the wrong design of the survey, shame on me...
# Remove everything within parenthesis and then split by comma
loc =re.sub(r'\([^)]*\)', '', loc)
data = loc.split(',')
for key in data:
if ref_data_locations.has_key(key.lstrip()):
ref_data_locations[key.lstrip()] += 1
else:
ref_data_locations[key.lstrip()] = 1
del(ref_data_locations['etc.'])
ref_data_locations
Out[13]:
In [14]:
df_ref_loc = pd.DataFrame.from_dict(ref_data_locations, orient='index')
df_ref_loc.sort(columns=0, ascending=False)
Out[14]:
In [15]:
df_ref_loc = df_ref_loc.sort(columns=0)
plot = df_ref_loc.plot(kind='barh', legend=False, grid=False, title='Where do you fetch your data from?',
figsize=(10,10))
How do you fetch the reference data?
In [27]:
fetching_options = {}
for key in responses['How do you fetch the reference data?']:
if fetching_options.has_key(key.lstrip()):
fetching_options[key.lstrip()] += 1
else:
fetching_options[key.lstrip()] = 1
fetching_options
Out[27]:
In [28]:
fetching_options = pd.DataFrame.from_dict(fetching_options, orient='index')
fetching_options.sort(columns=0, ascending=False)
Out[28]:
In [158]:
pie = fetching_options.plot(kind='pie', autopct='%1.1f%%', title='How do you fetch the reference data?', subplots=True, figsize=(7,7))
In [41]:
rel = responses.groupby(["Do you work only with one species, or several species' genomes?", "How do you fetch the reference data?"])
In [50]:
rel.count()
Out[50]:
How do you structure the reference data
In [163]:
structure = {}
for key in responses['How do you structure the reference data'].dropna():
if structure.has_key(key.lstrip()):
structure[key.lstrip()] += 1
else:
structure[key.lstrip()] = 1
structure
Out[163]:
In [164]:
structure = pd.DataFrame.from_dict(structure, orient='index')
structure.plot(kind='pie', autopct='%1.1f%%', title='How do you structure the reference data?', subplots=True, figsize=(7,7))
Out[164]:
What motivates you to use your own structure for your reference data?
In [142]:
motivation = {}
for loc in responses['What motivates you to use your own structure for your reference data?'].dropna():
# Nasty parsing due to the wrong design of the survey, shame on me...
# Remove everything within parenthesis and then split by comma
loc =re.sub(r'\([^)]*\)', '', loc)
data = loc.split(',')
for key in data:
if motivation.has_key(key.lstrip()):
motivation[key.lstrip()] += 1
else:
motivation[key.lstrip()] = 1
motivation
Out[142]:
In [143]:
del(motivation['etc.'])
# These two belong to the same question
del(motivation['2bit for sequence'])
del(motivation['conversion'])
del(motivation['bed-detail for gene structure annotations'])
In [144]:
motivation
Out[144]:
In [145]:
motivation = pd.DataFrame.from_dict(motivation, orient='index')
In [131]:
motivation.sort(columns=0, ascending=False)
Out[131]:
How do you keep your reference data up to date?
In [146]:
update = {}
for loc in responses['How do you keep your reference data up to date?'].dropna():
# Nasty parsing due to the wrong design of the survey, shame on me...
# Remove everything within parenthesis and then split by comma
loc =re.sub(r'\([^)]*\)', '', loc)
data = loc.split(',')
for key in data:
if update.has_key(key.lstrip()):
update[key.lstrip()] += 1
else:
update[key.lstrip()] = 1
update
Out[146]:
In [147]:
del(update['but usually not in between'])
In [150]:
for k in update.keys(): print k
Do you use any of these tools for downloading reference data?
In [176]:
tools = {}
for loc in responses['Do you use any of these tools for downloading reference data?'].dropna():
# Nasty parsing due to the wrong design of the survey, shame on me...
# Remove everything within parenthesis and then split by comma
loc =re.sub(r'\([^)]*\)', '', loc)
data = loc.split(',')
for key in data:
if tools.has_key(key.lstrip()):
tools[key.lstrip()] += 1
else:
tools[key.lstrip()] = 1
tools
Out[176]:
In [177]:
del(tools['but looking for stable alternatives'])
In [178]:
tools = pd.DataFrame.from_dict(tools, orient='index')
tools.plot(kind='pie', autopct='%1.1f%%', title='How do you structure the reference data?', subplots=True, figsize=(7,7), legend=False)
Out[178]:
In [ ]: