In this example, we will create a flotilla
study from the Illumina Bodymap 2.0 data. This is a nice dataset because it is simple yet complex. It has several nice features:
These data are available from the European Bioinformatics Institute (EBI) website here (you can browse their other datasets on their Data library website). There's a link on the "Expression Atlas" view of the data to download the processed data, which is the link I use below. So let's get this data!
In [40]:
! curl http://www.ebi.ac.uk/gxa/experiments/E-MTAB-513.tsv > E-MTAB-513.tsv
! curl http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-513/E-MTAB-513.sdrf.txt> E-MTAB-513.sdrf.txt
What does this look like? Let's look at the top of the file with head
.
In [39]:
! head E-MTAB-513.tsv
In [41]:
! head E-MTAB-513.sdrf.txt
We'll use the pandas
data analysis library to read the data. But we'll need a few caveats.
skiprows=3
.index_col=[0, 1]
, which says that the index columns/row names are the 0th and 1st columns, since we're counting from zero in the computer world.Let's import pandas and all the other packages we'll need.
In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import flotilla
In [4]:
expression = pd.read_table('E-MTAB-513.tsv', skiprows=3, index_col=[0, 1])
expression.head()
Out[4]:
Interesting side note: you'll notice that "animal ovary" is the only thing indicated as "animal." This is because in the Experimental Factor Ontology (EFO), they are defined as "animal ovary." This is because the EFO defines plant ovary separately. Apparently, in the venn diagram of the organs in animals versus the organs in plants, ovary is one of the few things that overlap! Cool! Thanks to Nick Semenkovich for figuring this out.
For flotilla
, we follow the machine learning standard of using matrices in the format $(\text{samples}) \times (\text{features})$, like this
Since this data is $(\text{features}) \times (\text{samples})$, we simply need to transpose the matrix with .T
In [5]:
expression = expression.T
expression.head()
Out[5]:
Now, let's also replace all our NAs with 0s. This makes sense because if a gene is not detected, then its expression value will be 0. From the header file of the original data, it said:
Genes matching: 'protein_coding' exactly, specifically expressed in any Organism part above the expression level cutoff: 0.5 in experiment E-MTAB-513
So we're forcing everything with expression less than 0.5 down to 0. Out of curiousity, let's see what was the minimum value left over after they did this 0.5 filter:
In [8]:
expression.min().min()
Out[8]:
Cool, pretty close to 0.5. Now let's replace all NAs with 0, with fillna(0)
.
In [9]:
expression = expression.fillna(0)
Finally, we will add 1 and log-transform the data, so it's closer to normally distributed. Gene expression data is known to be log-normal.
In [10]:
expression = np.log2(expression + 1)
The other thing that will make this data simpler to work with is making the columns of the data just use the crazy unique ENSEMBL id like "ENSG00000280433
". First, let's check to see if the common names are actually unique. We'll expand out the column names to a list of tuples using expression.columns.tolist()
In [11]:
ensembl_ids = pd.Index([a for a, b in expression.columns.tolist()])
gene_names = pd.Index([b for a, b in expression.columns.tolist()])
In [12]:
len(ensembl_ids.unique())
Out[12]:
In [13]:
len(gene_names.unique())
Out[13]:
So there's fewer unique gene names, meaning we should use the ENSEMBL ids for the unique IDs. We'll do this by resetting the columns of expression
, and creating metadata about the expression features, stored as expression_feature_data
.
First, let's reassign the columns as the ensembl_ids
we created from before.
In [14]:
expression.columns = ensembl_ids
Now let's create the expression_feature_data
DataFrame, and add a column of 'gene_name'
for the renamed feature.
In [15]:
expression_feature_data = pd.DataFrame(index=ensembl_ids)
expression_feature_data['gene_name'] = gene_names
expression_feature_data.head()
Out[15]:
For flotilla
, every project is required to have metadata. We'll create one from scratch using the sample names from the expression data.
In [16]:
metadata = pd.DataFrame(index=expression.index)
metadata.head()
Out[16]:
The first category we'll use is pretty straightforward, it'll just be the name of the tissue. We can just use the index as the 'phenotype'
.
In [18]:
metadata['phenotype'] = metadata.index
metadata.head()
Out[18]:
Next, let's add some categories on the data, grouping different tissue types together that have the same structure or function. My awesome MD/PhD friend Cynthia Hsu (you have to search for her name on the webpage) came up with these categories.
In [19]:
# All of these tissue types are part of the reproductive system
metadata['reproductive'] = metadata.phenotype.isin(['animal ovary', 'testis'])
# ALl of these tissue types generate hormones
metadata['hormonal'] = metadata.phenotype.isin(['animal ovary', 'testis', 'adrenal gland', 'thyroid'])
# These tissues are part of in the immune system
metadata['immune'] = metadata.phenotype.isin(['leukocyte', 'thyroid', 'lymph node'])
# These tissues are fatty
metadata['fatty'] = metadata.phenotype.isin(['adipose tissue', 'brain', 'breast'])
# These tissues contain either smooth (involuntary) or skeletal (voluntary) muscle
metadata['muscle'] = metadata.phenotype.isin(['colon', 'heart', 'prostate', 'skeletal muscle'])
# These tissues' main function is to filter blood in some way
metadata['filtration'] = metadata.phenotype.isin(['colon', 'kidney', 'liver'])
# These tissues have high blood flow to them, compared to other tissues
metadata['high_blood_flow'] = metadata.phenotype.isin(['brain', 'colon', 'kidney', 'liver', 'lung'])
Now we need to choose colors for the data. Since we have 16 samples, none of the usual ColorBrewer colors apply because the maximum color set there is Paired
, which is 12 colors. So we'll use a human-friendly version of the hue-lightness-saturation (hls) scale, called "husl". Read more at the seaborn
color tutorial.
In [20]:
colors = sns.color_palette('husl', len(expression.index))
sns.palplot(colors)
Let's create an iterator so we can easily loop over this list of colors without having to reference indices.
In [21]:
colors_iter = iter(colors)
Finally, let's make the phenotype to color mapping as a dictionary.
In [22]:
phenotype_to_color = {phenotype: colors_iter.next() for phenotype in metadata.phenotype}
phenotype_to_color
Out[22]:
In [26]:
study = flotilla.Study(metadata, expression_data=expression,
metadata_phenotype_to_color=phenotype_to_color,
expression_feature_data=expression_feature_data,
expression_feature_rename_col='gene_name',
species='hg19')
study.expression.feature_data.head()
Out[26]:
In [27]:
study.interactive_pca()
Out[27]:
In [33]:
study.plot_gene('RBFOX1')
In [34]:
study.plot_gene('ADIPOQ')
In [35]:
study.plot_gene('OCA2')
In [37]:
study.plot_gene('MAPT')
In [42]:
era_metadata = pd.read_table("E-MTAB-513.sdrf.txt")
In [44]:
from pprint import pprint
In [45]:
pprint(sorted(era_metadata.columns.tolist()))
In [46]:
era_metadata['Comment[biosource provider]']
Out[46]:
In [47]:
era_metadata['Scan Name']
Out[47]:
In [48]:
era_metadata['Material Type']
Out[48]:
In [49]:
era_metadata['Material Type.1']
Out[49]:
In [50]:
era_metadata.Description
Out[50]:
In [55]:
study.interactive_clustermap()
In [ ]:
study.save('bodymap2')