Create a flotilla study from the Illumina Bodymap 2.0 data

By: Olga Botvinnik

In this example, we will create a flotilla study from the Illumina Bodymap 2.0 data. This is a nice dataset because it is simple yet complex. It has several nice features:

  • Already processed to publically available gene expression values
  • 16 human tissue types, so its relatable to people across biological disciplines, and to non-scientists (everyone knows what a brain is)
  • Can group the tissue types as meta-types

These data are available from the European Bioinformatics Institute (EBI) website here (you can browse their other datasets on their Data library website). There's a link on the "Expression Atlas" view of the data to download the processed data, which is the link I use below. So let's get this data!

Download the data and supplement


In [40]:
! curl http://www.ebi.ac.uk/gxa/experiments/E-MTAB-513.tsv > E-MTAB-513.tsv
! curl http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-513/E-MTAB-513.sdrf.txt>  E-MTAB-513.sdrf.txt


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39860  100 39860    0     0  32219      0  0:00:01  0:00:01 --:--:-- 32249

What does this look like? Let's look at the top of the file with head.


In [39]:
! head E-MTAB-513.tsv


# Expression Atlas version: 0.1.4-SNAPSHOT
# Query: Genes matching: 'protein_coding' exactly, specifically expressed in any Organism part above the expression level cutoff: 0.5 in experiment E-MTAB-513
# Timestamp: Sun, 11-Jan-2015 23:58:59
Gene ID	Gene Name	adipose tissue	adrenal gland	animal ovary	brain	breast	colon	heart	kidney	leukocyte	liver	lung	lymph node	prostate	skeletal muscle	testis	thyroid
ENSG00000000003	TSPAN6	21	5	21	5	16	12	2	13		31	11	5	13		18	15
ENSG00000000005	TNMD	16	5			44	1									0.6	
ENSG00000000419	DPM1	20	31	33	25	26	25	23	35	29	17	25	27	32	18	42	44
ENSG00000000457	SCYL3	1	3	4	2	3	1	1	3	4	2	1	3	3	0.9	3	3
ENSG00000000460	C1orf112	0.7	0.8	0.8		0.6			0.6	0.6	0.7		0.9	0.9		5	0.7
ENSG00000000938	FGR	9	7	1	1	2	1	2	3	147	2	26	11	2	0.7	0.8	2

In [41]:
! head E-MTAB-513.sdrf.txt


Source Name	Material Type	Description	Characteristics[organism]	Characteristics[age]	Unit[TimeUnit]	Characteristics[organism part]	Characteristics[sex]	Characteristics[ethnic group]	Comment[biosource provider]	Protocol REF	Extract Name	Material Type	Comment[LIBRARY_LAYOUT]	Comment[ORIENTATION]	Comment[NOMINAL_LENGTH]	Comment[NOMINAL_SDEV]	Comment[LIBRARY_SOURCE]	Comment[LIBRARY_STRATEGY]	Comment[LIBRARY_SELECTION]	Comment[spiked_in]	Comment[insertSize]	Comment[LIBRARYPREP]	Protocol REF	Performer	Assay Name	Technology Type	Comment[ENA_EXPERIMENT]	Comment[CYCLE_COUNT]	Comment[SPOT_LENGTH]	Comment[SEQUENCE_LENGTH]	Comment[READ_INDEX_0_READ_TYPE]	Comment[READ_INDEX_0_READ_CLASS]	Comment[READ_INDEX_0_BASE_COORD]	Comment[READ_INDEX_1_READ_TYPE]	Comment[READ_INDEX_1_READ_CLASS]	Comment[READ_INDEX_1_BASE_COORD]	Protocol REF	Scan Name	Comment[FASTQ_URI]	Comment[ENA_RUN]	Comment[SUBMITTED_FILE_NAME]	Comment[quality_scoring_system]	Comment[quality_encoding]	Comment[ascii_offset]	Factor Value[organism part]
HCT20142	organism part	1x75 single mRNA-Seq	Homo sapiens	60	year	kidney	female	Caucasian	"Human kidney total RNA, lot 0908002"	P-MTAB-19502	HCT20142_SINGLE	total_RNA	SINGLE	  	  	  	TRANSCRIPTOMIC	RNA-Seq	cDNA	~ 0.5% phiX DNA	~210 bps	mRNA-Seq	P-MTAB-19506	ILLUMINA-CA	A20G14AAX1_s6	sequencing assay	ERX011219	75	75	75	  	  	  	  	  	  	P-MTAB-19507	75bp_mRNA_Seq_FCA_s_6_sequence.txt.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR030/ERR030893/ERR030893.fastq.gz	ERR030893	75bp_mRNA_Seq_FCA_s_6_sequence.txt.gz	phred	ascii	@	kidney
HCT20142	organism part	2x50 PE mRNA-seq READ1	Homo sapiens	60	year	kidney	female	Caucasian	"Human kidney total RNA, lot 0908002"	P-MTAB-19502	HCT20142_PAIRED	total_RNA	PAIRED	5'-3'-3'-5'	184	0	TRANSCRIPTOMIC	RNA-Seq	cDNA	~ 0.5% phiX DNA	~210 bps	mRNA-Seq	P-MTAB-19506	ILLUMINA-CA	B20G06AAX1_s6	sequencing assay	ERX011182	50	100	100	Forward	Application	1	Reverse	Application Run	51	P-MTAB-19507	50bp_PE_mRNA_Seq_FCB_s_6_1_sequence.txt.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR030/ERR030885/ERR030885_1.fastq.gz	ERR030885	50bp_PE_mRNA_Seq_FCB_s_6_1_sequence.txt.gz	phred	ascii	@	kidney
HCT20142	organism part	2x50 PE mRNA-seq READ2	Homo sapiens	60	year	kidney	female	Caucasian	"Human kidney total RNA, lot 0908002"	P-MTAB-19502	HCT20142_PAIRED	total_RNA	PAIRED	5'-3'-3'-5'	184	0	TRANSCRIPTOMIC	RNA-Seq	cDNA	~ 0.5% phiX DNA	~210 bps	mRNA-Seq	P-MTAB-19506	ILLUMINA-CA	B20G06AAX1_s6	sequencing assay	ERX011182	50	100	100	Forward	Application	1	Reverse	Application Run	51	P-MTAB-19507	50bp_PE_mRNA_Seq_FCB_s_6_2_sequence.txt.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR030/ERR030885/ERR030885_2.fastq.gz	ERR030885	50bp_PE_mRNA_Seq_FCB_s_6_2_sequence.txt.gz	phred	ascii	@	kidney
HCT20143	organism part	1x75 single mRNA-Seq	Homo sapiens	77	year	heart	male	Caucasian	"Human heart total RNA, lot 07040023"	P-MTAB-19502	HCT20143_SINGLE	total_RNA	SINGLE	  	  	  	TRANSCRIPTOMIC	RNA-Seq	cDNA	~ 0.5% phiX DNA	~210 bps	mRNA-Seq	P-MTAB-19506	ILLUMINA-CA	A20G14AAX1_s7	sequencing assay	ERX011183	75	75	75	  	  	  	  	  	  	P-MTAB-19507	75bp_mRNA_Seq_FCA_s_7_sequence.txt.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR030/ERR030894/ERR030894.fastq.gz	ERR030894	75bp_mRNA_Seq_FCA_s_7_sequence.txt.gz	phred	ascii	@	heart
HCT20143	organism part	2x50 PE mRNA-seq READ1	Homo sapiens	77	year	heart	male	Caucasian	"Human heart total RNA, lot 07040023"	P-MTAB-19502	HCT20143_PAIRED	total_RNA	PAIRED	5'-3'-3'-5'	188	0	TRANSCRIPTOMIC	RNA-Seq	cDNA	~ 0.5% phiX DNA	~210 bps	mRNA-Seq	P-MTAB-19506	ILLUMINA-CA	B20G06AAX1_s7	sequencing assay	ERX011197	50	100	100	Forward	Application	1	Reverse	Application Run	51	P-MTAB-19507	50bp_PE_mRNA_Seq_FCB_s_7_1_sequence.txt.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR030/ERR030886/ERR030886_1.fastq.gz	ERR030886	50bp_PE_mRNA_Seq_FCB_s_7_1_sequence.txt.gz	phred	ascii	@	heart
HCT20143	organism part	2x50 PE mRNA-seq READ2	Homo sapiens	77	year	heart	male	Caucasian	"Human heart total RNA, lot 07040023"	P-MTAB-19502	HCT20143_PAIRED	total_RNA	PAIRED	5'-3'-3'-5'	188	0	TRANSCRIPTOMIC	RNA-Seq	cDNA	~ 0.5% phiX DNA	~210 bps	mRNA-Seq	P-MTAB-19506	ILLUMINA-CA	B20G06AAX1_s7	sequencing assay	ERX011197	50	100	100	Forward	Application	1	Reverse	Application Run	51	P-MTAB-19507	50bp_PE_mRNA_Seq_FCB_s_7_2_sequence.txt.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR030/ERR030886/ERR030886_2.fastq.gz	ERR030886	50bp_PE_mRNA_Seq_FCB_s_7_2_sequence.txt.gz	phred	ascii	@	heart
HCT20144	organism part	1x75 single mRNA-Seq	Homo sapiens	37	year	liver	male	Caucasian	"Human liver total RNA, lot 040000124"	P-MTAB-19502	HCT20144_SINGLE	total_RNA	SINGLE	  	  	  	TRANSCRIPTOMIC	RNA-Seq	cDNA	~ 0.5% phiX DNA	~210 bps	mRNA-Seq	P-MTAB-19506	ILLUMINA-CA	A20G14AAX1_s8	sequencing assay	ERX011211	75	75	75	  	  	  	  	  	  	P-MTAB-19507	75bp_mRNA_Seq_FCA_s_8_sequence.txt.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR030/ERR030895/ERR030895.fastq.gz	ERR030895	75bp_mRNA_Seq_FCA_s_8_sequence.txt.gz	phred	ascii	@	liver
HCT20144	organism part	2x50 PE mRNA-seq READ1	Homo sapiens	37	year	liver	male	Caucasian	"Human liver total RNA, lot 040000124"	P-MTAB-19502	HCT20144_PAIRED	total_RNA	PAIRED	5'-3'-3'-5'	174	0	TRANSCRIPTOMIC	RNA-Seq	cDNA	~ 0.5% phiX DNA	~210 bps	mRNA-Seq	P-MTAB-19506	ILLUMINA-CA	B20G06AAX1_s8	sequencing assay	ERX011229	50	100	100	Forward	Application	1	Reverse	Application Run	51	P-MTAB-19507	50bp_PE_mRNA_Seq_FCB_s_8_1_sequence.txt.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR030/ERR030887/ERR030887_1.fastq.gz	ERR030887	50bp_PE_mRNA_Seq_FCB_s_8_1_sequence.txt.gz	phred	ascii	@	liver
HCT20144	organism part	2x50 PE mRNA-seq READ2	Homo sapiens	37	year	liver	male	Caucasian	"Human liver total RNA, lot 040000124"	P-MTAB-19502	HCT20144_PAIRED	total_RNA	PAIRED	5'-3'-3'-5'	174	0	TRANSCRIPTOMIC	RNA-Seq	cDNA	~ 0.5% phiX DNA	~210 bps	mRNA-Seq	P-MTAB-19506	ILLUMINA-CA	B20G06AAX1_s8	sequencing assay	ERX011229	50	100	100	Forward	Application	1	Reverse	Application Run	51	P-MTAB-19507	50bp_PE_mRNA_Seq_FCB_s_8_2_sequence.txt.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR030/ERR030887/ERR030887_2.fastq.gz	ERR030887	50bp_PE_mRNA_Seq_FCB_s_8_2_sequence.txt.gz	phred	ascii	@	liver

We'll use the pandas data analysis library to read the data. But we'll need a few caveats.

  1. The first three lines of the data don't match the tabular format, so we'll need to ignore them with skiprows=3.
  2. The row names indicating the gene name are the first two columns, so we'll need to specify that with index_col=[0, 1], which says that the index columns/row names are the 0th and 1st columns, since we're counting from zero in the computer world.

Let's import pandas and all the other packages we'll need.


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import flotilla


:0: FutureWarning: IPython widgets are experimental and may change in the future.

In [4]:
expression = pd.read_table('E-MTAB-513.tsv', skiprows=3, index_col=[0, 1])
expression.head()


Out[4]:
adipose tissue adrenal gland animal ovary brain breast colon heart kidney leukocyte liver lung lymph node prostate skeletal muscle testis thyroid
Gene ID Gene Name
ENSG00000000003 TSPAN6 21.0 5.0 21.0 5 16.0 12 2 13.0 NaN 31.0 11 5.0 13.0 NaN 18.0 15.0
ENSG00000000005 TNMD 16.0 5.0 NaN NaN 44.0 1 NaN NaN NaN NaN NaN NaN NaN NaN 0.6 NaN
ENSG00000000419 DPM1 20.0 31.0 33.0 25 26.0 25 23 35.0 29.0 17.0 25 27.0 32.0 18.0 42.0 44.0
ENSG00000000457 SCYL3 1.0 3.0 4.0 2 3.0 1 1 3.0 4.0 2.0 1 3.0 3.0 0.9 3.0 3.0
ENSG00000000460 C1orf112 0.7 0.8 0.8 NaN 0.6 NaN NaN 0.6 0.6 0.7 NaN 0.9 0.9 NaN 5.0 0.7

5 rows × 16 columns

Interesting side note: you'll notice that "animal ovary" is the only thing indicated as "animal." This is because in the Experimental Factor Ontology (EFO), they are defined as "animal ovary." This is because the EFO defines plant ovary separately. Apparently, in the venn diagram of the organs in animals versus the organs in plants, ovary is one of the few things that overlap! Cool! Thanks to Nick Semenkovich for figuring this out.

For flotilla, we follow the machine learning standard of using matrices in the format $(\text{samples}) \times (\text{features})$, like this

Since this data is $(\text{features}) \times (\text{samples})$, we simply need to transpose the matrix with .T


In [5]:
expression = expression.T
expression.head()


Out[5]:
Gene ID ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000938 ENSG00000000971 ENSG00000001036 ENSG00000001084 ENSG00000001167 ENSG00000001460 ENSG00000001461 ENSG00000001497 ENSG00000001561 ENSG00000001617 ENSG00000001626 ENSG00000001629 ENSG00000001630 ENSG00000001631 ENSG00000002016
Gene Name TSPAN6 TNMD DPM1 SCYL3 C1orf112 FGR CFH FUCA2 GCLC NFYA STPG1 NIPAL3 LAS1L ENPP4 SEMA3F CFTR ANKIB1 CYP51A1 KRIT1 RAD52
adipose tissue 21 16 20 1 0.7 9 18 27 5 6 0.7 2 6 13 17.0 NaN 6 0.7 4 1 ...
adrenal gland 5 5 31 3 0.8 7 26 24 8 5 0.9 3 8 8 6.0 NaN 7 NaN 7 4 ...
animal ovary 21 NaN 33 4 0.8 1 99 30 6 11 2.0 3 7 4 2.0 0.7 11 1.0 9 3 ...
brain 5 NaN 25 2 NaN 1 4 13 10 6 2.0 23 6 31 0.9 NaN 14 0.8 5 2 ...
breast 16 44 26 3 0.6 2 23 27 11 6 1.0 4 4 13 3.0 NaN 9 NaN 5 2 ...

5 rows × 17412 columns

Now, let's also replace all our NAs with 0s. This makes sense because if a gene is not detected, then its expression value will be 0. From the header file of the original data, it said:

Genes matching: 'protein_coding' exactly, specifically expressed in any Organism part above the expression level cutoff: 0.5 in experiment E-MTAB-513

So we're forcing everything with expression less than 0.5 down to 0. Out of curiousity, let's see what was the minimum value left over after they did this 0.5 filter:


In [8]:
expression.min().min()


Out[8]:
0.59999999999999998

Cool, pretty close to 0.5. Now let's replace all NAs with 0, with fillna(0).


In [9]:
expression = expression.fillna(0)

Finally, we will add 1 and log-transform the data, so it's closer to normally distributed. Gene expression data is known to be log-normal.


In [10]:
expression = np.log2(expression + 1)

Create metadata about the expression features

The other thing that will make this data simpler to work with is making the columns of the data just use the crazy unique ENSEMBL id like "ENSG00000280433". First, let's check to see if the common names are actually unique. We'll expand out the column names to a list of tuples using expression.columns.tolist()


In [11]:
ensembl_ids = pd.Index([a for a, b in expression.columns.tolist()])
gene_names = pd.Index([b for a, b in expression.columns.tolist()])

In [12]:
len(ensembl_ids.unique())


Out[12]:
17412

In [13]:
len(gene_names.unique())


Out[13]:
17396

So there's fewer unique gene names, meaning we should use the ENSEMBL ids for the unique IDs. We'll do this by resetting the columns of expression, and creating metadata about the expression features, stored as expression_feature_data.

First, let's reassign the columns as the ensembl_ids we created from before.


In [14]:
expression.columns = ensembl_ids

Now let's create the expression_feature_data DataFrame, and add a column of 'gene_name' for the renamed feature.


In [15]:
expression_feature_data = pd.DataFrame(index=ensembl_ids)
expression_feature_data['gene_name'] = gene_names
expression_feature_data.head()


Out[15]:
gene_name
ENSG00000000003 TSPAN6
ENSG00000000005 TNMD
ENSG00000000419 DPM1
ENSG00000000457 SCYL3
ENSG00000000460 C1orf112

5 rows × 1 columns

Preparing the metadata

For flotilla, every project is required to have metadata. We'll create one from scratch using the sample names from the expression data.


In [16]:
metadata = pd.DataFrame(index=expression.index)
metadata.head()


Out[16]:
Index([u'adipose tissue', u'adrenal gland', u'animal ovary', u'brain', u'breast'], dtype='object') Empty DataFrame

5 rows × 0 columns

The first category we'll use is pretty straightforward, it'll just be the name of the tissue. We can just use the index as the 'phenotype'.


In [18]:
metadata['phenotype'] = metadata.index
metadata.head()


Out[18]:
phenotype
adipose tissue adipose tissue
adrenal gland adrenal gland
animal ovary animal ovary
brain brain
breast breast

5 rows × 1 columns

Next, let's add some categories on the data, grouping different tissue types together that have the same structure or function. My awesome MD/PhD friend Cynthia Hsu (you have to search for her name on the webpage) came up with these categories.


In [19]:
# All of these tissue types are part of the reproductive system
metadata['reproductive'] = metadata.phenotype.isin(['animal ovary', 'testis'])

# ALl of these tissue types generate hormones
metadata['hormonal'] = metadata.phenotype.isin(['animal ovary', 'testis', 'adrenal gland', 'thyroid'])

# These tissues are part of in the immune system
metadata['immune'] = metadata.phenotype.isin(['leukocyte', 'thyroid', 'lymph node'])

# These tissues are fatty
metadata['fatty'] = metadata.phenotype.isin(['adipose tissue', 'brain', 'breast'])

# These tissues contain either smooth (involuntary) or skeletal (voluntary) muscle
metadata['muscle'] = metadata.phenotype.isin(['colon', 'heart', 'prostate', 'skeletal muscle'])

# These tissues' main function is to filter blood in some way
metadata['filtration'] = metadata.phenotype.isin(['colon', 'kidney', 'liver'])

# These tissues have high blood flow to them, compared to other tissues
metadata['high_blood_flow'] = metadata.phenotype.isin(['brain', 'colon', 'kidney', 'liver', 'lung'])

Assign colors to the phenotypes

Now we need to choose colors for the data. Since we have 16 samples, none of the usual ColorBrewer colors apply because the maximum color set there is Paired, which is 12 colors. So we'll use a human-friendly version of the hue-lightness-saturation (hls) scale, called "husl". Read more at the seaborn color tutorial.


In [20]:
colors = sns.color_palette('husl', len(expression.index))
sns.palplot(colors)


Let's create an iterator so we can easily loop over this list of colors without having to reference indices.


In [21]:
colors_iter = iter(colors)

Finally, let's make the phenotype to color mapping as a dictionary.


In [22]:
phenotype_to_color = {phenotype: colors_iter.next() for phenotype in metadata.phenotype}
phenotype_to_color


Out[22]:
{'adipose tissue': (0.9677975592919913,
  0.44127456009157356,
  0.5358103155058701),
 'adrenal gland': (0.9688417625390765,
  0.46710871459052145,
  0.1965441952393453),
 'animal ovary': (0.8087954113106306, 0.5634700050056693, 0.19502642696727285),
 'brain': (0.7008633391290917, 0.6080365980075504, 0.19419512204856468),
 'breast': (0.5920891529639701, 0.6418467016378244, 0.1935069134991043),
 'colon': (0.4225883781014591, 0.677943504931845, 0.19271544738133076),
 'heart': (0.19783576093349015, 0.6955516966063037, 0.3995301037444499),
 'kidney': (0.20518528131112984, 0.6851497738530601, 0.5562527763557912),
 'leukocyte': (0.21044753832183283, 0.6773105080456748, 0.6433941168468681),
 'liver': (0.21576108198845112, 0.6690446872415565, 0.7201192992055431),
 'lung': (0.22335772267769388, 0.6565792317435265, 0.8171355503265633),
 'lymph node': (0.3531380715309417, 0.6201408220829481, 0.9586195235634788),
 'prostate': (0.6423044349219739, 0.5497680051256467, 0.9582651433656727),
 'skeletal muscle': (0.8397010947263905,
  0.4529020995703274,
  0.9578638063653008),
 'testis': (0.9603888539940703, 0.3814317878772117, 0.8683117650835491),
 'thyroid': (0.9645179518697552, 0.41602112206844516, 0.708820872610067)}

Make a flotilla study!


In [26]:
study = flotilla.Study(metadata, expression_data=expression, 
                       metadata_phenotype_to_color=phenotype_to_color, 
                       expression_feature_data=expression_feature_data, 
                       expression_feature_rename_col='gene_name',
                       species='hg19')
study.expression.feature_data.head()


2015-01-11 13:50:44	Initializing Study
2015-01-11 13:50:44	Initializing Predictor configuration manager for Study
2015-01-11 13:50:44	Predictor ExtraTreesClassifier is of type <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>
2015-01-11 13:50:44	Added ExtraTreesClassifier to default predictors
2015-01-11 13:50:44	Predictor ExtraTreesRegressor is of type <class 'sklearn.ensemble.forest.ExtraTreesRegressor'>
2015-01-11 13:50:44	Added ExtraTreesRegressor to default predictors
2015-01-11 13:50:44	Predictor GradientBoostingClassifier is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
2015-01-11 13:50:44	Added GradientBoostingClassifier to default predictors
2015-01-11 13:50:44	Predictor GradientBoostingRegressor is of type <class 'sklearn.ensemble.gradient_boosting.GradientBoostingRegressor'>
2015-01-11 13:50:44	Added GradientBoostingRegressor to default predictors
2015-01-11 13:50:44	Loading metadata
2015-01-11 13:50:44	Loading species metadata from ~/flotilla_packages
https://s3-us-west-2.amazonaws.com/flotilla-projects/ercc/ERCC_Controls.txt has not been downloaded before.
	Downloading now to /home/jovyan/flotilla_projects/hg19/ERCC_Controls.txt
No phenotype to marker (matplotlib plotting symbol) was provided, so each phenotype will be plotted as a circle in visualizations.
Error loading species hg19 data 
2015-01-11 13:50:49	Loading expression data
2015-01-11 13:50:49	Initializing expression
2015-01-11 13:50:49	Done initializing expression
2015-01-11 13:50:50	Successfully initialized a Study object!
Out[26]:
biomark_neural_panel confident_rbp gencode_id gene_name gene_status gene_type havana_gene level rbp ribosomal ribosomal_subunit synapse tag transcript_id transcript_name transcript_status transcript_type transcription_factor
ENSG00000000003 False False ENSG00000000003.10 TSPAN6 KNOWN protein_coding OTTHUMG00000022002.1 2 False False False False None ENSG00000000003.10 TSPAN6 KNOWN protein_coding False
ENSG00000000005 False False ENSG00000000005.5 TNMD KNOWN protein_coding OTTHUMG00000022001.1 2 False False False False None ENSG00000000005.5 TNMD KNOWN protein_coding False
ENSG00000000419 False False ENSG00000000419.8 DPM1 KNOWN protein_coding OTTHUMG00000032742.2 2 False False False False None ENSG00000000419.8 DPM1 KNOWN protein_coding False
ENSG00000000457 False False ENSG00000000457.9 SCYL3 KNOWN protein_coding OTTHUMG00000035941.4 2 False False False False None ENSG00000000457.9 SCYL3 KNOWN protein_coding False
ENSG00000000460 False False ENSG00000000460.12 C1orf112 KNOWN protein_coding OTTHUMG00000035821.6 2 False False False False None ENSG00000000460.12 C1orf112 KNOWN protein_coding False

5 rows × 18 columns


In [27]:
study.interactive_pca()


featurewise : False
y_pc : 2
data_type : expression
std_multiplier : 2.0
most_variant_features : False
show_point_labels : False
plot_violins : False
sample_subset : all_samples
feature_subset : variant
bokeh : False
color_samples_by : phenotype
x_pc : 1
list_link : 
Out[27]:
<function flotilla.visualize.ipython_interact.do_interact>

In [33]:
study.plot_gene('RBFOX1')



In [34]:
study.plot_gene('ADIPOQ')



In [35]:
study.plot_gene('OCA2')



In [37]:
study.plot_gene('MAPT')



In [42]:
era_metadata = pd.read_table("E-MTAB-513.sdrf.txt")

In [44]:
from pprint import pprint

In [45]:
pprint(sorted(era_metadata.columns.tolist()))


['Assay Name',
 'Characteristics[age]',
 'Characteristics[ethnic group]',
 'Characteristics[organism part]',
 'Characteristics[organism]',
 'Characteristics[sex]',
 'Comment[CYCLE_COUNT]',
 'Comment[ENA_EXPERIMENT]',
 'Comment[ENA_RUN]',
 'Comment[FASTQ_URI]',
 'Comment[LIBRARYPREP]',
 'Comment[LIBRARY_LAYOUT]',
 'Comment[LIBRARY_SELECTION]',
 'Comment[LIBRARY_SOURCE]',
 'Comment[LIBRARY_STRATEGY]',
 'Comment[NOMINAL_LENGTH]',
 'Comment[NOMINAL_SDEV]',
 'Comment[ORIENTATION]',
 'Comment[READ_INDEX_0_BASE_COORD]',
 'Comment[READ_INDEX_0_READ_CLASS]',
 'Comment[READ_INDEX_0_READ_TYPE]',
 'Comment[READ_INDEX_1_BASE_COORD]',
 'Comment[READ_INDEX_1_READ_CLASS]',
 'Comment[READ_INDEX_1_READ_TYPE]',
 'Comment[SEQUENCE_LENGTH]',
 'Comment[SPOT_LENGTH]',
 'Comment[SUBMITTED_FILE_NAME]',
 'Comment[ascii_offset]',
 'Comment[biosource provider]',
 'Comment[insertSize]',
 'Comment[quality_encoding]',
 'Comment[quality_scoring_system]',
 'Comment[spiked_in]',
 'Description',
 'Extract Name',
 'Factor Value[organism part]',
 'Material Type',
 'Material Type.1',
 'Performer',
 'Protocol REF',
 'Protocol REF.1',
 'Protocol REF.2',
 'Scan Name',
 'Source Name',
 'Technology Type',
 'Unit[TimeUnit]']

In [46]:
era_metadata['Comment[biosource provider]']


Out[46]:
0                Human kidney total RNA, lot 0908002
1                Human kidney total RNA, lot 0908002
2                Human kidney total RNA, lot 0908002
3                Human heart total RNA, lot 07040023
4                Human heart total RNA, lot 07040023
5                Human heart total RNA, lot 07040023
6               Human liver total RNA, lot 040000124
7               Human liver total RNA, lot 040000124
8               Human liver total RNA, lot 040000124
9                  Human lung total RNA, lot 0904002
10                 Human lung total RNA, lot 0904002
11                 Human lung total RNA, lot 0904002
12    Human lymph node total RNA, lot 026P010305032B
13    Human lymph node total RNA, lot 026P010305032B
14    Human lymph node total RNA, lot 026P010305032B
...
49      
50      
51      
52      
53      
54      
55      
56      
57      
58      
59      
60      
61      
62      
63      
Name: Comment[biosource provider], Length: 64, dtype: object

In [47]:
era_metadata['Scan Name']


Out[47]:
0          75bp_mRNA_Seq_FCA_s_6_sequence.txt.gz
1     50bp_PE_mRNA_Seq_FCB_s_6_1_sequence.txt.gz
2     50bp_PE_mRNA_Seq_FCB_s_6_2_sequence.txt.gz
3          75bp_mRNA_Seq_FCA_s_7_sequence.txt.gz
4     50bp_PE_mRNA_Seq_FCB_s_7_1_sequence.txt.gz
5     50bp_PE_mRNA_Seq_FCB_s_7_2_sequence.txt.gz
6          75bp_mRNA_Seq_FCA_s_8_sequence.txt.gz
7     50bp_PE_mRNA_Seq_FCB_s_8_1_sequence.txt.gz
8     50bp_PE_mRNA_Seq_FCB_s_8_2_sequence.txt.gz
9     50bp_PE_mRNA_Seq_FCA_s_8_1_sequence.txt.gz
10    50bp_PE_mRNA_Seq_FCA_s_8_2_sequence.txt.gz
11         75bp_mRNA_Seq_FCB_s_1_sequence.txt.gz
12    50bp_PE_mRNA_Seq_FCA_s_7_1_sequence.txt.gz
13    50bp_PE_mRNA_Seq_FCA_s_7_2_sequence.txt.gz
14         75bp_mRNA_Seq_FCB_s_2_sequence.txt.gz
...
49    100bp_Stranded_RNA_Seq_FCA_s_2_sequence.txt.gz
50    100bp_Stranded_RNA_Seq_FCA_s_3_sequence.txt.gz
51    100bp_Stranded_RNA_Seq_FCB_s_1_sequence.txt.gz
52    100bp_Stranded_RNA_Seq_FCB_s_2_sequence.txt.gz
53    100bp_Stranded_RNA_Seq_FCA_s_4_sequence.txt.gz
54    100bp_Stranded_RNA_Seq_FCA_s_5_sequence.txt.gz
55    100bp_Stranded_RNA_Seq_FCA_s_6_sequence.txt.gz
56    100bp_Stranded_RNA_Seq_FCB_s_3_sequence.txt.gz
57    100bp_Stranded_RNA_Seq_FCB_s_4_sequence.txt.gz
58    100bp_Stranded_RNA_Seq_FCA_s_7_sequence.txt.gz
59    100bp_Stranded_RNA_Seq_FCA_s_8_sequence.txt.gz
60    100bp_Stranded_RNA_Seq_FCB_s_5_sequence.txt.gz
61    100bp_Stranded_RNA_Seq_FCB_s_6_sequence.txt.gz
62    100bp_Stranded_RNA_Seq_FCB_s_7_sequence.txt.gz
63    100bp_Stranded_RNA_Seq_FCB_s_8_sequence.txt.gz
Name: Scan Name, Length: 64, dtype: object

In [48]:
era_metadata['Material Type']


Out[48]:
0     organism part
1     organism part
2     organism part
3     organism part
4     organism part
5     organism part
6     organism part
7     organism part
8     organism part
9     organism part
10    organism part
11    organism part
12    organism part
13    organism part
14    organism part
...
49    organism part
50    organism part
51    organism part
52    organism part
53    organism part
54    organism part
55    organism part
56    organism part
57    organism part
58    organism part
59    organism part
60    organism part
61    organism part
62    organism part
63    organism part
Name: Material Type, Length: 64, dtype: object

In [49]:
era_metadata['Material Type.1']


Out[49]:
0     total_RNA
1     total_RNA
2     total_RNA
3     total_RNA
4     total_RNA
5     total_RNA
6     total_RNA
7     total_RNA
8     total_RNA
9     total_RNA
10    total_RNA
11    total_RNA
12    total_RNA
13    total_RNA
14    total_RNA
...
49    molecular_mixture
50    molecular_mixture
51    molecular_mixture
52    molecular_mixture
53    molecular_mixture
54    molecular_mixture
55    molecular_mixture
56    molecular_mixture
57    molecular_mixture
58    molecular_mixture
59    molecular_mixture
60    molecular_mixture
61    molecular_mixture
62    molecular_mixture
63    molecular_mixture
Name: Material Type.1, Length: 64, dtype: object

In [50]:
era_metadata.Description


Out[50]:
0       1x75 single mRNA-Seq
1     2x50 PE mRNA-seq READ1
2     2x50 PE mRNA-seq READ2
3       1x75 single mRNA-Seq
4     2x50 PE mRNA-seq READ1
5     2x50 PE mRNA-seq READ2
6       1x75 single mRNA-Seq
7     2x50 PE mRNA-seq READ1
8     2x50 PE mRNA-seq READ2
9     2x50 PE mRNA-seq READ1
10    2x50 PE mRNA-seq READ2
11      1x75 single mRNA-Seq
12    2x50 PE mRNA-seq READ1
13    2x50 PE mRNA-seq READ2
14      1x75 single mRNA-Seq
...
49    1x100 stranded total transcriptome preps; the ...
50    1x100 stranded total transcriptome preps; the ...
51    1x100 stranded total transcriptome preps; the ...
52    1x100 stranded total transcriptome preps; the ...
53    1x100 stranded total transcriptome preps; the ...
54    1x100 stranded total transcriptome preps; the ...
55    1x100 stranded total transcriptome preps; the ...
56    1x100 stranded total transcriptome preps; the ...
57    1x100 stranded total transcriptome preps; the ...
58    1x100 stranded total transcriptome preps; the ...
59    1x100 stranded total transcriptome preps; the ...
60    1x100 stranded total transcriptome preps; the ...
61    1x100 stranded total transcriptome preps; the ...
62    1x100 stranded total transcriptome preps; the ...
63    1x100 stranded total transcriptome preps; the ...
Name: Description, Length: 64, dtype: object

In [55]:
study.interactive_clustermap()


data_type : expression
metric : euclidean
sample_subset : all_samples
feature_subset : biomark_neural_panel
fig_height : 
scale_fig_by_data : True
fig_width : 
list_link : 
method : average

In [ ]:
study.save('bodymap2')