Access Ensembl BioMart using biomart module

We use rpy2 and R magics in IPython Notebook to utilize the powerful biomaRt package in R.

Usage:

Run Setup
Select a mart & dataset
Demo
1. All genes on Y chromosome
2. Annotate a gene list

Ref:

Blog post: Some basics of biomaRt
Table of Assemblies: http://www.ensembl.org/info/website/archives/assembly.html

Setup



In [1]:

    
import pandas as pd
%load_ext rpy2.ipython



In [16]:

    
%%R
library(biomaRt)



In [4]:

    
%load_ext version_information
%version_information pandas, rpy2









    Out[4]:




Software Version
Python 2.7.12 64bit [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
IPython 4.1.2
OS Linux 2.6.32 431.3.1.el6.x86_64 x86_64 with centos 6.8 Final
pandas 0.18.0
rpy2 2.7.4
Sun Nov 13 23:12:13 2016 EST

Tutorial

What marts are available?

Current build (currently not working...):



In [15]:

    
%%R
marts = listMarts()
head(marts)









    



Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Opening and ending tag mismatch: hr line 7 and body
Opening and ending tag mismatch: body line 4 and html
Premature end of data in tag html line 2
Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing
4: Opening and ending tag mismatch: hr line 7 and body
5: Opening and ending tag mismatch: body line 4 and html
6: Premature end of data in tag html line 2

Sometimes you need to specify a particular genome build (e.g., GTEx v6 used GENCODE v19, which was based on GRCh37.p13 = Ensembl 74):



In [4]:

    
%%R
marts.v74 = listMarts(host="dec2013.archive.ensembl.org")
head(marts.v74)









    





               biomart               version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 74
2     ENSEMBL_MART_SNP  Ensembl Variation 74
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 74
4    ENSEMBL_MART_VEGA               Vega 54
5                pride        PRIDE (EBI UK)

What datasets are available?



In [4]:

    
%%R
datasets = listDatasets(useMart("ensembl"))
head(datasets)









    



Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Opening and ending tag mismatch: hr line 7 and body
Opening and ending tag mismatch: body line 4 and html
Premature end of data in tag html line 2
Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing
4: Opening and ending tag mismatch: hr line 7 and body
5: Opening and ending tag mismatch: body line 4 and html
6: Premature end of data in tag html line 2






    



/cbhomes/cychen/anaconda/lib/python2.7/site-packages/rpy2-2.6.0-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py:106: UserWarning: Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing
4: Opening and ending tag mismatch: hr line 7 and body
5: Opening and ending tag mismatch: body line 4 and html
6: Premature end of data in tag html line 2

  res = super(Function, self).__call__(*new_args, **new_kwargs)

Select a mart & dataset



In [7]:

    
%%R
mart.hsa = useMart("ensembl", "hsapiens_gene_ensembl")









    



Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Opening and ending tag mismatch: hr line 7 and body
Opening and ending tag mismatch: body line 4 and html
Premature end of data in tag html line 2
Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing
4: Opening and ending tag mismatch: hr line 7 and body
5: Opening and ending tag mismatch: body line 4 and html
6: Premature end of data in tag html line 2

For an old archive, you can even specify the archive version when calling useMart, e.g.,



In [8]:

    
%%R
mart74.hsa = useMart("ENSEMBL_MART_ENSEMBL", "hsapiens_gene_ensembl", host="dec2013.archive.ensembl.org")

We will use mart build v74 as our example



In [9]:

    
%%R
mart.hsa = mart74.hsa

What attributes and filters can I use?

Attributes are the identifiers that you want to retrieve. For example HGNC gene ID, chromosome name, Ensembl transcript ID.
Filters are the identifiers that you supply in a query. Some but not all of the filter names may be the same as the attribute names.
Values are the filter identifiers themselves. For example the values of the filter “HGNC symbol” could be 3 genes “TP53”, “SRY” and “KIAA1199”.



In [13]:

    
%%R
attributes <- listAttributes(mart.hsa)
head(attributes)









    





                   name           description
1       ensembl_gene_id       Ensembl Gene ID
2 ensembl_transcript_id Ensembl Transcript ID
3    ensembl_peptide_id    Ensembl Protein ID
4       ensembl_exon_id       Ensembl Exon ID
5           description           Description
6       chromosome_name       Chromosome Name



In [14]:

    
%%R
filters <- listFilters(mart.hsa)
head(filters)









    





             name     description
1 chromosome_name Chromosome name
2           start Gene Start (bp)
3             end   Gene End (bp)
4      band_start      Band Start
5        band_end        Band End
6    marker_start    Marker Start

You can search for specific attributes by running grep() on the name. For example, if you’re looking for Affymetrix microarray probeset IDs:



In [15]:

    
%%R
head(attributes[grep("affy", attributes$name),])









    





                  name                  description
91        affy_hc_g110        Affy HC G110 probeset
92       affy_hg_focus       Affy HG FOCUS probeset
93 affy_hg_u133_plus_2 Affy HG U133-PLUS-2 probeset
94     affy_hg_u133a_2     Affy HG U133A_2 probeset
95       affy_hg_u133a       Affy HG U133A probeset
96       affy_hg_u133b       Affy HG U133B probeset

Demo

All genes on Y chromosome

Query in R:



In [16]:

    
%%R -o df
df = getBM(attributes=c("ensembl_gene_id", "hgnc_symbol", "chromosome_name"), 
           filters="chromosome_name",
           values="Y",
           mart=mart.hsa)
head(df)









    





  ensembl_gene_id hgnc_symbol chromosome_name
1 ENSG00000226555       AGKP1               Y
2 ENSG00000228787  NLGN4Y-AS1               Y
3 ENSG00000236131     MED13P1               Y
4 ENSG00000227949     CYCSP46               Y
5 ENSG00000224518                           Y
6 ENSG00000234620     HDHD1P1               Y

Accessible in Python:



In [17]:

    
df.head()









    Out[17]:






  
    
      
      ensembl_gene_id
      hgnc_symbol
      chromosome_name
    
  
  
    
      0
      ENSG00000226555
      AGKP1
      Y
    
    
      1
      ENSG00000228787
      NLGN4Y-AS1
      Y
    
    
      2
      ENSG00000236131
      MED13P1
      Y
    
    
      3
      ENSG00000227949
      CYCSP46
      Y
    
    
      4
      ENSG00000224518
      
      Y

Annotate a gene list



In [18]:

    
genes = ["ENSG00000135245", "ENSG00000240758", "ENSG00000225490"]



In [22]:

    
%%R -i genes -o df
df = getBM(attributes=c("ensembl_gene_id", "hgnc_symbol", "external_gene_id", "chromosome_name", "gene_biotype", "description"),
              filters="ensembl_gene_id",
              values=genes,
              mart=mart.hsa)
df









    





  ensembl_gene_id hgnc_symbol external_gene_id chromosome_name
1 ENSG00000135245      HILPDA           HILPDA               7
2 ENSG00000225490                 RP4-610C12.3              20
3 ENSG00000240758                RP11-155G14.6               7
          gene_biotype
1       protein_coding
2              lincRNA
3 processed_transcript
                                                                description
1 hypoxia inducible lipid droplet-associated [Source:HGNC Symbol;Acc:28859]
2                                                                          
3



In [23]:

    
df









    Out[23]:






  
    
      
      ensembl_gene_id
      hgnc_symbol
      external_gene_id
      chromosome_name
      gene_biotype
      description
    
  
  
    
      0
      ENSG00000135245
      HILPDA
      HILPDA
      7
      protein_coding
      hypoxia inducible lipid droplet-associated [So...
    
    
      1
      ENSG00000225490
      
      RP4-610C12.3
      20
      lincRNA
      
    
    
      2
      ENSG00000240758
      
      RP11-155G14.6
      7
      processed_transcript

Software	Version
Python	2.7.12 64bit [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
IPython	4.1.2
OS	Linux 2.6.32 431.3.1.el6.x86_64 x86_64 with centos 6.8 Final
pandas	0.18.0
rpy2	2.7.4
Sun Nov 13 23:12:13 2016 EST

	ensembl_gene_id	hgnc_symbol	chromosome_name
0	ENSG00000226555	AGKP1	Y
1	ENSG00000228787	NLGN4Y-AS1	Y
2	ENSG00000236131	MED13P1	Y
3	ENSG00000227949	CYCSP46	Y
4	ENSG00000224518		Y

	ensembl_gene_id	hgnc_symbol	external_gene_id	chromosome_name	gene_biotype	description
0	ENSG00000135245	HILPDA	HILPDA	7	protein_coding	hypoxia inducible lipid droplet-associated [So...
1	ENSG00000225490		RP4-610C12.3	20	lincRNA
2	ENSG00000240758		RP11-155G14.6	7	processed_transcript