Access Ensembl BioMart using biomart module

We use rpy2 and R magics in IPython Notebook to utilize the powerful biomaRt package in R.

Usage:

  1. Run Setup
  2. Select a mart & dataset
  3. Demo
    1. All genes on Y chromosome
    2. Annotate a gene list

Ref:

Setup


In [1]:
import pandas as pd
%load_ext rpy2.ipython

In [16]:
%%R
library(biomaRt)

In [4]:
%load_ext version_information
%version_information pandas, rpy2


Out[4]:
SoftwareVersion
Python2.7.12 64bit [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
IPython4.1.2
OSLinux 2.6.32 431.3.1.el6.x86_64 x86_64 with centos 6.8 Final
pandas0.18.0
rpy22.7.4
Sun Nov 13 23:12:13 2016 EST

Tutorial

What marts are available?

Current build (currently not working...):


In [15]:
%%R
marts = listMarts()
head(marts)


Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Opening and ending tag mismatch: hr line 7 and body
Opening and ending tag mismatch: body line 4 and html
Premature end of data in tag html line 2
Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing
4: Opening and ending tag mismatch: hr line 7 and body
5: Opening and ending tag mismatch: body line 4 and html
6: Premature end of data in tag html line 2

Sometimes you need to specify a particular genome build (e.g., GTEx v6 used GENCODE v19, which was based on GRCh37.p13 = Ensembl 74):


In [4]:
%%R
marts.v74 = listMarts(host="dec2013.archive.ensembl.org")
head(marts.v74)


               biomart               version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 74
2     ENSEMBL_MART_SNP  Ensembl Variation 74
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 74
4    ENSEMBL_MART_VEGA               Vega 54
5                pride        PRIDE (EBI UK)

What datasets are available?


In [4]:
%%R
datasets = listDatasets(useMart("ensembl"))
head(datasets)


Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Opening and ending tag mismatch: hr line 7 and body
Opening and ending tag mismatch: body line 4 and html
Premature end of data in tag html line 2
Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing
4: Opening and ending tag mismatch: hr line 7 and body
5: Opening and ending tag mismatch: body line 4 and html
6: Premature end of data in tag html line 2
/cbhomes/cychen/anaconda/lib/python2.7/site-packages/rpy2-2.6.0-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py:106: UserWarning: Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing
4: Opening and ending tag mismatch: hr line 7 and body
5: Opening and ending tag mismatch: body line 4 and html
6: Premature end of data in tag html line 2

  res = super(Function, self).__call__(*new_args, **new_kwargs)

Select a mart & dataset


In [7]:
%%R
mart.hsa = useMart("ensembl", "hsapiens_gene_ensembl")


Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Opening and ending tag mismatch: hr line 7 and body
Opening and ending tag mismatch: body line 4 and html
Premature end of data in tag html line 2
Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing
4: Opening and ending tag mismatch: hr line 7 and body
5: Opening and ending tag mismatch: body line 4 and html
6: Premature end of data in tag html line 2

For an old archive, you can even specify the archive version when calling useMart, e.g.,


In [8]:
%%R
mart74.hsa = useMart("ENSEMBL_MART_ENSEMBL", "hsapiens_gene_ensembl", host="dec2013.archive.ensembl.org")

We will use mart build v74 as our example


In [9]:
%%R
mart.hsa = mart74.hsa

What attributes and filters can I use?

  • Attributes are the identifiers that you want to retrieve. For example HGNC gene ID, chromosome name, Ensembl transcript ID.
  • Filters are the identifiers that you supply in a query. Some but not all of the filter names may be the same as the attribute names.
  • Values are the filter identifiers themselves. For example the values of the filter “HGNC symbol” could be 3 genes “TP53”, “SRY” and “KIAA1199”.

In [13]:
%%R
attributes <- listAttributes(mart.hsa)
head(attributes)


                   name           description
1       ensembl_gene_id       Ensembl Gene ID
2 ensembl_transcript_id Ensembl Transcript ID
3    ensembl_peptide_id    Ensembl Protein ID
4       ensembl_exon_id       Ensembl Exon ID
5           description           Description
6       chromosome_name       Chromosome Name

In [14]:
%%R
filters <- listFilters(mart.hsa)
head(filters)


             name     description
1 chromosome_name Chromosome name
2           start Gene Start (bp)
3             end   Gene End (bp)
4      band_start      Band Start
5        band_end        Band End
6    marker_start    Marker Start

You can search for specific attributes by running grep() on the name. For example, if you’re looking for Affymetrix microarray probeset IDs:


In [15]:
%%R
head(attributes[grep("affy", attributes$name),])


                  name                  description
91        affy_hc_g110        Affy HC G110 probeset
92       affy_hg_focus       Affy HG FOCUS probeset
93 affy_hg_u133_plus_2 Affy HG U133-PLUS-2 probeset
94     affy_hg_u133a_2     Affy HG U133A_2 probeset
95       affy_hg_u133a       Affy HG U133A probeset
96       affy_hg_u133b       Affy HG U133B probeset

Demo

All genes on Y chromosome

Query in R:


In [16]:
%%R -o df
df = getBM(attributes=c("ensembl_gene_id", "hgnc_symbol", "chromosome_name"), 
           filters="chromosome_name",
           values="Y",
           mart=mart.hsa)
head(df)


  ensembl_gene_id hgnc_symbol chromosome_name
1 ENSG00000226555       AGKP1               Y
2 ENSG00000228787  NLGN4Y-AS1               Y
3 ENSG00000236131     MED13P1               Y
4 ENSG00000227949     CYCSP46               Y
5 ENSG00000224518                           Y
6 ENSG00000234620     HDHD1P1               Y

Accessible in Python:


In [17]:
df.head()


Out[17]:
ensembl_gene_id hgnc_symbol chromosome_name
0 ENSG00000226555 AGKP1 Y
1 ENSG00000228787 NLGN4Y-AS1 Y
2 ENSG00000236131 MED13P1 Y
3 ENSG00000227949 CYCSP46 Y
4 ENSG00000224518 Y

Annotate a gene list


In [18]:
genes = ["ENSG00000135245", "ENSG00000240758", "ENSG00000225490"]

In [22]:
%%R -i genes -o df
df = getBM(attributes=c("ensembl_gene_id", "hgnc_symbol", "external_gene_id", "chromosome_name", "gene_biotype", "description"),
              filters="ensembl_gene_id",
              values=genes,
              mart=mart.hsa)
df


  ensembl_gene_id hgnc_symbol external_gene_id chromosome_name
1 ENSG00000135245      HILPDA           HILPDA               7
2 ENSG00000225490                 RP4-610C12.3              20
3 ENSG00000240758                RP11-155G14.6               7
          gene_biotype
1       protein_coding
2              lincRNA
3 processed_transcript
                                                                description
1 hypoxia inducible lipid droplet-associated [Source:HGNC Symbol;Acc:28859]
2                                                                          
3                                                                          

In [23]:
df


Out[23]:
ensembl_gene_id hgnc_symbol external_gene_id chromosome_name gene_biotype description
0 ENSG00000135245 HILPDA HILPDA 7 protein_coding hypoxia inducible lipid droplet-associated [So...
1 ENSG00000225490 RP4-610C12.3 20 lincRNA
2 ENSG00000240758 RP11-155G14.6 7 processed_transcript