Using BioMart to query mapping identifiers between Ensembl and other databases

Motivation: It seems to be a recurrent question on Ensembl website (http://www.ensembl.org/Help/Faq?id=125) so let us see how to solve the following question using BioServices:

How do I convert IDs? I have ENSG... IDs and I would like HGNC symbols and EntrezGene IDs along with matching Affymetrix platform HC G110 probes.

The solution from Ensembl web page is to use BioMart.

Using a list of ENSG symbols (e.g., ENSG00000162367 and ENSG00000187048).

The instructions from the Ensembl web page is to use the BioMart web page as follows. We will enter in the list of genes and export IDs from multiple databases. You need to fill the input boxes:

  • Database: Ensembl genes
  • Dataset: Homo sapiens genes
  • Filters: GENE: ID list limit box: select as the header Ensembl Gene ID(s) and enter gene names.
  • Attributes:
    • References, select HGNC symbol and EntrezGene ID.
    • Scroll down to Microarray Attributes to select Affy HC G110.

Here below, we will use the programmatic approach using BioServices

See also

In [68]:
from bioservices import biomart
#reload(biomart)

In [41]:
s = biomart.BioMart()

In [42]:
# First, we need to know the datasets of ensembl database
# From the Ensembl web page, the database is "Ensembl genes" but it looks like "ensembl" is enough
datasets = s.datasets("ensembl")

In [43]:
# From the list of datasets, the suggested name is "Homo sapiens genes"
# Can we find it ?
for d in datasets:
    if 'sapiens' in d.lower():
        print d


hsapiens_gene_ensembl

In [44]:
# well again it  is a bit different from the web page suggestion but it should be that one we are interested in ;-)

In [45]:
#What are the filters ?
filters = s.filters('hsapiens_gene_ensembl')

filters is a dictionary with lots of keys such as ensembl_gene_id. Each filter has some values/options to be provided


In [46]:
filters['ensembl_gene_id']


Out[46]:
[u'Ensembl Gene ID(s) [e.g. ENSG00000139618]',
 u'[]',
 u'Filter to include genes with supplied list of Ensembl Gene IDs',
 u'filters',
 u'id_list',
 u'=,in',
 u'hsapiens_gene_ensembl__gene__main',
 u'stable_id_1023']

Similary attributes have to be provided. What are they ? Check out the content of attributes.keys()


In [69]:
attributes = s.attributes('hsapiens_gene_ensembl')

In [70]:
# Let us first build the XML request
# Note that the list of identifiers should be actually a string separated by commas
s.new_query()
s.add_dataset_to_xml('hsapiens_gene_ensembl')
s.add_attribute_to_xml('affy_hc_g110')
s.add_attribute_to_xml('entrezgene')
s.add_attribute_to_xml('hgnc_symbol')
s.add_attribute_to_xml('ensembl_gene_id')
s.add_filter_to_xml('ensembl_gene_id', 'ENSG00000162367,ENSG00000187048')
xml = s.get_xml()
print xml


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "TSV"
header = "0" uniqueRows = "0" count = ""
datasetConfigVersion = "0.6" >
    <Dataset name = "hsapiens_gene_ensembl" interface = "default" >

        <Filter name = "ensembl_gene_id" value = "ENSG00000162367,ENSG00000187048"/>
        <Attribute name = "affy_hc_g110" />
        <Attribute name = "entrezgene" />
        <Attribute name = "hgnc_symbol" />
        <Attribute name = "ensembl_gene_id" />
    </Dataset>
</Query>

In [71]:
# now we call the requests itself
res = s.query(xml)
print res


560_s_at	6886	TAL1	ENSG00000162367
560_s_at	6886	TAL1	ENSG00000162367
	6886	TAL1	ENSG00000162367
	6886	TAL1	ENSG00000162367
	6886	TAL1	ENSG00000162367
	6886	TAL1	ENSG00000162367
1391_s_at	1579	CYP4A11	ENSG00000187048
1391_s_at	1579	CYP4A11	ENSG00000187048
1391_s_at	1579	CYP4A11	ENSG00000187048
	1579	CYP4A11	ENSG00000187048
	1579	CYP4A11	ENSG00000187048
	1579	CYP4A11	ENSG00000187048
	1579	CYP4A11	ENSG00000187048
	1579	CYP4A11	ENSG00000187048
	1579	CYP4A11	ENSG00000187048


In [101]:
# We can parse the results using Pandas to have a better rendering and more convenient way of handling the data
import pandas as pd
import StringIO
df = pd.read_csv(StringIO.StringIO(res), sep="\t", header=None)
df.columns=['affyhc_g110', 'entrezgene', 'hgnc_symbol', 'ensembl_gene_id']
df = df.drop_duplicates()
df = df.set_index('ensembl_gene_id')
# df.ix['ENSG00000162367']['hgnc_symbol']
df


Out[101]:
affyhc_g110 entrezgene hgnc_symbol
ensembl_gene_id
ENSG00000162367 560_s_at 6886 TAL1
ENSG00000162367 NaN 6886 TAL1
ENSG00000187048 1391_s_at 1579 CYP4A11
ENSG00000187048 NaN 1579 CYP4A11

In [ ]: