Using BioMart to query mapping identifiers between Ensembl and other databases

Motivation: It seems to be a recurrent question on Ensembl website (http://www.ensembl.org/Help/Faq?id=125) so let us see how to solve the following question using BioServices:

How do I convert IDs? I have ENSG... IDs and I would like HGNC symbols and EntrezGene IDs along with matching Affymetrix platform HC G110 probes.

The solution from Ensembl web page is to use BioMart.

Using a list of ENSG symbols (e.g., ENSG00000162367 and ENSG00000187048).

The instructions from the Ensembl web page is to use the BioMart web page as follows. We will enter in the list of genes and export IDs from multiple databases. You need to fill the input boxes:

Database: Ensembl genes
Dataset: Homo sapiens genes
Filters: GENE: ID list limit box: select as the header Ensembl Gene ID(s) and enter gene names.
Attributes:
- References, select HGNC symbol and EntrezGene ID.
- Scroll down to Microarray Attributes to select Affy HC G110.

Here below, we will use the programmatic approach using BioServices



In [70]:

    
# Let us first build the XML request
# Note that the list of identifiers should be actually a string separated by commas
s.new_query()
s.add_dataset_to_xml('hsapiens_gene_ensembl')
s.add_attribute_to_xml('affy_hc_g110')
s.add_attribute_to_xml('entrezgene')
s.add_attribute_to_xml('hgnc_symbol')
s.add_attribute_to_xml('ensembl_gene_id')
s.add_filter_to_xml('ensembl_gene_id', 'ENSG00000162367,ENSG00000187048')
xml = s.get_xml()
print xml









    



<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "TSV"
header = "0" uniqueRows = "0" count = ""
datasetConfigVersion = "0.6" >
    <Dataset name = "hsapiens_gene_ensembl" interface = "default" >

        <Filter name = "ensembl_gene_id" value = "ENSG00000162367,ENSG00000187048"/>
        <Attribute name = "affy_hc_g110" />
        <Attribute name = "entrezgene" />
        <Attribute name = "hgnc_symbol" />
        <Attribute name = "ensembl_gene_id" />
    </Dataset>
</Query>



In [71]:

    
# now we call the requests itself
res = s.query(xml)
print res









    



560_s_at	6886	TAL1	ENSG00000162367
560_s_at	6886	TAL1	ENSG00000162367
	6886	TAL1	ENSG00000162367
	6886	TAL1	ENSG00000162367
	6886	TAL1	ENSG00000162367
	6886	TAL1	ENSG00000162367
1391_s_at	1579	CYP4A11	ENSG00000187048
1391_s_at	1579	CYP4A11	ENSG00000187048
1391_s_at	1579	CYP4A11	ENSG00000187048
	1579	CYP4A11	ENSG00000187048
	1579	CYP4A11	ENSG00000187048
	1579	CYP4A11	ENSG00000187048
	1579	CYP4A11	ENSG00000187048
	1579	CYP4A11	ENSG00000187048
	1579	CYP4A11	ENSG00000187048



In [101]:

    
# We can parse the results using Pandas to have a better rendering and more convenient way of handling the data
import pandas as pd
import StringIO
df = pd.read_csv(StringIO.StringIO(res), sep="\t", header=None)
df.columns=['affyhc_g110', 'entrezgene', 'hgnc_symbol', 'ensembl_gene_id']
df = df.drop_duplicates()
df = df.set_index('ensembl_gene_id')
# df.ix['ENSG00000162367']['hgnc_symbol']
df









    Out[101]:






  
    
      
      affyhc_g110
      entrezgene
      hgnc_symbol
    
    
      ensembl_gene_id
      
      
      
    
  
  
    
      ENSG00000162367
        560_s_at
       6886
          TAL1
    
    
      ENSG00000162367
             NaN
       6886
          TAL1
    
    
      ENSG00000187048
       1391_s_at
       1579
       CYP4A11
    
    
      ENSG00000187048
             NaN
       1579
       CYP4A11



In [ ]:

	affyhc_g110	entrezgene	hgnc_symbol
ensembl_gene_id
ENSG00000162367	560_s_at	6886	TAL1
ENSG00000162367	NaN	6886	TAL1
ENSG00000187048	1391_s_at	1579	CYP4A11
ENSG00000187048	NaN	1579	CYP4A11

Using BioMart to query mapping identifiers between Ensembl and other databases

See also