GA4GH 1000 Genomes Reference Service Example

This example illustrates how to access the available reference sequences offered by a GA4GH instance.

Initialize the client

In this step we create a client object which will be used to communicate with the server. It is initialized using the URL.

In [1]:
from ga4gh.client import client
c = client.HttpClient("")

Search reference sets

Reference sets collect together named reference sequences as part of released assemblies. The API provides methods for accessing reference sequences.

The Thousand Genomes data presented here are mapped to GRCh37, and so this server makes that reference genome available. Datasets and reference genomes are decoupled in the data model, so it is possible to use the same reference set in multiple datasets.

Here, we list the details of the Reference Set.

In [2]:
for reference_set in c.search_reference_sets():
    ncbi37 = reference_set
    print "name: {}".format(
    print "ncbi_taxon_id: {}".format(ncbi37.ncbi_taxon_id)
    print "description: {}".format(ncbi37.description)
    print "source_uri: {}".format(ncbi37.source_uri)

name: NCBI37
ncbi_taxon_id: 9606
description: NCBI37 assembly of the human genome

Obtaining individual Reference Sets by ID

The API can also obtain an individual reference set if the id is known. In this case, we can observe that only one is available. But in the future, more sets might be implemented.

In [3]:
reference_set = c.get_reference_set(
print reference_set

id: "WyJOQ0JJMzciXQ"
name: "NCBI37"
md5checksum: "54e0bb53844059bb7152618fc927cfa9"
ncbi_taxon_id: 9606
description: "NCBI37 assembly of the human genome"
source_uri: ""

Search References

From the previous call, we have obtained the parameter required to obtain references which belong to ncbi37. We use its unique identifier to constrain the search for named sequences. As there are 86 of them, we have only chosen to show a few.

In [4]:
counter = 0
for reference in c.search_references(
    if == "1":
        base_id_ref = reference
    counter += 1
    if counter > 5:
    print reference

id: "WyJOQ0JJMzciLCIxIl0"
length: 249250621
md5checksum: "1b22b98cdeb4a9304cb5d48026a85128"
name: "1"
ncbi_taxon_id: 9606

id: "WyJOQ0JJMzciLCIyIl0"
length: 243199373
md5checksum: "a0d9851da00400dec1098a9255ac712e"
name: "2"
ncbi_taxon_id: 9606

id: "WyJOQ0JJMzciLCIzIl0"
length: 198022430
md5checksum: "fdfd811849cc2fadebc929bb925902e5"
name: "3"
ncbi_taxon_id: 9606

id: "WyJOQ0JJMzciLCI0Il0"
length: 191154276
md5checksum: "23dccd106897542ad87d2765d28a19a1"
name: "4"
ncbi_taxon_id: 9606

id: "WyJOQ0JJMzciLCI1Il0"
length: 180915260
md5checksum: "0740173db9ffd264d728f32784845cd7"
name: "5"
ncbi_taxon_id: 9606

Get Reference by ID

Reference sequence messages, like those above, can be referenced by their identifier directly. This identifier points to chromosome 1 in this server instance.

In [5]:
reference = c.get_reference(
print reference

id: "WyJOQ0JJMzciLCIxIl0"
length: 249250621
md5checksum: "1b22b98cdeb4a9304cb5d48026a85128"
name: "1"
ncbi_taxon_id: 9606

List Reference Bases

Using the reference_id from above we can construct a query to list the alleles present on a sequence using start and end offsets.

In [6]:
reference_bases = c.list_reference_bases(, start=15000, end= 16000)
print reference_bases
print len(reference_bases)

For documentation on the service, and more information go to.