GA4GH 1000 Genome Examples

Variant data from the Thousand Genomes project is being served by the GA4GH reference server. In this example we show how to use the GA4GH client to access the data.

Initialize the client

In this step we create a client object which will be used to communicate with the server. It is initialized using the URL.


In [1]:
import ga4gh.client as client
c = client.HttpClient("http://1kgenomes.ga4gh.org/")

We will continue to refer to this client object for accessing the remote server.

Access the dataset

Here, we issue or first API call to get a listing of datasets hosted by the server. The API call returns an iterator, which is iterated on once to get the 1kgenomes dataset.


In [2]:
dataset = c.searchDatasets().next()

In [3]:
print(dataset)


Dataset({"description": "Variants from the 1000 Genomes project and GENCODE genes annotations", "name": "1kgenomes", "id": "WyIxa2dlbm9tZXMiXQ"})

Access the reference set

The GA4GH Genomics API provides methods for accessing the bases of a reference. The Thousand Genomes data presented here are mapped to GRCh37.

To access these data we first list the reference sets.


In [4]:
referenceSet = c.searchReferenceSets().next()
print(referenceSet)


ReferenceSet({"name": "NCBI37", "sourceURI": "ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz", "assemblyId": null, "sourceAccessions": [], "ncbiTaxonId": 9606, "isDerived": false, "id": "WyJOQ0JJMzciXQ", "md5checksum": "54e0bb53844059bb7152618fc927cfa9", "description": "Variants from the 1000 Genomes project and GENCODE genes annotations"})

With the reference set saved to a variable we will now request for the available references. This is the list of contigs for which we can request reference bases.


In [5]:
references = [r for r in c.searchReferences(referenceSetId = referenceSet.id)]
print(', '.join(sorted([reference.name for reference in references])))


1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, GL000191.1, GL000192.1, GL000193.1, GL000194.1, GL000195.1, GL000196.1, GL000197.1, GL000198.1, GL000199.1, GL000200.1, GL000201.1, GL000202.1, GL000203.1, GL000204.1, GL000205.1, GL000206.1, GL000207.1, GL000208.1, GL000209.1, GL000210.1, GL000211.1, GL000212.1, GL000213.1, GL000214.1, GL000215.1, GL000216.1, GL000217.1, GL000218.1, GL000219.1, GL000220.1, GL000221.1, GL000222.1, GL000223.1, GL000224.1, GL000225.1, GL000226.1, GL000227.1, GL000228.1, GL000229.1, GL000230.1, GL000231.1, GL000232.1, GL000233.1, GL000234.1, GL000235.1, GL000236.1, GL000237.1, GL000238.1, GL000239.1, GL000240.1, GL000241.1, GL000242.1, GL000243.1, GL000244.1, GL000245.1, GL000246.1, GL000247.1, GL000248.1, GL000249.1, MT, NC_007605, X, Y, hs37d5

Here, we print the names of the available references. These reference names are used in the variants/search API. By selecting one of the references we can craft a ListBases request. Here, we ask for the 1000 bases between 10,000 and 11,000 on the first chromosome.

List reference bases


In [6]:
chr1 = filter(lambda x: x.name == "1", references)[0]
bases = c.listReferenceBases(chr1.id, start=10000, end=11000)
print(bases)
print(len(bases))


TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGACACATGCTAGCGCGTCGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTACCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCACCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGC
1000

List Variant Sets

The Thousand Genomes project sequenced the genome of over 2500 participants and released variation data in VCF format. The GA4GH reference server hosts those variant sets, and in this step we will list the available containers.


In [7]:
release = None
functional = None
for variantSet in c.searchVariantSets(datasetId=dataset.id):
    if variantSet.name == "release":
        release = variantSet
    else:
        functional = variantSet

There are two variant sets currently being made available by this server instance. release contains the calls for the each participant and functional_annotation provides details of the effects of these variants created using the Variant Effect Predictor.

Search variants

Now that we have stored identifiers for the variant sets hosted by this server, we can craft search requests to find the individual variants. The GA4GH genomics API closely follows VCF by providing the calls for each variant in the variant record itself. Let's get a single variant record to examine.


In [8]:
exampleVariant = c.searchVariants(variantSetId=release.id, start=10000, end=11000, referenceName=chr1.name).next()
print("Variant name: {}".format(exampleVariant.names[0]))
print("Start: {}, End: {}".format(exampleVariant.start, exampleVariant.end))
print("Reference bases: {}".format(exampleVariant.referenceBases))
print("Alternate bases: {}".format(exampleVariant.alternateBases))
print("Number of calls: {}".format(len(exampleVariant.calls)))


Variant name: rs367896724
Start: 10176, End: 10177
Reference bases: A
Alternate bases: [u'AC']
Number of calls: 2504

Variant calls

Here, we select a few of the important fields of the variant record. Let's now examine the calls returned with this variant. The calls are what tell us which participants were believed to have presented the given variant.


In [9]:
print(exampleVariant.calls[0])


Call({"info": {}, "genotype": [1, 0], "callSetId": "WyIxa2dlbm9tZXMiLCJ2cyIsInJlbGVhc2UiLCJIRzAwMDk2Il0", "phaseset": "True", "genotypeLikelihood": [], "callSetName": "HG00096"})

This tells us that for the participant HG00096 the variant in question was observed on the first haplotype ("genotype": [1, 0]). We can perform summary statistics on the calls field to measure allele frequency. For this demonstration we will count the presence of the variant on either haplotype to determine how often it was seen amongst the participants.


In [10]:
total = 0
count = 0
for call in exampleVariant.calls:
    total += 1
    count += call.genotype[0] or call.genotype[1]
print("{}/{} participants with this variant".format(count, total))
print(float(count) / float(total))


1810/2504 participants with this variant
0.722843450479

Variant annotations

To get at the detailed variant annotations, we must first find the variant annotation set.


In [11]:
annotationSet = c.searchVariantAnnotationSets(variantSetId=functional.id).next()

We can now search for the range that includes our example variant to discover relevant annotations.


In [12]:
annotation = c.searchVariantAnnotations(
    variantAnnotationSetId=annotationSet.id,
    start=exampleVariant.start,
    end=exampleVariant.end,
    referenceName=chr1.name).next()
print(annotation.transcriptEffects[0].effects[0].term)


upstream_gene_variant

Here we have found the annotation for our example variant and have found it has the upstream_gene_variant consequence.

Sequence annotations

Finally, we can learn more about the site of our example variant by querying the sequence annotations API, which is serving the Gencode Genes data.


In [13]:
gencode = c.searchFeatureSets(datasetId=dataset.id).next()
print(gencode)


FeatureSet({"info": {}, "name": "gencode_v24lift37", "sourceURI": null, "referenceSetId": "WyJOQ0JJMzciXQ", "id": "WyIxa2dlbm9tZXMiLCJnZW5jb2RlX3YyNGxpZnQzNyJd", "datasetId": "WyIxa2dlbm9tZXMiXQ"})

We can now craft search requests for features to find the nearest gene:


In [14]:
gene = c.searchFeatures(
    featureSetId=gencode.id,
    start=10000,
    end=12000,
    referenceName="chr1",
    featureTypes=['gene']).next()
print("Gene name: {}".format(gene.attributes.vals['gene_name'][0]))
print("Start: {}, End: {}".format(gene.start, gene.end))


Gene name: DDX11L1
Start: 11869, End: 14409