GA4GH 1000 Genomes Examples

Variant data from the Thousand Genomes project is being served by the GA4GH reference server. In this example we show how to use the GA4GH client to access the data.

Initialize the client

In this step we create a client object which will be used to communicate with the server. It is initialized using the URL.


In [2]:
from ga4gh.client import client
c = client.HttpClient("http://1kgenomes.ga4gh.org")

We will continue to refer to this client object for accessing the remote server.

Access the dataset

Here, we issue or first API call to get a listing of datasets hosted by the server. The API call returns an iterator, which is iterated on once to get the 1kgenomes dataset.


In [3]:
dataset = c.search_datasets().next()

In [4]:
print(dataset)


id: "WyIxa2dlbm9tZXMiXQ"
name: "1kgenomes"
description: "Variants from the 1000 Genomes project and GENCODE genes annotations"

Access the reference set

The GA4GH Genomics API provides methods for accessing the bases of a reference. The Thousand Genomes data presented here are mapped to GRCh37.

To access these data we first list the reference sets.


In [5]:
reference_set = c.search_reference_sets().next()
print(reference_set)


id: "WyJOQ0JJMzciXQ"
name: "NCBI37"
md5checksum: "54e0bb53844059bb7152618fc927cfa9"
ncbi_taxon_id: 9606
description: "NCBI37 assembly of the human genome"
source_uri: "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz"

With the reference set saved to a variable we will now request for the available references. This is the list of contigs for which we can request reference bases.


In [6]:
references = [r for r in c.search_references(reference_set_id=reference_set.id)]
print(', '.join(sorted([reference.name for reference in references])))


1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, GL000191.1, GL000192.1, GL000193.1, GL000194.1, GL000195.1, GL000196.1, GL000197.1, GL000198.1, GL000199.1, GL000200.1, GL000201.1, GL000202.1, GL000203.1, GL000204.1, GL000205.1, GL000206.1, GL000207.1, GL000208.1, GL000209.1, GL000210.1, GL000211.1, GL000212.1, GL000213.1, GL000214.1, GL000215.1, GL000216.1, GL000217.1, GL000218.1, GL000219.1, GL000220.1, GL000221.1, GL000222.1, GL000223.1, GL000224.1, GL000225.1, GL000226.1, GL000227.1, GL000228.1, GL000229.1, GL000230.1, GL000231.1, GL000232.1, GL000233.1, GL000234.1, GL000235.1, GL000236.1, GL000237.1, GL000238.1, GL000239.1, GL000240.1, GL000241.1, GL000242.1, GL000243.1, GL000244.1, GL000245.1, GL000246.1, GL000247.1, GL000248.1, GL000249.1, MT, NC_007605, X, Y, hs37d5

Here, we print the names of the available references. These reference names are used in the variants/search API. By selecting one of the references we can craft a ListBases request. Here, we ask for the 1000 bases between 10,000 and 11,000 on the first chromosome.

List reference bases


In [7]:
chr1 = filter(lambda x: x.name == "1", references)[0]
bases = c.list_reference_bases(chr1.id, start=10000, end=11000)
print(bases)
print(len(bases))


TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGACACATGCTAGCGCGTCGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTACCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCACCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGC
1000

List Variant Sets

The Thousand Genomes project sequenced the genome of over 2500 participants and released variation data in VCF format. The GA4GH reference server hosts those variant sets, and in this step we will list the available containers.


In [8]:
release = None
functional = None
for variant_set in c.search_variant_sets(dataset_id=dataset.id):
    if variant_set.name == "phase3-release":
        release = variant_set
    else:
        functional = variant_set

There are two variant sets currently being made available by this server instance. release contains the calls for the each participant and functional_annotation provides details of the effects of these variants created using the Variant Effect Predictor.

Search variants

Now that we have stored identifiers for the variant sets hosted by this server, we can craft search requests to find the individual variants. The GA4GH genomics API closely follows VCF by providing the calls for each variant in the variant record itself. Let's get a single variant record to examine.


In [9]:
all_call_sets = list(c.search_call_sets(release.id))

call_set_ids = []
for call_set in all_call_sets:
    call_set_ids.append(str(call_set.id))

example_variant = c.search_variants(variant_set_id=release.id, start=10000, end=11000, reference_name=chr1.name, call_set_ids=call_set_ids).next()
print("Variant name: {}".format(example_variant.names[0]))
print("Start: {}, End: {}".format(example_variant.start, example_variant.end))
print("Reference bases: {}".format(example_variant.reference_bases))
print("Alternate bases: {}".format(example_variant.alternate_bases))
print("Number of calls: {}".format(len(example_variant.calls)))


Variant name: rs367896724
Start: 10176, End: 10177
Reference bases: A
Alternate bases: [u'AC']
Number of calls: 2504

Variant calls

Here, we select a few of the important fields of the variant record. Let's now examine the calls returned with this variant. The calls are what tell us which participants were believed to have presented the given variant.


In [10]:
print(example_variant.calls[0])


call_set_name: "HG00096"
call_set_id: "WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDA5NiJd"
genotype: 1
genotype: 0
phaseset: "True"

This tells us that for the participant HG00096 the variant in question was observed on the first haplotype ("genotype": [1, 0]). We can perform summary statistics on the calls field to measure allele frequency. For this demonstration we will count the presence of the variant on either haplotype to determine how often it was seen amongst the participants.


In [11]:
total = 0
count = 0
for call in example_variant.calls:
    total += 1
    count += call.genotype[0] or call.genotype[1]
    
print("{}/{} participants with this variant".format(count, total))
print(float(count) / float(total))


1810/2504 participants with this variant
0.722843450479

Variant annotations

To get at the detailed variant annotations, we must first find the variant annotation set.


In [12]:
annotation_set = c.search_variant_annotation_sets(variant_set_id=functional.id).next()

We can now search for the range that includes our example variant to discover relevant annotations.


In [13]:
annotation = c.search_variant_annotations(
    variant_annotation_set_id=annotation_set.id,
    start=example_variant.start,
    end=example_variant.end,
    reference_name=chr1.name).next()
print(annotation.transcript_effects[0].effects[0].term)


upstream_gene_variant

Here we have found the annotation for our example variant and have found it has the upstream_gene_variant consequence.

Sequence annotations

Finally, we can learn more about the site of our example variant by querying the sequence annotations API, which is serving the Gencode Genes data.


In [14]:
gencode = c.search_feature_sets(dataset_id=dataset.id).next()
print(gencode)


id: "WyIxa2dlbm9tZXMiLCJnZW5jb2RlX3YyNGxpZnQzNyJd"
dataset_id: "WyIxa2dlbm9tZXMiXQ"
reference_set_id: "WyJOQ0JJMzciXQ"
name: "gencode_v24lift37"

We can now craft search requests for features to find the nearest gene:


In [15]:
gene = c.search_features(
    feature_set_id=gencode.id,
    start=10000,
    end=12000,
    reference_name="chr1",
    feature_types=['gene']).next()

print("Gene name: {}".format(gene.attributes.vals['gene_name'].values[0]))
print("Start: {}, End: {}".format(gene.start, gene.end))


Gene name: string_value: "DDX11L1"

Start: 11869, End: 14409

Querying read group set and reads from dataset

One can obtain read group set and their reads, by using the data set id, the group set name as well as the readgroup id, one can also provide a specific observance region, also we can provide a specific gene that we want to observe.

In this case we will use 0 as our starting point and 1,000,000 as our ending point, we will also be observing only the first chromosome. We will only be observing the first 2 subjects as means of demostration.


In [16]:
limit = 2
iterator = 0
read_group_set = c.search_read_group_sets(dataset_id=dataset.id).next()
print("Read group set : {}".format(read_group_set.name))
for read_group in read_group_set.read_groups:
    sequence = c.search_reads(read_group_ids=[read_group.id], start=0, end = 1000000, reference_id=references[0].id).next()
    print("Read group name: {},\nSequence: {}".format(read_group.name, sequence.aligned_sequence))


Read group set : HG03270
Read group name: ERR181329,
Sequence: ATAACCCTAACCATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA
Read group name: ERR184328,
Sequence: TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGGTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGGTAGGGATAGGGGTAGGGGTAGGGGGAGG
Read group name: ERR184336,
Sequence: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGC
Read group name: ERR184344,
Sequence: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCCTACCCCAACCCCTACCCCAAACCCTACCCCTACCCCTACC