GA4GH 1000 Genome Variant Service Example

This example illustrates how to access the GA4GH variant service.

The GA4GH variant service interchanges variant data similar to that found in VCF. It includes metadata related to variant entries and calls, which are the observed genetic variations pertaining to a given sample.

Initialize the client

In this step we create a client object which will be used to communicate with the server. It is initialized using the URL.


In [1]:
import ga4gh_client.client as client
c = client.HttpClient("http://1kgenomes.ga4gh.org")

Search variant sets method

This call returns variant sets hosted by the API. Observe that we are using the dataset id obtained from the metadata service example.


In [2]:
for variant_set in c.search_variant_sets(dataset_id="WyIxa2dlbm9tZXMiXQ"):
    print "Variant Set: {}".format(variant_set.name)
    print " id: {}".format(variant_set.id)
    print " dataset_id: {}".format(variant_set.dataset_id)
    print " reference_set_id: {}\n".format(variant_set.reference_set_id)


Variant Set: phase3-release
 id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0
 dataset_id: WyIxa2dlbm9tZXMiXQ
 reference_set_id: WyJOQ0JJMzciXQ

Variant Set: functional-annotation
 id: WyIxa2dlbm9tZXMiLCJ2cyIsImZ1bmN0aW9uYWwtYW5ub3RhdGlvbiJd
 dataset_id: WyIxa2dlbm9tZXMiXQ
 reference_set_id: WyJOQ0JJMzciXQ

Note:
In the previous call, not all the elements returned are printed to the screen.
Below we will look at the metadata descriptions in a variant set.

Get Variant Set by ID method

The following request returns a Variant Set by id. This identifier is unique to the server instance.


In [3]:
variant_set = c.get_variant_set(variant_set_id="WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0")
print "name: {}".format(variant_set.name)
print "dataset_id: {}".format(variant_set.dataset_id)
print "reference_set_id: {}".format(variant_set.reference_set_id)
for metadata in variant_set.metadata[0:3]:
    print metadata


name: phase3-release
dataset_id: WyIxa2dlbm9tZXMiXQ
reference_set_id: WyJOQ0JJMzciXQ
key: "version"
value: "VCFv4.1"
id: "WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwibWV0YWRhdGE6dmVyc2lvbiJd"
type: "String"
number: "1"

key: "INFO.CIEND"
id: "WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwibWV0YWRhdGE6SU5GTy5DSUVORCJd"
type: "Integer"
number: "2"
description: "Confidence interval around END for imprecise variants"

key: "INFO.CIPOS"
id: "WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwibWV0YWRhdGE6SU5GTy5DSVBPUyJd"
type: "Integer"
number: "2"
description: "Confidence interval around POS for imprecise variants"

The metadata fields of a Variant Set message can be used to interpret fields annotated about a variant, which will be addressed later in this document.

Search Variants method

Using the variant_set_id returned in one of the previous calls we constrain our variant search to a single variant set.


In [4]:
counter = 0
for variant in c.search_variants(variant_set_id="WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0", reference_name="1", start=10176, end= 40176):
    if counter > 5:
        break
    counter += 1
    print "Variant id: {}...".format(variant.id[0:10])
    print "Variant Set Id: {}".format(variant.variant_set_id)
    print "Names: {}".format(variant.names)
    print "Reference Chromosome: {}".format(variant.reference_name)
    print "Start: {}, End: {}".format(variant.start, variant.end)
    print "Reference Bases: {}".format(variant.reference_bases)
    print "Alternate Bases: {}\n".format(variant.alternate_bases)


Variant id: WyIxa2dlbm...
Variant Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0
Names: [u'rs367896724']
Reference Chromosome: 1
Start: 10176, End: 10177
Reference Bases: A
Alternate Bases: [u'AC']

Variant id: WyIxa2dlbm...
Variant Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0
Names: [u'rs540431307']
Reference Chromosome: 1
Start: 10234, End: 10235
Reference Bases: T
Alternate Bases: [u'TA']

Variant id: WyIxa2dlbm...
Variant Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0
Names: [u'rs555500075']
Reference Chromosome: 1
Start: 10351, End: 10352
Reference Bases: T
Alternate Bases: [u'TA']

Variant id: WyIxa2dlbm...
Variant Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0
Names: [u'rs548419688']
Reference Chromosome: 1
Start: 10504, End: 10505
Reference Bases: A
Alternate Bases: [u'T']

Variant id: WyIxa2dlbm...
Variant Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0
Names: [u'rs568405545']
Reference Chromosome: 1
Start: 10505, End: 10506
Reference Bases: C
Alternate Bases: [u'G']

Variant id: WyIxa2dlbm...
Variant Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0
Names: [u'rs534229142']
Reference Chromosome: 1
Start: 10510, End: 10511
Reference Bases: G
Alternate Bases: [u'A']

The fields shown above are selected named fields in the data model. Note that the data returned is richer, and will be illiustrated in the following example by examining the info field.

Get Variant by ID method

We can get a variant record, if we know its unique identifier, by performing a get variant call. We use one of the variant identifiers obtained above. We will then list all of the data available in the info field.


In [5]:
single_variant = c.get_variant(variant_id=variant.id)
print "idd: {}".format(single_variant.id)
print "Variant Set Id: {}".format(single_variant.variant_set_id)
print "Names: {}".format(single_variant.names)
print "Reference Name: {}".format(single_variant.reference_name)
print "Start: {}, End: {}".format(single_variant.start, single_variant.end)
print "Reference Bases: {}".format(single_variant.reference_bases)
print "Alternate Bases: {}\n".format(single_variant.alternate_bases)
for info in single_variant.info:
    print "Key: {},\tValues: {}".format(info, single_variant.info[info].values[0].string_value)


idd: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiMSIsIjEwNTM4IiwiMTg2ZjhiZTU3MTg2OWQ3Y2UzMmY4MDNlMGRlMjZlOTUiXQ
Variant Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0
Names: [u'rs537182016']
Reference Name: 1
Start: 10538, End: 10539
Reference Bases: C
Alternate Bases: [u'A']

Key: EUR_AF,	Values: 0.0010000000475
Key: SAS_AF,	Values: 0.0010000000475
Key: AC,	Values: 3
Key: AA,	Values: .|||
Key: AF,	Values: 0.000599041988607
Key: AFR_AF,	Values: 0.0
Key: AMR_AF,	Values: 0.00139999995008
Key: AN,	Values: 5008
Key: VT,	Values: SNP
Key: EAS_AF,	Values: 0.0
Key: NS,	Values: 2504
Key: DP,	Values: 9203

Using the variant_set.metadata message we can interpret these annotations.


In [6]:
metadata_dictionary = {}
for metadata in variant_set.metadata:
    metadata_dictionary[metadata.key] = metadata # Load the metadata elements into a dictionary
for key in single_variant.info:
    metadata_entry = metadata_dictionary["INFO." + key]
    print "\nKey: {}     Value: {}     Type: {}".format(
        key,
        single_variant.info[key].values[0].string_value,
        metadata_entry.type)
    print " " + metadata_entry.description


Key: EUR_AF     Value: 0.0010000000475     Type: Float
 Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)

Key: SAS_AF     Value: 0.0010000000475     Type: Float
 Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)

Key: AC     Value: 3     Type: Integer
 Total number of alternate alleles in called genotypes

Key: AA     Value: .|||     Type: String
 Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ancestral allele, REF:Reference Allele, ALT:Alternate Allele, IndelType:Type of Indel (REF, ALT and IndelType are only defined for indels)

Key: AF     Value: 0.000599041988607     Type: Float
 Estimated allele frequency in the range (0,1)

Key: AFR_AF     Value: 0.0     Type: Float
 Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)

Key: AMR_AF     Value: 0.00139999995008     Type: Float
 Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)

Key: AN     Value: 5008     Type: Integer
 Total number of alleles in called genotypes

Key: VT     Value: SNP     Type: String
 indicates what type of variant the line represents

Key: EAS_AF     Value: 0.0     Type: Float
 Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)

Key: NS     Value: 2504     Type: Integer
 Number of samples with data

Key: DP     Value: 9203     Type: Integer
 Total read depth; only low coverage data were counted towards the DP, exome data were not used

Search Call Sets method

Call sets represent a sample's column in a VCF, or the collection of observed sequences from a particular variation analysis. The Thousand Genomes data presents the variation data of 2504 individuals through 2504 callsets. These Call Sets ca be used to construct queries for individual genomic variations.


In [7]:
counter = 0
list_of_callset_ids = [] # Will use this list near the end to make a search variants query
for call_set in c.search_call_sets(variant_set_id=single_variant.variant_set_id):
    if counter > 3:
        break
    else:
        counter += 1
        list_of_callset_ids.append(call_set.id)
        print "Call Set Name: {}".format(call_set.name)
        print "  id: {}".format(call_set.name)
        print "  bio_sample_id: {}".format(call_set.name)
        print "  variant_set_ids: {}\n".format(call_set.variant_set_ids)


Call Set Name: HG00096
  id: HG00096
  bio_sample_id: HG00096
  variant_set_ids: [u'WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0']

Call Set Name: HG00097
  id: HG00097
  bio_sample_id: HG00097
  variant_set_ids: [u'WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0']

Call Set Name: HG00099
  id: HG00099
  bio_sample_id: HG00099
  variant_set_ids: [u'WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0']

Call Set Name: HG00100
  id: HG00100
  bio_sample_id: HG00100
  variant_set_ids: [u'WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0']

Get Call Set by ID method

Using one of the identifiers obtained in the previous call we can get a single Call Set element. This identifier is unique to the server instance.


In [8]:
call_set = c.get_call_set(call_set_id=call_set.id)
print call_set


id: "WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDEwMSJd"
name: "HG00101"
bio_sample_id: "WyIxa2dlbm9tZXMiLCJiIiwiSEcwMDEwMSJd"
variant_set_ids: "WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0"

Making a Search Variants Request with Call Set IDs

By passing in a list of call_set_ids to the variants search method, we can get in return a set of variants with the list of calls that belong to that variant. The genotype field of a Call states whether the variation was observed for a given sample.


In [9]:
for variant_with_calls in  c.search_variants(call_set_ids=list_of_callset_ids, variant_set_id="WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0", reference_name="1", start=10176, end= 10502):
    print "Variant Id: {}".format(variant_with_calls.id)
    print "Variant Set Id: {}".format(variant_with_calls.variant_set_id)
    print "Names:{}".format(variant_with_calls.names)
    print "Reference Chromosome: {}".format(variant_with_calls.reference_name)
    print "Start: {}, End: {}".format(variant_with_calls.start, variant_with_calls.end)
    print "Reference Bases: {}".format(variant_with_calls.reference_bases)
    print "Alternate Bases: {}\n".format(variant_with_calls.alternate_bases)
    for call in variant_with_calls.calls:
        print "  Call Set Name: {}".format(call.call_set_name)
        print "  Genotype: {}".format(call.genotype)
        print "  Phase set: {}".format(call.phaseset)
        print "  Call Set Id: {}\n".format(call.call_set_id)


Variant Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiMSIsIjEwMTc2IiwiZDAxNmM0ZTFhZGNhZDVkMWJjODljMmNhNGFkYmEzYTgiXQ
Variant Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0
Names:[u'rs367896724']
Reference Chromosome: 1
Start: 10176, End: 10177
Reference Bases: A
Alternate Bases: [u'AC']

  Call Set Name: HG00096
  Genotype: [1, 0]
  Phase set: True
  Call Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDA5NiJd

  Call Set Name: HG00097
  Genotype: [0, 1]
  Phase set: True
  Call Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDA5NyJd

  Call Set Name: HG00099
  Genotype: [0, 1]
  Phase set: True
  Call Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDA5OSJd

  Call Set Name: HG00100
  Genotype: [1, 0]
  Phase set: True
  Call Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDEwMCJd

Variant Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiMSIsIjEwMjM0IiwiMGNlMzUwNzI0NDYxNGMzNzA1ZjVlMmFhMmQxMGFmMjUiXQ
Variant Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0
Names:[u'rs540431307']
Reference Chromosome: 1
Start: 10234, End: 10235
Reference Bases: T
Alternate Bases: [u'TA']

  Call Set Name: HG00096
  Genotype: [0, 0]
  Phase set: True
  Call Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDA5NiJd

  Call Set Name: HG00097
  Genotype: [0, 0]
  Phase set: True
  Call Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDA5NyJd

  Call Set Name: HG00099
  Genotype: [0, 0]
  Phase set: True
  Call Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDA5OSJd

  Call Set Name: HG00100
  Genotype: [0, 0]
  Phase set: True
  Call Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDEwMCJd

Variant Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiMSIsIjEwMzUxIiwiMGNlMzUwNzI0NDYxNGMzNzA1ZjVlMmFhMmQxMGFmMjUiXQ
Variant Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIl0
Names:[u'rs555500075']
Reference Chromosome: 1
Start: 10351, End: 10352
Reference Bases: T
Alternate Bases: [u'TA']

  Call Set Name: HG00096
  Genotype: [1, 0]
  Phase set: True
  Call Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDA5NiJd

  Call Set Name: HG00097
  Genotype: [1, 0]
  Phase set: True
  Call Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDA5NyJd

  Call Set Name: HG00099
  Genotype: [0, 1]
  Phase set: True
  Call Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDA5OSJd

  Call Set Name: HG00100
  Genotype: [0, 1]
  Phase set: True
  Call Set Id: WyIxa2dlbm9tZXMiLCJ2cyIsInBoYXNlMy1yZWxlYXNlIiwiSEcwMDEwMCJd

We have searched for variants using Call Set IDs to find individual genetic variations. If the list of calls in the search was increased, more calls would have been returned in the variant search.
Getting a variant by ID also returns this information for all the available calls.
For more information and documentation on the Variant service,

https://ga4gh-schemas.readthedocs.io/en/latest/schemas/variant_service.proto.html