Variant data from the Thousand Genomes project is being served by the GA4GH reference server. In this example we show how to use the GA4GH client to access the data.
In this step we create a client object which will be used to communicate with the server. It is initialized using the URL.
In [1]:
import ga4gh.client as client
c = client.HttpClient("http://1kgenomes.ga4gh.org/")
We will continue to refer to this client object for accessing the remote server.
Here, we issue or first API call to get a listing of datasets hosted by the server. The API call returns an iterator, which is iterated on once to get the 1kgenomes
dataset.
In [2]:
dataset = c.searchDatasets().next()
In [3]:
print(dataset)
The GA4GH Genomics API provides methods for accessing the bases of a reference. The Thousand Genomes data presented here are mapped to GRCh37.
To access these data we first list the reference sets.
In [4]:
referenceSet = c.searchReferenceSets().next()
print(referenceSet)
With the reference set saved to a variable we will now request for the available references. This is the list of contigs for which we can request reference bases.
In [5]:
references = [r for r in c.searchReferences(referenceSetId = referenceSet.id)]
print(', '.join(sorted([reference.name for reference in references])))
Here, we print the names of the available references. These reference names are used in the variants/search
API. By selecting one of the references we can craft a ListBases
request. Here, we ask for the 1000 bases between 10,000 and 11,000 on the first chromosome.
In [6]:
chr1 = filter(lambda x: x.name == "1", references)[0]
bases = c.listReferenceBases(chr1.id, start=10000, end=11000)
print(bases)
print(len(bases))
The Thousand Genomes project sequenced the genome of over 2500 participants and released variation data in VCF format. The GA4GH reference server hosts those variant sets, and in this step we will list the available containers.
In [7]:
release = None
functional = None
for variantSet in c.searchVariantSets(datasetId=dataset.id):
if variantSet.name == "release":
release = variantSet
else:
functional = variantSet
There are two variant sets currently being made available by this server instance. release
contains the calls for the each participant and functional_annotation
provides details of the effects of these variants created using the Variant Effect Predictor.
Now that we have stored identifiers for the variant sets hosted by this server, we can craft search requests to find the individual variants. The GA4GH genomics API closely follows VCF by providing the calls for each variant in the variant record itself. Let's get a single variant record to examine.
In [8]:
exampleVariant = c.searchVariants(variantSetId=release.id, start=10000, end=11000, referenceName=chr1.name).next()
print("Variant name: {}".format(exampleVariant.names[0]))
print("Start: {}, End: {}".format(exampleVariant.start, exampleVariant.end))
print("Reference bases: {}".format(exampleVariant.referenceBases))
print("Alternate bases: {}".format(exampleVariant.alternateBases))
print("Number of calls: {}".format(len(exampleVariant.calls)))
Here, we select a few of the important fields of the variant record. Let's now examine the calls returned with this variant. The calls are what tell us which participants were believed to have presented the given variant.
In [9]:
print(exampleVariant.calls[0])
This tells us that for the participant HG00096
the variant in question was observed on the first haplotype ("genotype": [1, 0]
). We can perform summary statistics on the calls field to measure allele frequency. For this demonstration we will count the presence of the variant on either haplotype to determine how often it was seen amongst the participants.
In [10]:
total = 0
count = 0
for call in exampleVariant.calls:
total += 1
count += call.genotype[0] or call.genotype[1]
print("{}/{} participants with this variant".format(count, total))
print(float(count) / float(total))
To get at the detailed variant annotations, we must first find the variant annotation set.
In [11]:
annotationSet = c.searchVariantAnnotationSets(variantSetId=functional.id).next()
We can now search for the range that includes our example variant to discover relevant annotations.
In [12]:
annotation = c.searchVariantAnnotations(
variantAnnotationSetId=annotationSet.id,
start=exampleVariant.start,
end=exampleVariant.end,
referenceName=chr1.name).next()
print(annotation.transcriptEffects[0].effects[0].term)
Here we have found the annotation for our example variant and have found it has the upstream_gene_variant
consequence.
Finally, we can learn more about the site of our example variant by querying the sequence annotations API, which is serving the Gencode Genes data.
In [13]:
gencode = c.searchFeatureSets(datasetId=dataset.id).next()
print(gencode)
We can now craft search requests for features to find the nearest gene:
In [14]:
gene = c.searchFeatures(
featureSetId=gencode.id,
start=10000,
end=12000,
referenceName="chr1",
featureTypes=['gene']).next()
print("Gene name: {}".format(gene.attributes.vals['gene_name'][0]))
print("Start: {}, End: {}".format(gene.start, gene.end))