We'll be using the Google Python API client for interacting with Genomics API. We can install this library, or any other 3rd-party Python libraries from the Python Package Index (PyPI) using the pip
package manager.
There are 50+ Google APIs that you can work against with the Google Python API Client, but we'll focus on the Genomics API in this notebook.
In [1]:
!pip install --upgrade google-api-python-client
Next we construct a Python object that we can use it to make requests.
The following snippet shows how we can authenticate using the service account on the Datalab host. For more detail about authentication from Python, see Using OAuth 2.0 for Server to Server Applications.
In [2]:
from httplib2 import Http
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()
http = Http()
credentials.authorize(http)
Out[2]:
Out[2]:
And then we create a client for the Genomics API.
In [3]:
from apiclient.discovery import build
genomics = build('genomics', 'v1', http=http)
Now that we have a Python client for the Genomics API, we can access a variety of different resources. For details about each available resource, see the python client API docs here.
Using our genomics
client, we'll demonstrate fetching a Dataset resource by ID (the 1000 Genomes dataset in this case).
First, we need to construct a request object.
In [4]:
request = genomics.datasets().get(datasetId='10473108253681171589')
Next, we'll send this request to the Genomics API by calling the request.execute()
method.
In [5]:
response = request.execute()
You will need enable the Genomics API for your project if you have not done so previously. Click on this link to enable the API in your project.
The response object returned is simply a Python dictionary. Let's take a look at the properties returned in the response.
In [6]:
for entry in response.items():
print "%s => %s" % entry
Success! We can see the name of the specified Dataset and a few other pieces of metadata.
Accessing other Genomics API resources will follow this same set of steps. The full list of available resources within the API is here. Each resource has details about the different verbs that can be applied (e.g., Dataset methods).
In this portion of the notebook, we implement this same example implemented as a python script. First let's define a few constants to use within the examples that follow.
In [7]:
dataset_id = '10473108253681171589' # This is the 1000 Genomes dataset ID
sample = 'NA12872'
reference_name = '22'
reference_position = 51003835
First find the read group set ID for the sample.
In [8]:
request = genomics.readgroupsets().search(
body={'datasetIds': [dataset_id], 'name': sample},
fields='readGroupSets(id)')
read_group_sets = request.execute().get('readGroupSets', [])
if len(read_group_sets) != 1:
raise Exception('Searching for %s didn\'t return '
'the right number of read group sets' % sample)
read_group_set_id = read_group_sets[0]['id']
Once we have the read group set ID, lookup the reads at the position in which we are interested.
In [9]:
request = genomics.reads().search(
body={'readGroupSetIds': [read_group_set_id],
'referenceName': reference_name,
'start': reference_position,
'end': reference_position + 1,
'pageSize': 1024},
fields='alignments(alignment,alignedSequence)')
reads = request.execute().get('alignments', [])
And we print out the results.
In [10]:
# Note: This is simplistic - the cigar should be considered for real code
bases = [read['alignedSequence'][
reference_position - int(read['alignment']['position']['position'])]
for read in reads]
print '%s bases on %s at %d are' % (sample, reference_name, reference_position)
from collections import Counter
for base, count in Counter(bases).items():
print '%s: %s' % (base, count)
First find the call set ID for the sample.
In [11]:
request = genomics.callsets().search(
body={'variantSetIds': [dataset_id], 'name': sample},
fields='callSets(id)')
resp = request.execute()
call_sets = resp.get('callSets', [])
if len(call_sets) != 1:
raise Exception('Searching for %s didn\'t return '
'the right number of call sets' % sample)
call_set_id = call_sets[0]['id']
Once we have the call set ID, lookup the variants that overlap the position in which we are interested.
In [12]:
request = genomics.variants().search(
body={'callSetIds': [call_set_id],
'referenceName': reference_name,
'start': reference_position,
'end': reference_position + 1},
fields='variants(names,referenceBases,alternateBases,calls(genotype))')
variant = request.execute().get('variants', [])[0]
And we print out the results.
In [13]:
variant_name = variant['names'][0]
genotype = [variant['referenceBases'] if g == 0
else variant['alternateBases'][g - 1]
for g in variant['calls'][0]['genotype']]
print 'the called genotype is %s for %s' % (','.join(genotype), variant_name)