In this notebook we'll show you how you might combine information available in BigQuery tables with sequence-reads that have been imported into Google Genomics. We'll be using the open-access CCLE data for this example.
You'll need to make sure that your project has the necessary APIs enabled, so take a look at the Getting started with Google Genomics page, and be sure to also have a look at this Getting started with the Genomics API tutorial notebook available on github.
We'll be using the Google Python API client so we'll need to install that first using the pip
package manager.
NOTE that Datalab is currently using an older version of the oauth2client (1.4.12) and as a result we need to install an older version of the google-api-python-client that supports it.
In [24]:
!pip install --upgrade google-api-python-client==1.4.2
Next we're going to need to authenticate using the service account on the Datalab host.
In [25]:
from httplib2 import Http
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()
http = Http()
credentials.authorize(http)
Out[25]:
Now we can create a client for the Genomics API. NOTE that in order to use the Genomics API, you need to have enabled it for your GCP project.
In [26]:
from apiclient import discovery
ggSvc = discovery.build ( 'genomics', 'v1', http=http )
We're also going to want to work with BigQuery, so we'll need the biguery module. We will also be using the pandas and time modules.
In [27]:
import gcp.bigquery as bq
import pandas as pd
import time
The ISB-CGC group has assembled metadata as well as molecular data from the CCLE project into an open-access BigQuery dataset called isb-cgc:ccle_201602_alpha
. In this notebook we will make use of two tables in this dataset: Mutation_calls
and DataFile_info
. You can explore the entire dataset using the BigQuery web UI.
Let's say that we're interested in cell-lines with BRAF V600 mutations, and in particular we want to see if there is evidence in both the DNA-seq and the RNA-seq data for these mutations. Let's start by making sure that there are some cell-lines with these mutations in our dataset:
In [28]:
%%sql
SELECT CCLE_name, Hugo_Symbol, Protein_Change, Genome_Change
FROM [isb-cgc:ccle_201602_alpha.Mutation_calls]
WHERE ( Hugo_Symbol="BRAF" AND Protein_Change CONTAINS "p.V600" )
ORDER BY Cell_line_primary_name
LIMIT 5
Out[28]:
OK, so let's get the complete list of cell-lines with this particular mutation:
In [29]:
%%sql --module get_mutated_samples
SELECT CCLE_name
FROM [isb-cgc:ccle_201602_alpha.Mutation_calls]
WHERE ( Hugo_Symbol="BRAF" AND Protein_Change CONTAINS "p.V600" )
ORDER BY Cell_line_primary_name
In [30]:
r = bq.Query(get_mutated_samples).results()
list1 = r.to_dataframe()
print " Found %d samples with a BRAF V600 mutation. " % len(list1)
Now we want to know, from the DataFile_info
table, which cell lines have both DNA-seq and RNA-seq data imported into Google Genomics. (To find these samples, we will look for samples that have non-null readgroupset IDs from "DNA" and "RNA" pipelines.)
In [31]:
%%sql --module get_samples_with_data
SELECT
a.CCLE_name AS CCLE_name
FROM (
SELECT
CCLE_name
FROM
[isb-cgc:ccle_201602_alpha.DataFile_info]
WHERE
( Pipeline CONTAINS "DNA"
AND GG_readgroupset_id<>"NULL" ) ) a
JOIN (
SELECT
CCLE_name
FROM
[isb-cgc:ccle_201602_alpha.DataFile_info]
WHERE
( Pipeline CONTAINS "RNA"
AND GG_readgroupset_id<>"NULL" ) ) b
ON
a.CCLE_name = b.CCLE_name
In [32]:
r = bq.Query(get_samples_with_data).results()
list2 = r.to_dataframe()
print " Found %d samples with both DNA-seq and RNA-seq reads. " % len(list2)
Now let's find out which samples are in both of these lists:
In [33]:
list3 = pd.merge ( list1, list2, how='inner', on=['CCLE_name'] )
print " Found %d mutated samples with DNA-seq and RNA-seq data. " % len(list3)
No we're going to take a closer look at the reads from each of these samples. First, we'll need to be able to get the readgroupset IDs for each sample from the BigQuery table. To do this, we'll define a parameterized function:
In [34]:
%%sql --module get_readgroupsetid
SELECT Pipeline, GG_readgroupset_id
FROM [isb-cgc:ccle_201602_alpha.DataFile_info]
WHERE CCLE_name=$c AND GG_readgroupset_id<>"NULL"
Let's take a look at how this will work:
In [35]:
aName = list3['CCLE_name'][0]
print aName
ids = bq.Query(get_readgroupsetid,c=aName).to_dataframe()
print ids
Ok, so we see that for this sample, we have two readgroupset IDs, one based on DNA-seq and one based on RNA-seq. This is what we expect, based on how we chose this list of samples.
Now we'll define a function we can re-use that calls the GA4GH API reads.search method to find all reads that overlap the V600 mutation position. Note that we will query all of the readgroupsets that we get for each sample at the same time by passing in a list of readGroupSetIds. Once we have the reads, we'll organized them into a dictionary based on the local context centered on the mutation hotspot.
In [36]:
chr = "7"
pos = 140453135
width = 11
rgsList = ids['GG_readgroupset_id'].tolist()
def getReads ( rgsList, pos, width):
payload = { "readGroupSetIds": rgsList,
"referenceName": chr,
"start": pos-(width/2),
"end": pos+(width/2),
"pageSize": 2048
}
r = ggSvc.reads().search(body=payload).execute()
context = {}
for a in r['alignments']:
rgsid = a['readGroupSetId']
seq = a['alignedSequence']
seqStartPos = int ( a['alignment']['position']['position'] )
relPos = pos - (width/2) - seqStartPos
if ( relPos >=0 and relPos+width<len(seq) ):
# print rgsid, seq[relPos:relPos+width]
c = seq[relPos:relPos+width]
if (c not in context):
context[c] = {}
context[c][rgsid] = 1
else:
if (rgsid not in context[c]):
context[c][rgsid] = 1
else:
context[c][rgsid] += 1
for c in context:
numReads = 0
for a in context[c]:
numReads += context[c][a]
# write it out only if we have information from two or more readgroupsets
if ( numReads>3 or len(context[c])>1 ):
print " --> ", c, context[c]
Here we define the position (0-based) of the BRAF V600 mutation:
In [37]:
chr = "7"
pos = 140453135
width = 11
OK, now we can loop over all of the samples we found earlier:
In [38]:
for aName in list3['CCLE_name']:
print " "
print " "
print aName
r = bq.Query(get_readgroupsetid,c=aName).to_dataframe()
for i in range(r.shape[0]):
print " ", r['Pipeline'][i], r['GG_readgroupset_id'][i]
rgsList = r['GG_readgroupset_id'].tolist()
getReads ( rgsList, pos, width)
Notice that we consistently see greater read-depth in the DNA-seq data. Also all but the last sample are heterozygous for the V600 mutation, while WM1799_SKIN
is homozygous. (Of course a proper analysis would also take into consideration the cigar information that is available with each read as well.)
In [ ]: