This notebook demonstrates working with genetic variant data stored as publicly accessible Google BigQuery datasets.
Specifically, we will work with the Illumina Platinum Genomes data. The source data was originally in VCF format, which was imported to Google Genomics and exported to BigQuery.
If you want to explore other genomics samples, see https://github.com/googlegenomics/datalab-examples. You can import them into your Google Cloud Datalab instance by uploading them from the notebook list page.
In [1]:
import google.datalab.bigquery as bq
Let's start with a little bit of genomics nomenclature:
In [2]:
variants = bq.Table('genomics-public-data.platinum_genomes.variants')
variants.length
Out[2]:
Wow, that's a lot of records! For a detailed description of the schema, see Understanding the BigQuery Variants Table Schema.
Next, let's take a look at a few SNPs in our data. As mentioned in the nomenclature, SNPs are a particular kind of genetic variant. Let's create an SQL statement that can list all variant records for a single base change:
In [3]:
%%bq query --name single_base
SELECT
reference_name,
start,
reference_bases,
alternate_bases
FROM
`genomics-public-data.platinum_genomes.variants`
WHERE
reference_bases IN ('A','C','G','T') AND
ARRAY_LENGTH(alternate_bases) = 1
LIMIT 100
In [4]:
%bq execute --query single_base
Out[4]:
A SNP can be further classified as either a transition or a transversion. The ratio of transitions to transversions (TiTv ratio) in humans is observed to be approximately 2.1, but this is not uniform across the genome. Let's take a closer look by computing the TiTv ratio in contiguous regions of 100,000 base pairs.
In [5]:
%%bq query --name snps
SELECT
reference_name,
reference_bases,
alternate_bases,
CAST(FLOOR(start / 100000) AS INT64) AS windows,
CONCAT(reference_bases, CONCAT(CAST('->' AS STRING), ARRAY_TO_STRING(alternate_bases, ''))) AS mutation,
ARRAY_LENGTH(alternate_bases) AS num_alts
FROM
`genomics-public-data.platinum_genomes.variants`
WHERE reference_name IN ("1", "chr1")
# Limit to bi-allelic SNP variants
AND reference_bases IN ('A','C','G','T')
AND ARRAY_LENGTH(alternate_bases) = 1
We've updated the above query to include the "window" in which the SNP resides, and added a new field called "mutation".
In [6]:
%bq sample -q snps --count 10
Out[6]:
Next we group and classify the SNPs within their windows.
In [7]:
%%bq query --name windows --subqueries snps
SELECT
reference_name,
windows AS win,
COUNTIF(mutation IN ('A->G', 'G->A', 'C->T', 'T->C')) AS transitions,
COUNTIF(mutation IN ('A->C', 'C->A', 'G->T', 'T->G',
'A->T', 'T->A', 'C->G', 'G->C')) AS transversions,
COUNT(mutation) AS num_variants_in_window
FROM snps
GROUP BY
reference_name,
win
In [8]:
%bq sample -q windows --count 10
Out[8]:
And finally, we compute the per-window TiTv ratio.
In [9]:
%%bq query --name titv --subqueries snps windows
SELECT
reference_name,
win * 100000 AS window_start,
transitions,
transversions,
transitions/transversions AS titv,
num_variants_in_window
FROM windows
ORDER BY
window_start
In [10]:
%bq sample -q titv --count 10
Out[10]:
In [11]:
titvRatios = titv.execute(output_options=bq.QueryOutput.dataframe()).result()
titvRatios[:5]
Out[11]:
Now we can take the ratios and plot them by genomic position to see how the ratio varies depending upon where it was computed within the chromosome. By default, this plot shows chromosome 1 with its gap in the center of the data corresponding to its metacentric contromere.
In [14]:
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.lmplot(x='window_start',
y='titv',
data = titvRatios,
size = 8,
aspect = 2,
scatter_kws = { 'color': 'black', 'alpha': 0.4 },
line_kws = { 'color': 'blue', 'alpha': 0.5, 'lw': 4 },
lowess = True)
plt.xlabel('Genomic Position', fontsize = 14)
plt.ylabel('Ti/Tv Ratio', fontsize = 14)
plt.title('Ti/Tv Ratio computed on %d base pair windows for %s' % (100000, ['1', 'chr1']),
fontsize = 18)
plt.annotate('Centromere', xy = (1.3e8, 1.5), xytext = (1.5e8, 3), size = 14,
arrowprops=dict(facecolor='black', shrink=0.05))
Out[14]:
In [ ]: