This notebook demonstrates conducting a genome-wide association study using the public 1000 Genomes dataset stored in BigQuery. Specifically we will:
%%sql
statement to write and execute SQL statements within the notebookRelated Links:
NOTE:
In this experiment, we'll be identifying variant positions within chromosome 12 that differ significantly between the case and control groups. The case group for the purposes of this notebook will be individuals from the "EAS" (East Asian) super population. Variant data from the 1000 genomes dataset is publicly accessible within BigQuery.
In [1]:
import gcp.bigquery as bq
variants_table = bq.Table('genomics-public-data:1000_genomes.variants')
variants_table.schema
Out[1]:
We can tally the reference/alternate allele accounts for individual variant positions within chromosome 12. The field call.genotype
is an integer ranging from [-1, num_alternate_bases]
.
In [2]:
%%sql --module allele_counts
SELECT
reference_name,
start,
reference_bases,
alternate_bases,
end,
vt,
call.call_set_name AS call_set_name,
# 1000 genomes phase 1 data is bi-allelic so there is only ever a single alt
SUM(0 = call.genotype) WITHIN RECORD AS ref_count,
SUM(1 = call.genotype) WITHIN RECORD AS alt_count,
FROM
FLATTEN((
SELECT
reference_name,
start,
reference_bases,
alternate_bases,
end,
vt,
call.call_set_name,
call.genotype,
FROM
$variants_table
WHERE
reference_name = '12' -- i.e., chromosome 12
),
call)
Let's verify that our allele counts match our expectations before moving on. For any given row, the alternate + reference counts should sum to 2 for this experiment.
In [3]:
bq.Query(allele_counts, variants_table=variants_table).sample().to_dataframe()
Out[3]:
Now we can join our allele counts with metadata available in the sample info table. We'll use this sample metadata to split the set of genomes into case and control groups based upon the super population group.
In [4]:
sample_info_table = bq.Table('genomics-public-data:1000_genomes.sample_info')
In [5]:
%%sql --module exp_groups
SELECT
super_population,
('EAS' = super_population) AS is_case,
call_set_name,
reference_name,
start,
reference_bases,
alternate_bases,
end,
vt,
ref_count,
alt_count,
FROM $allele_counts AS allele_counts
JOIN $sample_info_table AS samples
ON allele_counts.call_set_name = samples.sample
In [6]:
bq.Query(exp_groups, allele_counts=allele_counts,
sample_info_table=sample_info_table,
variants_table=variants_table).sample().to_dataframe()
Out[6]:
The variants table contains a few different types of variant: structural variants ("SV"), indels ("INDEL") and SNPs ("SNP").
In [7]:
%%sql
SELECT
vt,
COUNT(*)
FROM $exp_groups
GROUP BY vt
Out[7]:
For the purposes of this experiment, let's limit the variants to only SNPs.
In [8]:
%%sql --module snps
SELECT *
FROM $exp_groups
WHERE vt='SNP'
In [9]:
bq.Query(snps,
exp_groups=exp_groups,
allele_counts=allele_counts,
sample_info_table=sample_info_table,
variants_table=variants_table).sample().to_dataframe()
Out[9]:
Now that we've assigned each call set to either the case or the control group, we can tally up the counts of reference and alternate alleles within each of our assigned case/control groups, for each variant position, like so:
In [10]:
%%sql --module grouped_counts
SELECT
reference_name,
start,
end,
reference_bases,
alternate_bases,
vt,
SUM(ref_count + alt_count) AS allele_count,
SUM(ref_count) AS ref_count,
SUM(alt_count) AS alt_count,
SUM(IF(TRUE = is_case, INTEGER(ref_count + alt_count), 0)) AS case_count,
SUM(IF(FALSE = is_case, INTEGER(ref_count + alt_count), 0)) AS control_count,
SUM(IF(TRUE = is_case, ref_count, 0)) AS case_ref_count,
SUM(IF(TRUE = is_case, alt_count, 0)) AS case_alt_count,
SUM(IF(FALSE = is_case, ref_count, 0)) AS control_ref_count,
SUM(IF(FALSE = is_case, alt_count, 0)) AS control_alt_count,
FROM $snps
GROUP BY
reference_name,
start,
end,
reference_bases,
alternate_bases,
vt
Again, validate that the results are sensical for the group level counts (still per variant position).
In [11]:
bq.Query(grouped_counts,
snps=snps,
exp_groups=exp_groups,
allele_counts=allele_counts,
sample_info_table=sample_info_table,
variants_table=variants_table).sample().to_dataframe()
Out[11]:
We can quantify the statistical significance of each variant position using the Chi-squared test. Furthermore, we can restrict our result set to only statistically significant variant positions for this experiment by ranking each position by its statistical signficance (decreasing) and thresholding the results for significance at p <= 5e-8
(chi-squared score >= 29.7).
In [12]:
%%sql --module results
SELECT
reference_name,
start,
end,
reference_bases,
alternate_bases,
vt,
case_count,
control_count,
allele_count,
ref_count,
alt_count,
case_ref_count,
case_alt_count,
control_ref_count,
control_alt_count,
# http://homes.cs.washington.edu/~suinlee/genome560/lecture7.pdf
# https://en.wikipedia.org/wiki/Yates%27s_correction_for_continuity
ROUND(
POW(ABS(case_ref_count - (ref_count/allele_count)*case_count) - 0.5,
2)/((ref_count/allele_count)*case_count) +
POW(ABS(control_ref_count - (ref_count/allele_count)*control_count) - 0.5,
2)/((ref_count/allele_count)*control_count) +
POW(ABS(case_alt_count - (alt_count/allele_count)*case_count) - 0.5,
2)/((alt_count/allele_count)*case_count) +
POW(ABS(control_alt_count - (alt_count/allele_count)*control_count) - 0.5,
2)/((alt_count/allele_count)*control_count),
3) AS chi_squared_score
FROM $grouped_counts
WHERE
# For chi-squared, expected counts must be at least 5 for each group
(ref_count/allele_count)*case_count >= 5.0
AND (ref_count/allele_count)*control_count >= 5.0
AND (alt_count/allele_count)*case_count >= 5.0
AND (alt_count/allele_count)*control_count >= 5.0
HAVING
# Chi-squared critical value for df=1, p-value=5*10^-8 is 29.71679
chi_squared_score >= 29.71679
ORDER BY
chi_squared_score DESC,
allele_count DESC
We now run this query over all the variants within chromosome 12 and return the first few most significant variants.
In [13]:
bq.Query(results,
grouped_counts=grouped_counts,
snps=snps,
exp_groups=exp_groups,
allele_counts=allele_counts,
sample_info_table=sample_info_table,
variants_table=variants_table).sample().to_dataframe()
Out[13]:
Scroll to the right in the above results to see that the positions deemed significant do in fact have significantly different case/control counts for the alternate/reference bases.
Let's compare these BigQuery-computed Chi-squared scores to ones calculated via Python's statistical packages
In [14]:
import numpy as np
from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency(np.array([
[220, 352], # case
[1593, 19] # control
]))
print 'Python Chi-sq score = %.3f' % chi2
Comparing Python statistics to those computed in R separately with the result displayed here:
> chisq.test(rbind(c(220, 352), c(1593, 19)))
Pearson's Chi-squared test with Yates' continuity correction
data: rbind(c(220, 352), c(1593, 19))
X-squared = 1086.505, df = 1, p-value < 2.2e-16
And we can see for both the computation in Python and R, that the value matches 1086.505 from BigQuery.
First, how many statistically significant variant positions did we find?
In [15]:
%%sql
SELECT COUNT(*) AS num_significant_snps
FROM $results
Out[15]:
We now have a dataset that is sufficiently small to fit into memory on our instance, so let's pull the top 1000 SNP positions locally. Since we only need a subset of the columns, we can project our data first to remove unneeded columns.
In [16]:
%%sql --module sig_snps_dataset
SELECT * FROM (
SELECT
reference_name,
start,
reference_bases,
alternate_bases,
chi_squared_score
FROM $results
LIMIT 1000
)
ORDER BY start asc
In [17]:
sig_snps = bq.Query(sig_snps_dataset,
results=results,
grouped_counts=grouped_counts,
snps=snps,
exp_groups=exp_groups,
allele_counts=allele_counts,
sample_info_table=sample_info_table,
variants_table=variants_table).to_dataframe()
sig_snps[:10]
Out[17]:
Let's visualize the distribution of significant SNPs along the length of the chromosome. The y-value of the charts indicates the Chi-squared score: larger values are more significant.
In [18]:
import matplotlib.pyplot as plt
import seaborn as sns
#g = sns.distplot(sig_snps['start'], rug=False, hist=False, kde_kws=dict(bw=0.1))
fig, ax = plt.subplots()
ax.scatter(sig_snps['start'], sig_snps['chi_squared_score'], alpha=0.3, c='red')
ax.set_ylabel('Chi-squared score')
ax.set_xlabel('SNP position (bp)')
Out[18]:
Let's zoom in on one region that contains a large number of very significant SNPs:
In [19]:
fig, ax = plt.subplots()
ax.scatter(sig_snps['start'], sig_snps['chi_squared_score'], alpha=0.5, c='red')
ax.set_xlim([3.3e7, 3.5e7])
ax.set_ylabel('Chi-squared score')
ax.set_xlabel('SNP position (bp)')
Out[19]:
We can take our analysis further by mapping selected variant positions back to the chromosome and visualizing call sets and reads. Let's retrieve the top SNP identified when ranked by the Chi-squared score:
In [20]:
%%sql --module top_snp
SELECT start
FROM $sig_snps_dataset
ORDER BY chi_squared_score desc
LIMIT 1
In [21]:
bq.Query(top_snp,
sig_snps_dataset=sig_snps_dataset,
results=results,
grouped_counts=grouped_counts,
snps=snps,
exp_groups=exp_groups,
allele_counts=allele_counts,
sample_info_table=sample_info_table,
variants_table=variants_table).results()
Out[21]:
Grab an arbitrary set of 10 callset IDs for rendering in the genome browser.
In [22]:
%%sql --module callset_ids
SELECT * FROM (
SELECT call.call_set_id AS callset_id
FROM $variants_table
GROUP BY callset_id)
LIMIT 10
In [23]:
callsets_df = bq.Query(callset_ids, variants_table=variants_table).to_dataframe()
callsets = list(callsets_df['callset_id'])
In [24]:
from IPython.display import HTML
def gabrowse(dataset, reference_name, start_position, callset_ids):
callsets_query_params = ''.join('&callsetId=%s&cBackend=GOOGLE' % callset_id for callset_id in callset_ids)
url = ('https://gabrowse.appspot.com/#=&backend=GOOGLE&location=12%3A'
+ str(start_position)
+ callsets_query_params)
return HTML('<iframe src="%s" width=1024 height=800></iframe>' % url)
Now we can render the call sets and reads for the selected SNP position by embedding the GABrowse application directly in our notebook.
In [25]:
gabrowse('1000genomes', '12', 110571373, callsets)
Out[25]:
This notebook illustrated how to conduct a GWAS experiment using variant data stored within the Google Genomics BigQuery tables, retrieve a local copy of the top results and visualize the data with Python libraries.