This notebook demonstrates identifying tag variants using linkage disequilibrium (LD) data stored as publicly accessible BigQuery datasets. Specifically, we will work with LD calculated on the 1000 Genomes Phase 3 variants. The source variants were imported to Google Genomics and then LD calculations were performed and the resulting dataset exported to BigQuery using pipelines in the https://github.com/googlegenomics/linkage-disequilibrium project.
If you want to explore more linkage disequilibrium samples, see https://github.com/googlegenomics/linkage-disequilibrium/tree/master/datalab. For general genomics examples, see https://github.com/googlegenomics/datalab-examples. You can import them into your Datalab instance by uploading them while on the notebook list page.
In [35]:
import gcp.bigquery as bq
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
The initial LD exploration Datalab indicated that LD is best analyzed on a population-specific level. Consequently, we load data for subpopulations directly.
In [36]:
# Get references to the BigQuery tables of linkage disequilibrium
# in the five superpopulations of the 1000 Genomes Project
# (http://www.1000genomes.org/faq/which-populations-are-part-your-study):
# AMR: Admixed American
# AFR: African
# EUR: European
# SAS: South Asian
# EAS: East Asian
populations = {
"AFR": ["ACB", "ASW", "ESN", "GWD", "LWK", "MSL", "YRI"],
"AMR": ["CLM", "MXL", "PEL", "PUR"],
"EAS": ["CDX", "CHB", "CHS", "JPT", "KHV"],
"EUR": ["CEU", "FIN", "GBR", "IBS", "TSI"],
"SAS": ["BEB", "GIH", "ITU", "PJL", "STU"],
}
sub_to_super = {sub: super for super, subs in populations.items() for sub in subs}
def get_ld_tablename(population):
"""Returns the name of the BigQuery table with the publicly-available LD data."""
is_subpopulation = population not in populations
pop = "sub_pop_%s" % population if is_subpopulation else "super_pop_%s" % population
return "genomics-public-data:linkage_disequilibrium_1000G_phase_3.%s" % pop
tables = {}
for superpopulation in populations:
subpopulations = populations[superpopulation]
for subpopulation in subpopulations:
# Only load the subpopulations for this analysis.
tables[subpopulation] = bq.Table(get_ld_tablename(subpopulation))
A major use case for LD data is to identify variants that capture the same underlying information. For example, an individual who has been genotyped using a DNA microarray will only have information at a subset of all variant sites in the genome. However, there are many variant sites that have some known phenotypic consequences. For instance, the variant rs4988235 is known to be associated with lactase persistence. An individual who has been genotyped on a microarray that does not directly assay rs4988235 may still be able to infer their likely genotype at that SNP by using LD.
The following BigQuery query will identify variants that may be usable as proxies for a specified input variant of interest.
In [37]:
%%sql --module proxy_variant_query
SELECT
qname,
qzeroallele,
qoneallele,
tchrom,
tstart,
tname,
tzeroallele,
toneallele,
ABS(tstart - qstart) AS distance,
corr * corr AS rsquared
FROM
$all_ld_table
WHERE
qname == $query_rsid AND
corr * corr >= $min_rsq AND
ABS(tstart - qstart) <= $max_distance
ORDER BY rsquared DESC, distance
LIMIT 30
In [38]:
lactase_rsid = "rs4988235"
table = tables["CEU"] # For a European individual
lactase_tag_variants = bq.Query(proxy_variant_query,
all_ld_table=table,
query_rsid=lactase_rsid,
min_rsq=0.8,
max_distance=500000).to_dataframe()
lactase_tag_variants
Out[38]:
As the result shows, there are eight target variants that are in strong LD with the lactase persistence-associated rs4988235, including rs182549 that is in perfect LD in the CEU population.
We can generalize the above query to investigate the average number of tag variants for each query variant as a function of distance between variants, in a manner similar to that performed in Figure 4b of the 1000 Genomes Phase 3 paper.
This analysis is performed in multiple stages. The first stage uses the below query to identify the total number of variants in the data set that fall within the specific minor allele frequency range of interest. The query uses a subtable query and GROUP BY to achieve the same effect as an EXACT_COUNT_DISTINCT while requiring fewer resources to complete.
In [39]:
%%sql --module maf_variant_counts
SELECT COUNT(*) AS num_variants
FROM
(SELECT
COUNT(qid)
FROM
$all_ld_table
WHERE
LEAST(num_qone_chroms / num_chroms, (num_chroms - num_qone_chroms) / num_chroms) >= $min_maf AND
LEAST(num_qone_chroms / num_chroms, (num_chroms - num_qone_chroms) / num_chroms) < $max_maf
GROUP BY qid)
This next query is used to count the total number of variant pairs at each distance bin that satisfy the correlation threshold.
In [40]:
%%sql --module tag_variation_query
SELECT
INTEGER(CEIL((tstart - qstart) / $bin_size)) AS distance,
INTEGER(COUNT(*)) AS num_tag_variants
FROM $all_ld_table
WHERE
qstart < tstart AND
(corr * corr) >= $min_rsq AND
LEAST(num_qone_chroms / num_chroms, (num_chroms - num_qone_chroms) / num_chroms) >= $min_maf AND
LEAST(num_qone_chroms / num_chroms, (num_chroms - num_qone_chroms) / num_chroms) < $max_maf AND
tstart - qstart <= $max_dist
GROUP BY distance
ORDER BY distance
In [41]:
binsize = 5000
maxdist = 1000000
queries = [("Rare", 0, 0.005), ("Low frequency", 0.005, 0.05), ("Common", 0.05, 1)]
In [42]:
num_maf_variants = {}
for mafbin, minmaf, maxmaf in queries:
for population, table in tables.iteritems():
# Find the total number of variants in this MAF range.
varcount_query = bq.Query(maf_variant_counts,
all_ld_table=table,
min_maf=minmaf,
max_maf=maxmaf).to_dataframe()
num_maf_variants[population] = varcount_query["num_variants"][0]
The average number of cumulative variant pairs at each distance bin is calculated by taking the running sum over all individual distance bins:
In [43]:
# Create a dictionary keyed by population with value a DataFrame of r^2 values
# binned by distance from the query variant.
tag_variation = {}
for mafbin, minmaf, maxmaf in queries:
tag_variation[mafbin] = {}
for population, table in tables.iteritems():
# Find the total number of pairs of strong LD at each bin size.
result = bq.Query(tag_variation_query,
all_ld_table=table,
bin_size=binsize,
min_maf=minmaf,
max_maf=maxmaf,
max_dist=maxdist,
min_rsq=0.8).to_dataframe()
if len(result.columns):
# Scale the bins back up to distances.
result["distance"] *= binsize
# Transform the counts to averages of cumulative counts.
num_pop_maf_vars = num_maf_variants[population]
result["avg_tag_variants"] = result["num_tag_variants"].cumsum() / float(num_pop_maf_vars)
tag_variation[mafbin][population] = result
else:
# There are no results. Create an empty DataFrame with the same columns.
tag_variation[mafbin][population] = pd.DataFrame({"distance": [],
"num_tag_variants": [],
"cumul_tag_variants": [],
"avg_tag_variants": []})
The following function is used to display the tag variation results.
In [44]:
def plot_tag_variation(tag_variation_dict, max_distance=200000):
"""Plots a figure of tag variation in all populations."""
fig = plt.figure()
ax = fig.add_subplot(111)
handles = []
colormaps = {"AFR": matplotlib.cm.get_cmap("YlOrBr"),
"AMR": matplotlib.cm.get_cmap("Reds"),
"EAS": matplotlib.cm.get_cmap("Greens"),
"EUR": matplotlib.cm.get_cmap("Blues"),
"SAS": matplotlib.cm.get_cmap("Purples")}
# Plot data so that the legend groups by superpopulation,
# ordered from the highest to lowest LD.
pop_order = [sub_to_super[k] for k, v in sorted(tag_variation_dict.items(),
key=lambda x: x[1]["avg_tag_variants"].max(),
reverse=True)]
population_plot_order = []
for superpop in pop_order:
if superpop not in population_plot_order:
population_plot_order.append(superpop)
labels = []
for superpopulation in population_plot_order:
subpopulations = populations[superpopulation]
for i, subpopulation in enumerate(subpopulations):
df = tag_variation_dict[subpopulation]
df = df.loc[df["distance"] <= max_distance]
if not len(df["distance"]):
continue
color = colormaps[superpopulation](1 - (i / 8.))
lp, = ax.plot(df["distance"], df["avg_tag_variants"],
label=subpopulation,
color=color)
handles.append(lp)
labels.append(subpopulation)
plt.legend(handles, labels,
loc='center left',
bbox_to_anchor=(1.1, 0.5))
ax.set_axis_bgcolor("white")
ax.grid(False)
ax.set_xlabel("Distance (bp)")
ax.set_ylabel("Average number of tag variants")
Now we can plot the average number of tag variants per query variant as a function of distance from the query variant:
In [45]:
plot_tag_variation(tag_variation["Common"])
Other LD metrics showed substantial differences when analyzed on the superpopulation level compared to subpopulation level. In contrast, the plots are very similar to the superpopulation-based plots made in the 1000 Genomes paper Figure 4b, with the African populations showing much less average tag variation than the other populations.
This similarity may be due in part to a lesser impact of population stratification on patterns of extremely strong LD.
In [46]:
plot_tag_variation(tag_variation["Low frequency"])
In [47]:
plot_tag_variation(tag_variation["Rare"])