In [0]:
#@title Default title text
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
This notebook demonstrates how one can dive deeper into QC results to explain some unexpected patterns. In this notebook, we will see that a few samples in the Platinum Genomes have a very low number of private variants, and we will figure out why.
Eberle, MA et al. (2017) A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Research 27: 157-164. doi:10.1101/gr.210500.116
Check out the code for the various QC methods to the current working directory. Further down in the notebook we will read the SQL templates from this clone.
In [0]:
!git clone https://github.com/verilylifesciences/variant-qc.git
In [0]:
!pip install --upgrade plotnine jinja2
In [0]:
import jinja2
import numpy as np
import os
import pandas as pd
import plotnine
from plotnine import *
plotnine.options.figure_size = (10, 6)
In [0]:
# Change this to be your project id.
PROJECT_ID = 'your-project-id' #@param
In [0]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
In [0]:
def run_query(sql_template, replacements={}):
if os.path.isfile(sql_template):
sql_template = open(sql_template, "r").read()
sql = jinja2.Template(sql_template).render(replacements)
print('SQL to be executed:\n', sql)
df = pd.io.gbq.read_gbq(sql, project_id=PROJECT_ID, dialect='standard')
print('\nResult shape:\t', df.shape)
return df
In [0]:
# Read the SQL template from the cloned repository in the home directory, perform
# the variable replacements and execute the query.
df = run_query(
sql_template='variant-qc/sql/private_variants.sql',
replacements={
'GENOME_CALL_OR_MULTISAMPLE_VARIANT_TABLE': 'bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823',
'HIGH_QUALITY_CALLS_FILTER': 'NOT EXISTS (SELECT ft FROM UNNEST(c.FILTER) ft WHERE ft NOT IN ("PASS", "."))'
}
)
We can read these values from the CSV created via Sample-Level-QC.Rmd.
In [0]:
df = pd.read_csv("https://storage.googleapis.com/genomics-public-data/platinum-genomes/reports/DeepVariant_Platinum_Genomes_sample_results.csv")[["name", "private_variant_count"]]
df.shape
Out[0]:
In [0]:
df
Out[0]:
Let's take a look at the samples who are more than one standard deviation away from the mean.
In [0]:
df.loc[abs(df.private_variant_count - df.private_variant_count.mean()) > df.private_variant_count.std(), :]
Out[0]:
Next let's see if the sample metadata can be used to help explain the explain the low number of private variants that we see.
In [0]:
metadata_df = run_query(
sql_template="""
SELECT
Sample AS name,
Gender AS sex,
Super_Population AS ancestry,
Relationship AS relationship
FROM
`bigquery-public-data.human_genome_variants.1000_genomes_sample_info`
WHERE
Sample IN ('NA12877', 'NA12878', 'NA12889', 'NA12890', 'NA12891', 'NA12892')
"""
)
In [0]:
joined_results = pd.merge(df, metadata_df, how='left', on='name')
joined_results.shape
Out[0]:
In [0]:
assert(joined_results.shape == (6, 5))
In [0]:
p = (ggplot(joined_results) +
geom_boxplot(aes(x = 'ancestry', y = 'private_variant_count', fill = 'ancestry')) +
theme_minimal()
)
p
Out[0]:
All individuals in this dataset are of the same ancestry, so that does not explain the pattern we see.
We know from the paper that all members of this cohort are from the same family.
In [0]:
run_query("""
SELECT
*
FROM
`bigquery-public-data.human_genome_variants.1000_genomes_pedigree`
WHERE
Individual_ID IN ('NA12877', 'NA12878', 'NA12889', 'NA12890', 'NA12891', 'NA12892')
""")
Out[0]:
In [0]:
p = (ggplot(joined_results) +
geom_text(aes(x = 'name', y = 'private_variant_count', label = 'relationship')) +
theme_minimal()
)
p
Out[0]:
And we can see that the relationship between individuals explains the pattern we see in the private variant counts.