This notebook will introduce the following concepts:
From https://jupyter.org:
"The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more."
In the last year, the Jupyter development team released Jupyter Lab, an integrated environment for data, code, and visualizations. If you've used R Studio, this is the closest thing that works in Python (and many other languages!).
Part of what we think is so exciting about Hail is that it has coincided with a larger shift in the data science community.
Three years ago, most computational biologists at Broad analyzed genetic data using command-line tools, and took advantage of research compute clusters by explicitly using scheduling frameworks like LSF or Sun Grid Engine.
Now, they have the option to use Hail in interactive Python notebooks backed by thousands of cores on public compute clouds like Google Cloud, Amazon Web Services, or Microsoft Azure.
In [ ]:
print('Hello, world')
Jupyter has two modes, a navigation mode and an editor mode.
UP
/ DOWN
move between cellsENTER
while a cell is selected will move to editing mode.UP
/ DOWN
/ move within cells before moving between cells.ESC
will return to navigation mode.SHIFT + ENTER
will evaluate a cell and return to navigation mode.
In [ ]:
# This is a code cell
my_variable = 5
This is a markdown cell, so even if something looks like code (as below), it won't get executed!
my_variable += 1
In [ ]:
print(my_variable)
Keyboard shortcuts:
SHIFT + ENTER
to evaluate a cellESC
to return to navigation modey
to turn a markdown cell into codem
to turn a code cell into markdowna
to add a new cell above the currently selected cellb
to add a new cell below the currently selected celld, d
(repeated) to delete the currently selected cellTAB
to activate code completionTo try this out, create a new cell below this one using b
, and print my_variable
by starting with print(my
and pressing TAB
!
Jupyter makes it easy to get yourself into trouble by executing cells out-of-order, or multiple times.
For example, if I declare x
:
x = 5
Then have a cell that reads:
x += 1
And finally:
print(x)
If you execute these cells in order and once, I'll see the notebook print 6
. However, there is nothing stopping you from executing the middle cell ten times, printing 16
!
If you get yourself into trouble into this way, the solution is to clear the kernel (Python process) and start again from the top.
First, Kernel > Restart & Clear Output > (accept dialog)
.
Second, Cell > Run all above
.
In addition to Hail, we import a few methods from the bokeh plotting library. We'll see examples soon!
In [ ]:
import hail as hl
from bokeh.io import output_notebook, show
Now we initialize Hail and set up Bokeh to display inline in the notebook.
In [ ]:
hl.init()
output_notebook()
The workshop materials are designed to work on a small (~20MB) downsampled chunk of the public 1000 Genomes dataset.
You can run these same functions on your computer or on the cloud!
In [ ]:
hl.utils.get_1kg('data/')
It is possible to call command-line utilities from Jupyter by prefixing a line with a !
:
In [ ]:
! ls -1 data/
The Variant Call Format (VCF) is a common file format for representing genetic data collected on multiple individuals (samples).
Hail's import_vcf function can read this format.
However, VCF is a text format that is easy for humans but very bad for computers. The first thing we do is write
to a Hail native file format, which is much faster!
In [ ]:
hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)
We represent genetic data as a Hail MatrixTable, and name our variable mt
to indicate this.
In [ ]:
mt = hl.read_matrix_table('data/1kg.mt')
MatrixTable
?Let's describe it!
The describe
method prints the schema, that is, the fields in the dataset and their types.
You can see:
int32
, int64
), e.g. 5
float32
, float64
), e.g. 5.5
or 3e-8
str
), e.g. "Foo"
bool
) e.g. True
array
), e.g. [1,1,2,3]
set
), e.g. {1,3}
dict
), e.g. {'Foo': 5, 'Bar': 10}
locus
), e.g. [GRCh37] 1:10000
or [GRCh38] chr1:10024
call
), e.g. 0/2
or 1|0
In [ ]:
mt.describe()
In [ ]:
mt.count()
In [ ]:
mt.s.show(5)
In [ ]:
mt.locus.show(5)
You can see the names of fields above. show()
the first few values for a few of them, making sure to include at least one row field and at least one entry field. Capitalization is important.
To print fields inside the info
structure, you must add another dot, e.g. mt.info.AN
.
What do you notice being printed alongside some of the fields?
In [ ]:
In [ ]:
hl.summarize_variants(mt)
In [ ]:
mt.aggregate_rows(hl.agg.count_where(mt.alleles == ['A', 'T']))
Or if we had flight data:
flight_data.aggregate(
hl.agg.count_where(flight_data.departure_city == 'Boston')
)
The counter
aggregator makes it possible to see distributions of categorical data, like alleles:
In [ ]:
snp_counts = mt.aggregate_rows(
hl.array(hl.agg.counter(mt.alleles)))
snp_counts
By sorting the result in Python, we can recover an interesting bit of biology...
In [ ]:
sorted(snp_counts,
key=lambda x: x[1])
The GQ field in our dataset is an interesting one to explore further, and we can use various pieces of Hail's functionality to do so.
GQ stands for Genotype Quality, and reflects confidence in a genotype call. It is a non-negative integer truncated at 99, and is the phred-scaled probability of the second-most-likely genotype call.
Phred-scaling a value is the following transformation:
$\quad Phred(x) = -10 * log_{10}(x)$
$\quad p_{0/0} = 0.9899$
$\quad p_{0/1} = 0.01$
$\quad p_{1/1} = 0.001$
In this case,
$\quad GQ = -10 * log_{10} (0.01) = 20$
Higher GQ values indicate higher confidence. $GQ=10$ is 90% confidence, $GQ=20$ is 99% confidence, $GQ=30$ is 99.9% confidence, and so on.
In [ ]:
mt.aggregate_entries(hl.agg.stats(mt.GQ))
Using our equation above, the mean value indicates about 99.9% confidence. But it's not generally a good idea to draw conclusions just based on a mean and standard deviation...
It is possible to build more complicated queries using small pieces. We can use hl.agg.filter
to compute conditional statistics:
In [ ]:
mt.aggregate_entries(hl.agg.filter(mt.GT.is_het(),
hl.agg.stats(mt.GQ)))
To look at GQ at genotypes that are not heterozygous, we need add only one character (~
):
In [ ]:
mt.aggregate_entries(hl.agg.filter(~mt.GT.is_het(),
hl.agg.stats(mt.GQ)))
There are often many ways to accomplish something in Hail. We could have done these both together (and more efficiently!) using hl.agg.group_by
:
In [ ]:
mt.aggregate_entries(hl.agg.group_by(mt.GT,
hl.agg.stats(mt.GQ)))
Of course, the best way to understand a distribution is to look at it!
In [ ]:
p = hl.plot.histogram(
mt.GQ,
bins=100)
show(p)
Hint: try copying some of the cells above and looking at DP, the sequencing depth, as well as GQ. The ratio between the two may also be interesting...
Hint: if you want to plot a filtered GQ distribution, you can use something like:
p = hl.plot.histogram(mt.filter_entries(mt.GT.is_het()).GQ, bins=100)
Remember that you can create a new cell using keyboard shortcuts A
or B
in navigation mode.
In [ ]:
In [ ]:
! head data/1kg_annotations.txt
We can import it as a Hail Table with hl.import_table.
We call it "sa" for "sample annotations".
In [ ]:
sa = hl.import_table('data/1kg_annotations.txt',
impute=True,
key='Sample')
While we can see the names and types of fields in the logging messages, we can also describe
and show
this table:
In [ ]:
sa.describe()
In [ ]:
sa.show()
In [ ]:
mt = mt.annotate_cols(pheno = sa[mt.s])
Understanding what's going on here is a bit more difficult. To understand, we need to understand a few pieces:
annotate
methodsIn Hail, annotate
methods refer to adding new fields.
MatrixTable
's annotate_cols
adds new column fields.MatrixTable
's annotate_rows
adds new row fields.MatrixTable
's annotate_entries
adds new entry fields.Table
's annotate
adds new row fields.In the above cell, we are adding a new coluimn field called "pheno". This field should be the values in our table sa
associated with the sample ID s
in our MatrixTable
- that is, this is performing a join.
Python uses square brackets to look up values in dictionaries:
d = {'foo': 5, 'bar': 10}
d['foo']
You should think of this in much the same way - for each column of mt
, we are looking up the fields in sa
using the sample ID s
.
In [ ]:
mt.describe()
In [ ]:
We'll start with examples of sample QC.
Hail has the function hl.sample_qc to compute a list of useful statistics about samples from sequencing data.
Click the link above to see the documentation, which lists the fields and their descriptions.
In [ ]:
mt = hl.sample_qc(mt)
In [ ]:
mt.sample_qc.describe()
In [ ]:
p = hl.plot.scatter(x=mt.sample_qc.r_het_hom_var,
y=mt.sample_qc.call_rate)
show(p)
In [ ]:
In [ ]:
mt = mt.filter_cols(mt.sample_qc.dp_stats.mean >= 4)
mt = mt.filter_cols(mt.sample_qc.call_rate >= 0.97)
We explored GQ above, and analysts often set thresholds for GQ to filter entries (genotypes). Another useful metric is allele read balance.
This value is defined by:
$\quad AB = \dfrac{N_{alt}}{{N_{ref} + N_{alt}}}$
Where $N_{ref}$ is the number of reference reads and $N_{alt}$ is the number of alternate reads.
We want
In [ ]:
# call rate before filtering
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
In [ ]:
ab = mt.AD[1] / hl.sum(mt.AD)
filter_condition_ab = (
hl.case()
.when(mt.GT.is_hom_ref(), ab <= 0.1)
.when(mt.GT.is_het(), (ab >= 0.25) & (ab <= 0.75))
.default(ab >= 0.9) # hom-var
)
mt = mt.filter_entries(filter_condition_ab)
In [ ]:
# call rate after filtering
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
Hail has the function hl.variant_qc to compute a list of useful statistics about variants from sequencing data.
Once again, Click the link above to see the documentation!
In [ ]:
mt = hl.variant_qc(mt)
In [ ]:
mt.variant_qc.describe()
In [ ]:
mt.variant_qc.AF.show()
In [ ]:
mt = mt.filter_rows(hl.min(mt.variant_qc.AF) > 1e-6)
In [ ]:
mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 0.005)
In [ ]:
# final variant and sample count
mt.count()
A GWAS is an independent association test performed per variant of a genetic dataset. We use the same phenotype and covariates, but test the genotypes for each variant separately.
In Hail, the method we use is hl.linear_regression_rows.
We use the phenotype CaffeineConsumption
as our dependent variable, the number of alternate alleles as our independent variable, and no covariates besides an intercept term (that's the 1.0
).
In [ ]:
gwas = hl.linear_regression_rows(y=mt.pheno.CaffeineConsumption,
x=mt.GT.n_alt_alleles(),
covariates=[1.0])
gwas.describe()
Two of the plots that analysts generally produce are a Manhattan plot and a Q-Q plot.
We'll start with the Manhattan plot:
In [ ]:
p = hl.plot.manhattan(gwas.p_value)
show(p)
In [ ]:
p = hl.plot.qq(gwas.p_value)
show(p)
The Q-Q plot indicates extreme inflation of p-values.
If you've done a GWAS before, you've probably included a few other covariates -- age, sex, and principal components.
Principal components are a measure of genetic ancestry, and can be used to control for population stratification.
We can compute principal components with Hail:
In [ ]:
pca_eigenvalues, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, compute_loadings=True)
The eigenvalues reflect the amount of variance explained by each principal component:
In [ ]:
pca_eigenvalues
The scores are the principal components themselves, computed per sample.
In [ ]:
pca_scores.describe()
In [ ]:
pca_scores.show()
The loadings are the contributions to each component for each variant.
In [ ]:
pca_loadings.describe()
We can annotate the principal components back onto mt
:
In [ ]:
mt = mt.annotate_cols(pca = pca_scores[mt.s])
In [ ]:
p = hl.plot.scatter(mt.pca.scores[0],
mt.pca.scores[1],
label=mt.pheno.SuperPopulation)
show(p)
In [ ]:
gwas = hl.linear_regression_rows(
y=mt.pheno.CaffeineConsumption,
x=mt.GT.n_alt_alleles(),
covariates=[1.0, mt.pheno.isFemale, mt.pca.scores[0], mt.pca.scores[1], mt.pca.scores[2]])
In [ ]:
p = hl.plot.qq(gwas.p_value)
show(p)
In [ ]:
p = hl.plot.manhattan(gwas.p_value)
show(p)
GWAS is a great tool for finding associations between common variants and disease, but a GWAS can't hope to find associations between rare variants and disease. Even if we have sequencing data for 1,000,000 people, we won't have the statistical power to link a mutation found in only a few people to any disease.
But rare variation has lots of information - especially because statistical genetic theory dictates that rarer variants have, on average, stronger effects on disease per allele.
One possible strategy is to group together rare variants with similar predicted consequence. For example, we can group all variants that are predicted to knock out the function of each gene and test the variants for each gene as a group.
We will be running a burden test on our common variant dataset to demonstrate the technical side, but we shouldn't hope to find anything here -- especially because we've only got 10,000 variants!
In [ ]:
! wget https://storage.googleapis.com/hail-tutorial/ensembl_gene_annotations.txt -O data/ensembl_gene_annotations.txt
In [ ]:
gene_ht = hl.import_table('data/ensembl_gene_annotations.txt', impute=True)
In [ ]:
gene_ht.show()
In [ ]:
gene_ht.count()
In [ ]:
gene_ht = gene_ht.transmute(interval = hl.locus_interval(gene_ht['Chromosome'],
gene_ht['Gene start'],
gene_ht['Gene end']))
gene_ht = gene_ht.key_by('interval')
In [ ]:
mt = mt.annotate_rows(gene_info = gene_ht[mt.locus])
In [ ]:
mt.gene_info.show()
There is no hl.burden_test
function -- instead, a burden test is the composition of two modular pieces of Hail functionality:
group_rows_by / aggregate
hl.linear_regression_rows
.While this might be a few more lines of code to write than hl.burden_test
, it means that you can flexibly specify the genotype aggregation however you like. Using other tools , you may have a few ways to aggregate, but if you want to do something different you are out of luck!
In [ ]:
burden_mt = (
mt
.group_rows_by(gene = mt.gene_info['Gene name'])
.aggregate(n_variants = hl.agg.count_where(mt.GT.n_alt_alleles() > 0))
)
In [ ]:
burden_mt.describe()
In [ ]:
burden_results = hl.linear_regression_rows(
y=burden_mt.pheno.CaffeineConsumption,
x=burden_mt.n_variants,
covariates=[1.0,
burden_mt.pheno.isFemale,
burden_mt.pca.scores[0],
burden_mt.pca.scores[1],
burden_mt.pca.scores[2]])
In [ ]:
burden_results.order_by(burden_results.p_value).show()
In [ ]:
In [ ]: