Hail workshop

This notebook will introduce the following concepts:

Using Jupyter notebooks effectively
Loading genetic data into Hail
General-purpose data exploration functionality
Plotting functionality
Quality control of sequencing data
Running a Genome-Wide Association Study (GWAS)
Rare variant burden tests

Hail on Jupyter

"The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more."

In the last year, the Jupyter development team released Jupyter Lab, an integrated environment for data, code, and visualizations. If you've used R Studio, this is the closest thing that works in Python (and many other languages!).

Why notebooks?

Part of what we think is so exciting about Hail is that it has coincided with a larger shift in the data science community.

Three years ago, most computational biologists at Broad analyzed genetic data using command-line tools, and took advantage of research compute clusters by explicitly using scheduling frameworks like LSF or Sun Grid Engine.

Now, they have the option to use Hail in interactive Python notebooks backed by thousands of cores on public compute clouds like Google Cloud, Amazon Web Services, or Microsoft Azure.

Using Jupyter

Running cells

Evaluate cells using SHIFT + ENTER. Select the next cell and run it.



In [ ]:

    
print('Hello, world')

Modes

Jupyter has two modes, a navigation mode and an editor mode.

BLUE cell borders
UP / DOWN move between cells
ENTER while a cell is selected will move to editing mode.
Many letters are keyboard shortcuts! This is a common trap.

Editor mode:

GREEN cell borders
UP / DOWN/ move within cells before moving between cells.
ESC will return to navigation mode.
SHIFT + ENTER will evaluate a cell and return to navigation mode.

Cell types

There are several types of cells in Jupyter notebooks. The two you will see here are Markdown (text) and Code.



In [ ]:

    
# This is a code cell
my_variable = 5

This is a markdown cell, so even if something looks like code (as below), it won't get executed!

my_variable += 1



In [ ]:

    
print(my_variable)

Common gotcha: a code cell turns into markdown

This can happen if you are in navigation mode and hit the keyboard shortcut m while selecting a code cell.

You can either navigate to Cell > Cell Type > Code through the top menu, or use the keyboard shortcut y to turn it back to code.

Tips and tricks

Keyboard shortcuts:

SHIFT + ENTER to evaluate a cell
ESC to return to navigation mode
y to turn a markdown cell into code
m to turn a code cell into markdown
a to add a new cell above the currently selected cell
b to add a new cell below the currently selected cell
d, d (repeated) to delete the currently selected cell
TAB to activate code completion

To try this out, create a new cell below this one using b, and print my_variable by starting with print(my and pressing TAB!

Common gotcha: the state of your code seems wrong

Jupyter makes it easy to get yourself into trouble by executing cells out-of-order, or multiple times.

For example, if I declare x:

x = 5

Then have a cell that reads:

x += 1

And finally:

print(x)

If you execute these cells in order and once, I'll see the notebook print 6. However, there is nothing stopping you from executing the middle cell ten times, printing 16!

Solution

If you get yourself into trouble into this way, the solution is to clear the kernel (Python process) and start again from the top.

First, Kernel > Restart & Clear Output > (accept dialog).

Second, Cell > Run all above.

Set up our Python environment

In addition to Hail, we import a few methods from the bokeh plotting library. We'll see examples soon!



In [ ]:

    
import hail as hl
from bokeh.io import output_notebook, show

Now we initialize Hail and set up Bokeh to display inline in the notebook.



In [ ]:

    
hl.init()
output_notebook()

Download public 1000 Genomes data

The workshop materials are designed to work on a small (~20MB) downsampled chunk of the public 1000 Genomes dataset.

You can run these same functions on your computer or on the cloud!



In [ ]:

    
hl.utils.get_1kg('data/')

It is possible to call command-line utilities from Jupyter by prefixing a line with a !:



In [ ]:

    
! ls -1 data/

Part 1: Explore genetic data with Hail

Import data from VCF

The Variant Call Format (VCF) is a common file format for representing genetic data collected on multiple individuals (samples).

Hail's import_vcf function can read this format.

However, VCF is a text format that is easy for humans but very bad for computers. The first thing we do is write to a Hail native file format, which is much faster!



In [ ]:

    
hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)

Read 1KG into Hail

We represent genetic data as a Hail MatrixTable, and name our variable mt to indicate this.



In [ ]:

    
mt = hl.read_matrix_table('data/1kg.mt')

What is a `MatrixTable`?

Let's describe it!

The describe method prints the schema, that is, the fields in the dataset and their types.

You can see:

numeric types:
- integers (int32, int64), e.g. 5
- floating point numbers (float32, float64), e.g. 5.5 or 3e-8
strings (str), e.g. "Foo"
boolean values (bool) e.g. True
collections:
- arrays (array), e.g. [1,1,2,3]
- sets (set), e.g. {1,3}
- dictionaries (dict), e.g. {'Foo': 5, 'Bar': 10}
genetic data types:
- loci (locus), e.g. [GRCh37] 1:10000 or [GRCh38] chr1:10024
- genotype calls (call), e.g. 0/2 or 1|0



In [ ]:

    
mt.describe()

`count`

MatrixTable.count returns a tuple with the number of rows (variants) and number of columns (samples).



In [ ]:

    
mt.count()

`show`

There is no mt.show() method, but you can show individual fields like the sample ID, s, or the locus.



In [ ]:

    
mt.s.show(5)



In [ ]:

    
mt.locus.show(5)

Exercise: show other fields

You can see the names of fields above. show() the first few values for a few of them, making sure to include at least one row field and at least one entry field. Capitalization is important.

To print fields inside the info structure, you must add another dot, e.g. mt.info.AN.

What do you notice being printed alongside some of the fields?



In [ ]:

Hail has functions built for genetics

For example, hl.summarize_variants prints useful statistics about the genetic variants in the dataset.



In [ ]:

    
hl.summarize_variants(mt)

Most of Hail's functionality is totally general-purpose!

Functions like summarize_variants are built out of Hail's general-purpose data manipulation functionality. We can use Hail to ask arbitrary questions about the data:



In [ ]:

    
mt.aggregate_rows(hl.agg.count_where(mt.alleles == ['A', 'T']))

Or if we had flight data:

flight_data.aggregate(
    hl.agg.count_where(flight_data.departure_city == 'Boston')
)

The counter aggregator makes it possible to see distributions of categorical data, like alleles:



In [ ]:

    
snp_counts = mt.aggregate_rows(
    hl.array(hl.agg.counter(mt.alleles)))
snp_counts

By sorting the result in Python, we can recover an interesting bit of biology...



In [ ]:

    
sorted(snp_counts,
       key=lambda x: x[1])

Question: What is interesting about this distribution?

Question: Why do the counts come in pairs?

A closer look at GQ

The GQ field in our dataset is an interesting one to explore further, and we can use various pieces of Hail's functionality to do so.

GQ stands for Genotype Quality, and reflects confidence in a genotype call. It is a non-negative integer truncated at 99, and is the phred-scaled probability of the second-most-likely genotype call.

Phred-scaling a value is the following transformation:

$\quad Phred(x) = -10 * log_{10}(x)$

Example:

$\quad p_{0/0} = 0.9899$

$\quad p_{0/1} = 0.01$

$\quad p_{1/1} = 0.001$

In this case,

$\quad GQ = -10 * log_{10} (0.01) = 20$

Higher GQ values indicate higher confidence. $GQ=10$ is 90% confidence, $GQ=20$ is 99% confidence, $GQ=30$ is 99.9% confidence, and so on.

### GQ in our dataset



In [ ]:

    
mt.aggregate_entries(hl.agg.stats(mt.GQ))

Using our equation above, the mean value indicates about 99.9% confidence. But it's not generally a good idea to draw conclusions just based on a mean and standard deviation...

It is possible to build more complicated queries using small pieces. We can use hl.agg.filter to compute conditional statistics:



In [ ]:

    
mt.aggregate_entries(hl.agg.filter(mt.GT.is_het(),
                                   hl.agg.stats(mt.GQ)))

To look at GQ at genotypes that are not heterozygous, we need add only one character (~):



In [ ]:

    
mt.aggregate_entries(hl.agg.filter(~mt.GT.is_het(), 
                                   hl.agg.stats(mt.GQ)))

There are often many ways to accomplish something in Hail. We could have done these both together (and more efficiently!) using hl.agg.group_by:



In [ ]:

    
mt.aggregate_entries(hl.agg.group_by(mt.GT, 
                                     hl.agg.stats(mt.GQ)))

Of course, the best way to understand a distribution is to look at it!



In [ ]:

    
p = hl.plot.histogram(
    mt.GQ, 
    bins=100)

show(p)

Exercise: What's going on here? Investigate!

Hint: try copying some of the cells above and looking at DP, the sequencing depth, as well as GQ. The ratio between the two may also be interesting...

Hint: if you want to plot a filtered GQ distribution, you can use something like:

p = hl.plot.histogram(mt.filter_entries(mt.GT.is_het()).GQ, bins=100)

Remember that you can create a new cell using keyboard shortcuts A or B in navigation mode.



In [ ]:

Part 2: Annotation and quality control

Integrate sample information

We're building toward a genome-wide association test in part 3, but we don't just need genetic data to do a GWAS -- we also need phenotype data! Luckily, our hl.utils.get_1kg function also downloaded some simulated phenotype data.

This is a text file:



In [ ]:

    
! head data/1kg_annotations.txt

We can import it as a Hail Table with hl.import_table.

We call it "sa" for "sample annotations".



In [ ]:

    
sa = hl.import_table('data/1kg_annotations.txt', 
                      impute=True, 
                      key='Sample')

While we can see the names and types of fields in the logging messages, we can also describe and show this table:



In [ ]:

    
sa.describe()



In [ ]:

    
sa.show()

Add sample metadata into our 1KG `MatrixTable`

It's short and easy:



In [ ]:

    
mt = mt.annotate_cols(pheno = sa[mt.s])

What's going on here?

Understanding what's going on here is a bit more difficult. To understand, we need to understand a few pieces:

1. `annotate` methods

In Hail, annotate methods refer to adding new fields.

MatrixTable's annotate_cols adds new column fields.
MatrixTable's annotate_rows adds new row fields.
MatrixTable's annotate_entries adds new entry fields.
Table's annotate adds new row fields.

In the above cell, we are adding a new coluimn field called "pheno". This field should be the values in our table sa associated with the sample ID s in our MatrixTable - that is, this is performing a join.

Python uses square brackets to look up values in dictionaries:

d = {'foo': 5, 'bar': 10}
d['foo']

You should think of this in much the same way - for each column of mt, we are looking up the fields in sa using the sample ID s.



In [ ]:

    
mt.describe()

Exercise: Query some of these column fields using `mt.aggregate_cols`.

Some of the aggregators we used earlier:

hl.agg.counter
hl.agg.stats
hl.agg.count_where



In [ ]:

Sample QC

We'll start with examples of sample QC.

Hail has the function hl.sample_qc to compute a list of useful statistics about samples from sequencing data.

Click the link above to see the documentation, which lists the fields and their descriptions.



In [ ]:

    
mt = hl.sample_qc(mt)



In [ ]:

    
mt.sample_qc.describe()



In [ ]:

    
p = hl.plot.scatter(x=mt.sample_qc.r_het_hom_var,
                    y=mt.sample_qc.call_rate)
show(p)

Exercise: Plot some other fields!

Modify the cell above. Remember hl.plot.histogram as well!

If you want to start getting fancy, you can plot more complicated expressions -- the ratio between two fields, for instance.



In [ ]:

Filter columns using generated QC statistics



In [ ]:

    
mt = mt.filter_cols(mt.sample_qc.dp_stats.mean >= 4)
mt = mt.filter_cols(mt.sample_qc.call_rate >= 0.97)

Entry QC

We explored GQ above, and analysts often set thresholds for GQ to filter entries (genotypes). Another useful metric is allele read balance.

This value is defined by:

$\quad AB = \dfrac{N_{alt}}{{N_{ref} + N_{alt}}}$

Where $N_{ref}$ is the number of reference reads and $N_{alt}$ is the number of alternate reads.

We want



In [ ]:

    
# call rate before filtering
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))



In [ ]:

    
ab = mt.AD[1] / hl.sum(mt.AD)

filter_condition_ab = (
    hl.case()
    .when(mt.GT.is_hom_ref(), ab <= 0.1)
    .when(mt.GT.is_het(), (ab >= 0.25) & (ab <= 0.75))
    .default(ab >= 0.9) # hom-var
)

mt = mt.filter_entries(filter_condition_ab)



In [ ]:

    
# call rate after filtering
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))

Variant QC

Hail has the function hl.variant_qc to compute a list of useful statistics about variants from sequencing data.

Once again, Click the link above to see the documentation!



In [ ]:

    
mt = hl.variant_qc(mt)



In [ ]:

    
mt.variant_qc.describe()



In [ ]:

    
mt.variant_qc.AF.show()

Remove rare sites:



In [ ]:

    
mt = mt.filter_rows(hl.min(mt.variant_qc.AF) > 1e-6)

Remove sites far from Hardy-Weinberg equilbrium:



In [ ]:

    
mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 0.005)



In [ ]:

    
# final variant and sample count
mt.count()

Part 3: GWAS!

A GWAS is an independent association test performed per variant of a genetic dataset. We use the same phenotype and covariates, but test the genotypes for each variant separately.

In Hail, the method we use is hl.linear_regression_rows.

We use the phenotype CaffeineConsumption as our dependent variable, the number of alternate alleles as our independent variable, and no covariates besides an intercept term (that's the 1.0).



In [ ]:

    
gwas = hl.linear_regression_rows(y=mt.pheno.CaffeineConsumption, 
                                 x=mt.GT.n_alt_alleles(), 
                                 covariates=[1.0])
gwas.describe()

Two of the plots that analysts generally produce are a Manhattan plot and a Q-Q plot.

We'll start with the Manhattan plot:



In [ ]:

    
p = hl.plot.manhattan(gwas.p_value)
show(p)



In [ ]:

    
p = hl.plot.qq(gwas.p_value)
show(p)

Confounded!

The Q-Q plot indicates extreme inflation of p-values.

If you've done a GWAS before, you've probably included a few other covariates -- age, sex, and principal components.

Principal components are a measure of genetic ancestry, and can be used to control for population stratification.

We can compute principal components with Hail:



In [ ]:

    
pca_eigenvalues, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, compute_loadings=True)

The eigenvalues reflect the amount of variance explained by each principal component:



In [ ]:

    
pca_eigenvalues

The scores are the principal components themselves, computed per sample.



In [ ]:

    
pca_scores.describe()



In [ ]:

    
pca_scores.show()

The loadings are the contributions to each component for each variant.



In [ ]:

    
pca_loadings.describe()

We can annotate the principal components back onto mt:



In [ ]:

    
mt = mt.annotate_cols(pca = pca_scores[mt.s])

Principal components measure ancestry



In [ ]:

    
p = hl.plot.scatter(mt.pca.scores[0], 
                    mt.pca.scores[1],
                    label=mt.pheno.SuperPopulation)
show(p)

Question: Does your plot match your neighbors'?

If not, how is it different?

Control confounders and run another GWAS



In [ ]:

    
gwas = hl.linear_regression_rows(
    y=mt.pheno.CaffeineConsumption, 
    x=mt.GT.n_alt_alleles(),
    covariates=[1.0, mt.pheno.isFemale, mt.pca.scores[0], mt.pca.scores[1], mt.pca.scores[2]])



In [ ]:

    
p = hl.plot.qq(gwas.p_value)
show(p)



In [ ]:

    
p = hl.plot.manhattan(gwas.p_value)
show(p)

Part 4: Burden tests

GWAS is a great tool for finding associations between common variants and disease, but a GWAS can't hope to find associations between rare variants and disease. Even if we have sequencing data for 1,000,000 people, we won't have the statistical power to link a mutation found in only a few people to any disease.

But rare variation has lots of information - especially because statistical genetic theory dictates that rarer variants have, on average, stronger effects on disease per allele.

One possible strategy is to group together rare variants with similar predicted consequence. For example, we can group all variants that are predicted to knock out the function of each gene and test the variants for each gene as a group.

We will be running a burden test on our common variant dataset to demonstrate the technical side, but we shouldn't hope to find anything here -- especially because we've only got 10,000 variants!

Import gene data

We start by importing data about genes.

First, we need to download it:



In [ ]:

    
! wget https://storage.googleapis.com/hail-tutorial/ensembl_gene_annotations.txt -O data/ensembl_gene_annotations.txt



In [ ]:

    
gene_ht = hl.import_table('data/ensembl_gene_annotations.txt', impute=True)



In [ ]:

    
gene_ht.show()



In [ ]:

    
gene_ht.count()

Create an interval key



In [ ]:

    
gene_ht = gene_ht.transmute(interval = hl.locus_interval(gene_ht['Chromosome'],
                                                         gene_ht['Gene start'],
                                                         gene_ht['Gene end']))
gene_ht = gene_ht.key_by('interval')

Annotate variants using these intervals



In [ ]:

    
mt = mt.annotate_rows(gene_info = gene_ht[mt.locus])



In [ ]:

    
mt.gene_info.show()

Aggregate genotypes per gene

There is no hl.burden_test function -- instead, a burden test is the composition of two modular pieces of Hail functionality:

group_rows_by / aggregate
hl.linear_regression_rows.

While this might be a few more lines of code to write than hl.burden_test, it means that you can flexibly specify the genotype aggregation however you like. Using other tools , you may have a few ways to aggregate, but if you want to do something different you are out of luck!



In [ ]:

    
burden_mt = (
    mt
    .group_rows_by(gene = mt.gene_info['Gene name'])
    .aggregate(n_variants = hl.agg.count_where(mt.GT.n_alt_alleles() > 0))
)



In [ ]:

    
burden_mt.describe()

What is `burden_mt`?

It is a gene-by-sample matrix (compare to mt, a variant-by-sample matrix).

It has one row field, the gene.

It has one entry field, n_variants.

It has all the column fields from mt.

Run linear regression per gene

This should look familiar!



In [ ]:

    
burden_results = hl.linear_regression_rows(
    y=burden_mt.pheno.CaffeineConsumption, 
    x=burden_mt.n_variants,
    covariates=[1.0, 
                burden_mt.pheno.isFemale, 
                burden_mt.pca.scores[0], 
                burden_mt.pca.scores[1], 
                burden_mt.pca.scores[2]])

Sorry, no `hl.plot.manhattan` for genes!

Instead, we can sort by p-value and print:



In [ ]:

    
burden_results.order_by(burden_results.p_value).show()

Exercise: Where along the genome can we find the top gene?



In [ ]:

Statistics question: What is the significance threshold for a burden test?

Is this top gene genome-wide significant?



In [ ]: