If you've gotten this far, you're probably thinking:
pandas
or R
?" The two crucial features that Hail adds are scalability and the domain-specific primitives needed to work easily with biological data. Fear not! You've learned most of the basic concepts of Hail and now are ready for the bit that makes it possible to represent and compute on genetic matrices: the MatrixTable.
In the last example of the Table Joins Tutorial, the ratings table had a compound key: movie_id
and user_id
. The ratings were secretly a movie-by-user matrix!
However, since this matrix is very sparse, it is reasonably represented in a so-called "coordinate form" Table
, where each row of the table is an entry of the sparse matrix. For large and dense matrices (like sequencing data), the per-row overhead of coordinate reresentations is untenable. That's why we built MatrixTable
, a 2-dimensional generalization of Table
.
Row fields are fields that are stored once per row. These can contain information about the rows, or summary data calculated per row.
Column fields are stored once per column. These can contain information about the columns, or summary data calculated per column.
Entry fields are the piece that makes this structure a matrix -- there is an entry for each (row, column) pair.
In [ ]:
import hail as hl
from bokeh.io import output_notebook, show
output_notebook()
hl.utils.get_1kg('data/')
In [ ]:
mt = hl.read_matrix_table('data/1kg.mt')
mt.describe()
There are a few things to note:
s
. This is the sample ID from the VCF. It is also the column key.locus
and alleles
. locus
has type locus<GRCh37>
alleles
has type array<str>
call
. That's a genotype call!Whereas table expressions could be indexed by nothing or indexed by rows, matrix table expression have four options: nothing, indexed by row, indexed by column, or indexed by row and column (the entries). Let's see some examples.
In [ ]:
mt.s.describe()
In [ ]:
mt.GT.describe()
We belabored the operations on tables because they all have natural analogs (sometimes several) on matrix tables. For example:
count
=> count_{rows, cols}
(and count
which returns both)filter
=> filter_{rows, cols, entries}
annotate
=> annotate_{rows, cols, entries}
(and globals for both)select
=> select_{rows, cols, entries}
(and globals for both)transmute
=> transmute_{rows, cols, entries}
(and globals for both)group_by
=> group_{rows, cols}_by
explode
=> expode_{rows, cols}
aggregate
=> aggregate_{rows, cols, entries}
Some operations are unique to MatrixTable
:
Table
with rowsTable
with cols.MatrixTable
can be accessed as a coordinate-form Table
with entries. Be careful with this! While it's fast to aggregate or query, trying to write this Table
to disk could produce files thousands of times larger than the corresponding MatrixTable
.Let's explore mt
using these tools. Let's get the size of the dataset.
In [ ]:
mt.count() # (rows, cols)
Let's look at the first few row keys (variants) and column keys (sample IDs).
In [ ]:
mt.rows().select().show()
In [ ]:
mt.s.show()
Let's investigate the genotypes and the call rate. Let's look at the first few genotypes:
In [ ]:
mt.GT.show()
All homozygous reference, which is not surprising. Let's look at the distribution of genotype calls:
In [ ]:
mt.aggregate_entries(hl.agg.counter(mt.GT.n_alt_alleles()))
Let's compute the overall call rate directly, and then plot the distribution of call rate per variant.
In [ ]:
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
Here's a nice trick: you can use an aggregator inside annotate_rows
and it will aggregate over columns, that is, summarize the values in the row using the aggregator. Let's compute and plot call rate per variant.
In [ ]:
mt2 = mt.annotate_rows(call_rate = hl.agg.fraction(hl.is_defined(mt.GT)))
mt2.describe()
In [ ]:
p = hl.plot.histogram(mt2.call_rate, range=(0,1.0), bins=100,
title='Variant Call Rate Histogram', legend='Call Rate')
show(p)
In [ ]:
p = hl.plot.histogram(mt.DP, range=(0,40), bins=40, title='DP Histogram', legend='DP')
show(p)
Now, let's do the same thing for GQ.
The GQ
field is the phred-scaled "genotype quality". The formula to convert to a linear-scale confidence (0 to 1) is 10 ** -(mt.GQ / 10)
. GQ is truncated to lie between 0 and 99.
In [ ]:
p = hl.plot.histogram(mt.GQ, range=(0,100), bins=100, title='GQ Histogram', legend='GQ')
show(p)
Whoa! That's a strange distribution! There's a big spike at 100. The rest of the values have roughly the same shape as the DP distribution, but form a Dimetrodon. Use Hail to figure out what's going on!
In [ ]: