## MatrixTable Tutorial

If you've gotten this far, you're probably thinking:

• "Can't I do all of this in `pandas` or `R`?"
• "What does this have to do with biology?"

The two crucial features that Hail adds are scalability and the domain-specific primitives needed to work easily with biological data. Fear not! You've learned most of the basic concepts of Hail and now are ready for the bit that makes it possible to represent and compute on genetic matrices: the MatrixTable.

In the last example of the Table Joins Tutorial, the ratings table had a compound key: `movie_id` and `user_id`. The ratings were secretly a movie-by-user matrix!

However, since this matrix is very sparse, it is reasonably represented in a so-called "coordinate form" `Table`, where each row of the table is an entry of the sparse matrix. For large and dense matrices (like sequencing data), the per-row overhead of coordinate reresentations is untenable. That's why we built `MatrixTable`, a 2-dimensional generalization of `Table`.

### MatrixTable Anatomy

Recall that `Table` has two kinds of fields:

• global fields
• row fields

`MatrixTable` has four kinds of fields:

• global fields
• row fields
• column fields
• entry fields

Row fields are fields that are stored once per row. These can contain information about the rows, or summary data calculated per row.

Column fields are stored once per column. These can contain information about the columns, or summary data calculated per column.

Entry fields are the piece that makes this structure a matrix -- there is an entry for each (row, column) pair.

Like tables, matrix tables can be imported from a variety of formats: VCF, (B)GEN, PLINK, TSV, etc. Matrix tables can also be read from a "native" matrix table format. Let's read a sample of prepared 1KG data.

``````

In [ ]:

import hail as hl
from bokeh.io import output_notebook, show
output_notebook()

hl.utils.get_1kg('data/')

``````
``````

In [ ]:

mt.describe()

``````

There are a few things to note:

• There is a single column field `s`. This is the sample ID from the VCF. It is also the column key.
• There is a compound row key: `locus` and `alleles`.
• `locus` has type `locus<GRCh37>`
• `alleles` has type `array<str>`
• GT has type `call`. That's a genotype call!

Whereas table expressions could be indexed by nothing or indexed by rows, matrix table expression have four options: nothing, indexed by row, indexed by column, or indexed by row and column (the entries). Let's see some examples.

``````

In [ ]:

mt.s.describe()

``````
``````

In [ ]:

mt.GT.describe()

``````

### MatrixTable operations

We belabored the operations on tables because they all have natural analogs (sometimes several) on matrix tables. For example:

• `count` => `count_{rows, cols}` (and `count` which returns both)
• `filter` => `filter_{rows, cols, entries}`
• `annotate` => `annotate_{rows, cols, entries}` (and globals for both)
• `select` => `select_{rows, cols, entries}` (and globals for both)
• `transmute` => `transmute_{rows, cols, entries}` (and globals for both)
• `group_by` => `group_{rows, cols}_by`
• `explode` => `expode_{rows, cols}`
• `aggregate` => `aggregate_{rows, cols, entries}`

Some operations are unique to `MatrixTable`:

• The row fields can be accessed as a `Table` with rows
• The column fields can be accessed as a `Table` with cols.
• The entire field space of a `MatrixTable` can be accessed as a coordinate-form `Table` with entries. Be careful with this! While it's fast to aggregate or query, trying to write this `Table` to disk could produce files thousands of times larger than the corresponding `MatrixTable`.

Let's explore `mt` using these tools. Let's get the size of the dataset.

``````

In [ ]:

mt.count() # (rows, cols)

``````

Let's look at the first few row keys (variants) and column keys (sample IDs).

``````

In [ ]:

mt.rows().select().show()

``````
``````

In [ ]:

mt.s.show()

``````

Let's investigate the genotypes and the call rate. Let's look at the first few genotypes:

``````

In [ ]:

mt.GT.show()

``````

All homozygous reference, which is not surprising. Let's look at the distribution of genotype calls:

``````

In [ ]:

mt.aggregate_entries(hl.agg.counter(mt.GT.n_alt_alleles()))

``````

Let's compute the overall call rate directly, and then plot the distribution of call rate per variant.

``````

In [ ]:

mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))

``````

Here's a nice trick: you can use an aggregator inside `annotate_rows` and it will aggregate over columns, that is, summarize the values in the row using the aggregator. Let's compute and plot call rate per variant.

``````

In [ ]:

mt2 = mt.annotate_rows(call_rate = hl.agg.fraction(hl.is_defined(mt.GT)))
mt2.describe()

``````
``````

In [ ]:

p = hl.plot.histogram(mt2.call_rate, range=(0,1.0), bins=100,
title='Variant Call Rate Histogram', legend='Call Rate')
show(p)

``````

### Exercise: GQ vs DP

In this exercise, you'll use Hail to investigate a strange property of sequencing datasets.

The `DP` field is the sequencing depth (the number of reads).

Let's first plot a histogram of `DP`:

``````

In [ ]:

p = hl.plot.histogram(mt.DP, range=(0,40), bins=40, title='DP Histogram', legend='DP')
show(p)

``````

Now, let's do the same thing for GQ.

The `GQ` field is the phred-scaled "genotype quality". The formula to convert to a linear-scale confidence (0 to 1) is `10 ** -(mt.GQ / 10)`. GQ is truncated to lie between 0 and 99.

``````

In [ ]:

p = hl.plot.histogram(mt.GQ, range=(0,100), bins=100, title='GQ Histogram', legend='GQ')
show(p)

``````

Whoa! That's a strange distribution! There's a big spike at 100. The rest of the values have roughly the same shape as the DP distribution, but form a Dimetrodon. Use Hail to figure out what's going on!

``````

In [ ]:

``````