In the last section, we inspected the structure of the data and displayed a few example values.
How do we get a deeper feel for the data? One of the most natural things to do is to create a summary of a large number of values. For example, you could ask:
We can answer these questions with aggregation. Aggregation combines many values together to create a summary.
To start, we'll aggregate all the values in a table. (Later, we'll learn how to aggregate over subsets.)
We can do this with the Table.aggregate method.
A call to
aggregate has two parts:
Hail has a large suite of aggregators for summarizing data. Let's see some in action!
In [ ]:import hail as hl from bokeh.io import output_notebook,show output_notebook() hl.init() hl.utils.get_movie_lens('data/') users = hl.read_table('data/users.ht')
In [ ]:users.aggregate(hl.agg.count())
In [ ]:users.count()
In [ ]:users.show()
In [ ]:users.aggregate(hl.agg.stats(users.age))
What about non-numeric data, like the
counter is modeled on the Python Counter object: it counts the number of times each distinct value occurs in the collection of values being aggregated.
In [ ]:users.aggregate(hl.agg.counter(users.occupation))
In [ ]:users.aggregate(hl.agg.filter(users.sex == 'M', hl.agg.count()))
The argument to
filter should be a Boolean expression.
In [ ]:users.aggregate(hl.agg.count_where(users.sex == 'M'))
Boolean expressions can be compound, but be sure to use parentheses appropriately. A single '&' denotes logical AND and a single '|' denotes logical OR.
In [ ]:users.aggregate(hl.agg.filter((users.occupation == 'writer') | (users.occupation == 'executive'), hl.agg.count()))
In [ ]:users.aggregate(hl.agg.filter((users.sex == 'F') | (users.occupation == 'executive'), hl.agg.count()))
In [ ]:hist = users.aggregate(hl.agg.hist(users.age, 10, 70, 60)) hist
In [ ]:p = hl.plot.histogram(hist, legend='Age') show(p)
In [ ]:users.aggregate(hl.agg.take(users.occupation, 5))
In [ ]:users.aggregate(hl.agg.take(users.age, 5, ordering=-users.age))
Warning! Aggregators like
counter return Python objects and can fail with out of memory errors if you apply them to collections that are too large (e.g. all 50 trillion genotypes in the UK Biobank dataset).
In [ ]: