Filtering and Annotation Tutorial

Filter

You can filter the rows of a table with Table.filter. This returns a table of those rows for which the expression evaluates to True.



In [ ]:

    
import hail as hl

hl.utils.get_movie_lens('data/')
users = hl.read_table('data/users.ht')



In [ ]:

    
users.filter(users.occupation == 'programmer').count()

We can also express this query in multiple ways using aggregations:



In [ ]:

    
users.aggregate(hl.agg.filter(users.occupation == 'programmer', hl.agg.count()))



In [ ]:

    
users.aggregate(hl.agg.counter(users.occupation == 'programmer'))[True]

Annotate

You can add new fields to a table with annotate. As an example, let's create a new column called cleaned_occupation that replaces missing entries in the occupation field labeled as 'other' with 'none.'



In [ ]:

    
missing_occupations = hl.set(['other', 'none'])

t = users.annotate(
    cleaned_occupation = hl.cond(missing_occupations.contains(users.occupation),
                                 hl.null('str'),
                                 users.occupation))
t.show()

Compare this to what we had before:



In [ ]:

    
users.show()

Note: annotate is functional: it doesn't mutate users, but returns a new table. This is also true of filter. In fact, all operations in Hail are functional.



In [ ]:

    
users.describe()

Select and Transmute

There are two other annotate methods: select and transmute. select allows you to create new tables from old ones by selecting existing fields, or creating new ones.

First, let's extract the sex and occupation fields:



In [ ]:

    
users.select(users.sex, users.occupation).show()

We can also create a new field that stores the age relative to the average. Note that new fields must be assigned a name (in this case mean_shifted_age):



In [ ]:

    
mean_age = round(users.aggregate(hl.agg.stats(users.age)).mean)
users.select(users.sex, users.occupation, mean_shifted_age = users.age - mean_age).show()

transmute replaces any fields mentioned on the right-hand side with the new fields, but leaves unmentioned fields unchanged. transmute is useful for transforming data into a new form. Compare the following two snippts of code. The second is identical to the first with transmute replacing select.



In [ ]:

    
missing_occupations = hl.set(['other', 'none'])

t = users.select(
    cleaned_occupation = hl.cond(missing_occupations.contains(users.occupation),
                                 hl.null('str'),
                                 users.occupation))
t.show()



In [ ]:

    
missing_occupations = hl.set(['other', 'none'])

t = users.transmute(
    cleaned_occupation = hl.cond(missing_occupations.contains(users.occupation),
                                 hl.null('str'),
                                 users.occupation))
t.show()

Global Fields

Finally, you can add global fields with annotate_globals. Globals are useful for storing metadata about a dataset or storing small data structures like sets and maps.



In [ ]:

    
t = users.annotate_globals(cohort = 5, cloudable = hl.set(['sample1', 'sample10', 'sample15']))
t.describe()



In [ ]:

    
t.cloudable



In [ ]:

    
hl.eval(t.cloudable)

Exercises

Z-score normalize the age field of users.
Convert zip to an integer. Hint: Not all zipcodes are US zipcodes! Use hl.int32 to convert a string to an integer. Use StringExpression.matches to see if a string matches a regular expression.



In [ ]: