Table Tutorial

Table is Hail's distributed analogue of a data frame or SQL table. It will be familiar if you've used R or pandas, but Table differs in 3 important ways:

  • It is distributed. Hail tables can store far more data than can fit on a single computer.
  • It carries global fields.
  • It is keyed.

A Table has two different kinds of fields:

  • global fields
  • row fields

Importing and Reading

Hail can import data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also read (and write) a native Hail format.

You can read a dataset with hl.read_table. It take a path and returns a Table. ht stands for Hail Table.

We've provided a method to download and import the MovieLens dataset of movie ratings in the Hail native format. Let's read it!

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=

In [ ]:
import hail as hl

In [ ]:

In [ ]:
users = hl.read_table('data/')

Exploring Tables

The describe method prints the structure of a table: the fields and their types.

In [ ]:

You can view the first few rows of the table using show.

10 rows are displayed by default. Try changing the code in the cell below to

In [ ]:

You can count the rows of a table.

In [ ]:

You can access fields of tables with the Python attribute notation table.field, or with index notation table['field']. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example).

In [ ]:

In [ ]:

users.occupation and users['occupation'] are Hail Expressions

Lets peak at their using show. Notice that the key is shown as well!

In [ ]:


The movie dataset has two other tables: and Load these tables and have a quick look around.

In [ ]: