Background

This notebook walks through the creation of a fastai DataBunch object. This object contains a pytorch dataloader for the train, valid and test sets. From the documentation:

Bind train_dl,valid_dl and test_dl in a data object.

It also ensures all the dataloaders are on device and applies to them dl_tfms as batch are drawn (like normalization). path is used internally to store temporary files, collate_fn is passed to the pytorch Dataloader (replacing the one there) to explain how to collate the samples picked for a batch.

Because we are training the language model, we want our dataloader to construct the target variable from the input data. The target variable for language models are the next word in a sentence. Furthermore, there are other optimizations with regard to the sequence length and concatenating texts together that avoids wasteful padding. Luckily the TextLMDataBunch does all this work for us (and more) automatically.


In [1]:
from fastai.text import TextLMDataBunch as lmdb
from fastai.text.transform import Tokenizer
import pandas as pd
from pathlib import Path

Read in Data

You can download the above saved dataframes (in pickle format) from Google Cloud Storage:

train_df.pkl (9GB):

https://storage.googleapis.com/issue_label_bot/pre_processed_data/2_partitioned_df/train_df.pkl

valid_df.pkl (1GB)

https://storage.googleapis.com/issue_label_bot/pre_processed_data/2_partitioned_df/valid_df.pkl


In [2]:
# note: download the data and place in right directory before running this code!

valid_df = pd.read_pickle(Path('../data/2_partitioned_df/valid_df.pkl'))
train_df = pd.read_pickle(Path('../data/2_partitioned_df/train_df.pkl'))

In [3]:
print(f'rows in train_df:, {train_df.shape[0]:,}')
print(f'rows in valid_df:, {valid_df.shape[0]:,}')


rows in train_df:, 16,762,799
rows in valid_df:, 1,858,033

In [4]:
train_df.head(3)


Out[4]:
text url
0 xxxfldtitle Grab excerpt and image using Open ... https://github.com/markjaquith/page-links-to/i...
1 xxxfldtitle Gracefully handling Ctrl+C ignores... https://github.com/dotnet/corefx/issues/32749
2 xxxfldtitle GradleAspectJ-Android not working ... https://github.com/Archinamon/android-gradle-a...

Create The DataBunch

Instantiate The Tokenizer


In [5]:
def pass_through(x):
    return x

In [8]:
# only thing is we are changing pre_rules to be pass through since we have already done all of the pre-rules.  
# you don't want to accidentally apply pre-rules again otherwhise it will corrupt the data.
tokenizer = Tokenizer(pre_rules=[pass_through], n_cpus=31)

Specify path for saving language model artifacts


In [9]:
path = Path('../model/lang_model/')

Create The Language Model Data Bunch

Warning: this steps builds the vocabulary and tokenizes the data. This procedure consumes an incredible amount of memory. This took 1 hour on a machine with 72 cores and 400GB of Memory.


In [14]:
# Note you want your own tokenizer, without pre-rules
data_lm = lmdb.from_df(path=path,
                       train_df=train_df,
                       valid_df=valid_df,
                       text_cols='text',
                       tokenizer=tokenizer,
                       chunksize=6000000)

In [ ]:
data_lm.save() # saves to self.path/data_save.pkl

Location of Saved DataBunch

The databunch object is available here:

https://storage.googleapis.com/issue_label_bot/model/lang_model/data_save.pkl

It is a massive file of 27GB so proceed with caution when downlaoding this file.