This notebook walks through the creation of a fastai DataBunch object. This object contains a pytorch dataloader for the train, valid and test sets. From the documentation:
Bind train_dl,valid_dl and test_dl in a data object.
It also ensures all the dataloaders are on device and applies to them dl_tfms as batch are drawn (like normalization). path is used internally to store temporary files, collate_fn is passed to the pytorch Dataloader (replacing the one there) to explain how to collate the samples picked for a batch.
Because we are training the language model, we want our dataloader to construct the target variable from the input data. The target variable for language models are the next word in a sentence. Furthermore, there are other optimizations with regard to the sequence length and concatenating texts together that avoids wasteful padding. Luckily the TextLMDataBunch does all this work for us (and more) automatically.
In [1]:
from fastai.text import TextLMDataBunch as lmdb
from fastai.text.transform import Tokenizer
import pandas as pd
from pathlib import Path
You can download the above saved dataframes (in pickle format) from Google Cloud Storage:
train_df.pkl (9GB):
https://storage.googleapis.com/issue_label_bot/pre_processed_data/2_partitioned_df/train_df.pkl
valid_df.pkl (1GB)
https://storage.googleapis.com/issue_label_bot/pre_processed_data/2_partitioned_df/valid_df.pkl
In [2]:
# note: download the data and place in right directory before running this code!
valid_df = pd.read_pickle(Path('../data/2_partitioned_df/valid_df.pkl'))
train_df = pd.read_pickle(Path('../data/2_partitioned_df/train_df.pkl'))
In [3]:
print(f'rows in train_df:, {train_df.shape[0]:,}')
print(f'rows in valid_df:, {valid_df.shape[0]:,}')
In [4]:
train_df.head(3)
Out[4]:
In [5]:
def pass_through(x):
return x
In [8]:
# only thing is we are changing pre_rules to be pass through since we have already done all of the pre-rules.
# you don't want to accidentally apply pre-rules again otherwhise it will corrupt the data.
tokenizer = Tokenizer(pre_rules=[pass_through], n_cpus=31)
Specify path for saving language model artifacts
In [9]:
path = Path('../model/lang_model/')
In [14]:
# Note you want your own tokenizer, without pre-rules
data_lm = lmdb.from_df(path=path,
train_df=train_df,
valid_df=valid_df,
text_cols='text',
tokenizer=tokenizer,
chunksize=6000000)
In [ ]:
data_lm.save() # saves to self.path/data_save.pkl