ULMFit

Fine-tuning a forward and backward langauge model to get to 95.4% accuracy on the IMDB movie reviews dataset. This tutorial is done with fastai v1.0.53.


In [ ]:
from fastai.text import *

From a language model...

Data collection

This was run on a Titan RTX (24 GB of RAM) so you will probably need to adjust the batch size accordinly. If you divide it by 2, don't forget to divide the learning rate by 2 as well in the following cells. You can also reduce a little bit the bptt to gain a bit of memory.


In [ ]:
bs,bptt=256,80

This will download and untar the file containing the IMDB dataset, returning a Pathlib object pointing to the directory it's in (default is ~/.fastai/data/imdb0). You can specify another folder with the dest argument.


In [ ]:
path = untar_data(URLs.IMDB)

We then gather the data we will use to fine-tune the language model using the data block API. For this step, we want all the texts available (even the ones that don't have lables in the unsup folder) and we won't use the IMDB validation set (we will do this for the classification part later only). Instead, we set aside a random 10% of all the texts to build our validation set.

The fastai library will automatically launch the tokenization process with the spacy tokenizer and a few default rules for pre and post-processing before numericalizing the tokens, with a vocab of maximum size 60,000. Tokens are sorted by their frequency and only the 60,000 most commom are kept, the other ones being replace by an unkown token. This cell takes a few minutes to run, so we save the result.


In [ ]:
data_lm = (TextList.from_folder(path)
           #Inputs: all the text files in path
            .filter_by_folder(include=['train', 'test', 'unsup']) 
           #We may have other temp folders that contain text files so we only keep what's in train, test and unsup
            .split_by_rand_pct(0.1)
           #We randomly split and keep 10% (10,000 reviews) for validation
            .label_for_lm()           
           #We want to do a language model so we label accordingly
            .databunch(bs=bs, bptt=bptt))
data_lm.save('data_lm.pkl')

When restarting the notebook, as long as the previous cell was executed once, you can skip it and directly load your data again with the following.


In [ ]:
data_lm = load_data(path, 'data_lm.pkl', bs=bs, bptt=bptt)

Since we are training a language model, all the texts are concatenated together (with a random shuffle between them at each new epoch). The model is trained to guess what the next word in the sentence is.


In [ ]:
data_lm.show_batch()


idx text
0 xxmaj the rest of the cast which includes xxmaj beau xxmaj bridges , xxmaj kathy xxmaj bates and xxmaj mary - xxmaj louise xxmaj parker give remarkable clarity and substance to their characters . \n \n xxmaj the direction is subtle and effective . i 've watched this movie several times over the years and would very much recommend it . a beautiful piece of filmmaking . xxbos xxmaj
1 xxup casablanca ( xxmaj warner xxmaj brothers , 1942 ) . xxbos xxmaj this may be the most tension - filled movie i have ever seen . \n \n xxmaj in fact , it 's so nerve - wracking , i have n't been able to watch it again after viewing it two years ago , but i will since i have the xxup dvd . xxmaj there were
2 xxmaj piper , xxmaj prue and xxmaj phoebe bring xxmaj dr. xxmaj griffiths to the xxmaj manor in order to try and save him from xxmaj the xxmaj source 's personal assassin , xxmaj shax . xxmaj whilst xxmaj phoebe looks in the xxmaj book of xxmaj shadows for a spell to vanquish xxmaj shax , xxmaj prue and xxmaj piper are attacked by xxmaj shax and chase him into
3 in play . a tease does n't work unless it actually delivers . xxmaj britney does , so does ' xxmaj you xxmaj are xxmaj alone ' . xxbos xxmaj this is the kind of movie that you rent when you are incredibly tired , or impaired in some other way ... xxmaj the acting in this movie is so bad it seems intentional , and to let you know
4 do n't already have it . xxmaj they do n't expect xxmaj holbeck to succeed . xxmaj that way the xxmaj russians , who had stopped transmitting with xxmaj enigma , just in case , will begin transmitting again . \n \n xxmaj enigma is in the computer in the office of xxmaj dimitri xxmaj vasilikov . xxmaj somehow xxmaj holbeck must gain access , and in order to

For a backward model, the only difference is we'll have to pqss the flag backwards=True.


In [ ]:
data_bwd = load_data(path, 'data_lm.pkl', bs=bs, bptt=bptt, backwards=True)

In [ ]:
data_bwd.show_batch()


idx text
0 xxmaj : comment final my xxmaj \n \n . down xxup song the of pleasure overall the drag they , filming shoddy and angles camera the with but , movie the throughout songs two hear 'll you xxmaj : fans filth xxmaj of xxmaj cradle xxmaj you to xxmaj \n \n . shirt 's man the in blood fake with filled packet juice a 's there like looks
1 character the for nothing felt i . thing whole the during pain in was and movie the watched i . craze movie action big the during set was it xxmaj ? trouble the xxmaj . more much so xxup been have could movie this xxmaj . family his and john xxmaj character main the for felt i and bachman xxmaj as books kings xxmaj from man running the read i
2 characters these of sense make to try to harder even it made mother the by comment " today died father your " explained never the as such screenplay the in holes enormous xxmaj . viewers most of ask to much too was out played structure scene the way the in nature long painfully the that thought certainly most i , intrigue 's it of way by film this in qualities
3 all in all but , long too development character the , standards 's today by laughable were effects special the of quality the xxmaj . movies of types these of expect to come 've i what was out xxmaj watch xxmaj better xxmaj you xxmaj . xxunk xxmaj on these of some see to excited very was i so , disappointed very was i , theaters our from pulled was
4 perhaps xxmaj . will free own their ... by safe the to returned are hobgoblins xxmaj the xxmaj . " ? what xxmaj " wondering watcher the leave will movie this of end the at twist the xxmaj : ahead xxup spoilers xxup warning xxup \n \n . budget big a having film the or ... dollars of millions having with do to anything has never which fantasy wildest

Fine-tuning the forward language model

The idea behind the ULMFit paper is to use transfer learning for this classification task. Our language model isn't randomly initialized but with the weights of a model pretrained on a larger corpus, Wikitext 103. The vocabulary of the two datasets are slightly different, so when loading the weights, we take care to put the embedding weights at the right place, and we rando;ly initiliaze the embeddings for words in the IMDB vocabulary that weren't in the wikitext-103 vocabulary of our pretrained model.

This is all done by the first line of code that will download the pretrained model for you at the first use. The second line is to use Mixed Precision Training, which enables us to use a higher batch size by training part of our model in FP16 precision, and also speeds up training by a factor 2 to 3 on modern GPUs.


In [ ]:
learn = language_model_learner(data_lm, AWD_LSTM)
learn = learn.to_fp16(clip=0.1)

The Learner object we get is frozen by default, which means we only train the embeddings at first (since some of them are random).


In [ ]:
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7), wd=0.1)


epoch train_loss valid_loss accuracy time
0 4.303632 3.995142 0.292891 06:48

Then we unfreeze the model and fine-tune the whole thing.


In [ ]:
learn.unfreeze()

In [ ]:
learn.fit_one_cycle(10, 2e-3, moms=(0.8,0.7), wd=0.1)


epoch train_loss valid_loss accuracy time
0 4.036638 3.858849 0.309023 08:26
1 3.939585 3.786816 0.317458 08:35
2 3.898183 3.739857 0.322893 08:46
3 3.849097 3.709006 0.326197 08:58
4 3.809948 3.676351 0.329813 08:41
5 3.766907 3.647854 0.333161 08:41
6 3.728683 3.621063 0.335861 08:41
7 3.686090 3.601203 0.338476 08:39
8 3.647661 3.589400 0.339779 08:39
9 3.612635 3.586823 0.340075 08:12

Once done, we jsut save the encoder of the model (everything except the last linear layer that was decoding our final hidden states to words) because this is what we will use for the classifier.


In [ ]:
learn.save_encoder('fwd_enc')

The same but backwards

You can't directly train a bidirectional RNN for language modeling, but you can always enseble a forward and backward model. fastai provides a pretrained forward and backawrd model, so we can repeat the previous step to fine-tune the pretrained backward model. The command language_model_learner checks the data object you pass to automatically decide if it should use the pretrained forward or backward model.


In [ ]:
learn = language_model_learner(data_bwd, AWD_LSTM)
learn = learn.to_fp16(clip=0.1)

Then the training is the same:


In [ ]:
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7), wd=0.1)


epoch train_loss valid_loss accuracy time
0 4.335882 4.022331 0.326608 06:14

In [ ]:
learn.unfreeze()

In [ ]:
learn.fit_one_cycle(10, 2e-3, moms=(0.8,0.7), wd=0.1)


epoch train_loss valid_loss accuracy time
0 4.062783 3.886092 0.342082 06:47
1 3.962229 3.810247 0.350727 06:46
2 3.922255 3.766445 0.355493 06:49
3 3.874089 3.730033 0.359623 06:48
4 3.837302 3.702043 0.362982 06:48
5 3.803697 3.675028 0.365694 06:48
6 3.754358 3.650825 0.368455 06:48
7 3.713858 3.629092 0.370696 06:47
8 3.676009 3.617974 0.371980 06:46
9 3.649889 3.615306 0.372268 06:48
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)


In [ ]:
learn.save_encoder('bwd_enc')

... to a classifier

Data Collection

The classifier is a model that is a bit heavier, so we have lower the batch size.


In [ ]:
path = untar_data(URLs.IMDB)
bs = 128

We use the data block API again to gather all the texts for classification. This time, we only keep the ones in the trainind and validation folderm and label then by the folder they are in. Since this step takes a bit of time, we save the result.


In [ ]:
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
             #grab all the text files in path
             .split_by_folder(valid='test')
             #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
             .label_from_folder(classes=['neg', 'pos'])
             #label them all with their folders
             .databunch(bs=bs))

data_clas.save('data_clas.pkl')

As long as the previous cell was executed once, you can skip it and directly do this.


In [ ]:
data_clas = load_data(path, 'data_clas.pkl', bs=bs)

In [ ]:
data_clas.show_batch()


text target
xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules pos
xxbos xxmaj by now you 've probably heard a bit about the new xxmaj disney dub of xxmaj miyazaki 's classic film , xxmaj laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky . xxmaj during late summer of 1998 , xxmaj disney released " xxmaj kiki 's xxmaj delivery xxmaj service " on video which included a preview of the xxmaj laputa dub saying it was due out pos
xxbos xxmaj some have praised _ xxunk _ as a xxmaj disney adventure for adults . i do n't think so -- at least not for thinking adults . \n \n xxmaj this script suggests a beginning as a live - action movie , that struck someone as the type of crap you can not sell to adults anymore . xxmaj the " crack staff " of many older neg
xxbos * * * xxmaj warning - this review contains " plot spoilers , " though nothing could " spoil " this movie any more than it already is . xxmaj it really xxup is that bad . * * * \n \n xxmaj before i begin , i 'd like to let everyone know that this definitely is one of those so - incredibly - bad - that neg
xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the sweetest and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with pos

Like before, you only have to add backwards=True to load the data for a backward model.


In [ ]:
data_clas_bwd = load_data(path, 'data_clas.pkl', bs=bs, backwards=True)

In [ ]:
data_clas_bwd.show_batch()


text target
\n \n a- xxup a ppv xxup this give i . winner a was one this but , good very n't were ppv xxup the lately xxmaj ! ppv xxup decent a is there xxunk \n \n rock xxmaj the xxmaj : champion xxmaj wwe xxup new xxmaj and winner xxmaj ! pinned and one xxmaj great xxmaj the by bottomed xxmaj rock xxmaj was but slam xxmaj pos
9 - score xxmaj a - grade xxmaj \n \n ... better is think you what on decision a make can you xxmaj . realistic hardly but inspiring is that legend a about film a is " braveheart xxmaj " . world 's today in apropos very are that themes has that film a is " roy xxmaj rob xxmaj " , simply put to xxmaj . " roy pos
\n \n . you for is definitely movie this then , disaster face they how and 's 1900 early the in characters of life the into insight an you give will that characters interesting like you if xxmaj . editing and , xxunk , sound , costuming including material award xxmaj academy xxmaj are film this of aspects certain xxmaj . picture this in found be can that treats pos
. lower ever seeps it , plot the about think i as but enjoyable is movie -the - 10 of out 4 \n \n . badness much so avoided and feel pulp its of none lost ark xxmaj lost xxmaj the of raiders xxmaj . well as up grow to needs pulp our and sophisticated more bit a become have we xxmaj . anyway , " anymore adults sell neg
. wrestlers of roster great a handle to way poor a is which hogan xxmaj and warrior xxmaj by decimated were heels the of most xxmaj . pinned be to belt conveyor a on waiting almost were wrestlers the as effect detrimental a had obviously time little too and matches many too , overall xxmaj \n \n 10 / 2 . nauseous of point the to ending predictable very neg

Fine-tuning the forward classifier

The classifier needs a little less dropout, so we pass drop_mult=0.5 to multiply all the dropouts by this amount (it's easier than adjusting all the five different values manually). We don't load the pretrained model, but instead our fine-tuned encoder from the previous section.


In [ ]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5, pretrained=False)
learn.load_encoder('fwd_enc')

Then we train the model using gradual unfreezing (partially training the model from everything but the classification head frozen to the whole model trianing by unfreezing one layer at a time) and differential learning rate (deeper layer gets a lower learning rate).


In [ ]:
lr = 1e-1

In [ ]:
learn.fit_one_cycle(1, lr, moms=(0.8,0.7), wd=0.1)


epoch train_loss valid_loss accuracy time
0 0.246949 0.180387 0.931840 01:14

In [ ]:
learn.freeze_to(-2)
lr /= 2
learn.fit_one_cycle(1, slice(lr/(2.6**4),lr), moms=(0.8,0.7), wd=0.1)


epoch train_loss valid_loss accuracy time
0 0.206164 0.152391 0.945360 01:28

In [ ]:
learn.freeze_to(-3)
lr /= 2
learn.fit_one_cycle(1, slice(lr/(2.6**4),lr), moms=(0.8,0.7), wd=0.1)


epoch train_loss valid_loss accuracy time
0 0.181309 0.141463 0.948080 02:30

In [ ]:
learn.unfreeze()
lr /= 5
learn.fit_one_cycle(2, slice(lr/(2.6**4),lr), moms=(0.8,0.7), wd=0.1)


epoch train_loss valid_loss accuracy time
0 0.123944 0.145212 0.948840 03:18
1 0.072845 0.155692 0.949560 03:00

In [ ]:
learn.save('fwd_clas')

The same but backwards

Then we do the same thing for the backward model, the only thigns to adjust are the names of the data object and the fine-tuned encoder we load.


In [ ]:
learn_bwd = text_classifier_learner(data_clas_bwd, AWD_LSTM, drop_mult=0.5, pretrained=False)
learn_bwd.load_encoder('bwd_enc')

In [ ]:
learn_bwd.fit_one_cycle(1, lr, moms=(0.8,0.7), wd=0.1)


epoch train_loss valid_loss accuracy time
0 0.306541 0.211134 0.919360 01:16

In [ ]:
learn_bwd.freeze_to(-2)
lr /= 2
learn_bwd.fit_one_cycle(1, slice(lr/(2.6**4),lr), moms=(0.8,0.7), wd=0.1)


epoch train_loss valid_loss accuracy time
0 0.213522 0.156620 0.942960 01:34

In [ ]:
learn_bwd.freeze_to(-3)
lr /= 2
learn_bwd.fit_one_cycle(1, slice(lr/(2.6**4),lr), moms=(0.8,0.7), wd=0.1)


epoch train_loss valid_loss accuracy time
0 0.184735 0.149036 0.946360 02:26

In [ ]:
learn_bwd.unfreeze()
lr /= 5
learn_bwd.fit_one_cycle(2, slice(lr/(2.6**4),lr), moms=(0.8,0.7), wd=0.1)


epoch train_loss valid_loss accuracy time
0 0.142983 0.142495 0.948320 03:01
1 0.093098 0.152614 0.947400 02:44

In [ ]:
learn_bwd.save('bwd_clas')

Ensembling the two models

For our final results, we'll take the average of the predictions of the forward and the backward models. SInce the samples are sorted by text lengths for batching, we pass the argument ordered=True to get the predictions in the order of the texts.


In [ ]:
pred_fwd,lbl_fwd = learn.get_preds(ordered=True)

In [ ]:
pred_bwd,lbl_bwd = learn_bwd.get_preds(ordered=True)

In [ ]:
final_pred = (pred_fwd+pred_bwd)/2

In [ ]:
accuracy(pred, lbl_fwd)


Out[ ]:
tensor(0.9539)

And we get the 95.4% accuracy reported in the paper!