Preparing German Wikipedia to train a fast.ai (ULMFiT) model for German

(should work with most other languages, too)

The core idea of Howard and Ruder's ULMFiT paper, see also https://nlp.fast.ai/, is to pretrain a language model on some corpus. Naturally, we also want such a thing for German. And happily I just launched MathInf, a great mathematical modelling, machine learning and actuarial consulting company, that allows me to do this type of research and make it public.

I have very raw info (and hope to add more description soon) on my blog. I'm making this available early at public request and hope it is useful to you to build great things, it is not as clean or well-commented I would love it to be, yet. I would love to hear from you if you make good use of it!

So we take a wikipedia dump (de_wikipedia_extracted dewiki-latest-pages-articles.xml.bz2 downloaded from dumps.wikipedia.org and prepocessed by wikiextractor/WikiExtractor.py -s --json -o de_wikipedia_extracted dewiki-latest-pages-articles.xml.bz2) and make token files out of them.

Note that the German Wikipedia contains more tokens (i.e. words) than recommended 100M to train the language model. I don't cut off much here, but only do this later when loading the tokens to start the training. That is a bit wasteful and follows a "keep as much data as long as you can" approach.

Credit for all the good things in the Notebook likely belong to Sylvain Gugger (see his notebook) and Jeremy Howard see the original imdb notebook from his great course, whose work I built on, all errors are my own.

Enough talk, here is the data preparation.



In [4]:

    
%matplotlib inline
%reload_ext autoreload
%autoreload 2



In [5]:

    
from fastai.text import *
import html
from matplotlib import pyplot
import numpy
import time



In [6]:

    
BOS = 'xbos'  # beginning-of-sentence tag
FLD = 'xfld'  # data field tag

LANG='de'
datasetpath =  Path('/home/datasets/nlp/wiki/')
# I ran this: wikiextractor/WikiExtractor.py -s --json -o de_wikipedia_extracted dewiki-latest-pages-articles.xml.bz2 
work_path = Path('~/data/nlp/german_lm/data/de_wiki/tmp/').expanduser()
work_path.mkdir(exist_ok=True)

Standarize format

You can skip this entire section if you like the results. In this case continue at Tokenize.



In [4]:

    
LANG_FILENAMES = [str(f) for f in datasetpath.rglob("de_wikipedia_extracted/*/*")]



In [5]:

    
len(LANG_FILENAMES), LANG_FILENAMES[:5]









    Out[5]:





(5330,
 ['/home/datasets/nlp/wiki/de_wikipedia_extracted/AI/wiki_98',
  '/home/datasets/nlp/wiki/de_wikipedia_extracted/AI/wiki_35',
  '/home/datasets/nlp/wiki/de_wikipedia_extracted/AI/wiki_96',
  '/home/datasets/nlp/wiki/de_wikipedia_extracted/AI/wiki_74',
  '/home/datasets/nlp/wiki/de_wikipedia_extracted/AI/wiki_29'])



In [11]:

    
LANG_TEXT = []
for fn in tqdm(LANG_FILENAMES):
    for line in open(fn, encoding='utf8'):
        LANG_TEXT.append(json.loads(line))
        
LANG_TEXT = pd.DataFrame(LANG_TEXT)









    



100%|██████████| 5330/5330 [00:28<00:00, 185.25it/s]



In [12]:

    
LANG_TEXT.head()









    Out[12]:







  
    
      
      id
      text
      title
      url
    
  
  
    
      0
      434091
      Homberg (bei Lauterecken)\n\nHomberg ist eine ...
      Homberg (bei Lauterecken)
      https://de.wikipedia.org/wiki?curid=434091
    
    
      1
      434093
      Nicole Petignat\n\nNicole Petignat (* 27. Okto...
      Nicole Petignat
      https://de.wikipedia.org/wiki?curid=434093
    
    
      2
      434098
      Schwezow\n\nSchwezow ist der Familienname folg...
      Schwezow
      https://de.wikipedia.org/wiki?curid=434098
    
    
      3
      434102
      Schukow\n\nSchukow (russ. "Жуков") bzw. Schuko...
      Schukow
      https://de.wikipedia.org/wiki?curid=434102
    
    
      4
      434112
      Otto Schmidt\n\nOtto Schmidt ist der Name folg...
      Otto Schmidt
      https://de.wikipedia.org/wiki?curid=434112



In [13]:

    
# Getting rid of the title name in the text field
def split_title_from_text(text):
    words = text.split("\n\n", 1)
    if len(words) == 2:
        return words[1]
    else:
        return words[0]
    
LANG_TEXT['text'] = LANG_TEXT['text'].apply(lambda x: split_title_from_text(x))



In [14]:

    
LANG_TEXT.head()









    Out[14]:







  
    
      
      id
      text
      title
      url
    
  
  
    
      0
      434091
      Homberg ist eine Ortsgemeinde im Landkreis Kus...
      Homberg (bei Lauterecken)
      https://de.wikipedia.org/wiki?curid=434091
    
    
      1
      434093
      Nicole Petignat (* 27. Oktober 1966 in La Chau...
      Nicole Petignat
      https://de.wikipedia.org/wiki?curid=434093
    
    
      2
      434098
      Schwezow ist der Familienname folgender Person...
      Schwezow
      https://de.wikipedia.org/wiki?curid=434098
    
    
      3
      434102
      Schukow (russ. "Жуков") bzw. Schukowa (weiblic...
      Schukow
      https://de.wikipedia.org/wiki?curid=434102
    
    
      4
      434112
      Otto Schmidt ist der Name folgender Personen:\...
      Otto Schmidt
      https://de.wikipedia.org/wiki?curid=434112

Determine article lengths and only keep at most the largest million and only those with at least 2000 characters



In [25]:

    
LANG_TEXT['label'] = 0 # dummy
LANG_TEXT['length'] = LANG_TEXT['text'].str.len()



In [16]:

    
MAX_ARTICLES = 1_000_000
# keep at most 1 million articles and only those of more than 2000 characters
MIN_LENGTH_CHARS = max(2000, int(numpy.percentile(LANG_TEXT['length'], 100-min(100*MAX_ARTICLES/len(LANG_TEXT), 100))))
LANG_TEXT = LANG_TEXT[LANG_TEXT['length'] >= MIN_LENGTH_CHARS] # Chars not words...



In [20]:

    
LANG_TEXT.to_csv(datasetpath/'wiki_de.csv', header=True, index=False) # I must say, I think the header is good! If in doubt, you should listen to Jeremy though.



In [ ]:

    
LANG_TEXT = pd.read_csv(datasetpath/'wiki_de.csv')



In [22]:

    
percentages = range(0,110,10)
print ('Article length percentiles' , ', '.join(['{}%: {}'.format(p, int(q))  for p,q in zip(percentages, numpy.percentile(LANG_TEXT['length'], percentages))]))
print ('Number of articles', len(LANG_TEXT))









    



Article length percentiles 0%: 2000, 10%: 2197, 20%: 2427, 30%: 2699, 40%: 3033, 50%: 3458, 60%: 4030, 70%: 4857, 80%: 6258, 90%: 9442, 100%: 463748
Number of articles 738850



In [23]:

    
#LANG_TEXT = LANG_TEXT.sort_values(by=['length'], ascending=False)
LANG_TEXT.head()









    Out[23]:







  
    
      
      id
      text
      title
      url
      label
      length
    
  
  
    
      0
      434114
      Carl Jacob Burckhardt (* 10. September 1891 in...
      Carl Jacob Burckhardt
      https://de.wikipedia.org/wiki?curid=434114
      0
      5021
    
    
      1
      434117
      Roscheid ist eine Ortsgemeinde im Eifelkreis B...
      Roscheid
      https://de.wikipedia.org/wiki?curid=434117
      0
      2562
    
    
      2
      434118
      Reipeldingen ist eine Ortsgemeinde im westlich...
      Reipeldingen
      https://de.wikipedia.org/wiki?curid=434118
      0
      2228
    
    
      3
      434122
      Lichtenborn ist eine Ortsgemeinde im Eifelkrei...
      Lichtenborn
      https://de.wikipedia.org/wiki?curid=434122
      0
      2776
    
    
      4
      434123
      Leidenborn ist eine Ortsgemeinde im Eifelkreis...
      Leidenborn
      https://de.wikipedia.org/wiki?curid=434123
      0
      2164

Splitting 10% for validation.



In [26]:

    
df_trn,df_val = sklearn.model_selection.train_test_split(LANG_TEXT.pipe(lambda x: x[['label', 'text']]), test_size=0.1)



In [33]:

    
df_trn.to_csv(work_path/'train.csv', header=False, index=False)
df_val.to_csv(work_path/'valid.csv', header=False, index=False)

I'm always trying to produce notebooks that you can run through in one go, so here is my attempt at getting rid of old stuff.



In [ ]:

    
del LANG_TEXT
import gc
gc.collect()

Tokenize

Note: be sure to care for your memory. I had all my memory allocated (for having several wikipedia copies in memory) and was swapping massively with the multiprocessing tokenization. My fix was to restart the notebook after after I had finished the above.



In [7]:

    
chunksize = 4000
N_CPUS = num_cpus() # I like to use all cores here, needs a patch to fast ai



In [8]:

    
re1 = re.compile(r'  +')

def fixup(x):
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>','u_n').replace(' @.@ ','.').replace(
        ' @-@ ','-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x))



In [ ]:



In [18]:

    
df_trn = pd.read_csv(work_path/'train.csv', header=None, chunksize=chunksize)
df_val = pd.read_csv(work_path/'valid.csv', header=None, chunksize=chunksize)



In [ ]:



In [12]:

    
def get_texts(df, n_lbls=1):
    labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)
    texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)
    for i in range(n_lbls+1, len(df.columns)): texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)
    texts = texts.apply(fixup).values.astype(str)
    #tok = Tokenizer.proc_all(texts, lang=LANG) # use this if you have memory trouble
    tok = Tokenizer.proc_all_mp(partition(texts, (len(texts)+N_CPUS-1)//N_CPUS), lang=LANG, ncpus=N_CPUS)
    return tok, list(labels)

def get_all(df, name, n_lbls=1):
    time_start = time.time()
    for i, r in enumerate(df):
        print("\r", i, end=" ")
        if i > 0:
            print ('time per chunk {}s'.format(int((time.time() - time_start) / i)), end="")
        tok_, labels_ = get_texts(r, n_lbls)
        #save the partial tokens instead of regrouping them in one big array.
        np.save(work_path/f'{name}_tok{i}.npy', tok_)



In [ ]:

    
get_all(df_trn,'trn',1)









    



 34 time per chunk 32s



In [14]:

    
get_all(df_val,'val',1)









    



 18 time per chunk 31s





    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-f8776495fb2d> in <module>()
----> 1 get_all(df_val,'val',1)

<ipython-input-12-808cc3b059f9> in get_all(df, name, n_lbls)
     17         #save the partial tokens instead of regrouping them in one big array.
     18         np.save(work_path/f'{name}_tok{i}.npy', tok_)
---> 19     return tok, labels

NameError: name 'tok' is not defined

Numericalize

Get the Counter object from all the splitted files.



In [7]:

    
def count_them_all(names):
    cnt = Counter()
    for name in names:
        for file in work_path.glob(f'{name}_tok*'):
            tok = np.load(file)
            cnt_tok = Counter(word for sent in tok for word in sent)
            cnt += cnt_tok
    return cnt



In [8]:

    
cnt = count_them_all(['trn'])



In [9]:

    
cnt.most_common(25)









    Out[9]:





[('.', 32264518),
 (',', 22918938),
 ('der', 19208785),
 ('die', 15844609),
 ('und', 13947739),
 ('in', 10617819),
 ('"', 9494449),
 ('\n\n', 7067205),
 ('von', 6913635),
 ('den', 5662917),
 ('im', 5452056),
 ('des', 4907024),
 ('das', 4734075),
 ('mit', 4716315),
 (')', 4682017),
 ('(', 4665298),
 ('\n', 4523044),
 ('er', 3776986),
 ('dem', 3722644),
 ('als', 3612743),
 ('wurde', 3601000),
 ('zu', 3561187),
 ('auf', 3324672),
 ('eine', 3083051),
 ('für', 3066354)]



In [10]:

    
max_vocab = 60000
min_freq = 5



In [11]:

    
itos = [o for o,c in cnt.most_common(max_vocab) if c > min_freq]
itos.insert(0,'_pad_')
itos.insert(0,'_unk_')



In [15]:

    
len(itos)
pickle.dump(itos, open(work_path/'itos.pkl', 'wb'))



In [11]:

    
stoi = collections.defaultdict(int,{s:i for (i,s) in enumerate(itos)})

Numericalize each partial file.



In [12]:

    
def numericalize(name):
    results = []
    for file in tqdm(work_path.glob(f'{name}_tok*')):
        tok = np.load(file)
        results.append(np.array([[stoi[word] for word in sent] for sent in tok]))
    return np.concatenate(results)



In [13]:

    
trn_ids = numericalize('trn')
np.save(work_path/'trn_ids.npy', trn_ids)









    



166it [03:04,  1.11s/it]



In [14]:

    
val_ids = numericalize('val')
np.save(work_path/'val_ids.npy', val_ids)









    



19it [00:25,  1.33s/it]

So now you have gread dumps to use with the training program I published on my blog.

As always, I would be honored by your feedback at tv@lernapparat.de. I read and appreciate every mail.

Thomas



In [ ]:

	id	text	title	url
0	434091	Homberg (bei Lauterecken)\n\nHomberg ist eine ...	Homberg (bei Lauterecken)	https://de.wikipedia.org/wiki?curid=434091
1	434093	Nicole Petignat\n\nNicole Petignat (* 27. Okto...	Nicole Petignat	https://de.wikipedia.org/wiki?curid=434093
2	434098	Schwezow\n\nSchwezow ist der Familienname folg...	Schwezow	https://de.wikipedia.org/wiki?curid=434098
3	434102	Schukow\n\nSchukow (russ. "Жуков") bzw. Schuko...	Schukow	https://de.wikipedia.org/wiki?curid=434102
4	434112	Otto Schmidt\n\nOtto Schmidt ist der Name folg...	Otto Schmidt	https://de.wikipedia.org/wiki?curid=434112

	id	text	title	url
0	434091	Homberg ist eine Ortsgemeinde im Landkreis Kus...	Homberg (bei Lauterecken)	https://de.wikipedia.org/wiki?curid=434091
1	434093	Nicole Petignat (* 27. Oktober 1966 in La Chau...	Nicole Petignat	https://de.wikipedia.org/wiki?curid=434093
2	434098	Schwezow ist der Familienname folgender Person...	Schwezow	https://de.wikipedia.org/wiki?curid=434098
3	434102	Schukow (russ. "Жуков") bzw. Schukowa (weiblic...	Schukow	https://de.wikipedia.org/wiki?curid=434102
4	434112	Otto Schmidt ist der Name folgender Personen:\...	Otto Schmidt	https://de.wikipedia.org/wiki?curid=434112

	id	text	title	url	length
0	434114	Carl Jacob Burckhardt (* 10. September 1891 in...	Carl Jacob Burckhardt	https://de.wikipedia.org/wiki?curid=434114	5021
1	434117	Roscheid ist eine Ortsgemeinde im Eifelkreis B...	Roscheid	https://de.wikipedia.org/wiki?curid=434117	2562
2	434118	Reipeldingen ist eine Ortsgemeinde im westlich...	Reipeldingen	https://de.wikipedia.org/wiki?curid=434118	2228
3	434122	Lichtenborn ist eine Ortsgemeinde im Eifelkrei...	Lichtenborn	https://de.wikipedia.org/wiki?curid=434122	2776
4	434123	Leidenborn ist eine Ortsgemeinde im Eifelkreis...	Leidenborn	https://de.wikipedia.org/wiki?curid=434123	2164