Using FastText via Gensim

This tutorial is about using fastText model in Gensim. There are two ways you can use fastText in Gensim - Gensim's native implementation of fastText and Gensim wrapper for fastText's original C++ code. Here, we'll learn to work with fastText library for training word-embedding models, saving & loading them and performing similarity operations & vector lookups analogous to Word2Vec.

When to use FastText?

The main principle behind fastText is that the morphological structure of a word carries important information about the meaning of the word, which is not taken into account by traditional word embeddings, which train a unique word embedding for every individual word. This is especially significant for morphologically rich languages (German, Turkish) in which a single word can have a large number of morphological forms, each of which might occur rarely, thus making it hard to train good word embeddings.
fastText attempts to solve this by treating each word as the aggregation of its subwords. For the sake of simplicity and language-independence, subwords are taken to be the character ngrams of the word. The vector for a word is simply taken to be the sum of all vectors of its component char-ngrams.
According to a detailed comparison of Word2Vec and FastText in this notebook, fastText does significantly better on syntactic tasks as compared to the original Word2Vec, especially when the size of the training corpus is small. Word2Vec slightly outperforms FastText on semantic tasks though. The differences grow smaller as the size of training corpus increases. Training time for fastText is significantly higher than the Gensim version of Word2Vec (15min 42s vs 6min 42s on text8, 17 mil tokens, 5 epochs, and a vector size of 100).
fastText can be used to obtain vectors for out-of-vocabulary (OOV) words, by summing up vectors for its component char-ngrams, provided at least one of the char-ngrams was present in the training data.

Training models

For the following examples, we'll use the Lee Corpus (which you already have if you've installed gensim) for training our model.

For using the wrapper for fastText, you need to have fastText setup locally to be able to train models. See installation instructions for fastText if you don't have fastText installed already.

Using Gensim's implementation of fastText


In [1]:
import gensim
import os
from gensim.models.word2vec import LineSentence
from gensim.models.fasttext import FastText as FT_gensim

# Set file names for train and test data
data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = data_dir + 'lee_background.cor'
lee_data = LineSentence(lee_train_file)

model_gensim = FT_gensim(size=100)

# build the vocabulary
model_gensim.build_vocab(lee_data)

# train the model
model_gensim.train(lee_data, total_examples=model_gensim.corpus_count, epochs=model_gensim.iter)

print(model_gensim)


Using TensorFlow backend.
FastText(vocab=1763, size=100, alpha=0.025)

Using wrapper for fastText's C++ code


In [2]:
from gensim.models.wrappers.fasttext import FastText as FT_wrapper

# Set FastText home to the path to the FastText executable
ft_home = '/home/chinmaya/GSOC/Gensim/fastText/fasttext'

# train the model
model_wrapper = FT_wrapper.train(ft_home, lee_train_file)

print(model_wrapper)


FastText(vocab=1763, size=100, alpha=0.025)

Training hyperparameters

Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the folllowing parameters from the original word2vec -

 - model: Training architecture. Allowed values: `cbow`, `skipgram` (Default `cbow`)
 - size: Size of embeddings to be learnt (Default 100)
 - alpha: Initial learning rate (Default 0.025)
 - window: Context window size (Default 5)
 - min_count: Ignore words with number of occurrences below this (Default 5)
 - loss: Training objective. Allowed values: `ns`, `hs`, `softmax` (Default `ns`)
 - sample: Threshold for downsampling higher-frequency words (Default 0.001)
 - negative: Number of negative words to sample, for `ns` (Default 5)
 - iter: Number of epochs (Default 5)
 - sorted_vocab: Sort vocab by descending frequency (Default 1)
 - threads: Number of threads to use (Default 12)

In addition, FastText has three additional parameters -

- min_n: min length of char ngrams (Default 3)
- max_n: max length of char ngrams (Default 6)
- bucket: number of buckets used for hashing ngrams (Default 2000000)

Parameters min_n and max_n control the lengths of character ngrams that each word is broken down into while training and looking up embeddings. If max_n is set to 0, or to be lesser than min_n, no character ngrams are used, and the model effectively reduces to Word2Vec.

To bound the memory requirements of the model being trained, a hashing function is used that maps ngrams to integers in 1 to K. For hashing these character sequences, the Fowler-Noll-Vo hashing function (FNV-1a variant) is employed.

Note: As in the case of Word2Vec, you can continue to train your model while using Gensim's native implementation of fastText. However, continuation of training with fastText models while using the wrapper is not supported.

Saving/loading models

Models can be saved and loaded via the load and save methods.


In [3]:
# saving a model trained via Gensim's fastText implementation
model_gensim.save('saved_model_gensim')
loaded_model = FT_gensim.load('saved_model_gensim')
print(loaded_model)

# saving a model trained via fastText wrapper
model_wrapper.save('saved_model_wrapper')
loaded_model = FT_wrapper.load('saved_model_wrapper')
print(loaded_model)


FastText(vocab=1763, size=100, alpha=0.025)
FastText(vocab=1763, size=100, alpha=0.025)

The save_word2vec_method causes the vectors for ngrams to be lost. As a result, a model loaded in this way will behave as a regular word2vec model.

Word vector lookup

Note: Operations like word vector lookups and similarity queries can be performed in exactly the same manner for both the implementations of fastText so they have been demonstrated using only the fastText wrapper here.

FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.


In [4]:
print('night' in model_wrapper.wv.vocab)
print('nights' in model_wrapper.wv.vocab)
print(model_wrapper['night'])
print(model_wrapper['nights'])


True
False
[ 0.60971916  0.66131264  0.09225323  0.28898761  0.34161603  0.06163925
 -0.10147806 -0.18834428 -0.26355353  0.46417126  0.20428349  0.08414238
 -0.61960417 -0.2977576  -0.22102182  0.14144184  0.13698931 -0.24608244
 -0.58096874  0.3039414   0.18766184  0.38110724  0.11518024 -0.75747257
 -0.275776   -0.42740449 -0.00725944 -0.24556711  0.41061676  0.05050014
 -0.71367824  0.05223881 -0.07810796  0.22933683  0.43850809  0.06360656
  0.43815458  0.11096461  0.29619065  0.38061273  0.26262566 -0.07368335
  0.33198604 -0.1431711  -0.04876067 -0.35243919  0.18561274 -0.70321769
 -0.16492438 -0.28362423  0.08294757  0.49758917 -0.17844993 -0.02241638
  0.18489315  0.01197879 -0.22931916  0.45774016 -0.40240806 -0.16401663
 -0.07500558  0.06775728  0.14273891  0.39902335  0.1906638   0.14533612
 -0.70275193 -0.64343351 -0.18003808  0.45082757 -0.42847934  0.23554228
  0.03722449 -0.0726353  -0.20106563 -0.85182953  0.16529776  0.2167791
  0.01655668 -0.45087481  0.44368106  0.94318634  0.3191022  -0.78148538
  0.06931634 -0.02454508 -0.07709292  0.00889531  0.41768485 -0.4333123
  0.57354093  0.40387386  0.50435936  0.15307237  0.41140166  0.09306428
 -0.6406759  -0.00130932  0.01818158  0.05408832]
[ 0.57120456  0.61710706  0.08425266  0.28013577  0.30789921  0.08454974
 -0.05984595 -0.14644302 -0.23369177  0.42689164  0.18699257  0.09090185
 -0.57885733 -0.28756606 -0.20198511  0.12675938  0.14102744 -0.22880791
 -0.52516965  0.27686313  0.19865591  0.33872125  0.11230565 -0.74198454
 -0.28486362 -0.40490177 -0.00606945 -0.18761727  0.40040097  0.06941447
 -0.70890718  0.03646363 -0.0598574   0.19175974  0.4242314   0.05878129
  0.41432344  0.10394377  0.2668701   0.38148809  0.2761937  -0.06951485
  0.34113405 -0.12189032 -0.05861677 -0.33032765  0.16585448 -0.65862278
 -0.18381383 -0.28438907  0.08867586  0.46635329 -0.18801565 -0.01610042
  0.1940661   0.03761584 -0.21442287  0.41826423 -0.38097134 -0.15111094
 -0.08636253  0.07374192  0.12731727  0.40068088  0.18576843  0.13244282
 -0.64814759 -0.62510144 -0.17045424  0.44949777 -0.39068545  0.19102012
  0.03177847 -0.06673145 -0.17997442 -0.81052922  0.15459165  0.21476634
 -0.01961387 -0.43806009  0.40781115  0.88663652  0.29360816 -0.74157697
  0.04686275 -0.0396045  -0.06810026  0.00260469  0.40505417 -0.39977569
  0.5443192   0.38472273  0.48665705  0.12033045  0.40395209  0.10123577
 -0.6243847  -0.02460667  0.00828873  0.04089492]

The word vector lookup operation only works if at least one of the component character ngrams is present in the training corpus. For example -


In [5]:
# Raises a KeyError since none of the character ngrams of the word `axe` are present in the training data
model_wrapper['axe']


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-5-6f1222595518> in <module>()
      1 # Raises a KeyError since none of the character ngrams of the word `axe` are present in the training data
----> 2 model_wrapper['axe']

/home/chinmaya/GSOC/Gensim/gensim/gensim/models/word2vec.pyc in __getitem__(self, words)
   1280         Refer to the documentation for `gensim.models.KeyedVectors.__getitem__`
   1281         """
-> 1282         return self.wv.__getitem__(words)
   1283 
   1284     def __contains__(self, word):

/home/chinmaya/GSOC/Gensim/gensim/gensim/models/keyedvectors.pyc in __getitem__(self, words)
    587         if isinstance(words, string_types):
    588             # allow calls like trained_model['office'], as a shorthand for trained_model[['office']]
--> 589             return self.word_vec(words)
    590 
    591         return vstack([self.word_vec(word) for word in words])

/home/chinmaya/GSOC/Gensim/gensim/gensim/models/wrappers/fasttext.pyc in word_vec(self, word, use_norm)
     92                 return word_vec / len(ngrams)
     93             else: # No ngrams of the word are present in self.ngrams
---> 94                 raise KeyError('all ngrams for word %s absent from model' % word)
     95 
     96     def init_sims(self, replace=False):

KeyError: 'all ngrams for word axe absent from model'

The in operation works slightly differently from the original word2vec. It tests whether a vector for the given word exists or not, not whether the word is present in the word vocabulary. To test whether a word is present in the training word vocabulary -


In [6]:
# Tests if word present in vocab
print("word" in model_wrapper.wv.vocab)
# Tests if vector present for word
print("word" in model_wrapper)


False
True

Similarity operations

Similarity operations work the same way as word2vec. Out-of-vocabulary words can also be used, provided they have at least one character ngram present in the training data.


In [7]:
print("nights" in model_wrapper.wv.vocab)
print("night" in model_wrapper.wv.vocab)
model_wrapper.similarity("night", "nights")


False
True
Out[7]:
0.9988949391617723

Syntactically similar words generally have high similarity in fastText models, since a large number of the component char-ngrams will be the same. As a result, fastText generally does better at syntactic tasks than Word2Vec. A detailed comparison is provided here.

Other similarity operations


In [8]:
# The example training corpus is a toy corpus, results are not expected to be good, for proof-of-concept only
model_wrapper.most_similar("nights")


Out[8]:
[(u'bowler', 0.9999216198921204),
 (u'flights', 0.999881386756897),
 (u'dozens', 0.9998700618743896),
 (u'each', 0.9998670220375061),
 (u'weather', 0.9998487234115601),
 (u'technology', 0.999805748462677),
 (u'acting', 0.9998006820678711),
 (u'dollars', 0.999785840511322),
 (u'place,', 0.9997731447219849),
 (u'custody', 0.9997485280036926)]

In [9]:
model_wrapper.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])


Out[9]:
0.99936318443348537

In [10]:
model_wrapper.doesnt_match("breakfast cereal dinner lunch".split())


Out[10]:
'dinner'

In [11]:
model_wrapper.most_similar(positive=['baghdad', 'england'], negative=['london'])


Out[11]:
[(u'September', 0.9997114539146423),
 (u'Rafter', 0.9996863007545471),
 (u'New', 0.999636709690094),
 (u'after', 0.9996317625045776),
 (u'day', 0.9996190071105957),
 (u'After', 0.9996107816696167),
 (u'against', 0.9996088743209839),
 (u'Robert', 0.9996023178100586),
 (u'attacks', 0.9995726346969604),
 (u'States', 0.9995641112327576)]

In [12]:
question_file_path = data_dir + 'questions-words.txt'

model_wrapper.accuracy(questions=question_file_path)


family: 0.0% (0/2)
gram3-comparative: 0.0% (0/12)
gram4-superlative: 0.0% (0/12)
gram5-present-participle: 0.0% (0/20)
gram6-nationality-adjective: 0.0% (0/20)
gram7-past-tense: 0.0% (0/20)
gram8-plural: 0.0% (0/12)
total: 0.0% (0/98)
Out[12]:
[{'correct': [], 'incorrect': [], 'section': u'capital-common-countries'},
 {'correct': [], 'incorrect': [], 'section': u'capital-world'},
 {'correct': [], 'incorrect': [], 'section': u'currency'},
 {'correct': [], 'incorrect': [], 'section': u'city-in-state'},
 {'correct': [],
  'incorrect': [(u'HE', u'SHE', u'HIS', u'HER'),
   (u'HIS', u'HER', u'HE', u'SHE')],
  'section': u'family'},
 {'correct': [], 'incorrect': [], 'section': u'gram1-adjective-to-adverb'},
 {'correct': [], 'incorrect': [], 'section': u'gram2-opposite'},
 {'correct': [],
  'incorrect': [(u'GOOD', u'BETTER', u'GREAT', u'GREATER'),
   (u'GOOD', u'BETTER', u'LONG', u'LONGER'),
   (u'GOOD', u'BETTER', u'LOW', u'LOWER'),
   (u'GREAT', u'GREATER', u'LONG', u'LONGER'),
   (u'GREAT', u'GREATER', u'LOW', u'LOWER'),
   (u'GREAT', u'GREATER', u'GOOD', u'BETTER'),
   (u'LONG', u'LONGER', u'LOW', u'LOWER'),
   (u'LONG', u'LONGER', u'GOOD', u'BETTER'),
   (u'LONG', u'LONGER', u'GREAT', u'GREATER'),
   (u'LOW', u'LOWER', u'GOOD', u'BETTER'),
   (u'LOW', u'LOWER', u'GREAT', u'GREATER'),
   (u'LOW', u'LOWER', u'LONG', u'LONGER')],
  'section': u'gram3-comparative'},
 {'correct': [],
  'incorrect': [(u'BIG', u'BIGGEST', u'GOOD', u'BEST'),
   (u'BIG', u'BIGGEST', u'GREAT', u'GREATEST'),
   (u'BIG', u'BIGGEST', u'LARGE', u'LARGEST'),
   (u'GOOD', u'BEST', u'GREAT', u'GREATEST'),
   (u'GOOD', u'BEST', u'LARGE', u'LARGEST'),
   (u'GOOD', u'BEST', u'BIG', u'BIGGEST'),
   (u'GREAT', u'GREATEST', u'LARGE', u'LARGEST'),
   (u'GREAT', u'GREATEST', u'BIG', u'BIGGEST'),
   (u'GREAT', u'GREATEST', u'GOOD', u'BEST'),
   (u'LARGE', u'LARGEST', u'BIG', u'BIGGEST'),
   (u'LARGE', u'LARGEST', u'GOOD', u'BEST'),
   (u'LARGE', u'LARGEST', u'GREAT', u'GREATEST')],
  'section': u'gram4-superlative'},
 {'correct': [],
  'incorrect': [(u'GO', u'GOING', u'LOOK', u'LOOKING'),
   (u'GO', u'GOING', u'PLAY', u'PLAYING'),
   (u'GO', u'GOING', u'RUN', u'RUNNING'),
   (u'GO', u'GOING', u'SAY', u'SAYING'),
   (u'LOOK', u'LOOKING', u'PLAY', u'PLAYING'),
   (u'LOOK', u'LOOKING', u'RUN', u'RUNNING'),
   (u'LOOK', u'LOOKING', u'SAY', u'SAYING'),
   (u'LOOK', u'LOOKING', u'GO', u'GOING'),
   (u'PLAY', u'PLAYING', u'RUN', u'RUNNING'),
   (u'PLAY', u'PLAYING', u'SAY', u'SAYING'),
   (u'PLAY', u'PLAYING', u'GO', u'GOING'),
   (u'PLAY', u'PLAYING', u'LOOK', u'LOOKING'),
   (u'RUN', u'RUNNING', u'SAY', u'SAYING'),
   (u'RUN', u'RUNNING', u'GO', u'GOING'),
   (u'RUN', u'RUNNING', u'LOOK', u'LOOKING'),
   (u'RUN', u'RUNNING', u'PLAY', u'PLAYING'),
   (u'SAY', u'SAYING', u'GO', u'GOING'),
   (u'SAY', u'SAYING', u'LOOK', u'LOOKING'),
   (u'SAY', u'SAYING', u'PLAY', u'PLAYING'),
   (u'SAY', u'SAYING', u'RUN', u'RUNNING')],
  'section': u'gram5-present-participle'},
 {'correct': [],
  'incorrect': [(u'AUSTRALIA', u'AUSTRALIAN', u'FRANCE', u'FRENCH'),
   (u'AUSTRALIA', u'AUSTRALIAN', u'INDIA', u'INDIAN'),
   (u'AUSTRALIA', u'AUSTRALIAN', u'ISRAEL', u'ISRAELI'),
   (u'AUSTRALIA', u'AUSTRALIAN', u'SWITZERLAND', u'SWISS'),
   (u'FRANCE', u'FRENCH', u'INDIA', u'INDIAN'),
   (u'FRANCE', u'FRENCH', u'ISRAEL', u'ISRAELI'),
   (u'FRANCE', u'FRENCH', u'SWITZERLAND', u'SWISS'),
   (u'FRANCE', u'FRENCH', u'AUSTRALIA', u'AUSTRALIAN'),
   (u'INDIA', u'INDIAN', u'ISRAEL', u'ISRAELI'),
   (u'INDIA', u'INDIAN', u'SWITZERLAND', u'SWISS'),
   (u'INDIA', u'INDIAN', u'AUSTRALIA', u'AUSTRALIAN'),
   (u'INDIA', u'INDIAN', u'FRANCE', u'FRENCH'),
   (u'ISRAEL', u'ISRAELI', u'SWITZERLAND', u'SWISS'),
   (u'ISRAEL', u'ISRAELI', u'AUSTRALIA', u'AUSTRALIAN'),
   (u'ISRAEL', u'ISRAELI', u'FRANCE', u'FRENCH'),
   (u'ISRAEL', u'ISRAELI', u'INDIA', u'INDIAN'),
   (u'SWITZERLAND', u'SWISS', u'AUSTRALIA', u'AUSTRALIAN'),
   (u'SWITZERLAND', u'SWISS', u'FRANCE', u'FRENCH'),
   (u'SWITZERLAND', u'SWISS', u'INDIA', u'INDIAN'),
   (u'SWITZERLAND', u'SWISS', u'ISRAEL', u'ISRAELI')],
  'section': u'gram6-nationality-adjective'},
 {'correct': [],
  'incorrect': [(u'GOING', u'WENT', u'PAYING', u'PAID'),
   (u'GOING', u'WENT', u'PLAYING', u'PLAYED'),
   (u'GOING', u'WENT', u'SAYING', u'SAID'),
   (u'GOING', u'WENT', u'TAKING', u'TOOK'),
   (u'PAYING', u'PAID', u'PLAYING', u'PLAYED'),
   (u'PAYING', u'PAID', u'SAYING', u'SAID'),
   (u'PAYING', u'PAID', u'TAKING', u'TOOK'),
   (u'PAYING', u'PAID', u'GOING', u'WENT'),
   (u'PLAYING', u'PLAYED', u'SAYING', u'SAID'),
   (u'PLAYING', u'PLAYED', u'TAKING', u'TOOK'),
   (u'PLAYING', u'PLAYED', u'GOING', u'WENT'),
   (u'PLAYING', u'PLAYED', u'PAYING', u'PAID'),
   (u'SAYING', u'SAID', u'TAKING', u'TOOK'),
   (u'SAYING', u'SAID', u'GOING', u'WENT'),
   (u'SAYING', u'SAID', u'PAYING', u'PAID'),
   (u'SAYING', u'SAID', u'PLAYING', u'PLAYED'),
   (u'TAKING', u'TOOK', u'GOING', u'WENT'),
   (u'TAKING', u'TOOK', u'PAYING', u'PAID'),
   (u'TAKING', u'TOOK', u'PLAYING', u'PLAYED'),
   (u'TAKING', u'TOOK', u'SAYING', u'SAID')],
  'section': u'gram7-past-tense'},
 {'correct': [],
  'incorrect': [(u'BUILDING', u'BUILDINGS', u'CAR', u'CARS'),
   (u'BUILDING', u'BUILDINGS', u'CHILD', u'CHILDREN'),
   (u'BUILDING', u'BUILDINGS', u'MAN', u'MEN'),
   (u'CAR', u'CARS', u'CHILD', u'CHILDREN'),
   (u'CAR', u'CARS', u'MAN', u'MEN'),
   (u'CAR', u'CARS', u'BUILDING', u'BUILDINGS'),
   (u'CHILD', u'CHILDREN', u'MAN', u'MEN'),
   (u'CHILD', u'CHILDREN', u'BUILDING', u'BUILDINGS'),
   (u'CHILD', u'CHILDREN', u'CAR', u'CARS'),
   (u'MAN', u'MEN', u'BUILDING', u'BUILDINGS'),
   (u'MAN', u'MEN', u'CAR', u'CARS'),
   (u'MAN', u'MEN', u'CHILD', u'CHILDREN')],
  'section': u'gram8-plural'},
 {'correct': [], 'incorrect': [], 'section': u'gram9-plural-verbs'},
 {'correct': [],
  'incorrect': [(u'HE', u'SHE', u'HIS', u'HER'),
   (u'HIS', u'HER', u'HE', u'SHE'),
   (u'GOOD', u'BETTER', u'GREAT', u'GREATER'),
   (u'GOOD', u'BETTER', u'LONG', u'LONGER'),
   (u'GOOD', u'BETTER', u'LOW', u'LOWER'),
   (u'GREAT', u'GREATER', u'LONG', u'LONGER'),
   (u'GREAT', u'GREATER', u'LOW', u'LOWER'),
   (u'GREAT', u'GREATER', u'GOOD', u'BETTER'),
   (u'LONG', u'LONGER', u'LOW', u'LOWER'),
   (u'LONG', u'LONGER', u'GOOD', u'BETTER'),
   (u'LONG', u'LONGER', u'GREAT', u'GREATER'),
   (u'LOW', u'LOWER', u'GOOD', u'BETTER'),
   (u'LOW', u'LOWER', u'GREAT', u'GREATER'),
   (u'LOW', u'LOWER', u'LONG', u'LONGER'),
   (u'BIG', u'BIGGEST', u'GOOD', u'BEST'),
   (u'BIG', u'BIGGEST', u'GREAT', u'GREATEST'),
   (u'BIG', u'BIGGEST', u'LARGE', u'LARGEST'),
   (u'GOOD', u'BEST', u'GREAT', u'GREATEST'),
   (u'GOOD', u'BEST', u'LARGE', u'LARGEST'),
   (u'GOOD', u'BEST', u'BIG', u'BIGGEST'),
   (u'GREAT', u'GREATEST', u'LARGE', u'LARGEST'),
   (u'GREAT', u'GREATEST', u'BIG', u'BIGGEST'),
   (u'GREAT', u'GREATEST', u'GOOD', u'BEST'),
   (u'LARGE', u'LARGEST', u'BIG', u'BIGGEST'),
   (u'LARGE', u'LARGEST', u'GOOD', u'BEST'),
   (u'LARGE', u'LARGEST', u'GREAT', u'GREATEST'),
   (u'GO', u'GOING', u'LOOK', u'LOOKING'),
   (u'GO', u'GOING', u'PLAY', u'PLAYING'),
   (u'GO', u'GOING', u'RUN', u'RUNNING'),
   (u'GO', u'GOING', u'SAY', u'SAYING'),
   (u'LOOK', u'LOOKING', u'PLAY', u'PLAYING'),
   (u'LOOK', u'LOOKING', u'RUN', u'RUNNING'),
   (u'LOOK', u'LOOKING', u'SAY', u'SAYING'),
   (u'LOOK', u'LOOKING', u'GO', u'GOING'),
   (u'PLAY', u'PLAYING', u'RUN', u'RUNNING'),
   (u'PLAY', u'PLAYING', u'SAY', u'SAYING'),
   (u'PLAY', u'PLAYING', u'GO', u'GOING'),
   (u'PLAY', u'PLAYING', u'LOOK', u'LOOKING'),
   (u'RUN', u'RUNNING', u'SAY', u'SAYING'),
   (u'RUN', u'RUNNING', u'GO', u'GOING'),
   (u'RUN', u'RUNNING', u'LOOK', u'LOOKING'),
   (u'RUN', u'RUNNING', u'PLAY', u'PLAYING'),
   (u'SAY', u'SAYING', u'GO', u'GOING'),
   (u'SAY', u'SAYING', u'LOOK', u'LOOKING'),
   (u'SAY', u'SAYING', u'PLAY', u'PLAYING'),
   (u'SAY', u'SAYING', u'RUN', u'RUNNING'),
   (u'AUSTRALIA', u'AUSTRALIAN', u'FRANCE', u'FRENCH'),
   (u'AUSTRALIA', u'AUSTRALIAN', u'INDIA', u'INDIAN'),
   (u'AUSTRALIA', u'AUSTRALIAN', u'ISRAEL', u'ISRAELI'),
   (u'AUSTRALIA', u'AUSTRALIAN', u'SWITZERLAND', u'SWISS'),
   (u'FRANCE', u'FRENCH', u'INDIA', u'INDIAN'),
   (u'FRANCE', u'FRENCH', u'ISRAEL', u'ISRAELI'),
   (u'FRANCE', u'FRENCH', u'SWITZERLAND', u'SWISS'),
   (u'FRANCE', u'FRENCH', u'AUSTRALIA', u'AUSTRALIAN'),
   (u'INDIA', u'INDIAN', u'ISRAEL', u'ISRAELI'),
   (u'INDIA', u'INDIAN', u'SWITZERLAND', u'SWISS'),
   (u'INDIA', u'INDIAN', u'AUSTRALIA', u'AUSTRALIAN'),
   (u'INDIA', u'INDIAN', u'FRANCE', u'FRENCH'),
   (u'ISRAEL', u'ISRAELI', u'SWITZERLAND', u'SWISS'),
   (u'ISRAEL', u'ISRAELI', u'AUSTRALIA', u'AUSTRALIAN'),
   (u'ISRAEL', u'ISRAELI', u'FRANCE', u'FRENCH'),
   (u'ISRAEL', u'ISRAELI', u'INDIA', u'INDIAN'),
   (u'SWITZERLAND', u'SWISS', u'AUSTRALIA', u'AUSTRALIAN'),
   (u'SWITZERLAND', u'SWISS', u'FRANCE', u'FRENCH'),
   (u'SWITZERLAND', u'SWISS', u'INDIA', u'INDIAN'),
   (u'SWITZERLAND', u'SWISS', u'ISRAEL', u'ISRAELI'),
   (u'GOING', u'WENT', u'PAYING', u'PAID'),
   (u'GOING', u'WENT', u'PLAYING', u'PLAYED'),
   (u'GOING', u'WENT', u'SAYING', u'SAID'),
   (u'GOING', u'WENT', u'TAKING', u'TOOK'),
   (u'PAYING', u'PAID', u'PLAYING', u'PLAYED'),
   (u'PAYING', u'PAID', u'SAYING', u'SAID'),
   (u'PAYING', u'PAID', u'TAKING', u'TOOK'),
   (u'PAYING', u'PAID', u'GOING', u'WENT'),
   (u'PLAYING', u'PLAYED', u'SAYING', u'SAID'),
   (u'PLAYING', u'PLAYED', u'TAKING', u'TOOK'),
   (u'PLAYING', u'PLAYED', u'GOING', u'WENT'),
   (u'PLAYING', u'PLAYED', u'PAYING', u'PAID'),
   (u'SAYING', u'SAID', u'TAKING', u'TOOK'),
   (u'SAYING', u'SAID', u'GOING', u'WENT'),
   (u'SAYING', u'SAID', u'PAYING', u'PAID'),
   (u'SAYING', u'SAID', u'PLAYING', u'PLAYED'),
   (u'TAKING', u'TOOK', u'GOING', u'WENT'),
   (u'TAKING', u'TOOK', u'PAYING', u'PAID'),
   (u'TAKING', u'TOOK', u'PLAYING', u'PLAYED'),
   (u'TAKING', u'TOOK', u'SAYING', u'SAID'),
   (u'BUILDING', u'BUILDINGS', u'CAR', u'CARS'),
   (u'BUILDING', u'BUILDINGS', u'CHILD', u'CHILDREN'),
   (u'BUILDING', u'BUILDINGS', u'MAN', u'MEN'),
   (u'CAR', u'CARS', u'CHILD', u'CHILDREN'),
   (u'CAR', u'CARS', u'MAN', u'MEN'),
   (u'CAR', u'CARS', u'BUILDING', u'BUILDINGS'),
   (u'CHILD', u'CHILDREN', u'MAN', u'MEN'),
   (u'CHILD', u'CHILDREN', u'BUILDING', u'BUILDINGS'),
   (u'CHILD', u'CHILDREN', u'CAR', u'CARS'),
   (u'MAN', u'MEN', u'BUILDING', u'BUILDINGS'),
   (u'MAN', u'MEN', u'CAR', u'CARS'),
   (u'MAN', u'MEN', u'CHILD', u'CHILDREN')],
  'section': 'total'}]

In [13]:
# Word Movers distance
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

# Remove their stopwords.
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
sentence_obama = [w for w in sentence_obama if w not in stopwords]
sentence_president = [w for w in sentence_president if w not in stopwords]

# Compute WMD.
distance = model_wrapper.wmdistance(sentence_obama, sentence_president)
distance


Out[13]:
1.1102867164706653

In [ ]: