Gensim `Doc2Vec` Tutorial on the IMDB Sentiment Dataset

Introduction

In this tutorial, we will learn how to apply Doc2vec using gensim by recreating the results of Le and Mikolov 2014.

Bag-of-words Model

Early state-of-the-art document representations were based on the bag-of-words model, which represent input documents as a fixed-length vector. For example, borrowing from the Wikipedia article, the two documents
(1) John likes to watch movies. Mary likes movies too.
(2) John also likes to watch football games.
are used to construct a length 10 list of words
["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games"]
so then we can represent the two documents as fixed length vectors whose elements are the frequencies of the corresponding words in our list
(1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
(2) [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]
Bag-of-words models are surprisingly effective but still lose information about word order. Bag of n-grams models consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality.

`Word2Vec`

Word2Vec is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, strong and powerful would be close together and strong and Paris would be relatively far. There are two versions of this model based on skip-grams (SG) and continuous-bag-of-words (CBOW), both implemented by the gensim Word2Vec class.

`Word2Vec` - Skip-gram Model

The skip-gram word2vec model, for example, takes in pairs (word1, word2) generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input. A virtual one-hot encoding of words goes through a 'projection layer' to the hidden layer; these projection weights are later interpreted as the word embeddings. So if the hidden layer has 300 neurons, this network will give us 300-dimensional word embeddings.

`Word2Vec` - Continuous-bag-of-words Model

Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It is also a 1-hidden-layer neural network. The synthetic training task now uses the average of multiple input context words, rather than a single word as in skip-gram, to predict the center word. Again, the projection weights that turn one-hot words into averageable vectors, of the same width as the hidden layer, are interpreted as the word embeddings.

But, Word2Vec doesn't yet get us fixed-size vectors for longer texts.

Paragraph Vector, aka gensim `Doc2Vec`

The straightforward approach of averaging each of a text's words' word-vectors creates a quick and crude document-vector that can often be useful. However, Le and Mikolov in 2014 introduced the Paragraph Vector, which usually outperforms such simple-averaging.

The basic idea is: act as if a document has another floating word-like vector, which contributes to all training predictions, and is updated like other word-vectors, but we will call it a doc-vector. Gensim's Doc2Vec class implements this algorithm.

Paragraph Vector - Distributed Memory (PV-DM)

This is the Paragraph Vector model analogous to Word2Vec CBOW. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document's doc-vector.

Paragraph Vector - Distributed Bag of Words (PV-DBOW)

This is the Paragraph Vector model analogous to Word2Vec SG. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a target word just from the full document's doc-vector. (It is also common to combine this with skip-gram testing, using both the doc-vector and nearby word-vectors to predict a single target word, but only one at a time.)

Requirements

The following python modules are dependencies for this tutorial:

testfixtures ( pip install testfixtures )
statsmodels ( pip install statsmodels )

Load corpus

Let's download the IMDB archive if it is not already downloaded (84 MB). This will be our text data for this tutorial.
The data can be found here: http://ai.stanford.edu/~amaas/data/sentiment/

This cell will only reattempt steps (such as downloading the compressed data) if their output isn't already present, so it is safe to re-run until it completes successfully.



In [1]:

    
%%time 

import locale
import glob
import os.path
import requests
import tarfile
import sys
import codecs
from smart_open import smart_open
import re

dirname = 'aclImdb'
filename = 'aclImdb_v1.tar.gz'
locale.setlocale(locale.LC_ALL, 'C')
all_lines = []

if sys.version > '3':
    control_chars = [chr(0x85)]
else:
    control_chars = [unichr(0x85)]

# Convert text to lower-case and strip punctuation/symbols from words
def normalize_text(text):
    norm_text = text.lower()
    # Replace breaks with spaces
    norm_text = norm_text.replace('<br />', ' ')
    # Pad punctuation with spaces on both sides
    norm_text = re.sub(r"([\.\",\(\)!\?;:])", " \\1 ", norm_text)
    return norm_text

if not os.path.isfile('aclImdb/alldata-id.txt'):
    if not os.path.isdir(dirname):
        if not os.path.isfile(filename):
            # Download IMDB archive
            print("Downloading IMDB archive...")
            url = u'http://ai.stanford.edu/~amaas/data/sentiment/' + filename
            r = requests.get(url)
            with smart_open(filename, 'wb') as f:
                f.write(r.content)
        # if error here, try `tar xfz aclImdb_v1.tar.gz` outside notebook, then re-run this cell
        tar = tarfile.open(filename, mode='r')
        tar.extractall()
        tar.close()
    else:
        print("IMDB archive directory already available without download.")

    # Collect & normalize test/train data
    print("Cleaning up dataset...")
    folders = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']
    for fol in folders:
        temp = u''
        newline = "\n".encode("utf-8")
        output = fol.replace('/', '-') + '.txt'
        # Is there a better pattern to use?
        txt_files = glob.glob(os.path.join(dirname, fol, '*.txt'))
        print(" %s: %i files" % (fol, len(txt_files)))
        with smart_open(os.path.join(dirname, output), "wb") as n:
            for i, txt in enumerate(txt_files):
                with smart_open(txt, "rb") as t:
                    one_text = t.read().decode("utf-8")
                    for c in control_chars:
                        one_text = one_text.replace(c, ' ')
                    one_text = normalize_text(one_text)
                    all_lines.append(one_text)
                    n.write(one_text.encode("utf-8"))
                    n.write(newline)

    # Save to disk for instant re-use on any future runs
    with smart_open(os.path.join(dirname, 'alldata-id.txt'), 'wb') as f:
        for idx, line in enumerate(all_lines):
            num_line = u"_*{0} {1}\n".format(idx, line)
            f.write(num_line.encode("utf-8"))

assert os.path.isfile("aclImdb/alldata-id.txt"), "alldata-id.txt unavailable"
print("Success, alldata-id.txt is available for next steps.")









    



IMDB archive directory already available without download.
Cleaning up dataset...
 train/pos: 12500 files
 train/neg: 12500 files
 test/pos: 12500 files
 test/neg: 12500 files
 train/unsup: 50000 files
Success, alldata-id.txt is available for next steps.
CPU times: user 17.3 s, sys: 14.1 s, total: 31.3 s
Wall time: 1min 2s

The text data is small enough to be read into memory.



In [2]:

    
%%time

import gensim
from gensim.models.doc2vec import TaggedDocument
from collections import namedtuple

# this data object class suffices as a `TaggedDocument` (with `words` and `tags`) 
# plus adds other state helpful for our later evaluation/reporting
SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')

alldocs = []
with smart_open('aclImdb/alldata-id.txt', 'rb', encoding='utf-8') as alldata:
    for line_no, line in enumerate(alldata):
        tokens = gensim.utils.to_unicode(line).split()
        words = tokens[1:]
        tags = [line_no] # 'tags = [tokens[0]]' would also work at extra memory cost
        split = ['train', 'test', 'extra', 'extra'][line_no//25000]  # 25k train, 25k test, 25k extra
        sentiment = [1.0, 0.0, 1.0, 0.0, None, None, None, None][line_no//12500] # [12.5K pos, 12.5K neg]*2 then unknown
        alldocs.append(SentimentDocument(words, tags, split, sentiment))

train_docs = [doc for doc in alldocs if doc.split == 'train']
test_docs = [doc for doc in alldocs if doc.split == 'test']

print('%d docs: %d train-sentiment, %d test-sentiment' % (len(alldocs), len(train_docs), len(test_docs)))









    



100000 docs: 25000 train-sentiment, 25000 test-sentiment
CPU times: user 5.3 s, sys: 1.25 s, total: 6.55 s
Wall time: 6.74 s

Because the native document-order has similar-sentiment documents in large clumps – which is suboptimal for training – we work with once-shuffled copy of the training set.



In [3]:

    
from random import shuffle
doc_list = alldocs[:]  
shuffle(doc_list)

Set-up Doc2Vec Training & Evaluation Models

We approximate the experiment of Le & Mikolov "Distributed Representations of Sentences and Documents" with guidance from Mikolov's example go.sh:

./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1

We vary the following parameter choices:

100-dimensional vectors, as the 400-d vectors of the paper take a lot of memory and, in our tests of this task, don't seem to offer much benefit
Similarly, frequent word subsampling seems to decrease sentiment-prediction accuracy, so it's left out
cbow=0 means skip-gram which is equivalent to the paper's 'PV-DBOW' mode, matched in gensim with dm=0
Added to that DBOW model are two DM models, one which averages context vectors (dm_mean) and one which concatenates them (dm_concat, resulting in a much larger, slower, more data-hungry model)
A min_count=2 saves quite a bit of model memory, discarding only words that appear in a single doc (and are thus no more expressive than the unique-to-each doc vectors themselves)



In [4]:

    
%%time
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from collections import OrderedDict
import multiprocessing

cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

simple_models = [
    # PV-DBOW plain
    Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample=0, 
            epochs=20, workers=cores),
    # PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes
    Doc2Vec(dm=1, vector_size=100, window=10, negative=5, hs=0, min_count=2, sample=0, 
            epochs=20, workers=cores, alpha=0.05, comment='alpha=0.05'),
    # PV-DM w/ concatenation - big, slow, experimental mode
    # window=5 (both sides) approximates paper's apparent 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, vector_size=100, window=5, negative=5, hs=0, min_count=2, sample=0, 
            epochs=20, workers=cores),
]

for model in simple_models:
    model.build_vocab(alldocs)
    print("%s vocabulary scanned & state initialized" % model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)









    



Doc2Vec(dbow,d100,n5,mc2,t4) vocabulary scanned & state initialized
Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4) vocabulary scanned & state initialized
Doc2Vec(dm/c,d100,n5,w5,mc2,t4) vocabulary scanned & state initialized
CPU times: user 28.7 s, sys: 414 ms, total: 29.1 s
Wall time: 29.1 s

Le and Mikolov notes that combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. We will follow, pairing the models together for evaluation. Here, we concatenate the paragraph vectors obtained from each model with the help of a thin wrapper class included in a gensim test module. (Note that this a separate, later concatenation of output-vectors than the kind of input-window-concatenation enabled by the dm_concat=1 mode above.)



In [5]:

    
from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[1]])
models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[2]])

Predictive Evaluation Methods

Let's define some helper methods for evaluating the performance of our Doc2vec using paragraph vectors. We will classify document sentiments using a logistic regression model based on our paragraph embeddings. We will compare the error rates based on word embeddings from our various Doc2vec models.



In [6]:

    
import numpy as np
import statsmodels.api as sm
from random import sample
    
def logistic_predictor_from_data(train_targets, train_regressors):
    """Fit a statsmodel logistic predictor on supplied data"""
    logit = sm.Logit(train_targets, train_regressors)
    predictor = logit.fit(disp=0)
    # print(predictor.summary())
    return predictor

def error_rate_for_model(test_model, train_set, test_set, 
                         reinfer_train=False, reinfer_test=False, 
                         infer_steps=None, infer_alpha=None, infer_subsample=0.2):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets = [doc.sentiment for doc in train_set]
    if reinfer_train:
        train_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in train_set]
    else:
        train_regressors = [test_model.docvecs[doc.tags[0]] for doc in train_set]
    train_regressors = sm.add_constant(train_regressors)
    predictor = logistic_predictor_from_data(train_targets, train_regressors)

    test_data = test_set
    if reinfer_test:
        if infer_subsample < 1.0:
            test_data = sample(test_data, int(infer_subsample * len(test_data)))
        test_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in test_data]
    else:
        test_regressors = [test_model.docvecs[doc.tags[0]] for doc in test_docs]
    test_regressors = sm.add_constant(test_regressors)
    
    # Predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_data])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)

Bulk Training & Per-Model Evaluation

Note that doc-vector training is occurring on all documents of the dataset, which includes all TRAIN/TEST/DEV docs.

We evaluate each model's sentiment predictive power based on error rate, and the evaluation is done for each model.

(On a 4-core 2.6Ghz Intel Core i7, these 20 passes training and evaluating 3 main models takes about an hour.)



In [7]:

    
from collections import defaultdict
error_rates = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved



In [8]:

    
for model in simple_models: 
    print("Training %s" % model)
    %time model.train(doc_list, total_examples=len(doc_list), epochs=model.epochs)
    
    print("\nEvaluating %s" % model)
    %time err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))









    



Training Doc2Vec(dbow,d100,n5,mc2,t4)
CPU times: user 18min 41s, sys: 59.7 s, total: 19min 41s
Wall time: 6min 49s

Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)
CPU times: user 1.85 s, sys: 226 ms, total: 2.07 s
Wall time: 673 ms

0.102600 Doc2Vec(dbow,d100,n5,mc2,t4)

Training Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)
CPU times: user 28min 21s, sys: 1min 30s, total: 29min 52s
Wall time: 9min 22s

Evaluating Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)
CPU times: user 1.71 s, sys: 175 ms, total: 1.88 s
Wall time: 605 ms

0.154280 Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)

Training Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
CPU times: user 55min 8s, sys: 36.5 s, total: 55min 44s
Wall time: 14min 43s

Evaluating Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
CPU times: user 1.47 s, sys: 110 ms, total: 1.58 s
Wall time: 533 ms

0.225760 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)



In [9]:

    
for model in [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]: 
    print("\nEvaluating %s" % model)
    %time err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))









    



Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)
CPU times: user 4.13 s, sys: 459 ms, total: 4.59 s
Wall time: 1.72 s

0.103360 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)


Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
CPU times: user 4.03 s, sys: 351 ms, total: 4.38 s
Wall time: 1.38 s

0.105080 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

Achieved Sentiment-Prediction Accuracy



In [10]:

    
# Compare error rates achieved, best-to-worst
print("Err_rate Model")
for rate, name in sorted((rate, name) for name, rate in error_rates.items()):
    print("%f %s" % (rate, name))









    



Err_rate Model
0.102600 Doc2Vec(dbow,d100,n5,mc2,t4)
0.103360 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)
0.105080 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
0.154280 Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)
0.225760 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

In our testing, contrary to the results of the paper, on this problem, PV-DBOW alone performs as good as anything else. Concatenating vectors from different models only sometimes offers a tiny predictive improvement – and stays generally close to the best-performing solo model included.

The best results achieved here are just around 10% error rate, still a long way from the paper's reported 7.42% error rate.

(Other trials not shown, with larger vectors and other changes, also don't come close to the paper's reported value. Others around the net have reported a similar inability to reproduce the paper's best numbers. The PV-DM/C mode improves a bit with many more training epochs – but doesn't reach parity with PV-DBOW.)

Examining Results

Are inferred vectors close to the precalculated ones?



In [11]:

    
doc_id = np.random.randint(simple_models[0].docvecs.count)  # Pick random doc; re-run cell for more examples
print('for doc %d...' % doc_id)
for model in simple_models:
    inferred_docvec = model.infer_vector(alldocs[doc_id].words)
    print('%s:\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))









    



for doc 66229...
Doc2Vec(dbow,d100,n5,mc2,t4):
 [(66229, 0.9756568670272827), (66223, 0.5901858806610107), (81851, 0.5678753852844238)]






    



/Users/neuscratch/Dev/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):






    



Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4):
 [(66229, 0.9355567097663879), (71883, 0.49743932485580444), (74232, 0.49549904465675354)]
Doc2Vec(dm/c,d100,n5,w5,mc2,t4):
 [(66229, 0.9248996376991272), (97306, 0.4372865557670593), (99824, 0.40370166301727295)]

(Yes, here the stored vector from 20 epochs of training is usually one of the closest to a freshly-inferred vector for the same words. Defaults for inference may benefit from tuning for each dataset or model parameters.)



In [18]:

    
import random

doc_id = np.random.randint(simple_models[0].docvecs.count)  # pick random doc, re-run cell for more examples
model = random.choice(simple_models)  # and a random model
sims = model.docvecs.most_similar(doc_id, topn=model.docvecs.count)  # get *all* similar documents
print(u'TARGET (%d): «%s»\n' % (doc_id, ' '.join(alldocs[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(alldocs[sims[index][0]].words)))









    



/Users/neuscratch/Dev/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):






    



TARGET (34105): «even a decade after " frontline " aired on the abc , near as i can tell , " current affairs " programmes are still using the same tricks over and over . time after time , " today tonight " and " a current affair " are seen to be hiding behind the facade of journalistic professionalism , and yet they feed us nothing but tired stories about weight-loss and dodgy tradesmen , shameless network promotions and pointless celebrity puff-pieces . having often been subjected to that entertainment-less void between 'the simpsons' at 6 : 00 pm and 'sale of the century' ( or 'temptation' ) at 7 : 00 pm , i was all too aware of the little tricks that these shows would use to attract ratings . fortunately , four rising comedians  rob sitch , jane kennedy , santo cilauro and tom gleisner  were also all too aware of all this , and they crafted their frustrations into one of the most wickedly-hilarious media satires you'll ever see on television . the four entertainers had already met with comedic success , their previous most memorable television stint being on 'the late show , ' the brilliant saturday night variety show which ran for two seasons from 1992-1993 , and also featured fellow comedians mick molloy , tony martin , jason stephens and judith lucy . " frontline " boasts an ensemble of colourful characters , each with their own distinct and quirky personality . the current-affairs show is headed by nicely-groomed mike moore ( rob sitch ) , an ambitious , pretentious , dim-witted narcissist . mike works under the delusion that the show is serving a vital role for society  he is always adamant that they " maintain their journalistic integrity "  and his executive producers have excelled into getting him to believe just that . mike is basically a puppet to bring the news to the people ; occasionally he gets the inkling that he is being led along by the nose , but usually this thought is stamped out via appeals to his vanity or promises of a promotion . brooke vandenberg ( jane kennedy ) is the senior female reporter on the show . she is constantly concerned about her looks and public profile , and , if the rumours are to be believed , she has had a romantic liaison with just about every male celebrity in existence . another equally amoral reporter , marty di stasio , is portrayed by tiriel mora , who memorably played inept solicitor dennis denuto in the australian comedy classic , 'the castle . ' emma ward ( alison whyte ) is the line producer on the show , and the single shining beacon of morality on the " frontline " set . then there's the highly-amusing weatherman , geoffrey salter ( santo cilauro ) , mike's best friend and confidant . geoff makes a living out of always agreeing with mike's opinion , and of laughing uproariously at his jokes before admitting that he doesn't get them . for each of the shows three seasons , we are treated to a different ep , executive producer . brian thompson ( bruno lawrence ) , who unfortunately passed away in 1995 , runs the programme during season 1 . he has a decent set of morals , and is always civil to his employees , and yet is more-than-willing to cast these aside in favour of high ratings . sam murphy ( kevin j . wilson ) arrives on set in season 2 , a hard-nosed , smooth-talking producer who knows exactly how to string mike along ; the last episode of the second season , when mike finally gets the better of him , is a classic moment . graeme " prowsey " prowse ( steve bisley ) , ep for the third season , is crude , unpleasant and unashamedly sexist . it's , therefore , remarkable that you eventually come to like him . with its cast of distinctive , exaggerated characters , " frontline " has a lot of fun satirising current-affairs programmes and their dubious methods for winning ratings . many of the episodes were shot quickly and cheaply , often implementing many plot ideas from recent real-life situations , but this never really detracts from the show's topicality ten years on . celebrity cameos come in abundance , with some of the most memorable appearances including pauline hanson , don burke and jon english . watch out for harry shearer's hilarious appearance in the season 2 episode " changing the face of current affairs , " playing larry hadges , an american hired by the network to reform the show . particularly in the third season , i noticed that " frontline " boasted an extremely gritty form of black humour , uncharacteristic for such a light-hearted comedy show . genuinely funny moments are born from brooke being surreptitiously bribed into having an abortion , murder by a crazed gunman and mike treacherously betraying his best friend's hopes and dreams , only to be told that he is a good friend . the series' final minute  minus an added-scene during the credits , which was probably added just in case a fourth season was to be produced  was probably the greatest , blackest ending to a comedy series that i've yet seen . below is listed a very tentative list of my top five favourite " frontline " episodes , but , make no mistake , every single half-hour is absolutely hilarious and hard-hitting satire . 1 ) " the siege " ( season 1 ) 2 ) " give 'em enough rope " ( season 2 ) 3 ) " addicted to fame " ( season 3 ) 4 ) " basic instincts " ( season 2 ) 5 ) " add sex and stir " ( season 1 )»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dbow,d100,n5,mc2,t4):

MOST (34106, 0.6284705996513367): «the sad thing about frontline is that once you watch three or four episodes of it you really begin to understand that it is not far away from what happens in real life . what is really sad is that it also makes extremely funny . the frontline team in series one consists of brian thompson ( bruno lawrence ) - a man who truly lives and dies merely by the ratings his show gets . occasionally his stunts to achieve these ratings see him run in with his line producer emma thompson ( alison whyte ) ; a woman who hasn't lost all her journalistic integrity and is prepared to defend moral scruples on occasions . the same cannot be said of reporter brooke vandenberg ( jane kennedy ) - a reporter who has had all the substance sucked out of her- so much so that when interviewing ben elton she needs to be instructed to laugh . her reports usually consist of interviewing celebrities ( with whom she has or hasn't 'crossed paths' with before ) or scandalous unethical reports that usually backfire . martin de stasio ( tiriel mora ) is the reporter with whom the team relies on for gravitas and dignity , as he has the smarts of 21 years of journalism behind him . his doesn't have principles so much as a nous of what makes a good journalistic story , though he does draw the occasional line . parading over this chaos ( in name ) is mike moore ( rob sitch ) an egotistical , naive reporter who can't see that he's only a pretty face for the grubby journalism . he often finds his morals being compromised simply because brian appeals to his vanity and allows his stupidity to do the rest . frontline is the sort of show that there needs to be more of , because it shows that while in modern times happiness , safety and deep political insight are interesting things ; it's much easier to rate with scandal , fear and tabloid celebrities .»

MEDIAN (35245, 0.2309201955795288): «" hell to pay " bills itself as the rebirth of the classic western . . . it succeeds as a western genre movie that the entire family could see and not unlike the films baby-boomers experienced decades ago . the good guys are good and the bad guys are really bad ! . bo svenson , stella stevens , lee majors , andrew prine ( excellent in this film ) tim thomerson and james drury are all great and it's fun to see them again . james drury really shines in this one , maybe even better than his days as " the virginian . " in a way , " hell to pay " reminds me of those movies in the 60's where actors you know from so many shows make an appearance . if you're of a certain age , buck taylor , peter brown and denny miller and william smith provide a " wow " factor because we seldom get to see these icons these days . " hell to pay " features screen legends along with newer names in hollywood . most notable in the cast of " newbies " is rachel kimsey ( rebekah ) , who i've seen lately on " the young and the restless " and kevin kazakoff , who plays the angst-ridden kirby , a war-weary man who's torn between wanting to live and let live or stepping in to " do the right thing . " william gregory lee is excellent as chance , kirby's mischievous and womanizing brother . katie keane plays rachel , rebekah's sister , a woman who did what was necessary to stay alive but giving up her pride in the process . in a small but memorable role , jeff davis plays mean joe , a former confederate with a rather nasty mean streak . i think we'll be seeing more of these fine actors in the future . " hell to pay " is a fun movie with a great story to tell grab the popcorn , we're headin' west ! .»

LEAST (261, -0.09666291624307632): «an unusual film from ringo lam and one that's strangely under-appreciated . the mix of fantasy kung-fu with a more realistic depiction of swords and spears being driven thru bodies is startling especially during the first ten minutes . a horseback rider get chopped in two and his waist and legs keep riding the horse . several horses get chopped up . it's very unexpected . the story is very simple , fong and his shaolin brothers are captured by a crazed maniac general and imprisoned in the red lotus temple which seems to be more of a torture chamber then a temple . the general has a similarity to kurtz in apocalypse now as he spouts warped philosophy and makes frightening paintings with human blood . the production is very impressive and the setting is bleak . blood is everywhere . the action is very well done and mostly coherent unlike many hk action scenes from the time . sometimes the movie veers into absurdity or the effects are cheesy but it's never bad enough to ruin the film . find this one , it's one of the best hk kung fu films from the early nineties . just remember it's not child friendly .»

Somewhat, in terms of reviewer tone, movie genre, etc... the MOST cosine-similar docs usually seem more like the TARGET than the MEDIAN or LEAST... especially if the MOST has a cosine-similarity > 0.5. Re-run the cell to try another random target document.

Do the word vectors show useful similarities?



In [13]:

    
word_models = simple_models[:]



In [23]:

    
import random
from IPython.display import HTML
# pick a random word with a suitable number of occurences
while True:
    word = random.choice(word_models[0].wv.index2word)
    if word_models[0].wv.vocab[word].count > 10:
        break
# or uncomment below line, to just pick a word from the relevant domain:
#word = 'comedy/drama'
similars_per_model = [str(model.wv.most_similar(word, topn=20)).replace('), ','),<br>\n') for model in word_models]
similar_table = ("<table><tr><th>" +
    "</th><th>".join([str(model) for model in word_models]) + 
    "</th></tr><tr><td>" +
    "</td><td>".join(similars_per_model) +
    "</td></tr></table>")
print("most similar words for '%s' (%d occurences)" % (word, simple_models[0].wv.vocab[word].count))
HTML(similar_table)









    



most similar words for 'spoilt' (97 occurences)






    



/Users/neuscratch/Dev/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):






    Out[23]:




Doc2Vec(dbow,d100,n5,mc2,t4) Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4) Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
[("wives'", 0.4262964725494385),

('horrificaly', 0.4177134335041046),

("snit'", 0.4037289619445801),

('improf', 0.40169233083724976),

('humiliatingly', 0.3946930170059204),

('heart-pounding', 0.3938479423522949),

("'jo'", 0.38460421562194824),

('kieron', 0.37991276383399963),

('linguistic', 0.3727714419364929),

('rothery', 0.3719364404678345),

('zellwegger', 0.370682954788208),

('never-released', 0.36564797163009644),

('coffeeshop', 0.36534833908081055),

('slater--these', 0.3643302917480469),

('over-plotted', 0.36348140239715576),

('synchronism', 0.36320072412490845),

('exploitations', 0.3631579875946045),

("donor's", 0.36226314306259155),

('neend', 0.3619685769081116),

('renaud', 0.3611547350883484)] [('spoiled', 0.6693772077560425),

('ruined', 0.5701743960380554),

('dominated', 0.554553747177124),

('marred', 0.5456377267837524),

('undermined', 0.5353708267211914),

('unencumbered', 0.5345744490623474),

('dwarfed', 0.5331343412399292),

('followed', 0.5186703205108643),

('entranced', 0.513541042804718),

('emboldened', 0.5100494623184204),

('shunned', 0.5044804215431213),

('disgusted', 0.5000460743904114),

('overestimated', 0.49955034255981445),

('bolstered', 0.4971669018268585),

('replaced', 0.4966174364089966),

('bookended', 0.49495506286621094),

('blowout', 0.49287083745002747),

('overshadowed', 0.48964253067970276),

('played', 0.48709338903427124),

('accompanied', 0.47834640741348267)] [('spoiled', 0.6672338247299194),

('troubled', 0.520033597946167),

('bankrupted', 0.509053647518158),

('ruined', 0.4965386986732483),

('misguided', 0.4900725483894348),

('devoured', 0.48988765478134155),

('ravaged', 0.4861036539077759),

('frustrated', 0.4841104745864868),

('suffocated', 0.4828023314476013),

('investigated', 0.47958582639694214),

('tormented', 0.4791877865791321),

('traumatized', 0.4785040616989136),

('shaken', 0.4784379005432129),

('persecuted', 0.4774147868156433),

('crippled', 0.4771782457828522),

('torpedoed', 0.4764551818370819),

('plagued', 0.47006863355636597),

('drowned', 0.4688340723514557),

('prompted', 0.4678872525691986),

('abandoned', 0.4652657210826874)]

Do the DBOW words look meaningless? That's because the gensim DBOW model doesn't train word vectors – they remain at their random initialized values – unless you ask with the dbow_words=1 initialization parameter. Concurrent word-training slows DBOW mode significantly, and offers little improvement (and sometimes a little worsening) of the error rate on this IMDB sentiment-prediction task, but may be appropriate on other tasks, or if you also need word-vectors.

Words from DM models tend to show meaningfully similar words when there are many examples in the training data (as with 'plot' or 'actor'). (All DM modes inherently involve word-vector training concurrent with doc-vector training.)

Are the word vectors from this dataset any good at analogies?



In [15]:

    
# grab the file if not already local
questions_filename = 'questions-words.txt'
if not os.path.isfile(questions_filename):
    # Download IMDB archive
    print("Downloading analogy questions file...")
    url = u'https://raw.githubusercontent.com/tmikolov/word2vec/master/questions-words.txt'
    r = requests.get(url)
    with smart_open(questions_filename, 'wb') as f:
        f.write(r.content)
assert os.path.isfile(questions_filename), "questions-words.txt unavailable"
print("Success, questions-words.txt is available for next steps.")









    



Success, questions-words.txt is available for next steps.



In [16]:

    
# Note: this analysis takes many minutes
for model in word_models:
    score, sections = model.wv.evaluate_word_analogies('questions-words.txt')
    correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])
    print('%s: %0.2f%% correct (%d of %d)' % (model, float(correct*100)/(correct+incorrect), correct, correct+incorrect))









    



/Users/neuscratch/Dev/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):






    



Doc2Vec(dbow,d100,n5,mc2,t4): 0.00% correct (0 of 14657)
Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4): 17.37% correct (2546 of 14657)
Doc2Vec(dm/c,d100,n5,w5,mc2,t4): 19.20% correct (2814 of 14657)

Even though this is a tiny, domain-specific dataset, it shows some meager capability on the general word analogies – at least for the DM/mean and DM/concat models which actually train word vectors. (The untrained random-initialized words of the DBOW model of course fail miserably.)

Slop



In [ ]:

    
This cell left intentionally erroneous.

Advanced technique: re-inferring doc-vectors

Because the bulk-trained vectors had much of their training early, when the model itself was still settling, it is sometimes the case that rather than using the bulk-trained vectors, new vectors re-inferred from the final state of the model serve better as the input/test data for downstream tasks.

Our error_rate_for_model() function already had a non-default option to re-infer vectors before training/testing the classifier, so here we test that option. (This takes as long or longer than initial bulk training, as inference is only single-threaded.)



In [24]:

    
for model in simple_models + [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]: 
    print("Evaluating %s re-inferred" % str(model))
    pseudomodel_name = str(model)+"_reinferred"
    %time err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs, reinfer_train=True, reinfer_test=True, infer_subsample=1.0)
    error_rates[pseudomodel_name] = err_rate
    print("\n%f %s\n" % (err_rate, pseudomodel_name))









    



Evaluating Doc2Vec(dbow,d100,n5,mc2,t4) re-inferred
CPU times: user 7min 9s, sys: 1.55 s, total: 7min 11s
Wall time: 7min 10s

0.102240 Doc2Vec(dbow,d100,n5,mc2,t4)_reinferred

Evaluating Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4) re-inferred
CPU times: user 9min 48s, sys: 1.53 s, total: 9min 49s
Wall time: 9min 48s

0.146200 Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)_reinferred

Evaluating Doc2Vec(dm/c,d100,n5,w5,mc2,t4) re-inferred
CPU times: user 16min 13s, sys: 1.32 s, total: 16min 14s
Wall time: 16min 13s

0.218120 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_reinferred

Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4) re-inferred
CPU times: user 15min 50s, sys: 1.63 s, total: 15min 52s
Wall time: 15min 49s

0.102120 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)_reinferred

Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4) re-inferred
CPU times: user 22min 53s, sys: 1.81 s, total: 22min 55s
Wall time: 22min 52s

0.104320 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_reinferred



In [25]:

    
# Compare error rates achieved, best-to-worst
print("Err_rate Model")
for rate, name in sorted((rate, name) for name, rate in error_rates.items()):
    print("%f %s" % (rate, name))









    



Err_rate Model
0.102120 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)_reinferred
0.102240 Doc2Vec(dbow,d100,n5,mc2,t4)_reinferred
0.102600 Doc2Vec(dbow,d100,n5,mc2,t4)
0.103360 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)
0.104320 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_reinferred
0.105080 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
0.146200 Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)_reinferred
0.154280 Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)
0.218120 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_reinferred
0.225760 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

Here, we do not see much benefit of re-inference. It's more likely to help if the initial training used fewer epochs (10 is also a common value in the literature for larger datasets), or perhaps in larger datasets.

To get copious logging output from above steps...



In [ ]:

    
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
rootLogger = logging.getLogger()
rootLogger.setLevel(logging.INFO)

To auto-reload python code while developing...



In [ ]:

    
%load_ext autoreload
%autoreload 2



In [ ]:

Gensim Doc2Vec Tutorial on the IMDB Sentiment Dataset