Doc2Vec Tutorial on the Lee Dataset


In [1]:
import gensim
import os
import collections
import random

What is it?

Doc2Vec is an NLP tool for representing documents as a vector and is a generalizing of the Word2Vec method. This tutorial will serve as an introduction to Doc2Vec and present ways to train and assess a Doc2Vec model.

Getting Started

To get going, we'll need to have a set of documents to train our doc2vec model. In theory, a document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book. In NLP parlance a collection or set of documents is often referred to as a corpus.

For this tutorial, we'll be training our model using the Lee Background Corpus included in gensim. This corpus contains 314 documents selected from the Australian Broadcasting Corporation’s news mail service, which provides text e-mails of headline stories and covers a number of broad topics.

And we'll test our model by eye using the much shorter Lee Corpus which contains 50 documents.


In [2]:
# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
lee_test_file = test_data_dir + os.sep + 'lee.cor'

Define a Function to Read and Preprocess Text

Below, we define a function to open the train/test file (with latin encoding), read the file line-by-line, pre-process each line using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc), and return a list of words. Note that, for a given file (aka corpus), each continuous line constitutes a single document and the length of each line (i.e., document) can vary. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the zero-based line number.


In [3]:
def read_corpus(fname, tokens_only=False):
    with open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])

In [4]:
train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

Let's take a look at the training corpus


In [5]:
train_corpus[:2]


Out[5]:
[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', 'caused', 'the', 'fire', 'to', 'burn', 'in', 'finger', 'formation', 'have', 'now', 'eased', 'and', 'about', 'fire', 'units', 'in', 'and', 'around', 'hill', 'top', 'are', 'optimistic', 'of', 'defending', 'all', 'properties', 'as', 'more', 'than', 'blazes', 'burn', 'on', 'new', 'year', 'eve', 'in', 'new', 'south', 'wales', 'fire', 'crews', 'have', 'been', 'called', 'to', 'new', 'fire', 'at', 'gunning', 'south', 'of', 'goulburn', 'while', 'few', 'details', 'are', 'available', 'at', 'this', 'stage', 'fire', 'authorities', 'says', 'it', 'has', 'closed', 'the', 'hume', 'highway', 'in', 'both', 'directions', 'meanwhile', 'new', 'fire', 'in', 'sydney', 'west', 'is', 'no', 'longer', 'threatening', 'properties', 'in', 'the', 'cranebrook', 'area', 'rain', 'has', 'fallen', 'in', 'some', 'parts', 'of', 'the', 'illawarra', 'sydney', 'the', 'hunter', 'valley', 'and', 'the', 'north', 'coast', 'but', 'the', 'bureau', 'of', 'meteorology', 'claire', 'richards', 'says', 'the', 'rain', 'has', 'done', 'little', 'to', 'ease', 'any', 'of', 'the', 'hundred', 'fires', 'still', 'burning', 'across', 'the', 'state', 'the', 'falls', 'have', 'been', 'quite', 'isolated', 'in', 'those', 'areas', 'and', 'generally', 'the', 'falls', 'have', 'been', 'less', 'than', 'about', 'five', 'millimetres', 'she', 'said', 'in', 'some', 'places', 'really', 'not', 'significant', 'at', 'all', 'less', 'than', 'millimetre', 'so', 'there', 'hasn', 'been', 'much', 'relief', 'as', 'far', 'as', 'rain', 'is', 'concerned', 'in', 'fact', 'they', 've', 'probably', 'hampered', 'the', 'efforts', 'of', 'the', 'firefighters', 'more', 'because', 'of', 'the', 'wind', 'gusts', 'that', 'are', 'associated', 'with', 'those', 'thunderstorms'], tags=[0]),
 TaggedDocument(words=['indian', 'security', 'forces', 'have', 'shot', 'dead', 'eight', 'suspected', 'militants', 'in', 'night', 'long', 'encounter', 'in', 'southern', 'kashmir', 'the', 'shootout', 'took', 'place', 'at', 'dora', 'village', 'some', 'kilometers', 'south', 'of', 'the', 'kashmiri', 'summer', 'capital', 'srinagar', 'the', 'deaths', 'came', 'as', 'pakistani', 'police', 'arrested', 'more', 'than', 'two', 'dozen', 'militants', 'from', 'extremist', 'groups', 'accused', 'of', 'staging', 'an', 'attack', 'on', 'india', 'parliament', 'india', 'has', 'accused', 'pakistan', 'based', 'lashkar', 'taiba', 'and', 'jaish', 'mohammad', 'of', 'carrying', 'out', 'the', 'attack', 'on', 'december', 'at', 'the', 'behest', 'of', 'pakistani', 'military', 'intelligence', 'military', 'tensions', 'have', 'soared', 'since', 'the', 'raid', 'with', 'both', 'sides', 'massing', 'troops', 'along', 'their', 'border', 'and', 'trading', 'tit', 'for', 'tat', 'diplomatic', 'sanctions', 'yesterday', 'pakistan', 'announced', 'it', 'had', 'arrested', 'lashkar', 'taiba', 'chief', 'hafiz', 'mohammed', 'saeed', 'police', 'in', 'karachi', 'say', 'it', 'is', 'likely', 'more', 'raids', 'will', 'be', 'launched', 'against', 'the', 'two', 'groups', 'as', 'well', 'as', 'other', 'militant', 'organisations', 'accused', 'of', 'targetting', 'india', 'military', 'tensions', 'between', 'india', 'and', 'pakistan', 'have', 'escalated', 'to', 'level', 'not', 'seen', 'since', 'their', 'war'], tags=[1])]

And the testing corpus looks like this:


In [6]:
print(test_corpus[:2])


[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to', 'june', 'chief', 'executive', 'paul', 'batchelor', 'said', 'the', 'result', 'was', 'solid', 'in', 'what', 'he', 'described', 'as', 'the', 'worst', 'conditions', 'for', 'stock', 'markets', 'in', 'years', 'amp', 'half', 'year', 'profit', 'sank', 'per', 'cent', 'to', 'million', 'or', 'share', 'as', 'australia', 'largest', 'investor', 'and', 'fund', 'manager', 'failed', 'to', 'hit', 'projected', 'per', 'cent', 'earnings', 'growth', 'targets', 'and', 'was', 'battered', 'by', 'falling', 'returns', 'on', 'share', 'markets']]

Notice that the testing corpus is just a list of lists and does not contain any tags.

Training the Model

Instantiate a Doc2Vec Object

Now, we'll instantiate a Doc2Vec model with a vector size with 50 words and iterating over the training corpus 10 times. We set the minimum word count to 2 in order to give higher frequency words more weighting. Model accuracy can be improved by increasing the number of iterations but this generally increases the training time.


In [7]:
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=10)

Build a Vocabulary


In [8]:
model.build_vocab(train_corpus)

Essentially, the vocabulary is a dictionary (accessible via model.vocab) of all of the unique words extracted from the training corpus along with the count (e.g., model.vocab['penalty'].count for counts for the word penalty).

Time to Train

This should take no more than 2 minutes


In [9]:
%time model.train(train_corpus)


/Users/law978/Git/gensim/gensim/models/word2vec.py:695: UserWarning: C extension not loaded for Word2Vec, training will be slow. Install a C compiler and reinstall gensim for fast training.
  warnings.warn("C extension not loaded for Word2Vec, training will be slow. "
CPU times: user 1min 12s, sys: 597 ms, total: 1min 12s
Wall time: 1min 15s
Out[9]:
551260

Inferring a Vector

One important thing to note is that you can now infer a vector for any piece of text without having to re-train the model by passing a list of words to the model.infer_vector function. This vector can then be compared with other vectors via cosine similarity.


In [10]:
model.infer_vector(['only', 'you', 'can', 'prevent', 'forrest', 'fires'])


Out[10]:
array([-0.04874077,  0.00532171,  0.03997586,  0.02039887,  0.05031801,
        0.01692798,  0.05166846,  0.02999674, -0.01478235,  0.05082989,
       -0.05563538, -0.03152708, -0.10895395, -0.00282841, -0.03549191,
        0.01407297, -0.02129708,  0.04731029,  0.05398854, -0.00647346,
        0.03382157,  0.01671328,  0.0430089 ,  0.08409277, -0.02082613,
        0.00802198, -0.14297496,  0.0373593 ,  0.0259844 , -0.05306018,
       -0.06854553, -0.05353218,  0.0096629 , -0.00472326, -0.02621017,
       -0.0160461 ,  0.10709939,  0.02299457, -0.03324679,  0.09491493,
        0.05415217, -0.02253654, -0.0550766 ,  0.02930273, -0.0288978 ,
        0.03869915,  0.03572316,  0.05450428, -0.01299429, -0.00369742], dtype=float32)

Assessing Model

To assess our new model, we'll first infer new vectors for each document of the training corpus, compare the inferred vectors with the training corpus, and then returning the rank of the document based on self-similarity. Basically, we're pretending as if the training corpus is some new unseen data and then seeing how they compare with the trained model. The expectation is that we've likely overfit our model (i.e., all of the ranks will be less than 2) and so we should be able to find similar documents very easily. Additionally, we'll keep track of the second ranks for a comparison of less similar documents.


In [11]:
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)
    
    second_ranks.append(sims[1])

Let's count how each document ranks with respect to the training corpus


In [12]:
collections.Counter(ranks)  #96% accuracy


Out[12]:
Counter({0: 292, 1: 8})

Basically, greater than 95% of the inferred documents are found to be most similar to itself and about 5% of the time it is mistakenly most similar to another document. This is great and not entirely surprising. We can take a look at an example:


In [13]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))


Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/s,d50,hs,w8,mc2):

MOST (299, 0.8678168058395386): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»

MEDIAN (201, 0.055294521152973175): «prime minister ariel sharon said israel might step up its military operations in the west bank and gaza strip following new palestinian suicide bombing in haifa our operations are yielding impressive results but we have not finished our action and because of what is happening we might have to step up our activities sharon told israeli public radio earlier on sunday palestinian suicide bomber tried to blow himself up at bus station in the northern israeli town of haifa injuring several people the militant was seriously wounded in the botched attack and was quickly shot dead by two policemen who feared he was about to activate second bomb mr sharon was speaking during weekly cabinet meeting that was held at the west bank headquarters of the israeli armed forces near the jewish settlement of beit el and the palestinian town of ramallah israeli cabinet ministers were driven to the meeting in an armored bus the radio said mr sharon has sent in his warplanes helicopter gunships and tanks to palestinian police stations across the territories to pile pressure on mr arafat to arrest and sentence islamic extremists behind wave of suicide bombings on israeli cities in the past week two palestinians were killed and more than injured in the air strikes»

LEAST (199, -0.41532400250434875): «the royal commission into the building industry has ended the first day of public hearings in melbourne counsel assisting lionel robbards qc told the commission of culture of fear in the building industry he said some witnesses were afraid to come forward after being physically threatened the construction forestry mining and energy union cfmeu secretary martin kingham will respond to allegations made against the unions when he gives evidence tomorrow it coincides with rally by thousands of building workers outside collins place where the commission is being held»

Notice above that the most similar document is has a similarity score of ~80% (or higher). However, the similarity score for the second ranked documents should be significantly lower (assuming the documents are in fact different) and the reasoning becomes obvious when we examine the text itself


In [14]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(train_corpus))

# Compare and print the most/median/least similar documents from the train corpus
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))


Train Document (68): «after months of delays the company behind plans for multi billion dollar timor sea gas development has reached an agreement with east timor to allow the project to go ahead six months ago phillips petroleum indefinitely delayed its plan to build billion pipeline to bring gas from the bayu undan gasfield to darwin the company blamed the delay on its concerns with east timor tax regime but today phillips has announced it has reached deal with east timor new government which will allow the gas project to go ahead in statement the company says it welcomes the tax and fiscal package offered by east timor council of ministers and says it is an historic day but phillips spokesman says the deal must first be ratified by australia which will share the revenue from the gas project with east timor»

Similar Document (214, 0.4991557002067566): «the united states offered full and direct approval to indonesia invasion of east timor move by then president suharto which consigned the territory to years of oppression official documents released today showed the documents prove conclusively for the first time that the united states gave green light to the invasion the opening salvo in an occupation that cost the lives of up to east timorese general suharto briefed us president gerald ford and his secretary of state henry kissinger on his plans for the former portuguese colony hours before the invasion according to documents collected by george washington university national security archive when mr ford and mr kissinger called in to jakarta on their way back from summit in beijing on december suharto claimed that in the interests of asia and regional stability he had to bring stability to east timor to which portugal was trying to grant autonomy we want your understanding if we deem it necessary to take rapid or drastic action suharto told his visitors according to long classified state department cable mr ford replied we will understand and will not press you on the issue we understand the problem you have and the intentions you have mr kissinger who has denied the subject of timor came up during the talks appeared to be concerned about the domestic political implications of an indonesian invasion it is important that whatever you do succeeds quickly we would be able to influence the reaction in america if whatever happens happens after we return the president will be back on monday at pm jakarta time we understand your problem and the need to move quickly but am only saying that it would be better if it were done after we returned the invasion took place on december the day after the ford suharto meeting mr kissinger has consistently rejected criticism of the ford administration conduct on east timor during launch in for his book diplomacy mr kissinger said at new york hotel it was perhaps regrettable that for us officials the implications of indonesia timor policy were lost in blizzard of geo political issues following the vietnam war»

Testing the Model

Using the same approach above, we'll infer the vector for a randomly chosen test document, and compare the document to our model by eye.


In [15]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus))
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))


Test Document (6): «the united states team of monica seles and jan michael gambill scored decisive victory over unseeded france in their first hopman cup match at burswood dome in perth the pair runners up in the million dollar mixed teams event last year both won their singles encounters to give the us an unbeatable lead the year old seles currently ranked eighth recovered from shaky start to overpower virginie razzano who is ranked nd seles had to fight hard to get home in straight sets winning in minutes then the year old gambill ranked st wore down determined arnaud clement th to win in minutes the americans are aiming to go one better than last year when they were beaten by swiss pair martina hingis and roger federer in the final of the eight nation contest gambill said the win was great way to start the tennis year got little tentative at the end but it was great start to my year he said arnaud is great scrapper and am delighted to beat him even though am frankly bit out of shape that is one of the reasons am here will be in shape by the end of the tournament just aim to keep improving in the new year and if do think have chance to beat anyone when am playing well gambill was pressed hard by clement before taking the first set in minutes but the american gained the ascendancy in the second set breaking in the third and fifth games seles said she had expected her clash with razzano to be tough she was top junior player in the world so it was no surprise that she fought so well she said seles said she still had the hunger to strive to regain her position at the top of her sport this is why you play she said but want to try not to peak too early this season seles slow into her stride slipped to in her opening set against razzano but recovered quickly claiming the set after snatching four games in row in the second set seles broke her opponent in the opening game and completed victory with relative ease despite razzano tenacious efforts»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/s,d50,hs,w8,mc2):

MOST (133, 0.5493497848510742): «the hunt for osama bin laden has shifted to the forests around the cave complex of tora bora after swoop through the last caves failed to reveal any sign of the saudi born fugitive us special forces are now combing the forests alongside anti taliban militias up to al qaeda fighters are believed to have scattered into the hills many heading south toward the pakistan border local commanders have warned they will shoot any villager who shelters them us bombing runs have eased off in the past hours as american special forces move to deeper into the forest to coordinate the hunt earlier us forces intercepted voice communication they believed could be bin laden speaking by short range radio to his fighters however senior afghan commander haji zaman said he believed bin laden had left the tora bora area us defence secretary donald rumsfeld on visit to us troops at baghram air base outside kabul said the battle against the taliban and al qaeda was not over there are still pockets of taliban and al qaeda forces that have drifted into the mountains and could reform and there is good deal yet to be done he said»

MEDIAN (38, 0.09136968851089478): «rafter who raised the alarm after most of his party was swept into the franklin river in tasmania south west says for nearly hours he did not know whether his four friends had survived richard romaszko was rafting down the collingwood river when the party was hit by huge water swell just before the junction with the franklin mr romaszko pulled himself to safety and after camping overnight he alerted tour group he was able to use the group satellite phone to raise the alarm second member of the party was found nearby yesterday afternoon while the other three were winched out by helicopter about pm aedt mr romanszko says he went into survival mode when we hit the bottom of the rapid there was big wave that overturned the rafts he said before we knew it we were in the franklin river at least my raft was upside down and the guy who was with me his raft was upside down»

LEAST (123, -0.38924235105514526): «new report has revealed there are fewer young people using homeless services than widely thought the australian institute of health and welfare report shows just over per cent of people aged between and used such services over the past year the main reason for young people seeking assistance was based on family or relationship difficulties the report suggests the older people become the less they use homeless services»

Wrapping Up

That's it! Doc2Vec is a great way to explore relationships between documents.