Doc2Vec Tutorial on the Lee Dataset


In [1]:
import gensim
import os
import collections
import smart_open
import random

What is it?

Doc2Vec is an NLP tool for representing documents as a vector and is a generalizing of the Word2Vec method. This tutorial will serve as an introduction to Doc2Vec and present ways to train and assess a Doc2Vec model.

Getting Started

To get going, we'll need to have a set of documents to train our doc2vec model. In theory, a document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book. In NLP parlance a collection or set of documents is often referred to as a corpus.

For this tutorial, we'll be training our model using the Lee Background Corpus included in gensim. This corpus contains 314 documents selected from the Australian Broadcasting Corporation’s news mail service, which provides text e-mails of headline stories and covers a number of broad topics.

And we'll test our model by eye using the much shorter Lee Corpus which contains 50 documents.


In [2]:
# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
lee_test_file = test_data_dir + os.sep + 'lee.cor'

Define a Function to Read and Preprocess Text

Below, we define a function to open the train/test file (with latin encoding), read the file line-by-line, pre-process each line using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc), and return a list of words. Note that, for a given file (aka corpus), each continuous line constitutes a single document and the length of each line (i.e., document) can vary. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the zero-based line number.


In [3]:
def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])

In [4]:
train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

Let's take a look at the training corpus


In [5]:
train_corpus[:2]


Out[5]:
[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', 'caused', 'the', 'fire', 'to', 'burn', 'in', 'finger', 'formation', 'have', 'now', 'eased', 'and', 'about', 'fire', 'units', 'in', 'and', 'around', 'hill', 'top', 'are', 'optimistic', 'of', 'defending', 'all', 'properties', 'as', 'more', 'than', 'blazes', 'burn', 'on', 'new', 'year', 'eve', 'in', 'new', 'south', 'wales', 'fire', 'crews', 'have', 'been', 'called', 'to', 'new', 'fire', 'at', 'gunning', 'south', 'of', 'goulburn', 'while', 'few', 'details', 'are', 'available', 'at', 'this', 'stage', 'fire', 'authorities', 'says', 'it', 'has', 'closed', 'the', 'hume', 'highway', 'in', 'both', 'directions', 'meanwhile', 'new', 'fire', 'in', 'sydney', 'west', 'is', 'no', 'longer', 'threatening', 'properties', 'in', 'the', 'cranebrook', 'area', 'rain', 'has', 'fallen', 'in', 'some', 'parts', 'of', 'the', 'illawarra', 'sydney', 'the', 'hunter', 'valley', 'and', 'the', 'north', 'coast', 'but', 'the', 'bureau', 'of', 'meteorology', 'claire', 'richards', 'says', 'the', 'rain', 'has', 'done', 'little', 'to', 'ease', 'any', 'of', 'the', 'hundred', 'fires', 'still', 'burning', 'across', 'the', 'state', 'the', 'falls', 'have', 'been', 'quite', 'isolated', 'in', 'those', 'areas', 'and', 'generally', 'the', 'falls', 'have', 'been', 'less', 'than', 'about', 'five', 'millimetres', 'she', 'said', 'in', 'some', 'places', 'really', 'not', 'significant', 'at', 'all', 'less', 'than', 'millimetre', 'so', 'there', 'hasn', 'been', 'much', 'relief', 'as', 'far', 'as', 'rain', 'is', 'concerned', 'in', 'fact', 'they', 've', 'probably', 'hampered', 'the', 'efforts', 'of', 'the', 'firefighters', 'more', 'because', 'of', 'the', 'wind', 'gusts', 'that', 'are', 'associated', 'with', 'those', 'thunderstorms'], tags=[0]),
 TaggedDocument(words=['indian', 'security', 'forces', 'have', 'shot', 'dead', 'eight', 'suspected', 'militants', 'in', 'night', 'long', 'encounter', 'in', 'southern', 'kashmir', 'the', 'shootout', 'took', 'place', 'at', 'dora', 'village', 'some', 'kilometers', 'south', 'of', 'the', 'kashmiri', 'summer', 'capital', 'srinagar', 'the', 'deaths', 'came', 'as', 'pakistani', 'police', 'arrested', 'more', 'than', 'two', 'dozen', 'militants', 'from', 'extremist', 'groups', 'accused', 'of', 'staging', 'an', 'attack', 'on', 'india', 'parliament', 'india', 'has', 'accused', 'pakistan', 'based', 'lashkar', 'taiba', 'and', 'jaish', 'mohammad', 'of', 'carrying', 'out', 'the', 'attack', 'on', 'december', 'at', 'the', 'behest', 'of', 'pakistani', 'military', 'intelligence', 'military', 'tensions', 'have', 'soared', 'since', 'the', 'raid', 'with', 'both', 'sides', 'massing', 'troops', 'along', 'their', 'border', 'and', 'trading', 'tit', 'for', 'tat', 'diplomatic', 'sanctions', 'yesterday', 'pakistan', 'announced', 'it', 'had', 'arrested', 'lashkar', 'taiba', 'chief', 'hafiz', 'mohammed', 'saeed', 'police', 'in', 'karachi', 'say', 'it', 'is', 'likely', 'more', 'raids', 'will', 'be', 'launched', 'against', 'the', 'two', 'groups', 'as', 'well', 'as', 'other', 'militant', 'organisations', 'accused', 'of', 'targetting', 'india', 'military', 'tensions', 'between', 'india', 'and', 'pakistan', 'have', 'escalated', 'to', 'level', 'not', 'seen', 'since', 'their', 'war'], tags=[1])]

And the testing corpus looks like this:


In [6]:
print(test_corpus[:2])


[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to', 'june', 'chief', 'executive', 'paul', 'batchelor', 'said', 'the', 'result', 'was', 'solid', 'in', 'what', 'he', 'described', 'as', 'the', 'worst', 'conditions', 'for', 'stock', 'markets', 'in', 'years', 'amp', 'half', 'year', 'profit', 'sank', 'per', 'cent', 'to', 'million', 'or', 'share', 'as', 'australia', 'largest', 'investor', 'and', 'fund', 'manager', 'failed', 'to', 'hit', 'projected', 'per', 'cent', 'earnings', 'growth', 'targets', 'and', 'was', 'battered', 'by', 'falling', 'returns', 'on', 'share', 'markets']]

Notice that the testing corpus is just a list of lists and does not contain any tags.

Training the Model

Instantiate a Doc2Vec Object

Now, we'll instantiate a Doc2Vec model with a vector size with 50 words and iterating over the training corpus 10 times. We set the minimum word count to 2 in order to give higher frequency words more weighting. Model accuracy can be improved by increasing the number of iterations but this generally increases the training time.


In [7]:
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=10)

Build a Vocabulary


In [8]:
model.build_vocab(train_corpus)

Essentially, the vocabulary is a dictionary (accessible via model.vocab) of all of the unique words extracted from the training corpus along with the count (e.g., model.vocab['penalty'].count for counts for the word penalty).

Time to Train

If the BLAS library is being used, this should take no more than 2 seconds. If the BLAS library is not being used, this should take no more than 2 minutes, so use BLAS if you value your time.


In [9]:
%time model.train(train_corpus)


CPU times: user 1.03 s, sys: 4.04 ms, total: 1.03 s
Wall time: 1.02 s
Out[9]:
554260

Inferring a Vector

One important thing to note is that you can now infer a vector for any piece of text without having to re-train the model by passing a list of words to the model.infer_vector function. This vector can then be compared with other vectors via cosine similarity.


In [10]:
model.infer_vector(['only', 'you', 'can', 'prevent', 'forrest', 'fires'])


Out[10]:
array([ 0.01498613, -0.04319233, -0.09383711, -0.11391422,  0.01824722,
       -0.08820333, -0.10123429, -0.04619413,  0.06180811,  0.00963261,
        0.01016402, -0.04718231,  0.03931789,  0.00525723,  0.03888206,
       -0.03022155,  0.05980202, -0.07421142, -0.00868064,  0.06489716,
       -0.03265649, -0.01281992,  0.18542686, -0.03033197,  0.0381287 ,
       -0.05379254, -0.02662263, -0.07465494, -0.04721382, -0.01964734,
        0.02553383, -0.03766384,  0.02129765,  0.11965464, -0.03847834,
        0.01352374,  0.06914927, -0.02476629,  0.0576441 , -0.04563627,
        0.05643371,  0.031572  ,  0.01302914,  0.01272549,  0.09680226,
       -0.00038328,  0.04555022,  0.04019492, -0.06818081, -0.02148021], dtype=float32)

Assessing Model

To assess our new model, we'll first infer new vectors for each document of the training corpus, compare the inferred vectors with the training corpus, and then returning the rank of the document based on self-similarity. Basically, we're pretending as if the training corpus is some new unseen data and then seeing how they compare with the trained model. The expectation is that we've likely overfit our model (i.e., all of the ranks will be less than 2) and so we should be able to find similar documents very easily. Additionally, we'll keep track of the second ranks for a comparison of less similar documents.


In [11]:
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)
    
    second_ranks.append(sims[1])

Let's count how each document ranks with respect to the training corpus


In [12]:
collections.Counter(ranks)  #96% accuracy


Out[12]:
Counter({0: 292, 1: 8})

Basically, greater than 95% of the inferred documents are found to be most similar to itself and about 5% of the time it is mistakenly most similar to another document. This is great and not entirely surprising. We can take a look at an example:


In [13]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))


Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/s,d50,hs,w8,mc2):

MOST (299, 0.8740596175193787): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»

MEDIAN (293, 0.035127244889736176): «george harrison the guitarist songwriter and film producer was widely known as the quiet beatle as the youngest beatle he had to be snuck in underage to venues prior to the band phenomenal success in the early he was responsible for some of the band classic songs such as taxman here comes the sun and something but up against the genius of paul mccartney and john lennon his songs were hard pressed to make it onto vinyl resentment built up and harrison withdrew from the limelight after the beatles broke up he found solo success in with the track my sweet lord although he was successfully sued for plagiarism and had to pay out half million pounds in damages always against beatles reformation he famously declared in that the band would reform when lennon was no longer dead in later years he was beset by lung and throat cancer he was lucky to survive stabbing by an intruder in his uk home in he was known for his love of eastern mysticism motor racing and his second wife olivia who saved his life in the knife attack»

LEAST (14, -0.5578939914703369): «the israeli army has killed three palestinian militants who attacked one of its armoured vehicles in the northern gaza strip the three palestinians opened fire with rifles at the vehicle between the jewish settlements of alei sinai and nitzanit on the northern edge of the gaza strip they were killed by shell fire from tank the sources said during the fire fight an israeli army observation post called in tank fire which killed the three gunmen the killing brings the death toll of months of the palestinian uprising against israeli occupation to people including palestinians and israelis alei sinai was attacked on october when two gunmen from the radical islamic group hamas infiltrated the settlement and opened fire on the residents killing two teenage israelis two palestinian gunmen also killed an israeli settler in alei sinai on december before being killed by the army»

Notice above that the most similar document is has a similarity score of ~80% (or higher). However, the similarity score for the second ranked documents should be significantly lower (assuming the documents are in fact different) and the reasoning becomes obvious when we examine the text itself


In [14]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(train_corpus))

# Compare and print the most/median/least similar documents from the train corpus
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))


Train Document (193): «new study shows that nearly one third of the aboriginal and torres strait islander population in australia have been arrested in the past five years the study conducted by the australian national university for the new south wales bureau of crime statistics is the first to compare the arrest rates of the aboriginal and non aboriginal population it finds that unemployment alcohol and assault rates were the main causes study author boyd hunter says policy both on community and government level must deal with these issues if the arrest rate is to be decreased addressing the supply of alcohol in remote communities is seen as the most likely avenue for reducing rates of abuse alcohol abuse and hence reduce arrest rates in those communities he said»

Similar Document (125, 0.5374129414558411): «the united states space shuttle endeavour has touched down at florida kennedy space centre after day mission bringing home crew that had been on the international space station since august the shuttle carrying outgoing space station commander frank culbertson and russian cosmonauts vladimir dezhurov and mikhail tyurin along with four other astronauts landed at pm local time taking over from the trio are russian commander yuri onufrienko and us astronauts carl walz and dan bursch who travelled to the station aboard endeavour on december earlier monday the seven us and russian astronauts on board endeavour woke up to the tune please come home for christmas by the rock group bon jovi on sunday the endeavour crew deployed small satellite called starshine from canister located in the shuttle payload bay more than students from countries will track the satellite as it orbits earth for the next eight months the students will collect information in order to calculate the density of the upper atmosphere nasa said on saturday endeavour undocked from the space station after making last minute maneuver to dodge piece of soviet era space refuse the endeavour mission the th shuttle trip to the international space station brought some three tonnes of equipment and materials for scientific experiments to the station the trip carried out under extremely tight security was the first since the september attacks on the united states»

Testing the Model

Using the same approach above, we'll infer the vector for a randomly chosen test document, and compare the document to our model by eye.


In [15]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus))
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))


Test Document (25): «an islamic high court in northern nigeria rejected an appeal today by single mother sentenced to be stoned to death for having sex out of wedlock clutching her baby daughter amina lawal burst into tears as the judge delivered the ruling lawal was first sentenced in march after giving birth to daughter more than nine months after divorcing»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/s,d50,hs,w8,mc2):

MOST (6, 0.44577494263648987): «the united states team of monica seles and jan michael gambill scored decisive victory over unseeded france in their first hopman cup match at burswood dome in perth the pair runners up in the million dollar mixed teams event last year both won their singles encounters to give the us an unbeatable lead the year old seles currently ranked eighth recovered from shaky start to overpower virginie razzano who is ranked nd seles had to fight hard to get home in straight sets winning in minutes then the year old gambill ranked st wore down determined arnaud clement th to win in minutes the americans are aiming to go one better than last year when they were beaten by swiss pair martina hingis and roger federer in the final of the eight nation contest gambill said the win was great way to start the tennis year got little tentative at the end but it was great start to my year he said arnaud is great scrapper and am delighted to beat him even though am frankly bit out of shape that is one of the reasons am here will be in shape by the end of the tournament just aim to keep improving in the new year and if do think have chance to beat anyone when am playing well gambill was pressed hard by clement before taking the first set in minutes but the american gained the ascendancy in the second set breaking in the third and fifth games seles said she had expected her clash with razzano to be tough she was top junior player in the world so it was no surprise that she fought so well she said seles said she still had the hunger to strive to regain her position at the top of her sport this is why you play she said but want to try not to peak too early this season seles slow into her stride slipped to in her opening set against razzano but recovered quickly claiming the set after snatching four games in row in the second set seles broke her opponent in the opening game and completed victory with relative ease despite razzano tenacious efforts»

MEDIAN (9, 0.07038116455078125): «some roads are closed because of dangerous conditions caused by bushfire smoke motorists are being asked to avoid the hume highway between picton road and the illawarra highway where police have reduced the speed limit from kilometres an hour to in southern sydney picton road is closed between wilton and bulli appin road is closed from appin to bulli tops and all access roads to royal national park are closed motorists are also asked to avoid the illawarra highway between the hume highway and robertson and the great western highway between penrith and springwood because of reduced visibility in north western sydney only local residents are allowed to use wisemans ferry road and upper color road under police escort»

LEAST (180, -0.4328160285949707): «australia has linked million of aid to new agreement with nauru to accept an extra asylum seekers the deal means nauru will take up to asylum seekers under australia pacific solution foreign minister alexander downer signed the understanding today with nauru president rene harris mr downer inspected the nauru camps and says they are are practical and efficient had good look at the sanitation the ablution blocks and thought they were pretty good he said the asylum seekers have various things to do there are volleyball facilities and soccer facilities television is available they can see different channels on tv the catering is good there are three meals day provided»

Wrapping Up

That's it! Doc2Vec is a great way to explore relationships between documents.