
This is an experimental code developed by Tomas Mikolov found in the word2vec google group:

This is not yet available on Pypi you need the latest master branch from git.

The input format for doc2vec is still one big text document but every line should be one document prepended with an unique id, for example:

_*0 This is sentence 1
_*1 This is sentence 2


  1. nltk
  2. Download some data:
  3. Untar that data: tar -xvf aclImdb_v1.tar.gz


Merge data into one big document with an id per line and do some basic preprocessing: word tokenizer.

from __future__ import unicode_literals

import os
import nltk

directories = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']

input_file = open('/Users/drodriguez/Downloads/alldata.txt', 'w')

id_ = 0
for directory in directories:
    rootdir = os.path.join('/Users/drodriguez/Downloads/aclImdb', directory)
    for subdir, dirs, files in os.walk(rootdir):
        for file_ in files:
            with open(os.path.join(subdir, file_), 'r') as f:
                doc_id = '_*%i' % id_
                id_ = id_ + 1

                text =
                text = text.decode('utf-8')
                tokens = nltk.word_tokenize(text)
                doc = ' '.join(tokens).lower()
                doc = doc.encode('ascii', 'ignore')
                input_file.write('%s %s\n' % (doc_id, doc))

%load_ext autoreload
%autoreload 2

import word2vec

word2vec.doc2vec('/Users/drodriguez/Downloads/alldata.txt', '/Users/drodriguez/Downloads/vectors.bin', cbow=0, size=100, window=10, negative=5, hs=0, sample='1e-4', threads=12, iter_=20, min_count=1, verbose=True)

Starting training using file /Users/drodriguez/Downloads/alldata.txt
Vocab size: 355046
Words in train file: 28300990
Alpha: 0.000002  Progress: 100.01%  Words/thread/sec: 92.57k  


Is possible to load the vectors using the same wordvectors class as a regular word2vec binary file.

%load_ext autoreload
%autoreload 2

import word2vec

model = word2vec.load('/Users/drodriguez/Downloads/vectors.bin')

(355046, 100)

The documents vector are going to be identified by the id we used in the preprocesing section, for example document 1 is going to have vector:

array([-0.09961915, -0.02504692, -0.00935447, -0.06283784,  0.0496435 ,
        0.08408608,  0.07249928,  0.02332756,  0.04755085,  0.14972655,
       -0.13693954,  0.01296212, -0.05615778,  0.05408363, -0.01922397,
        0.00903398,  0.11205222,  0.02491457,  0.04302743, -0.06734619,
       -0.2004746 , -0.10970256, -0.04777983, -0.05336951, -0.10399633,
       -0.06500414,  0.0393892 , -0.08285502,  0.05692215,  0.01362013,
        0.0013779 , -0.24589944, -0.16099831, -0.11000603, -0.08007748,
       -0.05447361,  0.10116527,  0.06073807,  0.00416331,  0.00434075,
       -0.02536621,  0.12531835, -0.0312396 , -0.03754066,  0.10542928,
       -0.01937485,  0.03270554,  0.03367785, -0.31589472,  0.00840659,
       -0.09368768, -0.11164349, -0.02970047, -0.11497822,  0.06357043,
       -0.16664146,  0.02935979,  0.25292322,  0.01335857, -0.19644944,
        0.08630948,  0.05118916, -0.08062234,  0.03329093, -0.13994266,
        0.07419056, -0.0284326 , -0.04101218, -0.01186225,  0.10280388,
        0.00699921,  0.07681306,  0.0986157 , -0.06155488, -0.17678751,
       -0.0433546 , -0.1698599 ,  0.00764652, -0.11591533, -0.12973167,
       -0.01140277, -0.02404138, -0.06018848, -0.02115276, -0.14684282,
       -0.18135296, -0.03216174, -0.02125036,  0.2539596 ,  0.16910006,
       -0.11961638,  0.03961169,  0.07747228,  0.02761923,  0.07856126,
        0.06564176,  0.05922704,  0.10623101,  0.04387141, -0.14101151])

We can ask for similarity words or documents on document 1

indexes, metrics = model.cosine('_*1')

model.generate_response(indexes, metrics).tolist()

[(u'houselessness', 0.9697854490765448),
 (u'_*62909', 0.8435915200187546),
 (u'_*92297', 0.8382383325156331),
 (u'_*62902', 0.8354628520801568),
 (u'_*31249', 0.8321578405132342),
 (u'_*20758', 0.8302829157776485),
 (u'_*12342', 0.8263513274559964),
 (u'_*32435', 0.8237210585123108),
 (u'_*67836', 0.823590134267539),
 (u'_*31245', 0.8230957438394273)]

Now its just a case of matching the id to the data created on the preprocessing step

