doc2vec

This is an experimental code developed by Tomas Mikolov found in the word2vec google group: https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ

This is not yet available on Pypi you need the latest master branch from git.

The input format for doc2vec is still one big text document but every line should be one document prepended with an unique id, for example:

_*0 This is sentence 1
_*1 This is sentence 2

Requirements

  1. nltk
  2. Download some data: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
  3. Untar that data: tar -xvf aclImdb_v1.tar.gz

Preprocess

Merge data into one big document with an id per line and do some basic preprocessing: word tokenizer.


In [1]:
from __future__ import unicode_literals

In [2]:
import os
import nltk

In [3]:
directories = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']

In [4]:
input_file = open('/Users/drodriguez/Downloads/alldata.txt', 'w')

In [5]:
id_ = 0
for directory in directories:
    rootdir = os.path.join('/Users/drodriguez/Downloads/aclImdb', directory)
    for subdir, dirs, files in os.walk(rootdir):
        for file_ in files:
            with open(os.path.join(subdir, file_), 'r') as f:
                doc_id = '_*%i' % id_
                id_ = id_ + 1

                text = f.read()
                text = text.decode('utf-8')
                tokens = nltk.word_tokenize(text)
                doc = ' '.join(tokens).lower()
                doc = doc.encode('ascii', 'ignore')
                input_file.write('%s %s\n' % (doc_id, doc))

In [6]:
input_file.close()

doc2vec


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import word2vec

In [3]:
word2vec.doc2vec('/Users/drodriguez/Downloads/alldata.txt', '/Users/drodriguez/Downloads/vectors.bin', cbow=0, size=100, window=10, negative=5, hs=0, sample='1e-4', threads=12, iter_=20, min_count=1, verbose=True)


Starting training using file /Users/drodriguez/Downloads/alldata.txt
Vocab size: 355046
Words in train file: 28300990
Alpha: 0.000002  Progress: 100.01%  Words/thread/sec: 92.57k  

Predictions

Is possible to load the vectors using the same wordvectors class as a regular word2vec binary file.


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import word2vec

In [3]:
model = word2vec.load('/Users/drodriguez/Downloads/vectors.bin')

In [5]:
model.vectors.shape


Out[5]:
(355046, 100)

The documents vector are going to be identified by the id we used in the preprocesing section, for example document 1 is going to have vector:


In [7]:
model['_*1']


Out[7]:
array([-0.09961915, -0.02504692, -0.00935447, -0.06283784,  0.0496435 ,
        0.08408608,  0.07249928,  0.02332756,  0.04755085,  0.14972655,
       -0.13693954,  0.01296212, -0.05615778,  0.05408363, -0.01922397,
        0.00903398,  0.11205222,  0.02491457,  0.04302743, -0.06734619,
       -0.2004746 , -0.10970256, -0.04777983, -0.05336951, -0.10399633,
       -0.06500414,  0.0393892 , -0.08285502,  0.05692215,  0.01362013,
        0.0013779 , -0.24589944, -0.16099831, -0.11000603, -0.08007748,
       -0.05447361,  0.10116527,  0.06073807,  0.00416331,  0.00434075,
       -0.02536621,  0.12531835, -0.0312396 , -0.03754066,  0.10542928,
       -0.01937485,  0.03270554,  0.03367785, -0.31589472,  0.00840659,
       -0.09368768, -0.11164349, -0.02970047, -0.11497822,  0.06357043,
       -0.16664146,  0.02935979,  0.25292322,  0.01335857, -0.19644944,
        0.08630948,  0.05118916, -0.08062234,  0.03329093, -0.13994266,
        0.07419056, -0.0284326 , -0.04101218, -0.01186225,  0.10280388,
        0.00699921,  0.07681306,  0.0986157 , -0.06155488, -0.17678751,
       -0.0433546 , -0.1698599 ,  0.00764652, -0.11591533, -0.12973167,
       -0.01140277, -0.02404138, -0.06018848, -0.02115276, -0.14684282,
       -0.18135296, -0.03216174, -0.02125036,  0.2539596 ,  0.16910006,
       -0.11961638,  0.03961169,  0.07747228,  0.02761923,  0.07856126,
        0.06564176,  0.05922704,  0.10623101,  0.04387141, -0.14101151])

We can ask for similarity words or documents on document 1


In [10]:
indexes, metrics = model.cosine('_*1')

In [11]:
model.generate_response(indexes, metrics).tolist()


Out[11]:
[(u'houselessness', 0.9697854490765448),
 (u'_*62909', 0.8435915200187546),
 (u'_*92297', 0.8382383325156331),
 (u'_*62902', 0.8354628520801568),
 (u'_*31249', 0.8321578405132342),
 (u'_*20758', 0.8302829157776485),
 (u'_*12342', 0.8263513274559964),
 (u'_*32435', 0.8237210585123108),
 (u'_*67836', 0.823590134267539),
 (u'_*31245', 0.8230957438394273)]

Now its just a case of matching the id to the data created on the preprocessing step


In [ ]: