doc2vec

This is an experimental code developed by Tomas Mikolov found and published on the word2vec Google group.

The input format for doc2vec is still one big text document but every line should be one document prepended with an unique id, for example:

_*0 This is sentence 1
_*1 This is sentence 2

Requirements

This notebook requires nltk

Download some data: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz.

You could use make test-data from the root of the repo.

Preprocess

Merge data into one big document with an id per line and do some basic preprocessing: word tokenizer.


In [1]:
import os
import nltk

In [2]:
directories = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']

In [3]:
input_file = open('../data/alldata.txt', 'w')

In [4]:
id_ = 0
for directory in directories:
    rootdir = os.path.join('../data/aclImdb', directory)
    for subdir, dirs, files in os.walk(rootdir):
        for file_ in files:
            with open(os.path.join(subdir, file_), "r") as f:
                doc_id = "_*%i" % id_
                id_ = id_ + 1

                text = f.read()
                text = text
                tokens = nltk.word_tokenize(text)
                doc = " ".join(tokens).lower()
                doc = doc.encode("ascii", "ignore")
                input_file.write("%s %s\n" % (doc_id, doc))

In [5]:
input_file.close()

doc2vec


In [ ]:
%load_ext autoreload
%autoreload 2

In [2]:
import word2vec

In [ ]:
word2vec.doc2vec('../data/alldata.txt', '../data/doc2vec-vectors.bin', cbow=0, size=100, window=10, negative=5,
                 hs=0, sample='1e-4', threads=12, iter_=20, min_count=1, binary=True, verbose=True)

Prediction

Is possible to load the vectors using the same wordvectors class as a regular word2vec binary file.


In [ ]:
%load_ext autoreload
%autoreload 2

In [2]:
import word2vec

In [3]:
model = word2vec.load('../data/doc2vec-vectors.bin')

In [4]:
model.vectors.shape


Out[4]:
(363652, 100)

The documents vector are going to be identified by the id we used in the preprocesing section, for example document 1 is going to have vector:


In [6]:
model['_*1']


Out[6]:
array([ 0.15458781,  0.05329125, -0.03372733, -0.03253617, -0.00038198,
        0.01859742,  0.05822669,  0.08649707,  0.07116864,  0.01571589,
        0.11401868,  0.05802745, -0.03446389, -0.05136882, -0.03741959,
        0.03825757,  0.03826223,  0.06828724,  0.05144465, -0.02519033,
       -0.02553019, -0.13064705, -0.01241815,  0.06572731,  0.0361409 ,
        0.06780364, -0.11406382,  0.03698812,  0.01645933,  0.04006057,
       -0.06411161,  0.10451728, -0.09016737,  0.0612823 ,  0.05084847,
        0.05336419,  0.13584702, -0.05651987,  0.01329237, -0.20875289,
       -0.08468074,  0.10479708,  0.05787989,  0.13176955, -0.01580021,
        0.07367507, -0.09352259, -0.13971043, -0.24871282, -0.14094608,
        0.04577528,  0.12431064, -0.05354084,  0.14227368, -0.11196741,
        0.03350607, -0.01999633, -0.07860398,  0.02664598, -0.0592974 ,
       -0.07686727, -0.05597671, -0.06345078, -0.27290845, -0.06400879,
        0.00619202, -0.01699482, -0.05217196,  0.17153341,  0.12772463,
       -0.02238818,  0.09656674,  0.09250097, -0.18347315, -0.15046608,
       -0.04235512, -0.15838784,  0.03597225, -0.01584246,  0.00568255,
        0.0409179 ,  0.02537416,  0.06123869,  0.0961483 , -0.12116033,
        0.07526031,  0.0303142 , -0.01561208,  0.36425424, -0.16918449,
        0.07423471,  0.14127608,  0.00530258, -0.10363475,  0.00953067,
       -0.04992943,  0.1713592 ,  0.21640711, -0.00298326,  0.05985325])

We can ask for similarity words or documents on document 1


In [10]:
indexes, metrics = model.similar('_*1')

In [11]:
model.generate_response(indexes, metrics).tolist()


Out[11]:
[('_*2351', 0.8519178427298584),
 ('imagery.an', 0.8397190257038456),
 ('_*58565', 0.8342486893807792),
 ('_*35569', 0.8274130976031872),
 ('_*79830', 0.8268153781216658),
 ('_*37088', 0.8245519264554979),
 ('b"witness', 0.8239716809480113),
 ("b'ghost", 0.822800405312356),
 ("l'anticristo", 0.8222915811568171),
 ('_*38763', 0.8189633692215181)]

Now its we just need to matching the id to the data created on the preprocessing step


In [ ]: