Using dstoolbox models

Table of contents

Imports


In [1]:
import numpy as np

In [2]:
from dstoolbox.models import W2VClassifier

Text

W2VClassifier

Mock the fit method so that we don't depend on an external file.


In [3]:
word2idx = {'herren': 0, 'damen': 1, 'stiefel': 2, 'rock': 3}
word_embeddings = np.array([
    [0.1, 0.1, 0.1],
    [10.0, 10.0, 10.1],
    [-0.1, -0.1, -0.1],
    [-0.1, -0.1, 0.1],
])

def mock_fit(self, X=None, y=None):
    from dstoolbox.utils import normalize_matrix
    self.word2idx = word2idx
    idx2word = {val: key for key, val in word2idx.items()}
    self.classes_ = np.array([idx2word[i] for i in range(len(idx2word))])
    self.idx2word = idx2word
    self.syn0 = normalize_matrix(word_embeddings)
    return self

In [4]:
print(word2idx)
print(word_embeddings)


{'herren': 0, 'rock': 3, 'damen': 1, 'stiefel': 2}
[[  0.1   0.1   0.1]
 [ 10.   10.   10.1]
 [ -0.1  -0.1  -0.1]
 [ -0.1  -0.1   0.1]]

In [5]:
setattr(W2VClassifier, 'fit', mock_fit)

In [6]:
clf = W2VClassifier('a/path', topn=3).fit()

Using the most_similar method

The most_similar method works similarly to the gensim method of the same name but does not support multiple positive terms or any negative terms.


In [7]:
clf.classes_


Out[7]:
array(['herren', 'damen', 'stiefel', 'rock'], 
      dtype='<U7')

In [8]:
clf.most_similar('herren')


Out[8]:
[('damen', 0.99999448138848246),
 ('rock', 0.33333333333333337),
 ('stiefel', 0.0)]

In [9]:
clf.most_similar(['damen'])


Out[9]:
[('herren', 0.99999448138848246),
 ('rock', 0.33554998784897083),
 ('stiefel', 5.5186115175409611e-06)]

In [10]:
clf.most_similar('rock')


Out[10]:
[('stiefel', 0.66666666666666663),
 ('damen', 0.33554998784897083),
 ('herren', 0.33333333333333337)]

Using the predict method

The predict method works similarly to what would be expected from an sklearn classifier. The classes corresponding to the indices can be found in the classes_ attribute.

A predict_proba method does not exist, since it is not well defined for this case.


In [11]:
clf.predict(['herren', 'damen', 'rock'])


Out[11]:
array([1, 0, 2])

In [12]:
clf.classes_[clf.predict(['herren', 'damen', 'rock'])]


Out[12]:
array(['damen', 'herren', 'stiefel'], 
      dtype='<U7')

The kneighbors method

Similarly to KNeighborsClassifier et al., the W2VClassifier supports the kneighbors method.


In [13]:
clf.kneighbors(['herren', 'rock'], return_distance=False)


Out[13]:
array([[1, 3, 2],
       [2, 1, 0]])

In [14]:
clf.kneighbors(['herren', 'rock'])


Out[14]:
(array([[1, 3, 2],
        [2, 1, 0]]),
 array([[  5.51861152e-06,   6.66666667e-01,   1.00000000e+00],
        [  3.33333333e-01,   6.64450012e-01,   6.66666667e-01]]))

most_similar can be called with multiple positive words


In [15]:
clf.most_similar(['herren', 'rock'])


Out[15]:
[('damen', 0.79059003445900067),
 ('herren', 0.78867513459481287),
 ('rock', 0.78867513459481287),
 ('stiefel', 0.21132486540518713)]

The new search will result in an update in the dictionary and can thus be retrieved at a later point in time


In [16]:
clf.classes_


Out[16]:
array(['herren', 'damen', 'stiefel', 'rock', 'herren rock'], 
      dtype='<U11')

In [17]:
clf.most_similar('rock')


Out[17]:
[('herren rock', 0.78867513459481287),
 ('stiefel', 0.66666666666666663),
 ('damen', 0.33554998784897083),
 ('herren', 0.33333333333333337)]