Using dstoolbox models

Text
1. W2VClassifier

Imports



In [1]:

    
import numpy as np



In [2]:

    
from dstoolbox.models import W2VClassifier

Text

W2VClassifier

Mock the fit method so that we don't depend on an external file.



In [3]:

    
word2idx = {'herren': 0, 'damen': 1, 'stiefel': 2, 'rock': 3}
word_embeddings = np.array([
    [0.1, 0.1, 0.1],
    [10.0, 10.0, 10.1],
    [-0.1, -0.1, -0.1],
    [-0.1, -0.1, 0.1],
])

def mock_fit(self, X=None, y=None):
    from dstoolbox.utils import normalize_matrix
    self.word2idx = word2idx
    idx2word = {val: key for key, val in word2idx.items()}
    self.classes_ = np.array([idx2word[i] for i in range(len(idx2word))])
    self.idx2word = idx2word
    self.syn0 = normalize_matrix(word_embeddings)
    return self



In [4]:

    
print(word2idx)
print(word_embeddings)









    



{'herren': 0, 'rock': 3, 'damen': 1, 'stiefel': 2}
[[  0.1   0.1   0.1]
 [ 10.   10.   10.1]
 [ -0.1  -0.1  -0.1]
 [ -0.1  -0.1   0.1]]



In [5]:

    
setattr(W2VClassifier, 'fit', mock_fit)



In [6]:

    
clf = W2VClassifier('a/path', topn=3).fit()

Using the `most_similar` method

The most_similar method works similarly to the gensim method of the same name but does not support multiple positive terms or any negative terms.



In [7]:

    
clf.classes_









    Out[7]:





array(['herren', 'damen', 'stiefel', 'rock'], 
      dtype='<U7')



In [8]:

    
clf.most_similar('herren')









    Out[8]:





[('damen', 0.99999448138848246),
 ('rock', 0.33333333333333337),
 ('stiefel', 0.0)]



In [9]:

    
clf.most_similar(['damen'])









    Out[9]:





[('herren', 0.99999448138848246),
 ('rock', 0.33554998784897083),
 ('stiefel', 5.5186115175409611e-06)]



In [10]:

    
clf.most_similar('rock')









    Out[10]:





[('stiefel', 0.66666666666666663),
 ('damen', 0.33554998784897083),
 ('herren', 0.33333333333333337)]

Using the `predict` method

The predict method works similarly to what would be expected from an sklearn classifier. The classes corresponding to the indices can be found in the classes_ attribute.

A predict_proba method does not exist, since it is not well defined for this case.



In [11]:

    
clf.predict(['herren', 'damen', 'rock'])









    Out[11]:





array([1, 0, 2])



In [12]:

    
clf.classes_[clf.predict(['herren', 'damen', 'rock'])]









    Out[12]:





array(['damen', 'herren', 'stiefel'], 
      dtype='<U7')

The `kneighbors` method

Similarly to KNeighborsClassifier et al., the W2VClassifier supports the kneighbors method.



In [13]:

    
clf.kneighbors(['herren', 'rock'], return_distance=False)









    Out[13]:





array([[1, 3, 2],
       [2, 1, 0]])



In [14]:

    
clf.kneighbors(['herren', 'rock'])









    Out[14]:





(array([[1, 3, 2],
        [2, 1, 0]]),
 array([[  5.51861152e-06,   6.66666667e-01,   1.00000000e+00],
        [  3.33333333e-01,   6.64450012e-01,   6.66666667e-01]]))

`most_similar` can be called with multiple positive words



In [15]:

    
clf.most_similar(['herren', 'rock'])









    Out[15]:





[('damen', 0.79059003445900067),
 ('herren', 0.78867513459481287),
 ('rock', 0.78867513459481287),
 ('stiefel', 0.21132486540518713)]

The new search will result in an update in the dictionary and can thus be retrieved at a later point in time



In [16]:

    
clf.classes_









    Out[16]:





array(['herren', 'damen', 'stiefel', 'rock', 'herren rock'], 
      dtype='<U11')



In [17]:

    
clf.most_similar('rock')









    Out[17]:





[('herren rock', 0.78867513459481287),
 ('stiefel', 0.66666666666666663),
 ('damen', 0.33554998784897083),
 ('herren', 0.33333333333333337)]

Using dstoolbox models

Table of contents

Imports

Text

W2VClassifier

Using the most_similar method

Using the predict method

The kneighbors method

most_similar can be called with multiple positive words

Using the `most_similar` method

Using the `predict` method

The `kneighbors` method

`most_similar` can be called with multiple positive words