Using wrappers for Scikit learn API

This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at gensim.sklearn_integration

The wrappers available (as of now) are :

  • LdaModel (gensim.sklearn_integration.sklearn_wrapper_gensim_ldaModel.SklLdaModel),which implements gensim's LDA Model in a scikit-learn interface

  • LsiModel (gensim.sklearn_integration.sklearn_wrapper_gensim_lsiModel.SklLsiModel),which implements gensim's LSI Model in a scikit-learn interface

  • RpModel (gensim.sklearn_integration.sklearn_wrapper_gensim_rpmodel.SklRpModel),which implements gensim's Random Projections Model in a scikit-learn interface

  • LDASeq Model (gensim.sklearn_integration.sklearn_wrapper_gensim_lsiModel.SklLdaSeqModel),which implements gensim's LdaSeqModel in a scikit-learn interface

LDA Model

To use LdaModel begin with importing LdaModel wrapper


In [1]:
from gensim.sklearn_integration import SklLdaModel


Using TensorFlow backend.

Next we will create a dummy set of texts and convert it into a corpus


In [2]:
from gensim.corpora import Dictionary
texts = [
    ['complier', 'system', 'computer'],
    ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],
    ['graph', 'flow', 'network', 'graph'],
    ['loading', 'computer', 'system'],
    ['user', 'server', 'system'],
    ['tree', 'hamiltonian'],
    ['graph', 'trees'],
    ['computer', 'kernel', 'malfunction', 'computer'],
    ['server', 'system', 'computer']
]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Then to run the LdaModel on it


In [3]:
model = SklLdaModel(num_topics=2, id2word=dictionary, iterations=20, random_state=1)
model.fit(corpus)
model.transform(corpus)


WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
Out[3]:
array([[ 0.85275314,  0.14724686],
       [ 0.12390183,  0.87609817],
       [ 0.4612995 ,  0.5387005 ],
       [ 0.84924177,  0.15075823],
       [ 0.49180096,  0.50819904],
       [ 0.40086923,  0.59913077],
       [ 0.28454427,  0.71545573],
       [ 0.88776198,  0.11223802],
       [ 0.84210373,  0.15789627]])

Integration with Sklearn

To provide a better example of how it can be used with Sklearn, Let's use CountVectorizer method of sklearn. For this example we will use 20 Newsgroups data set. We will only use the categories rec.sport.baseball and sci.crypt and use it to generate topics.


In [4]:
import numpy as np
from gensim import matutils
from gensim.models.ldamodel import LdaModel
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from gensim.sklearn_integration.sklearn_wrapper_gensim_ldamodel import SklLdaModel

In [5]:
rand = np.random.mtrand.RandomState(1) # set seed for getting same result
cats = ['rec.sport.baseball', 'sci.crypt']
data = fetch_20newsgroups(subset='train', categories=cats, shuffle=True)

Next, we use countvectorizer to convert the collection of text documents to a matrix of token counts.


In [6]:
vec = CountVectorizer(min_df=10, stop_words='english')

X = vec.fit_transform(data.data)
vocab = vec.get_feature_names()  # vocab to be converted to id2word 

id2word = dict([(i, s) for i, s in enumerate(vocab)])

Next, we just need to fit X and id2word to our Lda wrapper.


In [7]:
obj = SklLdaModel(id2word=id2word, num_topics=5, iterations=20)
lda = obj.fit(X)


WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy

In [8]:
from sklearn.model_selection import GridSearchCV
from gensim.models.coherencemodel import CoherenceModel

In [9]:
def scorer(estimator, X, y=None):
    goodcm = CoherenceModel(model=estimator.gensim_model, texts= texts, dictionary=estimator.gensim_model.id2word, coherence='c_v')
    return goodcm.get_coherence()

In [10]:
obj = SklLdaModel(id2word=dictionary, num_topics=5, iterations=20)
parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}
model = GridSearchCV(obj, parameters, scoring=scorer, cv=5)
model.fit(corpus)


WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
Out[10]:
GridSearchCV(cv=5, error_score='raise',
       estimator=SklLdaModel(alpha='symmetric', chunksize=2000, decay=0.5, eta=None,
      eval_every=10, gamma_threshold=0.001,
      id2word=<gensim.corpora.dictionary.Dictionary object at 0x7ffa190461d0>,
      iterations=20, minimum_probability=0.01, num_topics=5, offset=1.0,
      passes=1, random_state=None, update_every=1),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=<function scorer at 0x7ffa1756f6e0>, verbose=0)

In [11]:
model.best_params_


Out[11]:
{'iterations': 20, 'num_topics': 10}

Example of Using Pipeline


In [12]:
from sklearn.pipeline import Pipeline
from sklearn import linear_model

def print_features_pipe(clf, vocab, n=10):
    ''' Better printing for sorted list '''
    coef = clf.named_steps['classifier'].coef_[0]
    print coef
    print 'Positive features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[::-1][:n] if coef[j] > 0]))
    print 'Negative features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[:n] if coef[j] < 0]))

In [13]:
id2word = Dictionary([_.split() for _ in data.data])
corpus = [id2word.doc2bow(i.split()) for i in data.data]

In [14]:
model = SklLdaModel(num_topics=15, id2word=id2word, iterations=10, random_state=37)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))


WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
[-0.91085778 -0.48036135 -0.41265981 -0.66310168 -0.01339967 -0.12794711
  0.01611456  0.15208847  0.21579624  0.25621594  0.54796235  0.43618653
  0.56767608  0.39267377  0.27554429]
Positive features: internet...:0.57 hanging:0.55 trawling:0.44 dome.:0.39 >Pat:0.28 red@redpoll.neoucom.edu:0.26 circuitry:0.22 *best*:0.15 01101001B:0.02
Negative features: Fame.:-0.91 Keach:-0.66 considered,:-0.48 Fame,:-0.41 comp.org.eff.talk.:-0.13 comp.org.eff.talk,:-0.01
0.640939597315

LSI Model

To use LsiModel begin with importing LsiModel wrapper


In [15]:
from gensim.sklearn_integration import SklLsiModel

Example of Using Pipeline


In [17]:
model = SklLsiModel(num_topics=15, id2word=id2word)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))


[ 0.13650375 -0.00382155 -0.0264042  -0.08478659 -0.02379243 -0.6006137
  1.07099917  0.03998737  0.43831279 -0.54905248  0.20204591 -0.2185433
 -1.3051437  -0.08704868  0.17599105]
Positive features: 01101001B:1.07 circuitry:0.44 hanging:0.20 >Pat:0.18 Fame.:0.14 *best*:0.04
Negative features: internet...:-1.31 comp.org.eff.talk.:-0.60 red@redpoll.neoucom.edu:-0.55 trawling:-0.22 dome.:-0.09 Keach:-0.08 Fame,:-0.03 comp.org.eff.talk,:-0.02 considered,:-0.00
0.865771812081

Random Projections Model

To use RpModel begin with importing RpModel wrapper


In [18]:
from gensim.sklearn_integration import SklRpModel

Example of Using Pipeline


In [19]:
model = SklRpModel(num_topics=2)
np.random.mtrand.RandomState(1)  # set seed for getting same result
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))


[-0.00071555  0.00913274]
Positive features: considered,:0.01
Negative features: Fame.:-0.00
0.543624161074

LDASeq Model

To use LdaSeqModel begin with importing LdaSeqModel wrapper


In [20]:
from gensim.sklearn_integration import SklLdaSeqModel

Example of Using Pipeline


In [21]:
test_data = data.data[0:2]
test_target = data.target[0:2]
id2word = Dictionary(map(lambda x: x.split(), test_data))
corpus = [id2word.doc2bow(i.split()) for i in test_data]

model = SklLdaSeqModel(id2word=id2word, num_topics=2, time_slice=[1, 1, 1], initialize='gensim')
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, test_target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, test_target))


[ 0.04877324 -0.04877324]
Positive features: What:0.05
Negative features: NLCS:-0.05
1.0

In [ ]: