This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at gensim.sklearn_integration
The wrappers available (as of now) are :
LdaModel (gensim.sklearn_integration.sklearn_wrapper_gensim_ldaModel.SklLdaModel
),which implements gensim's LDA Model
in a scikit-learn interface
LsiModel (gensim.sklearn_integration.sklearn_wrapper_gensim_lsiModel.SklLsiModel
),which implements gensim's LSI Model
in a scikit-learn interface
RpModel (gensim.sklearn_integration.sklearn_wrapper_gensim_rpmodel.SklRpModel
),which implements gensim's Random Projections Model
in a scikit-learn interface
LDASeq Model (gensim.sklearn_integration.sklearn_wrapper_gensim_lsiModel.SklLdaSeqModel
),which implements gensim's LdaSeqModel
in a scikit-learn interface
To use LdaModel begin with importing LdaModel wrapper
In [1]:
from gensim.sklearn_integration import SklLdaModel
Next we will create a dummy set of texts and convert it into a corpus
In [2]:
from gensim.corpora import Dictionary
texts = [
['complier', 'system', 'computer'],
['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],
['graph', 'flow', 'network', 'graph'],
['loading', 'computer', 'system'],
['user', 'server', 'system'],
['tree', 'hamiltonian'],
['graph', 'trees'],
['computer', 'kernel', 'malfunction', 'computer'],
['server', 'system', 'computer']
]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
Then to run the LdaModel on it
In [3]:
model = SklLdaModel(num_topics=2, id2word=dictionary, iterations=20, random_state=1)
model.fit(corpus)
model.transform(corpus)
Out[3]:
To provide a better example of how it can be used with Sklearn, Let's use CountVectorizer method of sklearn. For this example we will use 20 Newsgroups data set. We will only use the categories rec.sport.baseball and sci.crypt and use it to generate topics.
In [4]:
import numpy as np
from gensim import matutils
from gensim.models.ldamodel import LdaModel
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from gensim.sklearn_integration.sklearn_wrapper_gensim_ldamodel import SklLdaModel
In [5]:
rand = np.random.mtrand.RandomState(1) # set seed for getting same result
cats = ['rec.sport.baseball', 'sci.crypt']
data = fetch_20newsgroups(subset='train', categories=cats, shuffle=True)
Next, we use countvectorizer to convert the collection of text documents to a matrix of token counts.
In [6]:
vec = CountVectorizer(min_df=10, stop_words='english')
X = vec.fit_transform(data.data)
vocab = vec.get_feature_names() # vocab to be converted to id2word
id2word = dict([(i, s) for i, s in enumerate(vocab)])
Next, we just need to fit X and id2word to our Lda wrapper.
In [7]:
obj = SklLdaModel(id2word=id2word, num_topics=5, iterations=20)
lda = obj.fit(X)
In [8]:
from sklearn.model_selection import GridSearchCV
from gensim.models.coherencemodel import CoherenceModel
In [9]:
def scorer(estimator, X, y=None):
goodcm = CoherenceModel(model=estimator.gensim_model, texts= texts, dictionary=estimator.gensim_model.id2word, coherence='c_v')
return goodcm.get_coherence()
In [10]:
obj = SklLdaModel(id2word=dictionary, num_topics=5, iterations=20)
parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}
model = GridSearchCV(obj, parameters, scoring=scorer, cv=5)
model.fit(corpus)
Out[10]:
In [11]:
model.best_params_
Out[11]:
In [12]:
from sklearn.pipeline import Pipeline
from sklearn import linear_model
def print_features_pipe(clf, vocab, n=10):
''' Better printing for sorted list '''
coef = clf.named_steps['classifier'].coef_[0]
print coef
print 'Positive features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[::-1][:n] if coef[j] > 0]))
print 'Negative features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[:n] if coef[j] < 0]))
In [13]:
id2word = Dictionary([_.split() for _ in data.data])
corpus = [id2word.doc2bow(i.split()) for i in data.data]
In [14]:
model = SklLdaModel(num_topics=15, id2word=id2word, iterations=10, random_state=37)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1) # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))
To use LsiModel begin with importing LsiModel wrapper
In [15]:
from gensim.sklearn_integration import SklLsiModel
In [17]:
model = SklLsiModel(num_topics=15, id2word=id2word)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1) # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))
To use RpModel begin with importing RpModel wrapper
In [18]:
from gensim.sklearn_integration import SklRpModel
In [19]:
model = SklRpModel(num_topics=2)
np.random.mtrand.RandomState(1) # set seed for getting same result
clf = linear_model.LogisticRegression(penalty='l2', C=0.1) # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))
To use LdaSeqModel begin with importing LdaSeqModel wrapper
In [20]:
from gensim.sklearn_integration import SklLdaSeqModel
In [21]:
test_data = data.data[0:2]
test_target = data.target[0:2]
id2word = Dictionary(map(lambda x: x.split(), test_data))
corpus = [id2word.doc2bow(i.split()) for i in test_data]
model = SklLdaSeqModel(id2word=id2word, num_topics=2, time_slice=[1, 1, 1], initialize='gensim')
clf = linear_model.LogisticRegression(penalty='l2', C=0.1) # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, test_target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, test_target))
In [ ]: