PyLDAvis now also support LDA application from scikit-learn. Let's take a look into this in more detail. We will be using 20new_groups dataset as provided by scikit-learn.


In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [ ]:
%matplotlib inline
import pyLDAvis
pyLDAvis.enable_notebook()

Next, we only fetch three categories from the news groups. If we set number of topics to 3 in LDA, we should get an intuitive result that measure intertopic distance map acrosss these categories.


In [24]:
cats = ['talk.religion.misc','comp.graphics', 'sci.med']
train = fetch_20newsgroups(subset='train',categories=cats,
                           shuffle=True,random_state=42)

# vect = CountVectorizer(stop_words='english')
lda = LatentDirichletAllocation(n_topics=3,random_state=123,max_iter=1500)

Finally, we use prepare method to visualize inside the notebook.

Visualzing the model with pyLDAvis


In [14]:
prepare(pd.Series(train['data']),vect,lda)


Out[14]:

Alternatively, you can also have LDA model that have sklearn's API, for example https://github.com/ariddell/lda.


In [15]:
from lda import LDA

lda_b = LDA(n_topics=len(cats),n_iter=1500,random_state=123)

In [17]:
prepare(pd.Series(train['data']),vect,lda_b)


Out[17]:

Note that you can get the vectorizer and LDA that has been learned on the data, by using _extract_data method.


In [31]:
lda_fit,vect_fit, prepared = _extract_data(pd.Series(train['data']),vect,lda_b)

In [32]:
vect_fit


Out[32]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [35]:
lda_fit


Out[35]:
<lda.lda.LDA at 0x7f735378bd30>