PyLDAvis now also support LDA application from scikit-learn. Let's take a look into this in more detail. We will be using 20new_groups dataset as provided by scikit-learn.
In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
In [ ]:
%matplotlib inline
import pyLDAvis
pyLDAvis.enable_notebook()
Next, we only fetch three categories from the news groups. If we set number of topics to 3 in LDA, we should get an intuitive result that measure intertopic distance map acrosss these categories.
In [24]:
cats = ['talk.religion.misc','comp.graphics', 'sci.med']
train = fetch_20newsgroups(subset='train',categories=cats,
shuffle=True,random_state=42)
# vect = CountVectorizer(stop_words='english')
lda = LatentDirichletAllocation(n_topics=3,random_state=123,max_iter=1500)
Finally, we use prepare method to visualize inside the notebook.
In [14]:
prepare(pd.Series(train['data']),vect,lda)
Out[14]:
Alternatively, you can also have LDA model that have sklearn's API, for example https://github.com/ariddell/lda.
In [15]:
from lda import LDA
lda_b = LDA(n_topics=len(cats),n_iter=1500,random_state=123)
In [17]:
prepare(pd.Series(train['data']),vect,lda_b)
Out[17]:
Note that you can get the vectorizer and LDA that has been learned on the data, by using _extract_data method.
In [31]:
lda_fit,vect_fit, prepared = _extract_data(pd.Series(train['data']),vect,lda_b)
In [32]:
vect_fit
Out[32]:
In [35]:
lda_fit
Out[35]: