notebook.community

Edit and run

PyLDAvis now also support LDA application from scikit-learn. Let's take a look into this in more detail. We will be using 20new_groups dataset as provided by scikit-learn.



In [2]:

    
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation



In [ ]:

    
%matplotlib inline
import pyLDAvis
pyLDAvis.enable_notebook()

Next, we only fetch three categories from the news groups. If we set number of topics to 3 in LDA, we should get an intuitive result that measure intertopic distance map acrosss these categories.



In [24]:

    
cats = ['talk.religion.misc','comp.graphics', 'sci.med']
train = fetch_20newsgroups(subset='train',categories=cats,
                           shuffle=True,random_state=42)

# vect = CountVectorizer(stop_words='english')
lda = LatentDirichletAllocation(n_topics=3,random_state=123,max_iter=1500)

Finally, we use prepare method to visualize inside the notebook.

Visualzing the model with pyLDAvis



In [14]:

    
prepare(pd.Series(train['data']),vect,lda)









    Out[14]:

Alternatively, you can also have LDA model that have sklearn's API, for example https://github.com/ariddell/lda.



In [15]:

    
from lda import LDA

lda_b = LDA(n_topics=len(cats),n_iter=1500,random_state=123)



In [17]:

    
prepare(pd.Series(train['data']),vect,lda_b)









    Out[17]:

Note that you can get the vectorizer and LDA that has been learned on the data, by using _extract_data method.



In [31]:

    
lda_fit,vect_fit, prepared = _extract_data(pd.Series(train['data']),vect,lda_b)



In [32]:

    
vect_fit









    Out[32]:





CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)



In [35]:

    
lda_fit









    Out[35]:





<lda.lda.LDA at 0x7f735378bd30>