Visualizing a Gensim model

To illustrate how to use pyLDAvis's gensim helper funtions we will create a model from the 20 Newsgroup corpus. Minimal preprocessing is done and so the model is not the best, the goal of this notebook is to demonstrate the the helper functions.

Downloading the data


In [1]:
%%bash
mkdir -p data
pushd data
if [ -d "20news-bydate-train" ]
then
  echo "The data has already been downloaded..."
else
  wget http://qwone.com/%7Ejason/20Newsgroups/20news-bydate.tar.gz
  tar xfv 20news-bydate.tar.gz
  rm 20news-bydate.tar.gz
fi
echo "Lets take a look at the groups..."
ls 20news-bydate-train/
popd


/Users/bmabey/w/rbl/pyLDAvis/notebooks/data
~/w/rbl/pyLDAvis/notebooks/data ~/w/rbl/pyLDAvis/notebooks
The data has already been downloaded...
Lets take a look at the groups...
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
~/w/rbl/pyLDAvis/notebooks

Exploring the dataset

Each group dir has a set of files:


In [2]:
ls -lah data/20news-bydate-train/sci.space | tail  -n 5


-rw-r--r--   1 bmabey 1.5K Mar 18  2003 61250
-rw-r--r--   1 bmabey  889 Mar 18  2003 61252
-rw-r--r--   1 bmabey 1.2K Mar 18  2003 61264
-rw-r--r--   1 bmabey 1.7K Mar 18  2003 61308
-rw-r--r--   1 bmabey 1.4K Mar 18  2003 61422

Lets take a peak at one email:


In [3]:
!head data/20news-bydate-train/sci.space/61422 -n 20


From: ralph.buttigieg@f635.n713.z3.fido.zeta.org.au (Ralph Buttigieg)
Subject: Why not give $1 billion to first year-lo
Organization: Fidonet. Gate admin is fido@socs.uts.edu.au
Lines: 34

Original to: keithley@apple.com
G'day keithley@apple.com

21 Apr 93 22:25, keithley@apple.com wrote to All:

 kc> keithley@apple.com (Craig Keithley), via Kralizec 3:713/602


 kc> But back to the contest goals, there was a recent article in AW&ST
about a
 kc> low cost (it's all relative...) manned return to the moon.  A General
 kc> Dynamics scheme involving a Titan IV & Shuttle to lift a Centaur upper
 kc> stage, LEV, and crew capsule.  The mission consists of delivering two
 kc> unmanned payloads to the lunar surface, followed by a manned mission.
 kc> Total cost:  US was $10-$13 billion.  Joint ESA(?)/NASA project was

Loading the tokenizing the corpus


In [4]:
from glob import glob
import re
import string
import funcy as fp
from gensim import models
from gensim.corpora import Dictionary, MmCorpus
import nltk
import pandas as pd

In [5]:
# quick and dirty....
EMAIL_REGEX = re.compile(r"[a-z0-9\.\+_-]+@[a-z0-9\._-]+\.[a-z]*")
FILTER_REGEX = re.compile(r"[^a-z '#]")
TOKEN_MAPPINGS = [(EMAIL_REGEX, "#email"), (FILTER_REGEX, ' ')]

def tokenize_line(line):
    res = line.lower()
    for regexp, replacement in TOKEN_MAPPINGS:
        res = regexp.sub(replacement, res)
    return res.split()
    
def tokenize(lines, token_size_filter=2):
    tokens = fp.mapcat(tokenize_line, lines)
    return [t for t in tokens if len(t) > token_size_filter]
    

def load_doc(filename):
    group, doc_id = filename.split('/')[-2:]
    with open(filename) as f:
        doc = f.readlines()
    return {'group': group,
            'doc': doc,
            'tokens': tokenize(doc),
            'id': doc_id}


docs = pd.DataFrame(map(load_doc, glob('data/20news-bydate-train/*/*'))).set_index(['group','id'])
docs.head()


Out[5]:
doc tokens
group id
alt.atheism 49960 [From: mathew <mathew@mantis.co.uk>\n, Subject... [from, mathew, #email, subject, alt, atheism, ...
51060 [From: mathew <mathew@mantis.co.uk>\n, Subject... [from, mathew, #email, subject, alt, atheism, ...
51119 [From: I3150101@dbstu1.rz.tu-bs.de (Benedikt R... [from, #email, benedikt, rosenau, subject, gos...
51120 [From: mathew <mathew@mantis.co.uk>\n, Subject... [from, mathew, #email, subject, university, vi...
51121 [From: strom@Watson.Ibm.Com (Rob Strom)\n, Sub... [from, #email, rob, strom, subject, soc, motss...

Creating the dictionary, and bag of words corpus


In [6]:
def nltk_stopwords():
    return set(nltk.corpus.stopwords.words('english'))

def prep_corpus(docs, additional_stopwords=set(), no_below=5, no_above=0.5):
  print('Building dictionary...')
  dictionary = Dictionary(docs)
  stopwords = nltk_stopwords().union(additional_stopwords)
  stopword_ids = map(dictionary.token2id.get, stopwords)
  dictionary.filter_tokens(stopword_ids)
  dictionary.compactify()
  dictionary.filter_extremes(no_below=no_below, no_above=no_above, keep_n=None)
  dictionary.compactify()

  print('Building corpus...')
  corpus = [dictionary.doc2bow(doc) for doc in docs]

  return dictionary, corpus

In [7]:
dictionary, corpus = prep_corpus(docs['tokens'])


Building dictionary...
Building corpus...

In [8]:
MmCorpus.serialize('newsgroups.mm', corpus)
dictionary.save('newsgroups.dict')

Fitting the LDA model


In [9]:
%%time
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=50, passes=10)
                                      
lda.save('newsgroups_50.model')


CPU times: user 4min 3s, sys: 2.92 s, total: 4min 6s
Wall time: 4min 6s

Visualzing the model with pyLDAvis

Okay, the moment we have all been waiting for is finally here! You'll notice in the visualizaiton that we have a few junk topics that would probably disappear after better preprocessing of the corpus. This is left as an exercises to the reader. :)


In [10]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

In [11]:
vis_data = gensimvis.prepare(lda, corpus, dictionary)
pyLDAvis.display(vis_data)


Out[11]: