In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

1.7. Finding concepts in texts - Latent Dirichlet Allocation

Latent Semantic Analysis provided a powerful way to begin interrogating relationships among texts.

In this notebook we use the gensim implementation of Online LDA (Hoffman et al 2010), which is an alternate of the typical Gibbs-Sampling MCMC approach.


In [46]:
import nltk
from tethne.readers import zotero
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import gensim
import networkx as nx
import pandas as pd

from collections import defaultdict, Counter

wordnet = nltk.WordNetLemmatizer()
stemmer = nltk.SnowballStemmer('english')
stoplist = stopwords.words('english')

In [5]:
text_root = '../data/EmbryoProjectTexts/files'
zotero_export_path = '../data/EmbryoProjectTexts'

corpus = nltk.corpus.PlaintextCorpusReader(text_root, 'https.+')
metadata = zotero.read(zotero_export_path, index_by='link', follow_links=False)

In [26]:
def normalize_token(token):
    """
    Convert token to lowercase, and stem using the Porter algorithm.

    Parameters
    ----------
    token : str

    Returns
    -------
    token : str
    """
    return wordnet.lemmatize(token.lower())

def filter_token(token):
    """
    Evaluate whether or not to retain ``token``.

    Parameters
    ----------
    token : str

    Returns
    -------
    keep : bool
    """
    token = token.lower()
    return token not in stoplist and token.isalpha() and len(token) > 2

We will represent our documents as a list of lists. Each sub-list contains tokens in the document.


In [27]:
documents=[[normalize_token(token) 
            for token in corpus.words(fileids=[fileid])
            if filter_token(token)]
           for fileid in corpus.fileids()]

In [37]:
years = [metadata[fileid].date for fileid in corpus.fileids()]

Further filtering

LDA in Python is a bit computationally expensive, so anything we can do to cut down on "noise" will help. Let's take a look at wordcounts and documentcounts to see whether we can narrow in on more useful terms.


In [28]:
wordcounts = nltk.FreqDist([token for document in documents for token in document])

In [29]:
wordcounts.plot(20)



In [30]:
documentcounts = nltk.FreqDist([token for document in documents for token in set(document)])

In [31]:
documentcounts.plot(80)


Here we filter the tokens in each document, preserving the shape of the corpus.


In [32]:
filtered_documents = [[token for token in document 
                      if wordcounts[token] < 2000
                      and 1 < documentcounts[token] < 350]
                     for document in documents]

It's easier to compute over integers, so we use a Dictionary to create a mapping between words and their integer/id representation.


In [33]:
dictionary = gensim.corpora.Dictionary(filtered_documents)

The doc2bow() converts a document (series of tokens) into a bag-of-words representation.


In [34]:
documents_bow = [dictionary.doc2bow(document) for document in filtered_documents]

We're ready to fit the model! We pass our BOW-transformed documents, our dictionary, and the number of topics. update_every=0 disables an "online" feature in the sampler (used for very very large corpora), and passes=20 tells the sampler to pass over the whole corpus 20 times.


In [35]:
model = gensim.models.LdaModel(documents_bow, 
                               id2word=dictionary,
                               num_topics=20, 
                               update_every=0,
                               passes=20)

In [41]:
for i, topic in enumerate(model.print_topics(num_topics=20, num_words=5)):
    print i, ':', topic


0 : 0.012*dna + 0.010*genome + 0.008*www + 0.006*eugenics + 0.006*gene
1 : 0.012*fetus + 0.012*woman + 0.011*case + 0.011*court + 0.009*fetal
2 : 0.013*alcohol + 0.012*child + 0.010*defect + 0.009*egg + 0.009*autism
3 : 0.024*stem + 0.020*plant + 0.016*pluripotent + 0.013*genetic + 0.011*seed
4 : 0.015*stem + 0.009*bioethics + 0.008*egg + 0.008*president + 0.007*cloning
5 : 0.013*birth + 0.013*death + 0.009*blood + 0.009*pill + 0.008*hartman
6 : 0.023*sex + 0.015*male + 0.011*female + 0.010*hormone + 0.009*lillie
7 : 0.010*fistula + 0.009*spemann + 0.008*organizer + 0.006*hamlin + 0.006*experiment
8 : 0.011*theory + 0.008*animal + 0.006*organism + 0.006*evolution + 0.006*specie
9 : 0.008*loeb + 0.007*organism + 0.006*theory + 0.006*egg + 0.006*embryology
10 : 0.018*theory + 0.017*weismann + 0.010*germ + 0.010*darwin + 0.009*heredity
11 : 0.023*gene + 0.009*protein + 0.007*mouse + 0.007*nucleus + 0.006*experiment
12 : 0.018*cocaine + 0.010*stage + 0.008*blood + 0.008*sandel + 0.008*genetic
13 : 0.018*nerve + 0.017*growth + 0.017*kammerer + 0.014*chick + 0.013*experiment
14 : 0.017*court + 0.014*stem + 0.010*preembryos + 0.009*ivf + 0.008*case
15 : 0.011*abortion + 0.007*act + 0.007*report + 0.006*pope + 0.006*woman
16 : 0.015*sperm + 0.010*egg + 0.009*pregnancy + 0.009*zhang + 0.009*test
17 : 0.013*fetus + 0.012*body + 0.009*model + 0.006*anatomy + 0.006*world
18 : 0.012*hayflick + 0.010*neural + 0.010*telomere + 0.008*crest + 0.008*telomerase
19 : 0.013*chromosome + 0.013*layer + 0.011*germ + 0.008*gene + 0.007*thalidomide

In [42]:
documents_lda = model[documents_bow]

In [43]:
documents_lda[6]


Out[43]:
[(12, 0.014435555062542226),
 (14, 0.085751791682966233),
 (15, 0.16342467897748753),
 (16, 0.73486706170519944)]

In [44]:
topic_counts = defaultdict(Counter)
for year, document in zip(years, documents_lda):
    for topic, representation in document:
        topic_counts[topic][year] += 1.

In [47]:
topics_over_time = pd.DataFrame(columns=['Topic', 'Year', 'Count'])

i = 0
for topic, yearcounts in topic_counts.iteritems():
    for year, count in yearcounts.iteritems():
        topics_over_time.loc[i] = [topic, year, count]
        i += 1

In [48]:
topics_over_time


Out[48]:
Topic Year Count
0 0 2016 1
1 0 2007 18
2 0 2008 3
3 0 2009 7
4 0 2010 22
5 0 2011 19
6 0 2012 8
7 0 2013 13
8 0 2014 39
9 0 2015 8
10 1 2016 2
11 1 2007 15
12 1 2008 18
13 1 2009 4
14 1 2010 31
15 1 2011 21
16 1 2012 18
17 1 2013 14
18 1 2014 28
19 1 2015 6
20 2 2016 1
21 2 2007 3
22 2 2008 11
23 2 2009 9
24 2 2010 23
25 2 2011 16
26 2 2012 8
27 2 2013 3
28 2 2014 19
29 2 2015 4
... ... ... ...
152 16 2012 7
153 16 2013 7
154 16 2014 5
155 16 2015 1
156 17 2007 15
157 17 2008 10
158 17 2009 8
159 17 2010 35
160 17 2011 10
161 17 2012 7
162 17 2013 11
163 17 2014 8
164 18 2007 3
165 18 2008 2
166 18 2009 1
167 18 2010 9
168 18 2011 4
169 18 2012 7
170 18 2013 7
171 18 2014 8
172 18 2015 5
173 19 2007 4
174 19 2008 3
175 19 2009 4
176 19 2010 20
177 19 2011 21
178 19 2012 6
179 19 2013 10
180 19 2014 20
181 19 2015 3

182 rows × 3 columns


In [58]:
topic_0_over_time = topics_over_time[topics_over_time.Topic == 0]

In [60]:
plt.bar(topic_0_over_time.Year, topic_0_over_time.Count)
plt.ylabel('Number of documents')
plt.show()



In [63]:
from scipy.spatial import distance

In [64]:
distance.cosine


Out[64]:
<function scipy.spatial.distance.cosine>

In [ ]: