In [1]:
%pylab inline
In [2]:
import matplotlib.pyplot as plt
In this workbook we'll start working with word-based FeatureSet
s, and use the model.corpus.mallet
module to fit a Latent Dirichlet Allocation (LDA) topic model.
First, import the dfr
module from tethne.readers
.
In [3]:
from tethne.readers import dfr
Unlike WoS datasets, DfR datasets can contain wordcounts, bigrams, trigrams, and quadgrams in addition to bibliographic data. read
will automagically load those data as FeatureSet
s.
In [4]:
corpus = dfr.read('/Users/erickpeirson/Projects/tethne-notebooks/data/dfr')
In [5]:
print 'There are %i papers in this corpus.' % len(corpus.papers)
In [6]:
print 'This corpus contains the following features: \n\t%s' % '\n\t'.join(corpus.features.keys())
Whereas Corpora generated from WoS datasets are indexed by wosid
, Corpora generated from DfR datasets are indexed by doi
.
In [7]:
corpus.indexed_papers.keys()[0:10] # The first 10 dois in the Paper index.
Out[7]:
So you can retrieve specific Paper
s from Corpus.papers
using their DOIs:
In [8]:
corpus['10.2307/2418718']
Out[8]:
FeatureSet
s are stored in the features
attribute of a Corpus
object. Corpus.features
is just a dict, so you can see which featuresets are available by calling its keys
method.
In [9]:
print 'This corpus contains the following featuresets: \n\t{0}'.format(
'\n\t'.join(corpus.features.keys()) )
It's not uncommon for featuresets to be incomplete; sometimes data won't be available for all of the Papers in the Corpus. An easy way to check the number of Papers for which data is available in a FeatureSet
is to get the size (len
) of the features
attribute of the FeatureSet
. For example:
In [10]:
print 'There are %i papers with unigram features in this corpus.' % len(corpus.features['wordcounts'].features)
To check the number of features in a featureset (e.g. the size of a vocabulary), look at the size of the index
attribute.
In [11]:
print 'There are %i words in the unigrams featureset' % len(corpus.features['wordcounts'].index)
index
maps words to an integer representation. Here are the first ten words in the FeatureSet
:
In [12]:
corpus.features['wordcounts'].index.items()[0:10]
Out[12]:
In many cases you may wish to apply a stoplist (a list of features to exclude from analysis) to a featureset. In our DfR dataset, the most common words are prepositions and other terms that don't really have anything to do with the topical content of the corpus.
In [13]:
corpus.features['wordcounts'].top(5) # The top 5 words in the FeatureSet.
Out[13]:
You may apply any stoplist you like to a featureset. In this example, we import the Natural Language ToolKit (NLTK) stopwords corpus.
In [3]:
from nltk.corpus import stopwords
stoplist = stopwords.words()
We then create a function that will evaluate whether or not a word is in our stoplist. The function should take three arguments:
f
-- the feature itself (the word)v
-- the number of instances of that feature in a specific documentc
-- the number of instances of that feature in the whole FeatureSetdc
-- the number of documents that contain that featureThis function will be applied to each word in each document. If it returns 0
or None
, the word will be excluded. Otherwise, it should return a numeric value (in this case, the count for that document).
In addition to applying the stoplist, we'll also exclude any word that occurs in more than 50 of the documents and less than 3 documents.
In [17]:
def apply_stoplist(f, v, c, dc):
if f in stoplist or dc > 50 or dc < 3:
return 0
return v
We apply the stoplist using the transform()
method. FeatureSet
s are not modified in place; instead, a new FeatureSet
is generated that reflects the specified changes. We'll call the new FeatureSet
'wordcounts_filtered'
.
In [18]:
corpus.features['wordcounts_filtered'] = corpus.features['wordcounts'].transform(apply_stoplist)
In [19]:
print 'There are %i words in the wordcounts_filtered FeatureSet' % len(corpus.features['wordcounts_filtered'].index)
Latent Dirichlet Allocation is a popular approach to discovering latent "topics" in large corpora. Many digital humanists use a software package called MALLET to fit LDA to text data. Tethne uses MALLET to fit LDA topic models.
Start by importing the mallet
module from the tethne.model.corpus
subpackage.
In [4]:
from tethne.model.corpus import mallet
Now we'll create a new LDAModel
for our Corpus
. The featureset_name
parameter tells the LDAModel
which FeatureSet
we want to use. We'll use our filtered wordcounts.
In [21]:
model = mallet.LDAModel(corpus, featureset_name='wordcounts_filtered')
Next we'll fit the model. We need to tell MALLET how many topics to fit (the hyperparameter Z
), and how many iterations (max_iter
) to perform. This step may take a little while, depending on the size of your corpus.
In [47]:
model.fit(Z=50, max_iter=500)
You can inspect the inferred topics using the model's print_topics()
method. By default, this will print the top ten words for each topic.
In [48]:
model.print_topics()
We can also look at the representation of a topic over time using the topic_over_time()
method. In the example below we'll print the first five of the topics on the same plot.
In [49]:
plt.figure(figsize=(15, 5))
for k in xrange(5):
x, y = model.topic_over_time(k)
plt.plot(x, y, label='topic {0}'.format(k), lw=2, alpha=0.7)
plt.legend(loc='best')
plt.show()
The features
module in the tethne.networks
subpackage contains some useful methods for visualizing topic models as networks. You can import it just like the authors
or papers
modules.
In [5]:
from tethne.networks import topics
In [51]:
termGraph = topics.terms(model, threshold=0.01)
In [66]:
termGraph.name = ''
The topic_coupling
function generates a network of words connected on the basis of shared affinity with a topic. If two words i and j are both associated with a topic z with $\Phi(i|z) >= 0.01$ and $\Phi(j|z) >= 0.01$, then an edge is drawn between them.
The resulting graph will be smaller or larger depending on the value that you choose for threshold
. You may wish to increase or decrease threshold
to achieve something interpretable.
In [67]:
print 'There are {0} nodes and {1} edges in this graph.'.format(
len(termGraph.nodes()), len(termGraph.edges()))
The resulting Graph
can be written to GraphML
just like any other Graph
.
In [37]:
from tethne.writers import graph
In [69]:
graphpath = '/Users/erickpeirson/Projects/tethne-notebooks/output/lda.graphml'
graph.write_graphml(termGraph, graphpath)
The network visualization below was generated in Cytoscape
. Edge width is a function of the 'weight'
attribute. Edge color is based on the 'topics'
attribute, to give some sense of how which clusters of terms belong to which topics. We can see right away that terms like plants
, populations
, species
, and selection
are all very central to the topics retrieved by this model.
JSTOR DfR is not the only source of wordcounts with which we perform topic modeling. For records more recent than about 1990, the Web of Science includes abstracts in their bibliographic records.
Let's first spin up our WoS dataset.
In [6]:
from tethne.readers import wos
wosCorpus = wos.read('/Users/erickpeirson/Projects/tethne-notebooks/data/wos')
Here's what one of the abstracts looks like:
In [7]:
wosCorpus[0].abstract
Out[7]:
The abstract_to_features
method converts all of the available abstracts in our Corpus
to a unigram featureset. It takes no arguments. The abstracts will be diced up into their constituent words, punctuation and capitalization is removed, and a featureset called abstractTerms
is generated. By default, abstract_to_features
will apply the NLTK stoplist and Porter stemmer.
In [8]:
from tethne import tokenize
wosCorpus.index_feature('abstract', tokenize=tokenize, structured=True)
Sure enough, here's our 'abstract'
featureset:
In [9]:
wosCorpus.features.keys()
Out[9]:
Since we're not working from OCR texts (that's where JSTOR DfR data comes from), there are far fewer "junk" words. We end up with a much smaller vocabulary.
In [10]:
print 'There are {0} features in the abstract featureset.'.format(len(wosCorpus.features['abstract'].index))
But since not all of our WoS records come from >= 1990, there are a handful for which there are no abstract terms.
In [11]:
print 'Only {0} of {1} papers have abstracts, however.'.format(len(wosCorpus.features['abstract'].features), len(wosCorpus.papers))
In [12]:
filter = lambda f, v, c, dc: f not in stoplist and 2 < dc < 400
wosCorpus.features['abstract_filtered'] = wosCorpus.features['abstract'].transform(filter)
In [13]:
print 'There are {0} features in the abstract_filtered featureset.'.format(len(wosCorpus.features['abstract_filtered'].index))
In [14]:
type(wosCorpus.features['abstract_filtered'])
Out[14]:
The 'abstract'
featureset is similar to the 'unigrams'
featureset from the JSTOR DfR dataset, so we can perform topic modeling with it, too.
In [15]:
wosModel = mallet.LDAModel(wosCorpus, featureset_name='abstract_filtered')
In [18]:
wosModel.fit(Z=50, max_iter=500)
In [19]:
wosModel.print_topics(Nwords=5)
topics.cotopics()
creates a network of topics, linked by virtue of their co-occurrence in documents. Use the threshold
parameter to tune the density of the graph.
In [39]:
coTopicGraph = topics.cotopics(wosModel, threshold=0.15)
In [44]:
print '%i nodes and %i edges' % (coTopicGraph.order(), coTopicGraph.size())
In [41]:
graph.write_graphml(coTopicGraph, '/Users/erickpeirson/Projects/tethne-notebooks/output/lda_coTopics.graphml')
topics.topicCoupling()
creates a network of documents, linked by virtue of containing shared topics. Again, use the threshold
parameter to tune the density of the graph.
In [50]:
topicCoupling = topics.topic_coupling(wosModel, threshold=0.2)
In [51]:
print '%i nodes and %i edges' % (topicCoupling.order(), topicCoupling.size())
In [52]:
graph.write_graphml(topicCoupling, '/Users/erickpeirson/Projects/tethne-notebooks/output/lda_topicCoupling.graphml')
In [ ]: