In this tutorial, you will learn how to use the author-topic model in Gensim. We will apply it to a corpus consisting of scientific papers, to get insight about the authors of the papers.
The author-topic model is an extension of Latent Dirichlet Allocation (LDA), that allows us to learn topic representations of authors in a corpus. The model can be applied to any kinds of labels on documents, such as tags on posts on the web. The model can be used as a novel way of data exploration, as features in machine learning pipelines, for author (or tag) prediction, or to simply leverage your topic model with existing metadata.
To learn about the theoretical side of the author-topic model, see Rosen-Zvi and co-authors 2004, for example. A report on the algorithm used in the Gensim implementation will be available soon.
Naturally, familiarity with topic modelling, LDA and Gensim is assumed in this tutorial. If you are not familiar with either LDA, or its Gensim implementation, I would recommend starting there. Consider some of these resources:
NOTE:
To run this tutorial on your own, install Jupyter, Gensim, SpaCy, Scikit-Learn, Bokeh and Pandas, e.g. using pip:
pip install jupyter gensim spacy sklearn bokeh pandas
Note that you need to download some data for SpaCy using
python -m spacy.en.download
.Download the notebook at https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks/atmodel_tutorial.ipynb.
In this tutorial, we will learn how to prepare data for the model, how to train it, and how to explore the resulting representation in different ways. We will inspect the topic representation of some well known authors like Geoffrey Hinton and Yann LeCun, and compare authors by plotting them in reduced dimensionality and performing similarity queries.
The data we will be using consists of scientific papers about machine learning, from the Neural Information Processing Systems conference (NIPS). It is the same dataset used in the Pre-processing and training LDA tutorial, mentioned earlier.
We will be performing qualitative analysis of the model, and at times this will require an understanding of the subject matter of the data. If you try running this tutorial on your own, consider applying it on a dataset with subject matter that you are familiar with. For example, try one of the StackExchange datadump datasets.
You can download the data from Sam Roweis' website (http://www.cs.nyu.edu/~roweis/data.html). Or just run the cell below, and it will be downloaded and extracted into your `tmp.
In [1]:
!wget -O - 'http://www.cs.nyu.edu/~roweis/data/nips12raw_str602.tgz' > /tmp/nips12raw_str602.tgz
In [4]:
import tarfile
filename = '/tmp/nips12raw_str602.tgz'
tar = tarfile.open(filename, 'r:gz')
for item in tar:
tar.extract(item, path='/tmp')
In the following sections we will load the data, pre-process it, train the model, and explore the results using some of the implementation's functionality. Feel free to skip the loading and pre-processing for now, if you are familiar with the process.
In the cell below, we crawl the folders and files in the dataset, and read the files into memory.
In [5]:
import os, re
# Folder containing all NIPS papers.
data_dir = '/tmp/nipstxt/' # Set this path to the data on your machine.
# Folders containin individual NIPS papers.
yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
dirs = ['nips' + yr for yr in yrs]
# Get all document texts and their corresponding IDs.
docs = []
doc_ids = []
for yr_dir in dirs:
files = os.listdir(data_dir + yr_dir) # List of filenames.
for filen in files:
# Get document ID.
(idx1, idx2) = re.search('[0-9]+', filen).span() # Matches the indexes of the start end end of the ID.
doc_ids.append(yr_dir[4:] + '_' + str(int(filen[idx1:idx2])))
# Read document text.
# Note: ignoring characters that cause encoding errors.
with open(data_dir + yr_dir + '/' + filen, errors='ignore', encoding='utf-8') as fid:
txt = fid.read()
# Replace any whitespace (newline, tabs, etc.) by a single space.
txt = re.sub('\s', ' ', txt)
docs.append(txt)
Construct a mapping from author names to document IDs.
In [2]:
filenames = [data_dir + 'idx/a' + yr + '.txt' for yr in yrs] # Using the years defined in previous cell.
# Get all author names and their corresponding document IDs.
author2doc = dict()
i = 0
for yr in yrs:
# The files "a00.txt" and so on contain the author-document mappings.
filename = data_dir + 'idx/a' + yr + '.txt'
for line in open(filename, errors='ignore', encoding='utf-8'):
# Each line corresponds to one author.
contents = re.split(',', line)
author_name = (contents[1] + contents[0]).strip()
# Remove any whitespace to reduce redundant author names.
author_name = re.sub('\s', '', author_name)
# Get document IDs for author.
ids = [c.strip() for c in contents[2:]]
if not author2doc.get(author_name):
# This is a new author.
author2doc[author_name] = []
i += 1
# Add document IDs to author.
author2doc[author_name].extend([yr + '_' + id for id in ids])
# Use an integer ID in author2doc, instead of the IDs provided in the NIPS dataset.
# Mapping from ID of document in NIPS datast, to an integer ID.
doc_id_dict = dict(zip(doc_ids, range(len(doc_ids))))
# Replace NIPS IDs by integer IDs.
for a, a_doc_ids in author2doc.items():
for i, doc_id in enumerate(a_doc_ids):
author2doc[a][i] = doc_id_dict[doc_id]
The text will be pre-processed using the following steps:
A lot of the heavy lifting will be done by the great package, Spacy. Spacy markets itself as "industrial-strength natural language processing", is fast, enables multiprocessing, and is easy to use. First, let's import it and load the NLP pipline in english.
In [3]:
import spacy
nlp = spacy.load('en')
In the code below, Spacy takes care of tokenization, removing non-alphabetic characters, removal of stopwords, lemmatization and named entity recognition.
Note that we only keep named entities that consist of more than one word, as single word named entities are already there.
In [4]:
%%time
processed_docs = []
for doc in nlp.pipe(docs, n_threads=4, batch_size=100):
# Process document using Spacy NLP pipeline.
ents = doc.ents # Named entities.
# Keep only words (no numbers, no punctuation).
# Lemmatize tokens, remove punctuation and remove stopwords.
doc = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]
# Remove common words from a stopword list.
#doc = [token for token in doc if token not in STOPWORDS]
# Add named entities, but only if they are a compound of more than word.
doc.extend([str(entity) for entity in ents if len(entity) > 1])
processed_docs.append(doc)
In [5]:
docs = processed_docs
del processed_docs
Below, we use a Gensim model to add bigrams. Note that this achieves the same goal as named entity recognition, that is, finding adjacent words that have some particular significance.
In [6]:
# Compute bigrams.
from gensim.models import Phrases
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
for token in bigram[docs[idx]]:
if '_' in token:
# Token is a bigram, add to document.
docs[idx].append(token)
Now we are ready to construct a dictionary, as our vocabulary is finalized. We then remove common words (occurring $> 50\%$ of the time), and rare words (occur $< 20$ times in total).
In [7]:
# Create a dictionary representation of the documents, and filter out frequent and rare words.
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)
# Remove rare and common tokens.
# Filter out words that occur too frequently or too rarely.
max_freq = 0.5
min_wordcount = 20
dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)
_ = dictionary[0] # This sort of "initializes" dictionary.id2token.
We produce the vectorized representation of the documents, to supply the author-topic model with, by computing the bag-of-words.
In [8]:
# Vectorize data.
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]
Let's inspect the dimensionality of our data.
In [9]:
print('Number of authors: %d' % len(author2doc))
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))
We train the author-topic model on the data prepared in the previous sections.
The interface to the author-topic model is very similar to that of LDA in Gensim. In addition to a corpus, ID to word mapping (id2word
) and number of topics (num_topics
), the author-topic model requires either an author to document ID mapping (author2doc
), or the reverse (doc2author
).
Below, we have also (this can be skipped for now):
passes
over the dataset (to improve the convergence of the optimization problem).iterations
over each document (related to the above).chunksize
) (primarily to speed up training).eval_every
) (as it takes a long time to compute).alpha
and eta
priors (to improve the convergence of the optimization problem).random_state
) of the random number generator (to make these experiments reproducible).We load the model, and train it.
In [10]:
from gensim.models import AuthorTopicModel
%time model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
author2doc=author2doc, chunksize=2000, passes=1, eval_every=0, \
iterations=1, random_state=1)
If you believe your model hasn't converged, you can continue training using model.update()
. If you have additional documents and/or authors call model.update(corpus, author2doc)
.
Before we explore the model, let's try to improve upon it. To do this, we will train several models with different random initializations, by giving different seeds for the random number generator (random_state
). We evaluate the topic coherence of the model using the top_topics method, and pick the model with the highest topic coherence.
In [11]:
%%time
model_list = []
for i in range(5):
model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
author2doc=author2doc, chunksize=2000, passes=100, gamma_threshold=1e-10, \
eval_every=0, iterations=1, random_state=i)
top_topics = model.top_topics(corpus)
tc = sum([t[1] for t in top_topics])
model_list.append((model, tc))
Choose the model with the highest topic coherence.
In [12]:
model, tc = max(model_list, key=lambda x: x[1])
print('Topic coherence: %.3e' %tc)
We save the model, to avoid having to train it again, and also show how to load it again.
In [13]:
# Save model.
model.save('/tmp/model.atmodel')
In [14]:
# Load model.
model = AuthorTopicModel.load('/tmp/model.atmodel')
Now that we have trained a model, we can start exploring the authors and the topics.
First, let's simply print the most important words in the topics. Below we have printed topic 0. As we can see, each topic is associated with a set of words, and each word has a probability of being expressed under that topic.
In [15]:
model.show_topic(0)
Out[15]:
Below, we have given each topic a label based on what each topic seems to be about intuitively.
In [42]:
topic_labels = ['Circuits', 'Neuroscience', 'Numerical optimization', 'Object recognition', \
'Math/general', 'Robotics', 'Character recognition', \
'Reinforcement learning', 'Speech recognition', 'Bayesian modelling']
Rather than just calling model.show_topics(num_topics=10)
, we format the output a bit so it is easier to get an overview.
In [43]:
for topic in model.show_topics(num_topics=10):
print('Label: ' + topic_labels[topic[0]])
words = ''
for word, prob in model.show_topic(topic[0]):
words += word + ' '
print('Words: ' + words)
print()
These topics are by no means perfect. They have problems such as chained topics, intruded words, random topics, and unbalanced topics (see Mimno and co-authors 2011). They will do for the purposes of this tutorial, however.
Below, we use the model[name]
syntax to retrieve the topic distribution for an author. Each topic has a probability of being expressed given the particular author, but only the ones above a certain threshold are shown.
In [18]:
model['YannLeCun']
Out[18]:
Let's print the top topics of some authors. First, we make a function to help us do this more easily.
In [44]:
from pprint import pprint
def show_author(name):
print('\n%s' % name)
print('Docs:', model.author2doc[name])
print('Topics:')
pprint([(topic_labels[topic[0]], topic[1]) for topic in model[name]])
Below, we print some high profile researchers and inspect them. Three of these, Yann LeCun, Geoffrey E. Hinton and Christof Koch, are spot on.
Terrence J. Sejnowski's results are surprising, however. He is a neuroscientist, so we would expect him to get the "neuroscience" label. This may indicate that Sejnowski works with the neuroscience aspects of visual perception, or perhaps that we have labeled the topic incorrectly, or perhaps that this topic simply is not very informative.
In [55]:
show_author('YannLeCun')
In [46]:
show_author('GeoffreyE.Hinton')
In [47]:
show_author('TerrenceJ.Sejnowski')
In [53]:
show_author('ChristofKoch')
In [24]:
from gensim.models import atmodel
doc2author = atmodel.construct_doc2author(model.corpus, model.author2doc)
Now let's evaluate the per-word bound.
In [25]:
# Compute the per-word bound.
# Number of words in corpus.
corpus_words = sum(cnt for document in model.corpus for _, cnt in document)
# Compute bound and divide by number of words.
perwordbound = model.bound(model.corpus, author2doc=model.author2doc, \
doc2author=model.doc2author) / corpus_words
print(perwordbound)
We can evaluate the quality of the topics by computing the topic coherence, as in the LDA class. Use this to e.g. find out which of the topics are poor quality, or as a metric for model selection.
In [26]:
%time top_topics = model.top_topics(model.corpus)
Now we're going to produce the kind of pacific archipelago looking plot below. The goal of this plot is to give you a way to explore the author-topic representation in an intuitive manner.
We take all the author-topic distributions (stored in model.state.gamma
) and embed them in a 2D space. To do this, we reduce the dimensionality of this data using t-SNE.
t-SNE is a method that attempts to reduce the dimensionality of a dataset, while maintaining the distances between the points. That means that if two authors are close together in the plot below, then their topic distributions are similar.
In the cell below, we transform the author-topic representation into the t-SNE space. You can increase the smallest_author
value if you do not want to view all the authors with few documents.
In [27]:
%%time
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
smallest_author = 0 # Ignore authors with documents less than this.
authors = [model.author2id[a] for a in model.author2id.keys() if len(model.author2doc[a]) >= smallest_author]
_ = tsne.fit_transform(model.state.gamma[authors, :]) # Result stored in tsne.embedding_
We are now ready to make the plot.
Note that if you run this notebook yourself, you will see a different graph. The random initialization of the model will be different, and the result will thus be different to some degree. You may find an entirely different representation of the data, or it may show the same interpretation slightly differently.
If you can't see the plot, you are probably viewing this tutorial in a Jupyter Notebook. View it in an nbviewer instead at http://nbviewer.jupyter.org/github/rare-technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb.
In [28]:
# Tell Bokeh to display plots inside the notebook.
from bokeh.io import output_notebook
output_notebook()
In [29]:
from bokeh.models import HoverTool
from bokeh.plotting import figure, show, ColumnDataSource
x = tsne.embedding_[:, 0]
y = tsne.embedding_[:, 1]
author_names = [model.id2author[a] for a in authors]
# Radius of each point corresponds to the number of documents attributed to that author.
scale = 0.1
author_sizes = [len(model.author2doc[a]) for a in author_names]
radii = [size * scale for size in author_sizes]
source = ColumnDataSource(
data=dict(
x=x,
y=y,
author_names=author_names,
author_sizes=author_sizes,
radii=radii,
)
)
# Add author names and sizes to mouse-over info.
hover = HoverTool(
tooltips=[
("author", "@author_names"),
("size", "@author_sizes"),
]
)
p = figure(tools=[hover, 'crosshair,pan,wheel_zoom,box_zoom,reset,save,lasso_select'])
p.scatter('x', 'y', radius='radii', source=source, fill_alpha=0.6, line_color=None)
show(p)
The circles in the plot above are individual authors, and their sizes represent the number of documents attributed to the corresponding author. Hovering your mouse over the circles will tell you the name of the authors and their sizes. Large clusters of authors tend to reflect some overlap in interest.
We see that the model tends to put duplicate authors close together. For example, Terrence J. Sejnowki and T. J. Sejnowski are the same person, and their vectors end up in the same place (see about $(-10, -10)$ in the plot).
At about $(-15, -10)$ we have a cluster of neuroscientists like Christof Koch and James M. Bower.
As discussed earlier, the "object recognition" topic was assigned to Sejnowski. If we get the topics of the other authors in Sejnoski's neighborhood, like Peter Dayan, we also get this same topic. Furthermore, we see that this cluster is close to the "neuroscience" cluster discussed above, which is further indication that this topic is about visual perception in the brain.
Other clusters include a reinforcement learning cluster at about $(-5, 8)$, and a Bayesian modelling cluster at about $(8, -12)$.
In this section, we are going to set up a system that takes the name of an author and yields the authors that are most similar. This functionality can be used as a component in an information retrieval (i.e. a search engine of some kind), or in an author prediction system, i.e. a system that takes an unlabelled document and predicts the author(s) that wrote it.
We simply need to search for the closest vector in the author-topic space. In this sense, the approach is similar to the t-SNE plot above.
Below we illustrate a similarity query using a built-in similarity framework in Gensim.
In [30]:
from gensim.similarities import MatrixSimilarity
# Generate a similarity object for the transformed corpus.
index = MatrixSimilarity(model[list(model.id2author.values())])
# Get similarities to some author.
author_name = 'YannLeCun'
sims = index[model[author_name]]
However, this framework uses the cosine distance, but we want to use the Hellinger distance. The Hellinger distance is a natural way of measuring the distance (i.e. dis-similarity) between two probability distributions. Its discrete version is defined as $$ H(p, q) = \frac{1}{\sqrt{2}} \sqrt{\sum_{i=1}^K (\sqrt{p_i} - \sqrt{q_i})^2}, $$
where $p$ and $q$ are both topic distributions for two different authors. We define the similarity as $$ S(p, q) = \frac{1}{1 + H(p, q)}. $$
In the cell below, we prepare everything we need to perform similarity queries based on the Hellinger distance.
In [63]:
# Make a function that returns similarities based on the Hellinger distance.
from gensim import matutils
import pandas as pd
# Make a list of all the author-topic distributions.
author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]
def similarity(vec1, vec2):
'''Get similarity between two vectors'''
dist = matutils.hellinger(matutils.sparse2full(vec1, model.num_topics), \
matutils.sparse2full(vec2, model.num_topics))
sim = 1.0 / (1.0 + dist)
return sim
def get_sims(vec):
'''Get similarity of vector to all authors.'''
sims = [similarity(vec, vec2) for vec2 in author_vecs]
return sims
def get_table(name, top_n=10, smallest_author=1):
'''
Get table with similarities, author names, and author sizes.
Return `top_n` authors as a dataframe.
'''
# Get similarities.
sims = get_sims(model.get_author_topics(name))
# Arrange author names, similarities, and author sizes in a list of tuples.
table = []
for elem in enumerate(sims):
author_name = model.id2author[elem[0]]
sim = elem[1]
author_size = len(model.author2doc[author_name])
if author_size >= smallest_author:
table.append((author_name, sim, author_size))
# Make dataframe and retrieve top authors.
df = pd.DataFrame(table, columns=['Author', 'Score', 'Size'])
df = df.sort_values('Score', ascending=False)[:top_n]
return df
Now we can find the most similar authors to some particular author. We use the Pandas library to print the results in a nice looking tables.
In [64]:
get_table('YannLeCun')
Out[64]:
As before, we can specify the minimum author size.
In [72]:
get_table('JamesM.Bower', smallest_author=3)
Out[72]:
The AuthorTopicModel
class accepts serialized corpora, that is, corpora that are stored on the hard-drive rather than in memory. This is usually done when the corpus is too big to fit in memory. There are, however, some caveats to this functionality, which we will discuss here. As these caveats make this functionality less than ideal, it may be improved in the future.
It is not necessary to read this section if you don't intend to use serialized corpora.
In the following, an explanation, followed by an example and a summarization will be given.
If the corpus is serialized, the user must specify serialized=True
. Any input corpus can then be any type of iterable or generator.
The model will then take the input corpus and serialize it in the MmCorpus
format, which is supported in Gensim.
The user must specify the path where the model should serialize all input documents, for example serialization_path='/tmp/model_serializer.mm'
. To avoid accidentally overwriting some important data, the model will raise an error if there already exists a file at serialization_path
; in this case, either choose another path, or delete the old file.
When you want to train on new data, and call model.update(corpus, author2doc)
, all the old data and the new data have to be re-serialized. This can of course be quite computationally demanding, so it is recommended that you do this only when necessary; that is, wait until you have as much new data as possible to update, rather than updating the model for every new document.
In [34]:
%time model_ser = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
author2doc=author2doc, random_state=1, serialized=True, \
serialization_path='/tmp/model_serialization.mm')
In [35]:
# Delete the file, once you're done using it.
import os
os.remove('/tmp/model_serialization.mm')
In summary, when using serialized corpora:
serialized=True
.serialization_path
to a path that doesn't already contain a file.model.update(corpus, author2doc)
.serialization_path
if it's not needed anymore.Try the model on one of the datasets in the StackExchange data dump. You can treat the tags on the posts as authors and train a "tag-topic" model. There are many different categories, from statistics to cooking to philosophy, so you can pick on that you like. You can even try your hand at a Kaggle competition that uses tags in this dataset.