The author-topic model: LDA with metadata

In this tutorial, you will learn how to use the author-topic model in Gensim. We will apply it to a corpus consisting of scientific papers, to get insight about the authors of the papers.

The author-topic model is an extension of Latent Dirichlet Allocation (LDA), that allows us to learn topic representations of authors in a corpus. The model can be applied to any kinds of labels on documents, such as tags on posts on the web. The model can be used as a novel way of data exploration, as features in machine learning pipelines, for author (or tag) prediction, or to simply leverage your topic model with existing metadata.

To learn about the theoretical side of the author-topic model, see Rosen-Zvi and co-authors 2004, for example. A report on the algorithm used in the Gensim implementation will be available soon.

Naturally, familiarity with topic modelling, LDA and Gensim is assumed in this tutorial. If you are not familiar with either LDA, or its Gensim implementation, I would recommend starting there. Consider some of these resources:

NOTE:

To run this tutorial on your own, install Jupyter, Gensim, SpaCy, Scikit-Learn, Bokeh and Pandas, e.g. using pip:

pip install jupyter gensim spacy sklearn bokeh pandas

Note that you need to download some data for SpaCy using python -m spacy.en.download.

Download the notebook at https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks/atmodel_tutorial.ipynb.

In this tutorial, we will learn how to prepare data for the model, how to train it, and how to explore the resulting representation in different ways. We will inspect the topic representation of some well known authors like Geoffrey Hinton and Yann LeCun, and compare authors by plotting them in reduced dimensionality and performing similarity queries.

Analyzing scientific papers

The data we will be using consists of scientific papers about machine learning, from the Neural Information Processing Systems conference (NIPS). It is the same dataset used in the Pre-processing and training LDA tutorial, mentioned earlier.

We will be performing qualitative analysis of the model, and at times this will require an understanding of the subject matter of the data. If you try running this tutorial on your own, consider applying it on a dataset with subject matter that you are familiar with. For example, try one of the StackExchange datadump datasets.

You can download the data from Sam Roweis' website (http://www.cs.nyu.edu/~roweis/data.html). Or just run the cell below, and it will be downloaded and extracted into your `tmp.


In [1]:
!wget -O - 'http://www.cs.nyu.edu/~roweis/data/nips12raw_str602.tgz' > /tmp/nips12raw_str602.tgz


--2017-01-16 12:29:12--  http://www.cs.nyu.edu/~roweis/data/nips12raw_str602.tgz
Resolving www.cs.nyu.edu (www.cs.nyu.edu)... 128.122.49.30
Connecting to www.cs.nyu.edu (www.cs.nyu.edu)|128.122.49.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12851423 (12M) [application/x-gzip]
Saving to: ‘STDOUT’

-                   100%[===================>]  12.26M  3.33MB/s    in 4.9s    

2017-01-16 12:29:18 (2.49 MB/s) - written to stdout [12851423/12851423]


In [4]:
import tarfile

filename = '/tmp/nips12raw_str602.tgz'
tar = tarfile.open(filename, 'r:gz')
for item in tar:
    tar.extract(item, path='/tmp')

In the following sections we will load the data, pre-process it, train the model, and explore the results using some of the implementation's functionality. Feel free to skip the loading and pre-processing for now, if you are familiar with the process.

Loading the data

In the cell below, we crawl the folders and files in the dataset, and read the files into memory.


In [5]:
import os, re

# Folder containing all NIPS papers.
data_dir = '/tmp/nipstxt/'  # Set this path to the data on your machine.

# Folders containin individual NIPS papers.
yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
dirs = ['nips' + yr for yr in yrs]

# Get all document texts and their corresponding IDs.
docs = []
doc_ids = []
for yr_dir in dirs:
    files = os.listdir(data_dir + yr_dir)  # List of filenames.
    for filen in files:
        # Get document ID.
        (idx1, idx2) = re.search('[0-9]+', filen).span()  # Matches the indexes of the start end end of the ID.
        doc_ids.append(yr_dir[4:] + '_' + str(int(filen[idx1:idx2])))
        
        # Read document text.
        # Note: ignoring characters that cause encoding errors.
        with open(data_dir + yr_dir + '/' + filen, errors='ignore', encoding='utf-8') as fid:
            txt = fid.read()
            
        # Replace any whitespace (newline, tabs, etc.) by a single space.
        txt = re.sub('\s', ' ', txt)
        
        docs.append(txt)

Construct a mapping from author names to document IDs.


In [2]:
filenames = [data_dir + 'idx/a' + yr + '.txt' for yr in yrs]  # Using the years defined in previous cell.

# Get all author names and their corresponding document IDs.
author2doc = dict()
i = 0
for yr in yrs:
    # The files "a00.txt" and so on contain the author-document mappings.
    filename = data_dir + 'idx/a' + yr + '.txt'
    for line in open(filename, errors='ignore', encoding='utf-8'):
        # Each line corresponds to one author.
        contents = re.split(',', line)
        author_name = (contents[1] + contents[0]).strip()
        # Remove any whitespace to reduce redundant author names.
        author_name = re.sub('\s', '', author_name)
        # Get document IDs for author.
        ids = [c.strip() for c in contents[2:]]
        if not author2doc.get(author_name):
            # This is a new author.
            author2doc[author_name] = []
            i += 1
        
        # Add document IDs to author.
        author2doc[author_name].extend([yr + '_' + id for id in ids])

# Use an integer ID in author2doc, instead of the IDs provided in the NIPS dataset.
# Mapping from ID of document in NIPS datast, to an integer ID.
doc_id_dict = dict(zip(doc_ids, range(len(doc_ids))))
# Replace NIPS IDs by integer IDs.
for a, a_doc_ids in author2doc.items():
    for i, doc_id in enumerate(a_doc_ids):
        author2doc[a][i] = doc_id_dict[doc_id]

Pre-processing text

The text will be pre-processed using the following steps:

  • Tokenize text.
  • Replace all whitespace by single spaces.
  • Remove all punctuation and numbers.
  • Remove stopwords.
  • Lemmatize words.
  • Add multi-word named entities.
  • Add frequent bigrams.
  • Remove frequent and rare words.

A lot of the heavy lifting will be done by the great package, Spacy. Spacy markets itself as "industrial-strength natural language processing", is fast, enables multiprocessing, and is easy to use. First, let's import it and load the NLP pipline in english.


In [3]:
import spacy
nlp = spacy.load('en')

In the code below, Spacy takes care of tokenization, removing non-alphabetic characters, removal of stopwords, lemmatization and named entity recognition.

Note that we only keep named entities that consist of more than one word, as single word named entities are already there.


In [4]:
%%time
processed_docs = []    
for doc in nlp.pipe(docs, n_threads=4, batch_size=100):
    # Process document using Spacy NLP pipeline.
    
    ents = doc.ents  # Named entities.

    # Keep only words (no numbers, no punctuation).
    # Lemmatize tokens, remove punctuation and remove stopwords.
    doc = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]

    # Remove common words from a stopword list.
    #doc = [token for token in doc if token not in STOPWORDS]

    # Add named entities, but only if they are a compound of more than word.
    doc.extend([str(entity) for entity in ents if len(entity) > 1])
    
    processed_docs.append(doc)


CPU times: user 9min 6s, sys: 276 ms, total: 9min 7s
Wall time: 2min 52s

In [5]:
docs = processed_docs
del processed_docs

Below, we use a Gensim model to add bigrams. Note that this achieves the same goal as named entity recognition, that is, finding adjacent words that have some particular significance.


In [6]:
# Compute bigrams.
from gensim.models import Phrases
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)


/home/olavur/Dropbox/my_folder/workstuff/DTU/thesis/code/gensim/gensim/models/phrases.py:248: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
  warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")

Now we are ready to construct a dictionary, as our vocabulary is finalized. We then remove common words (occurring $> 50\%$ of the time), and rare words (occur $< 20$ times in total).


In [7]:
# Create a dictionary representation of the documents, and filter out frequent and rare words.

from gensim.corpora import Dictionary
dictionary = Dictionary(docs)

# Remove rare and common tokens.
# Filter out words that occur too frequently or too rarely.
max_freq = 0.5
min_wordcount = 20
dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)

_ = dictionary[0]  # This sort of "initializes" dictionary.id2token.

We produce the vectorized representation of the documents, to supply the author-topic model with, by computing the bag-of-words.


In [8]:
# Vectorize data.

# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

Let's inspect the dimensionality of our data.


In [9]:
print('Number of authors: %d' % len(author2doc))
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))


Number of authors: 2479
Number of unique tokens: 6996
Number of documents: 1740

Train and use model

We train the author-topic model on the data prepared in the previous sections.

The interface to the author-topic model is very similar to that of LDA in Gensim. In addition to a corpus, ID to word mapping (id2word) and number of topics (num_topics), the author-topic model requires either an author to document ID mapping (author2doc), or the reverse (doc2author).

Below, we have also (this can be skipped for now):

  • Increased the number of passes over the dataset (to improve the convergence of the optimization problem).
  • Decreased the number of iterations over each document (related to the above).
  • Specified the mini-batch size (chunksize) (primarily to speed up training).
  • Turned off bound evaluation (eval_every) (as it takes a long time to compute).
  • Turned on automatic learning of the alpha and eta priors (to improve the convergence of the optimization problem).
  • Set the random state (random_state) of the random number generator (to make these experiments reproducible).

We load the model, and train it.


In [10]:
from gensim.models import AuthorTopicModel
%time model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
                author2doc=author2doc, chunksize=2000, passes=1, eval_every=0, \
                iterations=1, random_state=1)


CPU times: user 3.56 s, sys: 316 ms, total: 3.87 s
Wall time: 3.65 s

If you believe your model hasn't converged, you can continue training using model.update(). If you have additional documents and/or authors call model.update(corpus, author2doc).

Before we explore the model, let's try to improve upon it. To do this, we will train several models with different random initializations, by giving different seeds for the random number generator (random_state). We evaluate the topic coherence of the model using the top_topics method, and pick the model with the highest topic coherence.


In [11]:
%%time
model_list = []
for i in range(5):
    model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
                    author2doc=author2doc, chunksize=2000, passes=100, gamma_threshold=1e-10, \
                    eval_every=0, iterations=1, random_state=i)
    top_topics = model.top_topics(corpus)
    tc = sum([t[1] for t in top_topics])
    model_list.append((model, tc))


CPU times: user 11min 59s, sys: 2min 14s, total: 14min 13s
Wall time: 11min 41s

Choose the model with the highest topic coherence.


In [12]:
model, tc = max(model_list, key=lambda x: x[1])
print('Topic coherence: %.3e' %tc)


Topic coherence: -1.847e+03

We save the model, to avoid having to train it again, and also show how to load it again.


In [13]:
# Save model.
model.save('/tmp/model.atmodel')

In [14]:
# Load model.
model = AuthorTopicModel.load('/tmp/model.atmodel')

Explore author-topic representation

Now that we have trained a model, we can start exploring the authors and the topics.

First, let's simply print the most important words in the topics. Below we have printed topic 0. As we can see, each topic is associated with a set of words, and each word has a probability of being expressed under that topic.


In [15]:
model.show_topic(0)


Out[15]:
[('chip', 0.014645100754555081),
 ('circuit', 0.011967493386263996),
 ('analog', 0.011466032752399413),
 ('control', 0.010067258628938444),
 ('implementation', 0.0078096719430403956),
 ('design', 0.0072620826472022419),
 ('implement', 0.0063648695668359189),
 ('signal', 0.0063389759280913392),
 ('vlsi', 0.0059415519461153785),
 ('processor', 0.0056545823226162124)]

Below, we have given each topic a label based on what each topic seems to be about intuitively.


In [42]:
topic_labels = ['Circuits', 'Neuroscience', 'Numerical optimization', 'Object recognition', \
               'Math/general', 'Robotics', 'Character recognition', \
                'Reinforcement learning', 'Speech recognition', 'Bayesian modelling']

Rather than just calling model.show_topics(num_topics=10), we format the output a bit so it is easier to get an overview.


In [43]:
for topic in model.show_topics(num_topics=10):
    print('Label: ' + topic_labels[topic[0]])
    words = ''
    for word, prob in model.show_topic(topic[0]):
        words += word + ' '
    print('Words: ' + words)
    print()


Label: Circuits
Words: chip circuit analog control implementation design implement signal vlsi processor 

Label: Neuroscience
Words: neuron cell spike response synaptic activity frequency stimulus synapse signal 

Label: Numerical optimization
Words: gradient noise prediction w optimal nonlinear matrix approximation series variance 

Label: Object recognition
Words: image visual object motion field direction representation map position orientation 

Label: Math/general
Words: bound f generalization class let w p theorem y threshold 

Label: Robotics
Words: dynamic control field trajectory neuron motor net forward l movement 

Label: Character recognition
Words: node distance character layer recognition matrix image sequence p code 

Label: Reinforcement learning
Words: action policy q reinforcement rule control optimal representation environment sequence 

Label: Speech recognition
Words: recognition speech word layer classifier net classification hidden class context 

Label: Bayesian modelling
Words: mixture gaussian likelihood prior data bayesian density sample cluster posterior 

These topics are by no means perfect. They have problems such as chained topics, intruded words, random topics, and unbalanced topics (see Mimno and co-authors 2011). They will do for the purposes of this tutorial, however.

Below, we use the model[name] syntax to retrieve the topic distribution for an author. Each topic has a probability of being expressed given the particular author, but only the ones above a certain threshold are shown.


In [18]:
model['YannLeCun']


Out[18]:
[(6, 0.99976720177983869)]

Let's print the top topics of some authors. First, we make a function to help us do this more easily.


In [44]:
from pprint import pprint

def show_author(name):
    print('\n%s' % name)
    print('Docs:', model.author2doc[name])
    print('Topics:')
    pprint([(topic_labels[topic[0]], topic[1]) for topic in model[name]])

Below, we print some high profile researchers and inspect them. Three of these, Yann LeCun, Geoffrey E. Hinton and Christof Koch, are spot on.

Terrence J. Sejnowski's results are surprising, however. He is a neuroscientist, so we would expect him to get the "neuroscience" label. This may indicate that Sejnowski works with the neuroscience aspects of visual perception, or perhaps that we have labeled the topic incorrectly, or perhaps that this topic simply is not very informative.


In [55]:
show_author('YannLeCun')


YannLeCun
Docs: [143, 406, 370, 495, 456, 449, 595, 616, 760, 752, 1532]
Topics:
[('Character recognition', 0.99976720177983869)]

In [46]:
show_author('GeoffreyE.Hinton')


GeoffreyE.Hinton
Docs: [56, 143, 284, 230, 197, 462, 463, 430, 688, 784, 826, 848, 869, 1387, 1684, 1728]
Topics:
[('Object recognition', 0.42128917017624745),
 ('Math/general', 0.043249835412857811),
 ('Robotics', 0.11149925993091593),
 ('Bayesian modelling', 0.42388500261455564)]

In [47]:
show_author('TerrenceJ.Sejnowski')


TerrenceJ.Sejnowski
Docs: [513, 530, 539, 468, 611, 581, 600, 594, 703, 711, 849, 981, 944, 865, 850, 883, 881, 1221, 1137, 1224, 1146, 1282, 1248, 1179, 1424, 1359, 1528, 1484, 1571, 1727, 1732]
Topics:
[('Object recognition', 0.99992379088787087)]

In [53]:
show_author('ChristofKoch')


ChristofKoch
Docs: [9, 221, 266, 272, 349, 411, 337, 371, 450, 483, 653, 663, 754, 712, 778, 921, 1212, 1285, 1254, 1533, 1489, 1580, 1441, 1657]
Topics:
[('Neuroscience', 0.99989393011046035)]

Simple model evaluation methods

We can compute the per-word bound, which is a measure of the model's predictive performance (you could also say that it is the reconstruction error).

To do that, we need the doc2author dictionary, which we can build automatically.


In [24]:
from gensim.models import atmodel
doc2author = atmodel.construct_doc2author(model.corpus, model.author2doc)

Now let's evaluate the per-word bound.


In [25]:
# Compute the per-word bound.
# Number of words in corpus.
corpus_words = sum(cnt for document in model.corpus for _, cnt in document)

# Compute bound and divide by number of words.
perwordbound = model.bound(model.corpus, author2doc=model.author2doc, \
                           doc2author=model.doc2author) / corpus_words
print(perwordbound)


-6.9955968712

We can evaluate the quality of the topics by computing the topic coherence, as in the LDA class. Use this to e.g. find out which of the topics are poor quality, or as a metric for model selection.


In [26]:
%time top_topics = model.top_topics(model.corpus)


CPU times: user 15.6 s, sys: 4 ms, total: 15.6 s
Wall time: 15.6 s

Plotting the authors

Now we're going to produce the kind of pacific archipelago looking plot below. The goal of this plot is to give you a way to explore the author-topic representation in an intuitive manner.

We take all the author-topic distributions (stored in model.state.gamma) and embed them in a 2D space. To do this, we reduce the dimensionality of this data using t-SNE.

t-SNE is a method that attempts to reduce the dimensionality of a dataset, while maintaining the distances between the points. That means that if two authors are close together in the plot below, then their topic distributions are similar.

In the cell below, we transform the author-topic representation into the t-SNE space. You can increase the smallest_author value if you do not want to view all the authors with few documents.


In [27]:
%%time
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
smallest_author = 0  # Ignore authors with documents less than this.
authors = [model.author2id[a] for a in model.author2id.keys() if len(model.author2doc[a]) >= smallest_author]
_ = tsne.fit_transform(model.state.gamma[authors, :])  # Result stored in tsne.embedding_


CPU times: user 35.4 s, sys: 1.16 s, total: 36.5 s
Wall time: 36.4 s

We are now ready to make the plot.

Note that if you run this notebook yourself, you will see a different graph. The random initialization of the model will be different, and the result will thus be different to some degree. You may find an entirely different representation of the data, or it may show the same interpretation slightly differently.

If you can't see the plot, you are probably viewing this tutorial in a Jupyter Notebook. View it in an nbviewer instead at http://nbviewer.jupyter.org/github/rare-technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb.


In [28]:
# Tell Bokeh to display plots inside the notebook.
from bokeh.io import output_notebook
output_notebook()


Loading BokehJS ...

In [29]:
from bokeh.models import HoverTool
from bokeh.plotting import figure, show, ColumnDataSource

x = tsne.embedding_[:, 0]
y = tsne.embedding_[:, 1]
author_names = [model.id2author[a] for a in authors]

# Radius of each point corresponds to the number of documents attributed to that author.
scale = 0.1
author_sizes = [len(model.author2doc[a]) for a in author_names]
radii = [size * scale for size in author_sizes]

source = ColumnDataSource(
        data=dict(
            x=x,
            y=y,
            author_names=author_names,
            author_sizes=author_sizes,
            radii=radii,
        )
    )

# Add author names and sizes to mouse-over info.
hover = HoverTool(
        tooltips=[
        ("author", "@author_names"),
        ("size", "@author_sizes"),
        ]
    )

p = figure(tools=[hover, 'crosshair,pan,wheel_zoom,box_zoom,reset,save,lasso_select'])
p.scatter('x', 'y', radius='radii', source=source, fill_alpha=0.6, line_color=None)
show(p)


The circles in the plot above are individual authors, and their sizes represent the number of documents attributed to the corresponding author. Hovering your mouse over the circles will tell you the name of the authors and their sizes. Large clusters of authors tend to reflect some overlap in interest.

We see that the model tends to put duplicate authors close together. For example, Terrence J. Sejnowki and T. J. Sejnowski are the same person, and their vectors end up in the same place (see about $(-10, -10)$ in the plot).

At about $(-15, -10)$ we have a cluster of neuroscientists like Christof Koch and James M. Bower.

As discussed earlier, the "object recognition" topic was assigned to Sejnowski. If we get the topics of the other authors in Sejnoski's neighborhood, like Peter Dayan, we also get this same topic. Furthermore, we see that this cluster is close to the "neuroscience" cluster discussed above, which is further indication that this topic is about visual perception in the brain.

Other clusters include a reinforcement learning cluster at about $(-5, 8)$, and a Bayesian modelling cluster at about $(8, -12)$.

Similarity queries

In this section, we are going to set up a system that takes the name of an author and yields the authors that are most similar. This functionality can be used as a component in an information retrieval (i.e. a search engine of some kind), or in an author prediction system, i.e. a system that takes an unlabelled document and predicts the author(s) that wrote it.

We simply need to search for the closest vector in the author-topic space. In this sense, the approach is similar to the t-SNE plot above.

Below we illustrate a similarity query using a built-in similarity framework in Gensim.


In [30]:
from gensim.similarities import MatrixSimilarity

# Generate a similarity object for the transformed corpus.
index = MatrixSimilarity(model[list(model.id2author.values())])

# Get similarities to some author.
author_name = 'YannLeCun'
sims = index[model[author_name]]

However, this framework uses the cosine distance, but we want to use the Hellinger distance. The Hellinger distance is a natural way of measuring the distance (i.e. dis-similarity) between two probability distributions. Its discrete version is defined as $$ H(p, q) = \frac{1}{\sqrt{2}} \sqrt{\sum_{i=1}^K (\sqrt{p_i} - \sqrt{q_i})^2}, $$

where $p$ and $q$ are both topic distributions for two different authors. We define the similarity as $$ S(p, q) = \frac{1}{1 + H(p, q)}. $$

In the cell below, we prepare everything we need to perform similarity queries based on the Hellinger distance.


In [63]:
# Make a function that returns similarities based on the Hellinger distance.

from gensim import matutils
import pandas as pd

# Make a list of all the author-topic distributions.
author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]

def similarity(vec1, vec2):
    '''Get similarity between two vectors'''
    dist = matutils.hellinger(matutils.sparse2full(vec1, model.num_topics), \
                              matutils.sparse2full(vec2, model.num_topics))
    sim = 1.0 / (1.0 + dist)
    return sim

def get_sims(vec):
    '''Get similarity of vector to all authors.'''
    sims = [similarity(vec, vec2) for vec2 in author_vecs]
    return sims

def get_table(name, top_n=10, smallest_author=1):
    '''
    Get table with similarities, author names, and author sizes.
    Return `top_n` authors as a dataframe.
    
    '''
    
    # Get similarities.
    sims = get_sims(model.get_author_topics(name))

    # Arrange author names, similarities, and author sizes in a list of tuples.
    table = []
    for elem in enumerate(sims):
        author_name = model.id2author[elem[0]]
        sim = elem[1]
        author_size = len(model.author2doc[author_name])
        if author_size >= smallest_author:
            table.append((author_name, sim, author_size))
            
    # Make dataframe and retrieve top authors.
    df = pd.DataFrame(table, columns=['Author', 'Score', 'Size'])
    df = df.sort_values('Score', ascending=False)[:top_n]
    
    return df

Now we can find the most similar authors to some particular author. We use the Pandas library to print the results in a nice looking tables.


In [64]:
get_table('YannLeCun')


Out[64]:
Author Score Size
2422 YannLeCun 1.000000 11
1717 PatriceSimard 0.999977 8
986 J.S.Denker 0.999581 3
2425 YaserAbu-Mostafa 0.998040 5
1160 JohnS.Denker 0.903560 6
187 AntoninaStarita 0.901699 1
1718 PatriceY.Simard 0.899005 4
560 DiegoSona 0.876237 1
612 EduardSackinger 0.870400 3
2413 Y.LeCun 0.868843 2

As before, we can specify the minimum author size.


In [72]:
get_table('JamesM.Bower', smallest_author=3)


Out[72]:
Author Score Size
118 JamesM.Bower 1.000000 10
44 ChristofKoch 0.999967 24
182 MatthewA.Wilson 0.999879 3
157 L.F.Abbott 0.999872 4
256 StephenP.DeWeerth 0.999869 5
82 EveMarder 0.999828 3
96 GirishN.Patel 0.856874 3
43 ChdstofKoch 0.788195 3
291 WilliamBialek 0.786987 4
247 Shih-ChiiLiu 0.781643 3

Serialized corpora

The AuthorTopicModel class accepts serialized corpora, that is, corpora that are stored on the hard-drive rather than in memory. This is usually done when the corpus is too big to fit in memory. There are, however, some caveats to this functionality, which we will discuss here. As these caveats make this functionality less than ideal, it may be improved in the future.

It is not necessary to read this section if you don't intend to use serialized corpora.

In the following, an explanation, followed by an example and a summarization will be given.

If the corpus is serialized, the user must specify serialized=True. Any input corpus can then be any type of iterable or generator.

The model will then take the input corpus and serialize it in the MmCorpus format, which is supported in Gensim.

The user must specify the path where the model should serialize all input documents, for example serialization_path='/tmp/model_serializer.mm'. To avoid accidentally overwriting some important data, the model will raise an error if there already exists a file at serialization_path; in this case, either choose another path, or delete the old file.

When you want to train on new data, and call model.update(corpus, author2doc), all the old data and the new data have to be re-serialized. This can of course be quite computationally demanding, so it is recommended that you do this only when necessary; that is, wait until you have as much new data as possible to update, rather than updating the model for every new document.


In [34]:
%time model_ser = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
                               author2doc=author2doc, random_state=1, serialized=True, \
                               serialization_path='/tmp/model_serialization.mm')


CPU times: user 17.6 s, sys: 540 ms, total: 18.1 s
Wall time: 17.7 s

In [35]:
# Delete the file, once you're done using it.
import os
os.remove('/tmp/model_serialization.mm')

In summary, when using serialized corpora:

  • Set serialized=True.
  • Set serialization_path to a path that doesn't already contain a file.
  • Wait until you have lots of data before you call model.update(corpus, author2doc).
  • When done, delete the file at serialization_path if it's not needed anymore.

What to try next

Try the model on one of the datasets in the StackExchange data dump. You can treat the tags on the posts as authors and train a "tag-topic" model. There are many different categories, from statistics to cooking to philosophy, so you can pick on that you like. You can even try your hand at a Kaggle competition that uses tags in this dataset.