In [ ]:
%%capture
!rm -rf data/*
!unzip data.zip -d data/
!pip install --no-cache-dir pyldavis
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pyLDAvis
import pyLDAvis.sklearn
import pickle
%matplotlib inline

Topic Modeling in Python

In Lisa Rhody's article, "Topic Modeling and Figurative Language", she uses LDA topic modeling to look at ekphrasis poetry. She argues that ekphrasis poetry is particulary well-suited to an LDA analysis because of the assumption of a previously existing set of topics. She's able to extract a number of topics, each constituted of a set of words and probabilities. While we don't have Rhody's corpus, we can use this technique on any large text corpus. We'll use a corpus of novels curated by Andrew Piper.


Corpus Description

We'll look at an English-language subset of Andrew Piper's novel corpus, totaling 150 novels by British and American authors spanning the years 1771-1930. These texts are each in a separate plaintext file in our data folder. Metadata is contained in a spreadsheet distributed with the novel files by the txtLAB at McGill.

The metadata provided describes the corpus that exists as .txt files. So let's first read in the metadata:


In [ ]:
metadata_tb = Table.read_table('data/txtlab_Novel150_English.csv')
metadata_tb.show(5)

Before we go anywhere, let's randomly shuffle the rows so that we don't have them ordered by dates or anything else:


In [ ]:
np.random.seed(0)
metadata_tb = Table.from_df(metadata_tb.to_df().sample(frac=1))
metadata_tb.show(5)

We can see the column variables we have in the metadata with the .labels attribute:


In [ ]:
metadata_tb.labels

To clarify:

  1. `filename`: Name of file on disk
  2. `id`: Unique ID in Piper corpus
  3. `language`: Language of novel
  4. `date`: Initial publication date
  5. `author`: Author's name
  6. `title`: Title of novel
  7. `gender`: Authorial gender
  8. `person`: Textual perspective
  9. `length`: Number of tokens in novel

We see a list of filenames in the table, these map into a folder we have called txtlab_Novel150_English:


In [ ]:
!ls data/txtlab_Novel150_English/

We can then read in the full text for each novel by iterating through the column, reading each file and appending the string to our novel_list:


In [ ]:
# create empty list, entries will be list of tokens from each novel
novel_list = []

# iterate through filenames in metadata table
for filename in metadata_tb['filename']:
    
    # read in novel text as single string
    with open('data/txtlab_Novel150_English/'+filename, 'r') as f:
        novel = f.read()
    
    # clean up (no titles)
    toks = novel.split()  # split to tokens
    toks = [t for t in toks if not t.istitle() and not t.isupper()]  # quick & dirty no titles/proper nouns
    novel = ' '.join(toks)  # join to single string
    
    # add string
    novel_list.append(novel)

Let's double check they all came through:


In [ ]:
len(novel_list)

And look at the first 200 characters of the fourth novel:


In [ ]:
metadata_tb['author'][3], metadata_tb['title'][3], novel_list[3][:200]

Document Term Matrix

Now we need to make a document term matrix, just as we have in the past two classes. We can pull in our CountVectorizer from sklearn again to create our dtm:


In [ ]:
from sklearn.feature_extraction.text import CountVectorizer

While you may not have seen the importance of max_features, max_df and min_df before, for topic modeling this is extremely important, because otherwise your topics will not be super coherent.

Let's start out with this:

  • max_features = 5000 (i.e. only include 5000 tokens in our dtm)
  • max_df = .8 (i.e. don't keep any tokens that appear in > 80% of the documents)
  • min_df = 5 (i.e. only keep the token if it appears in > 5 documents)

We'll add in a stop_words='english' too, which automatically uses its own stopwords list to remove from our dtm:


In [ ]:
cv = CountVectorizer(max_features=5000, stop_words='english', max_df=0.80, min_df=5)

As with most machine learning approaches, to validate your model you need training and testing partitions. Since we don't have any labels (topic modeling is unsupervised machine learning), we just need to do this for the novel strings:


In [ ]:
train = novel_list[:120]
test = novel_list[120:]

Now we can use our cv to fit_transform our training list of novels (strings!):


In [ ]:
dtm = cv.fit_transform(train)

To get our words back out we'll use the method get_feature_names()


In [ ]:
dtm_feature_names = cv.get_feature_names()
dtm_feature_names[:10]

We can double check that our feature limit was enforced by calling len on the dtm_feature_names:


In [ ]:
len(dtm_feature_names)

We can throw our dtm into a Table like we have before too:


In [ ]:
dtm_tb = Table(dtm_feature_names).with_rows(dtm.toarray())
dtm_tb.show(5)

Topic Modeling

Latent Dirichlet Allocation (LDA) Models

LDA reflects an intuition that words in a text are not merely chosen at random but are drawn from underlying concepts (the so-called "latent variables"). The goal of LDA is to look across many texts in order to reverse engineer these concepts by finding words that tend to cluster with one another. For this reason, LDA has been referred to as "the mother of all word collocation techniques."

sklearn has the LatentDirichletAllocation function:


In [ ]:
from sklearn.decomposition import LatentDirichletAllocation

Let's check the doc string:


In [ ]:
LatentDirichletAllocation?

Importantly, we'll note:

  • `n_components`: This is the number of topics. Choosing this is the art of Topic Modeling
  • `max_iter`: TM initially uses random distribution, and iteratively tweaks model
  • Let's just say we'll look for 10 topics. We'll do a max_iter of 5. Generally, the higher max_iter volume the better opportunity to the model has to accurately tune, but it also takes much longer.

    
    
    In [ ]:
    lda = LatentDirichletAllocation(n_components=10, max_iter=5)
    

    Before we fit the model, we need to remember that with a lot of these probabilistic models random number generators are used to star the algorithm. If we want our results to be reproducible, we need to set the random seed of the math library we use, in this case numpy:

    
    
    In [ ]:
    np.random.seed(0)
    

    Now we just fit the model, as we've done with all sklearn models! This may take a while, a lot is going on:

    
    
    In [ ]:
    lda_model = lda.fit(dtm)
    

    Evaluation

    One measure of the model's fit is perplexity, with which we can judge how well the model fits the data. We need to call this on our test portion after it's been transformed into a dtm:

    
    
    In [ ]:
    lda_model.perplexity(cv.transform(test))
    

    NOTE: Currently sklearns perplexity algorithm is broken.

    The lower the perplexity, the better the fit of the model. So one way to get the optimal number of topics would be to loop through several numbers of topics and minimize the perplexity value.

    Unfortunately, it has been shown time and again that minimizing perplexity does not actually separate topics into coherent groups that humans would.

    Choosing the best model

    Since traditional metrics of evaluating a model's accuracy have not proven to conform to human understanding, a new appraoch was developed by David Minmo in 2011.

    this score measures how much, within the words used to describe a topic, a common word is in average a good predictor for a less common word. (More on topic coherency.)

    Here we look for the highest value. This algorithm has only been implemented in the Python gensim library. I ran the following code for you on a remote server because it takes a while!


    import pickle
    from joblib import Parallel, delayed
    import multiprocessing
    
    
    def try_topic_number(i):
        lda_model = gensim.models.LdaModel(
            corpus,
            num_topics=i,
            id2word=dictionary,
            iterations=1000,
            alpha='auto',
            passes=4)
    
        cm = gensim.models.CoherenceModel(
            model=lda_model,
            corpus=corpus,
            dictionary=dictionary,
            coherence='u_mass')
    
        return cm.get_coherence()
    
    
    if __name__ == '__main__':
    
        num_cores = multiprocessing.cpu_count()
    
        results = Parallel(n_jobs=num_cores)(delayed(try_topic_number)(i)
                                             for i in try_topic_n)
    
        pickle.dump(results, open('scores.pkl', 'wb'))
    

    You can see above I've dumped the coherence scores into a binary pickle file. A pickle is simply any Python object that has been saved to a binary file. We can load these in too:

    
    
    In [ ]:
    try_topic_n = list(range(5,200,2))
    scores = pickle.load(open('scripts/scores.pkl', 'rb'))
    list(zip(try_topic_n, scores))
    

    Let's plot these results:

    
    
    In [ ]:
    plt.plot(try_topic_n, [x for x in scores])
    plt.xlabel('number of topics')
    plt.ylabel('coherence score')
    

    numpy has a handy argmax or argmin function that returns the index of the highest or lowest value in an array:

    
    
    In [ ]:
    np.argmax(scores)
    

    Then we can just index our topic numbers to get the corresponding number of topics with the highest coherency:

    
    
    In [ ]:
    try_topic_n[np.argmax(scores)]
    

    I've retrained the model for 13 topics and exported as below (note the max_iter=1000 takes a long time, so I've pickled the model again):


    lda = LatentDirichletAllocation(n_components=13, max_iter=1000)
    lda_model = lda.fit(dtm)
    
    pickle.dump((lda, lda_model, dtm, cv), open('13-topics.pkl', 'wb'))
    

    We can load in the pre-trained model from the pickle:

    
    
    In [ ]:
    lda, lda_model, dtm, cv = pickle.load(open('scripts/13-topics.pkl', 'rb'))
    

    Many papers in the social sciences still don't use a quantitative evaluation metric. Many use the library pyLDAvis to simply visualize the topic distributions, looking for the right size and little overlap in topics as markers of a well-chosen number of topics:

    
    
    In [ ]:
    pyLDAvis.enable_notebook()
    pyLDAvis.sklearn.prepare(lda_model, dtm, cv)
    

    Topics

    To print the topics, we can write a function. display_topics will print the most probable words to show up in each topic.

    
    
    In [ ]:
    def display_topics(model, feature_names, num_top_words):
        for topic_idx, topic in enumerate(model.components_):
            print(topic_idx, " ".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]))
    

    Now let's print the top 10 words of the 20 topics for the model we trained, using our display_topics function. Have a look through the output and see what topics you can spot:

    
    
    In [ ]:
    display_topics(lda, dtm_feature_names, 10)
    

    We can print which topic each novel is closest to by indexing the topic probabilities and using the argmax function:

    
    
    In [ ]:
    doc_topic = lda.transform(dtm)
    
    for n in range(doc_topic.shape[0]):
        topic_most_pr = doc_topic[n].argmax()
        print(metadata_tb['author'][n], metadata_tb['title'][n])
        print("doc: {} topic: {}\n".format(n,topic_most_pr))
    

    To get the probabilities for each topic for a given book we can print the whole probability list for a given novel:

    
    
    In [ ]:
    metadata_tb['author'][25], metadata_tb['title'][25], doc_topic[25]
    

    Challenge

    Add these topic assignments back to our Table metadata_tb

    
    
    In [ ]:
    # YOUR CODE HERE
    

    Interpreting the Model

    There are many strategies that can be used to interpret the output of a topic model. In this case, we will look for any correlations between the topic distributions and metadata.

    We'll first grab all the topic distributions similar to what we did above. Remember, the order of the novels is still the same!

    
    
    In [ ]:
    list_of_doctopics = [doc_topic[n] for n in range(len(doc_topic))]
    list_of_doctopics[0]
    

    We'll make a DataFrame, which is similar to a Table, with the probabilities for the topics (columns) and documents (rows):

    
    
    In [ ]:
    df = pd.DataFrame(list_of_doctopics)
    df.head()
    

    We can add these columns to our metadata_tb Table:

    
    
    In [ ]:
    meta = metadata_tb.to_df()
    meta[df.columns] = df
    meta.head()
    

    The corr() method will give us a correlation matrix:

    
    
    In [ ]:
    meta.corr()
    

    We see some strong correlations of topics with date, recall:

    
    
    In [ ]:
    display_topics(lda, dtm_feature_names, 10)
    
    
    
    In [ ]:
    meta.plot.scatter(x='date', y=1)
    
    
    
    In [ ]:
    meta.plot.scatter(x='date', y=5)
    
    
    
    In [ ]:
    meta.plot.scatter(x='date', y=6)
    
    
    
    In [ ]:
    meta.plot.scatter(x='date', y=12)
    

    Why do you think we see this?

    Homework

    We're going to download the 20 Newsgroups, a widely used corpus for demos of general texts:

    The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

    Let's read in the training data:

    
    
    In [ ]:
    train_subset = pickle.load(open('scripts/20-news-train.pkl', 'rb'))
    

    Here are the predetermined catgories:

    
    
    In [ ]:
    train_subset.target_names
    

    Since we're topic modeling, we don't care about what they've been labeled, but it'll be interesting to see how our topics line up with these!

    How many documents are there?

    
    
    In [ ]:
    len(train_subset.data)
    

    Let's get a list of documents as strings just like we did with the novels, and then we'll randomly shuffle them in case they're ordered by category already:

    
    
    In [ ]:
    documents_train = train_subset.data
    np.random.shuffle(documents_train)
    
    
    
    In [ ]:
    print(documents_train[0])
    

    Now we'll do the same for the test set:

    
    
    In [ ]:
    test_subset = pickle.load(open('scripts/20-news-test.pkl', 'rb'))
    documents_test = test_subset.data
    np.random.shuffle(documents_test)
    print(documents_test[0])
    

    TASK:

    You now have two arrays of strings: documents_train and documents_test. Create a dtm and then a topic model for k number of topics. Just choose one number of k and a very low iter value for the training so it doesn't take too long.

    See how the topics match up to the annotated categories, and play with different ways of preprocessing the data. Use the pyLDAvis library to evaluate your model.

    What did you have to do to get decent results?

    
    
    In [ ]:
    
    

    BONUS (not assigned)

    Create a classifier from this corpus. They're assigned group are in the target attribute:

    
    
    In [ ]:
    train_subset.target
    
    
    
    In [ ]:
    test_subset.target
    
    
    
    In [ ]: