Present-day society is flooded with digital texts: never before, humankind has produced more text than now. To efficiently cope with the vast amounts of text that are published nowadays, industry and academia alike increasingly turn to automated techniques for text analysis. Spam filtering or machine translation are but two relevant examples of popular applications for automated text analysis.
In this workshop, we will work our way through a newbee tutorial which showcases a number of ways how you can use computer programming to analyze digital texts. We will use Python, an intuitive, yet powerful programming language which is increasingly popular among people working in text analysis. While the scope of this workshop is too limited to present all possibilities, I hope this workshop will give you some idea of Python's current possibilities for text analysis. Below, we will have a look at three introductory topics: (1) building a text representation using a bag-of-words model, (2) visualizing texts via cluster trees and (3) text mining via topic modelling. We might not make it up to the third and final part of the tutorial, but if you're interested feel free to give this part a go at home.
Because we will of course not have the time to explain all details about coding in Python, I will try to give you an idea of what it is like to use Python to analyse texts via 'distant reading'. I do not expect you will be able to understand everything little detail about the code after today, but I trust you will nevertheless experience that Python is an easy-to-read language. Hopefully, this tutorial will wet your appetite! Below you will find code blocks (in lightgrey) which you should be able to execute in your browser, by clicking Shift + Enter
. It is important that you execute each of these code blocks when progressing through the chapter. Let's get started!
Computers cannot intuitively process or 'read' texts, as humans can. Even today, a computer is still nothing more than an incredibly large and fast abacus, which is only able to count and process information if it represented in a numerical format. For computers, texts are nothing more than a long series or 'string' of characters. In Python, we can create a variable collecting a piece of text as follows:
In [ ]:
text = 'It is a truth, universally acknowledged.'
Here we define a string of text by enclosing it with quotations marks and assigning it to a variable or container called text
. The fact that Python still sees this piece of text as a continguous string of characters becomes evident when we ask Python to print out the length of text
, using the len()
function:
In [ ]:
print(len(text))
One could say that characters are the 'atoms' or smallest meaningful units in computational text processing. Just as computer images use pixels as their fundamental building blocks, all digital text processing applications start from raw characters and it are these characters that are physically stored on your machines in bits and bytes.
In [ ]:
# your code goes here
Many people find it more intuitive to consider texts as a strings of words, rather than plain characters, because words correspond to more concrete entities. In Python, we can easily turn our original 'string' into a list
of words:
In [ ]:
words = text.split()
print(words)
Using the split()
method, we split our original sentence into a word list along instances of whitespace characters. Note that, in technical terms, the variable type of text
is different from that of the newly created variable words
:
In [ ]:
print(type(text))
print(type(words))
Likewise, they evidently differ in length:
In [ ]:
print(len(text))
print(len(words))
By using 'indexing' (with square brackets), we can now access individual words in our word list. Check out the following print statements:
In [ ]:
print(words[3])
print(words[5])
Note that words is a so-called list
variable in technical terms, but that the individual elements of words
are still plain strings:
In [ ]:
print(type(words[3]))
In the previous paragraph, we have adopted an extremely crude definition of a 'word', namely as a string of characters that doesn't contain any whitespace. There are of course many problems that arise if we use such a naive definition. Can you think of some?
In computer science, and computational linguistics in particular, people have come up with much smarter ways to divide texts into words. This process is called tokenization, which refers to the fact that this process divides strings of characters into a list of more meaningful tokens. One interesting package which we can use for this, is nltk (the Natural Language Toolkit), a package which has been specifically designed to deal with language problems. First, we have to import it, since it isn't part of the standard library of Python:
In [ ]:
import nltk
We can now use apply nltk's functionality, for instance, its (default) tokenizer for English:
In [ ]:
tokens = nltk.word_tokenize(text)
print(tokens)
Note how the function word_tokenize()
neatly splits off punctuation! Many improvements nevertheless remain. To collapse the difference between uppercase and lowercase variables, for instance, we could first lowercase the original input string:
In [ ]:
lower_str = text.lower()
lower_tokens = nltk.word_tokenize(lower_str)
print(lower_tokens)
Many applications will not be very interested in punctuation marks, so can we can try to remove these as well. The isalpha()
method allows you to determine whether a string only contains alphabetic characters:
In [ ]:
print(lower_tokens[1].isalpha())
print(lower_tokens[-1].isalpha())
Functions like isalpha()
return something that is called a 'boolean' value, a kind of variable that can only take two values, i.e. True
or False
. Such values are useful because you can use them to test whether some condition is true or not. For example, if isalpha()
evaluates to False
for a word, we can have Python ignore such a word.
DIY
Using some more complicated Python syntax (a so-called 'list generator'), it is very easy to filter out non-alphabetic strings. In the example below, I inserted a logical thinking error on purpose: can you adapt the line below and make it output the correct result? You will note that Python is really a super-intuitive programming language, because it almost reads like plain English.
In [ ]:
clean_tokens = [w for w in lower_tokens if not w.isalpha()]
print(clean_tokens)
Once we have come up with a good way to split texts into individual tokens, it is time to start thinking about how we can represent texts via these tokens. One popular approach to this problem is called the bag-of-words model (BOW): this is a very old (and slightly naive) strategy for text representation, but is still surprisingly popular. Many spam filters, for instance, will still rely on a bag-of-words model when deciding which email messages will show up in your Junk folder.
The intuition behind this model is very simple: to represent a document, we consider it a 'bag', containing tokens in no particular order. We then characterize a particular text by counting how often each term occurs in it. Counting how often each word occurs in a list of tokens example from above is child's play in Python. For this purpose, we copy some of the code from the previous section and apply it too a larger paragraph:
In [ ]:
text = """It is a truth universally acknowledged, that a single man
in possession of a good fortune, must be in want of a wife. However
little known the feelings or views of such a man may be on his first
entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families, that he is considered the rightful
property of some one or other of their daughters. "My dear Mr. Bennet,"
said his lady to him one day, "have you heard that Netherfield Park is
let at last?" Mr. Bennet replied that he had not. "But it is," returned
she; "for Mrs. Long has just been here, and she told me all about it."
Mr. Bennet made no answer. "Do you not want to know who has taken it?"
cried his wife impatiently. "_You_ want to tell me, and I have no
objection to hearing it." This was invitation enough."""
lower_str = text.lower()
lower_tokens = nltk.word_tokenize(lower_str)
clean_tokens = [w for w in lower_tokens if w.isalpha()]
print('Word count:', len(clean_tokens))
We obtain a list of 148 tokens. Counting how often each individual token occurs in this 'document' is trivial, using the Counter
object which we can import from Python's collection
module:
In [ ]:
from collections import Counter
bow = Counter(clean_tokens)
print(bow)
Let us have a look at the three most frequent items in the text:
In [ ]:
print(bow.most_common(3))
Obviously, the most common items in such a frequency list are typically small, grammatical words that are very frequent throughout all the texts in a language. Let us add a small visualisation of this information. We can use a barplot to show the top-frequency items in a more pleasing manner. In the following block, we use the matplotlib
package for this purpose, which is a popular graphics package in Python. To make sure that it shows up neatly in our notebook, first execute this cell:
In [ ]:
%matplotlib inline
And then execute the following blocks -- and try to understand the intuition behind them:
In [ ]:
# first, we extract the counts:
nb_words = 8
wrds, cnts = zip(*bow.most_common(nb_words))
print(wrds)
print(cnts)
# now the plotting part:
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots()
bar_width = 0.5
idxs = np.arange(nb_words)
#print(idxs)
ax.bar(idxs, cnts, bar_width, color='blue', align='center')
plt.xticks(idxs, wrds)
plt.show()
We are almost there: we now know how to split documents into tokens and how we we count (and even visualize!) the frequencies of these items. Now it is only a small step towards a 'real' bag of words model. If we present a collection of texts under a bag of words model, what we really would like to end up with is a frequence table, which has a row for each document, and a column for all the tokens that occur in the collections, which is also called the vocabulary of the corpus. Each cell is then filled with the frequency of each vocabulary item, so that the final matrix will look like like a standard, two dimensional table which you all know from spreadsheet applications.
While creating such a matrix youself isn't too difficult in Python, here we will rely on an external package, which makes it really simple to efficiently create such matrices. The zipped folder which you downloaded for this coarse, contains a small corpus, containing novels by a number of famous Victorian novelists. Under data/victorian_small
, for instance, you will find a number of files; the filenames indicate the author and (abbreviated) title of the novel contained in that file (e.g. Austen_Pride.txt
). In the block below, I prepared some code to load these files from your hard drive into Python, which can execute now:
In [ ]:
# we import some modules which we need
import glob
import os
# we create three emptylist
authors, titles, texts = [], [], []
# we loop over the filenames under the directory:
for filename in sorted(glob.glob('data/victorian_small/*.txt')):
# we open a file and read the contents from it:
with open(filename, 'r') as f:
text = f.read()
# we derive the title and author from the filename
author, title = os.path.basename(filename).replace('.txt', '').split('_')
# we add to the lists:
authors.append(author)
titles.append(title)
texts.append(text)
This code makes use of a so called for
-loop: after retrieving the list of relevant file names, we load the content of each file and add it to a list called texts
, using the append()
function. Additionally, we also create lists in which we store the authors and titles of the novels:
In [ ]:
print(authors)
print(titles)
Note that these three lists can be neatly zipped together, so that the third item in authors
corresponds to the third item in titles
(remember: Python starts counting at zero!):
In [ ]:
print('Title:', titles[2], '- by:', authors[2])
To have a peak at the content of this novel, we can now 'stack' indices as follows. Using the first square brackets ([2]
) we select the third novel in the list, i.e. Sense and Sensibility by Jane Austen. Then, using a second index ([:300]
), we print the first 300 characters in that novel.
In [ ]:
print(texts[2][:300])
After loading these documents, we can now represent them using a bag of words model. To this end, we will use a library called scikit-learn
, or sklearn
in shorthand, which is increasingly popular in text analysis nowadays. As you will see below, we import its CountVectorizer
object below, and we apply it to our corpus, specifying that we would like to extract a maximum of 10 features from the texts. The means that we will only consider the frequencies of 10 words (to keep our model small enough to be workable for now).
In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(max_features=10)
BOW = vec.fit_transform(texts).toarray()
print(BOW.shape)
The code block above creates a matrix which has a 9x10 shape: this means that the resulting matrix has 9 rows and 10 columns. Can you figure out where these numbers come from?
To find out which words are included, we can inspect the newly created vec
object as follows:
In [ ]:
print(vec.get_feature_names())
As you can see, the max_features
argument which we used above restricts the model to the n words which are most frequent throughout our texts. These are typically smallish function words. Funnily, sklearn
uses its own tokenizer, and this default tokenizer ignores certain words that are surprisingly enough absent in the vocabulary list we just inspected. Can you figure which words? Why are they absent?
Luckily, sklearn
is flexible enough to allow us to use our own tokenizer. To use the nltk
tokenizer for instance, we can simply pass it as an argument when we create the CountVectorizer
. (Note that, depending on the speed of your machine, the following code block might actually take a while to execute, because the tokenizer now has to process entire novels, instead of a single sentence. Sit tight!)
In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(max_features=10, tokenizer=nltk.word_tokenize)
BOW = vec.fit_transform(texts).toarray()
print(vec.get_feature_names())
Finally, let us visually inspect the BOW model which we have converted. To this end, we make use of pandas
, a Python-package that is nowadays often used to work with all sorts of data tables in Python. In the code block below, we create a new table or 'DataFrame' and add the correct row and column names:
In [ ]:
import pandas as pd # conventional shorthand!
df = pd.DataFrame(BOW, columns=vec.get_feature_names(), index=titles)
print(df)
After creating this DataFrame, it becomes very easy to retrieve specific information from our corpus. What is the frequency, for instance, of the word 'the' in each text?
In [ ]:
print(df['the'])
Or the frequency of 'and' in 'Emma'?
In [ ]:
print(df['of']['Emma'])
Now that we have converted our small dummy corpus to a bag-of-words matrix, we can finally start actually analyzing it! One very common technique to visualize texts is to render a cluster diagram or dendrogram. Such a tree-like visualization (an example will follow shortly) can be used to obtain a rough-and-ready first idea of the (dis)similarities between the texts in our corpus. Texts that cluster together under a similar branch in the resulting diagram, can be argued to be stylistically closer to each other, than texts which occupy completely different places in the tree. Texts by the same authors, for instance, will often form thight clades in the tree, because they are written in a similar style.
However, when comparing texts, we should be aware of the fact that documents can strongly vary in length. The bag-of-words model which we created above does not take into account that some texts might be longer than others, because it simply uses absolute frequencies, which will be much higher in the case of longer documents. Before comparing texts on the basis of word frequencies, it therefore makes sense to apply some sort of normalization. One very common type of normalization is to use relative, instead of absolute word frequencies: that means that we have to divide the original, absolute frequencies in a document, by the total number of word in that document. Remember that we are dealing with a 9x10 matrix at this stage:
Each of the 9 document rows which we obtain should now be normalized by dividing each word count by the total number of words which we recorded for that document. First, we therefore need to calculate the row-wise sum in our table.
In [ ]:
totals = BOW.sum(axis=1, keepdims=True)
print(totals)
Now, we can efficiently 'scale' or normalize the matrix using these sums:
In [ ]:
BOW = BOW / totals
print(BOW.shape)
If we inspect our new frequency table, we can see that the values are now neatly in the 0-1 range:
In [ ]:
print(BOW)
Moreover, if we now print the sum of the word frequencies for each of our nine texts, we see that the relative values sum to 1:
In [ ]:
print(BOW.sum(axis=1))
That looks great. Let us now build a model with a more serious vocabulary size (=300) for the actual cluster analysis:
In [ ]:
vec = CountVectorizer(max_features=300, tokenizer=nltk.word_tokenize)
BOW = vec.fit_transform(texts).toarray()
BOW = BOW / BOW.sum(axis=1, keepdims=True)
Clustering algorithms are based on essentially based on the distances between texts: clustering algorithms typically start by calculating the distance between each pair of texts in a corpus, so that they know for each text how (dis)similar it is from any other text. Only after these pairwise-distances have been calculated, we can have the clustering algorithm start building a tree representation, in which the similar texts are joined together and merged into new nodes. To create a distance matrix, we use a number of functions from scipy
(Scientific Python), a commonly used package for scientific applications.
In [ ]:
from scipy.spatial.distance import pdist, squareform
The function pdist()
('pairwise distances') is a function which we can use to calculate the distance between each pair of texts in our corpus. Using the squareform()
function, we will eventually obtain a 9x9 matrix, the structure of which is conceptually easy to understand: this square distance matrix (named dm
) will hold for each of our 9 texts the distance to each other text in the corpus. Naturally, the diagonal in this matrix are all-zeroes (since the distance from a text to itself will be zero). We create this distance matrix as follows:
In [ ]:
dm = squareform(pdist(BOW))
print(dm.shape)
As is clear from the shape info, we have obtained a 9 by 9 matrix, which holds the distance between each pair of texts. Note that the distance from a text to itself is of course zero (cf. diagonal cells):
In [ ]:
print(dm[3][3])
print(dm[8][8])
Additionally, we can observe that the distance from text A to text B, is equal to the distance from B to A:
In [ ]:
print(dm[2][3])
print(dm[3][2])
We can visualize this distance matrix as a square heatmap, where darker cells indicate a larger distance between texts. Again, we use the matplotlib
package to achieve this:
In [ ]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
heatmap = ax.pcolor(dm, cmap=plt.cm.Blues)
ax.set_xticks(np.arange(dm.shape[0])+0.5, minor=False)
ax.set_yticks(np.arange(dm.shape[1])+0.5, minor=False)
ax.set_xticklabels(titles, minor=False, rotation=90)
ax.set_yticklabels(authors, minor=False)
plt.show()
As you can see, the little squares representing texts by the same author already show a tendency to invite lower distance scores. But how are these distances exactly calculated?
Each text in our distance matrix is represented as a row, consisting of 100 numbers. Such a list of numbers is also called a document vector, which is why the document modeling process described above is sometimes also called vectorization (cf. CountVectorizer
). In digital text analysis, documents are compared by applying standard metrics from geometry to these documents vectors containing word frequencies. Let us have a closer look at one popular, but intuitively simple distance metric, the Manhattan city block distance. The formula behind this metric is very simple (don't be afraid of the mathematical notation; it won't bite):
What this formula expresses, is that to calculate the distance between two documents, we loop over each word column in both texts and we calculate the absolute difference between the value for that item in each text. Afterwards we simply sum all the absolute differences.
DIY
Consider the following two dummy vectors:
In [ ]:
a = [2, 5, 1, 6, 7]
b = [4, 5, 1, 7, 3]
Can you calculate the manhattan distance between
a
andb
by hand? Compare the result you obtain to this line of code:
In [ ]:
from scipy.spatial.distance import cityblock as manhattan
print(manhattan(a, b))
This is an example of one popular distance metric which is currently used a lot in digital text analysis. Alternatives (which might ring a bell from math classes in high school) include the Euclidean distance or cosine distance. Our dm
distance matrix from above can be created with any of these option, by specifying the correct metric when calling pdist()
. Try out some of them!
In [ ]:
dm = squareform(pdist(BOW), 'cosine') # or 'euclidean', or 'cosine' etc.
fig, ax = plt.subplots()
heatmap = ax.pcolor(dm, cmap=plt.cm.Reds)
ax.set_xticks(np.arange(dm.shape[0])+0.5, minor=False)
ax.set_yticks(np.arange(dm.shape[1])+0.5, minor=False)
ax.set_xticklabels(titles, minor=False, rotation=90)
ax.set_yticklabels(authors, minor=False)
plt.show()
Now that we have learned how to calculate the pairwise distances between texts, we are very close to the dendrogram that I promised you a while back. To be able to visualize a dendrogram, we must first figure out the (branch) linkages in the tree, because we have to determine which texts are most similar to each etc. Our clustering procedure therefore starts by merging (or 'linking') the most similar texts in the corpus into a new node; only at a later stage in the tree, these new nodes of very similar texts will be joined together with nodes representing other texts. We perform this - fairly abstract - step on our distance matrix as follows:
In [ ]:
from scipy.cluster.hierarchy import linkage
linkage_object = linkage(dm)
We are now ready to draw the actual dendrogram, which we do in the following code block. Note that we annotate the outer leaf nodes in our tree (i.e. the actual texts) using the labels
argument. With the orientation argument, we make sure that our dendrogram can be easily read from left to right:
In [ ]:
from scipy.cluster.hierarchy import dendrogram
d = dendrogram(Z=linkage_object, labels=titles, orientation='right')
Using the authors as labels is of course also a good idea:
In [ ]:
d = dendrogram(Z=linkage_object, labels=authors, orientation='right')
As we can see, Jane Austen's novels form a tight and distinctive cloud; an author like Thackeray is apparantly more difficult to tell apart. The actual distance between nodes is hinted at on the horizontal length of the branches (i.e. the values on the x-axis in this plot). Note that in this code block too we can easily switch to, for instance, the Euclidean distance. Does the code block below produce better results?
In [ ]:
dm = squareform(pdist(BOW, 'euclidean'))
linkage_object = linkage(dm, method='ward')
d = dendrogram(Z=linkage_object, labels=authors, orientation='right')
Exercise
The code repository also contains a larger folder of novels, called
victorian_large
. Use the code block below to copy and paste code snippets from above, which you can slightly adapt to do the following things:
- Read in the texts, producing 3 lists of
texts
,authors
andtitles
. How many texts did you load (use thelen()
function)?- Select the text for David (Copperfield) by Charles Dickens and find out how often the word "and" is used in this text. Hint: use the
nltk
tokenizer and theCounter
object from thecollections
module. Make sure no punctuation is included in your counts.- Vectorize the text using the
CountVectorizer
using the 250 most frequent words.- Normalize the resulting document matrix and draw a heatmap using blue colors.
- Draw a cluster diagram and experiment with the distance metrics: which distance metric produces the 'best' result from the point of view of authorship clustering?
In [ ]:
# exercise code goes here
Up until now, we have been working with fairly small, dummy-size corpora to introduce you to some standard methods for text analysis in Python. When working with real-world data, however, we are often confronted with much larger and noisier datasets, sometimes even datasets that are too large to read or inspect manually. To deal with such huge datasets, researchers in fields such as computer science have come up with a number of techniques that allow us to nevertheless get a grasp of the kind of texts that are contained in a document collection, as well as their content.
For this part of the tutorial, I have included a set of +3,000 documents under the folder 'data/newsgroups'. The so-called "20 newsgroups dataset" is a very famous dataset in computational linguistics (see this website ): it refers to a collection of approximately 20,000 newsgroup documents, divided in 20 categories which each correspond to another topic. The topics are very diverse and range from science to politics. I have subsampled a number of these categories in the repository for this tutorial, but I won't tell you which... The idea is that we will use topic modelling so that you can find out for yourself which topics are discussed in this dataset! First, we start by loading the documents, using code that is very similar to the text loading code we used above:
In [ ]:
import os
documents, names = [], []
for filename in sorted(os.listdir('data/newsgroups')):
try:
with open('data/newsgroups/'+filename, 'r') as f:
text = f.read()
documents.append(text)
names.append(filename)
except:
continue
print(len(documents))
As you can see, we are dealing with 3,551 documents. Have a look at some of the documents and try to find out what they are about. Vary the index used to select a random document and print out its first 1000 characters or so:
In [ ]:
print(documents[3041][:1000])
You might already get a sense of the kind of topics that are being discussed. Also, you will notice that these are rather noisy data, which is challenging for humans to process manually. In the last part of this tutorial we will use a technique called topic modelling. This technique will automatically determine a number of topics or semantic word clusters that seem to be important in a document collection. The nice thing about topic modelling is that is a largely unsupervised technique, meaning that it does not need prior information about the document collection or the language it uses. It will simply inspect which words often co-occur in documents and are therefore more likely to be semantically related.
After fitting a topic model to a document collection, we can use it to inspect which topics have been detected. Additionally, we can use the model to infer to which extent these topics are present in new documents. Interestingly, the model does not assume that texts are always about a single topic; rather, it assumes that documents contain a mixture of different topics. A text about the transfer of a football player, for instance, might contain 80% of a 'sports' topic, 15% of a 'finance'-related topic, and 5% of a topic about 'Spanish lifestyle' etc. For topic modelling too, we first need to convert our corpus to a numerical format (i.e. 'vectorize' it as we did above). Luckily, we already know how to do that:
In [ ]:
vec = CountVectorizer(max_df=0.95, min_df=5, max_features=2000, stop_words='english')
BOW = vec.fit_transform(documents)
print(BOW.shape)
Note that we make use of a couple of additional bells and whistles that ship with sklearn
's CountVectorizer
. Can you figure out what they mean (hint: df
here stands for document frequency
)? In topic modelling we are not interested in the type of high-frequency grammatical words that we have used up until now. Such words are typically called function words in Information Retrieval and there are mostly completely ignored in topic modelling. Have a look at the 1000 features extracted: are these indeed content words?
In [ ]:
print(vec.get_feature_names())
We are now ready to start modelling the topics in this text collection. For this we make use of a popular technique called Latent Dirichlet Allocation or LDA, which is also included in the sklearn
library. In the code block below, you can safely ignore most of the settings which we use when we initialize the model, but you should pay attention to the n_topics
and max_iter
parameter. The former controls how many topics we will extract from the document collection (this is one of few parameters which the model, sadly, does not learn itself). We start with a fairly small number of topics, but if you want a more finegrained analysis of your corpus you can always increase this parameter. The max_iter
setting, finally, controls how long we let the model 'think': the more interations we allow, the better the model will get, but because LDA is as you will see fairly computationally intensive, it makes sense start with a relatively low number in this respect. You can now execute the following code block -- you will see that this code might take several minutes to complete.
In [ ]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=50,
max_iter=10,
learning_method='online',
learning_offset=50.,
random_state=0)
lda.fit(BOW)
After the model has (finally!) been fitted, we can now inspect our topics. We do this by finding out which items in our vocabulary have the highest score for each topic. The topics are available as lda.components_
after the model has been fitted.
In [ ]:
feature_names = vec.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
print('Topic', topic_idx, '> ', end='')
print(' '.join([feature_names[i] for i in topic.argsort()[:-12 - 1:-1]]))
Can you make sense of these topics? Which are the main thematic categories that you can discern?
Now that we have build a topic model, we can use it to represent our corpus. Instead of representing each document as a vector containing word frequencies, we represent it as a vector containing topic scores. To achieve this, we can simply call the fit()
function to the bag-of-words representation of our document:
In [ ]:
topic_repr = lda.transform(BOW)
print(topic_repr.shape)
As you can see, we obtain another sort of document matrix, where the number of columns corresponds to the number of topics we extracted. Let us now find out whether this representation yields anything useful. It is difficult to visualize 3,000+ documents all at once, so in the code block below, I select a smaller subset of 30 documents (and the corresponding filenames), using the random
module.
In [ ]:
comb = list(zip(names, topic_repr))
import random
random.seed(10000)
random.shuffle(comb)
comb = comb[:30]
subset_names, subset_topic_repr = zip(*comb)
We can now use our clustering algorithm from above in an exactly parallel way. Go on and try it (because of the random aspect of the previous code block, it possible that you obtain a different random selection).
In [ ]:
dm = squareform(pdist(subset_topic_repr), 'cosine') # or 'euclidean', or 'cosine' etc.
linkage_object = linkage(dm, method='ward')
fig_size = plt.rcParams["figure.figsize"]
plt.rcParams["figure.figsize"] = [15, 9]
d = dendrogram(Z=linkage_object, labels=subset_names, orientation='right')
How many clusters can you identify? Can you inspect the content of some of the files? Does it make sense that they are placed under a similar branch in the tree? Do they treat similar topics?