In this notebook we will explore some tools for text analysis in python. To do so, first we will import the requested python libraries.
In [ ]:
%matplotlib inline
# Required imports
from wikitools import wiki
from wikitools import category
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import gensim
import numpy as np
import lda
import lda.datasets
import matplotlib.pyplot as plt
from test_helper import Test
In these notebooks we will explore some tools for text processing and analysis and two topic modeling algorithms available from Python toolboxes.
To do so, we will explore and analyze collections of Wikipedia articles from a given category, using wikitools
, that makes easy the capture of content from wikimedia sites.
(As a side note, there are many other available text collections to test topic modelling algorithm. In particular, the NLTK library has many examples, that can explore them using the nltk.download()
tool.
import nltk
nltk.download()
for instance, you can take the gutemberg dataset
Mycorpus = nltk.corpus.gutenberg
text_name = Mycorpus.fileids()[0]
raw = Mycorpus.raw(text_name)
Words = Mycorpus.words(text_name)
Also, tools like Gensim or Sci-kit learn include text databases to work with).
In order to use Wikipedia data, we will select a single category of articles:
In [ ]:
site = wiki.Wiki("https://en.wikipedia.org/w/api.php")
# Select a category with a reasonable number of articles (>100)
cat = "Economics"
# cat = "Pseudoscience"
print cat
You can try with any other categories. Take into account that the behavior of topic modelling algorithms may depend on the amount of documents available for the analysis. Select a category with at least 100 articles. You can browse the wikipedia category tree here, https://en.wikipedia.org/wiki/Category:Contents, for instance.
We start downloading the text collection.
In [ ]:
# Loading category data. This may take a while
print "Loading category data. This may take a while..."
cat_data = category.Category(site, cat)
corpus_titles = []
corpus_text = []
for n, page in enumerate(cat_data.getAllMembersGen()):
print "\r Loading article {0}".format(n + 1),
corpus_titles.append(page.title)
corpus_text.append(page.getWikiText())
n_art = len(corpus_titles)
print "\nLoaded " + str(n_art) + " articles from category " + cat
Now, we have stored the whole text collection in two lists:
corpus_titles
, which contains the titles of the selected articlescorpus_text
, with the text content of the selected wikipedia articlesYou can browse the content of the wikipedia articles to get some intuition about the kind of documents that will be processed.
In [ ]:
# n = 5
# print corpus_titles[n]
# print corpus_text[n]
Topic modelling algorithms process vectorized data. In order to apply them, we need to transform the raw text input data into a vector representation. To do so, we will remove irrelevant information from the text data and preserve as much relevant information as possible to capture the semantic content in the document collection.
Thus, we will proceed with the following steps:
For the first steps, we will use some of the powerfull methods available from the Natural Language Toolkit. In order to use the word_tokenize
method from nltk, you might need to get the appropriate libraries using nltk.download()
. You must select option "d) Download", and identifier "punkt"
In [ ]:
# You can comment this if the package is already available.
# Select option "d) Download", and identifier "punkt"
# nltk.download()
Task: Insert the appropriate call to word_tokenize
in the code below, in order to get the tokens list corresponding to each Wikipedia article:
In [ ]:
corpus_tokens = []
for n, art in enumerate(corpus_text):
print "\rTokenizing article {0} out of {1}".format(n + 1, n_art),
# This is to make sure that all characters have the appropriate encoding.
art = art.decode('utf-8')
# Tokenize each text entry.
# scode: tokens = <FILL IN>
# Add the new token list as a new element to corpus_tokens (that will be a list of lists)
# scode: <FILL IN>
print "\n The corpus has been tokenized. Let's check some portion of the first article:"
print corpus_tokens[0][0:30]
In [ ]:
Test.assertEquals(len(corpus_tokens), n_art, "The number of articles has changed unexpectedly")
Test.assertTrue(len(corpus_tokens) >= 100,
"Your corpus_tokens has less than 100 articles. Consider using a larger dataset")
By looking at the tokenized corpus you may verify that there are many tokens that correspond to punktuation signs and other symbols that are not relevant to analyze the semantic content. They can be removed using the stemming tool from nltk
.
The homogeneization process will consist of:
Task: Convert all tokens in corpus_tokens
to lowercase (using .lower()
method) and remove non alphanumeric tokens (that you can detect with .isalnum()
method). You can do it in a single line of code...
In [ ]:
# Select stemmer.
stemmer = nltk.stem.SnowballStemmer('english')
corpus_filtered = []
for n, token_list in enumerate(corpus_tokens):
print "\rFiltering article {0} out of {1}".format(n + 1, n_art),
# Convert all tokens in token_list to lowercase, remove non alfanumeric tokens and stem.
# Store the result in a new token list, clean_tokens.
# scode: filtered_tokens = <FILL IN>
# Add art to corpus_filtered
# scode: <FILL IN>
print "\nLet's check the first tokens from document 0 after stemming:"
print corpus_filtered[0][0:30]
In [ ]:
Test.assertTrue(all([c==c.lower() for c in corpus_filtered[23]]), 'Capital letters have not been removed')
Test.assertTrue(all([c.isalnum() for c in corpus_filtered[13]]), 'Non alphanumeric characters have not been removed')
Task: Apply the .stem()
method, from the stemmer object created in the first line, to corpus_filtered
.
In [ ]:
# Select stemmer.
stemmer = nltk.stem.SnowballStemmer('english')
corpus_stemmed = []
for n, token_list in enumerate(corpus_filtered):
print "\rStemming article {0} out of {1}".format(n + 1, n_art),
# Apply stemming to all tokens in token_list and save them in stemmed_tokens
# scode: stemmed_tokens = <FILL IN>
# Add stemmed_tokens to the stemmed corpus
# scode: <FILL IN>
print "\nLet's check the first tokens from document 0 after stemming:"
print corpus_stemmed[0][0:30]
In [ ]:
Test.assertTrue((len([c for c in corpus_stemmed[0] if c!=stemmer.stem(c)]) < 0.1*len(corpus_stemmed[0])),
'It seems that stemming has not been applied properly')
Alternatively, we can apply lemmatization. For english texts, we can use the lemmatizer from NLTK, which is based on WordNet. If you have not used wordnet before, you will likely need to download it from nltk
In [ ]:
# You can comment this if the package is already available.
# Select option "d) Download", and identifier "wordnet"
# nltk.download()
Task: Apply the .lemmatize()
method, from the WordNetLemmatizer object created in the first line, to corpus_filtered
.
In [ ]:
wnl = WordNetLemmatizer()
# Select stemmer.
corpus_lemmat = []
for n, token_list in enumerate(corpus_filtered):
print "\rLemmatizing article {0} out of {1}".format(n + 1, n_art),
# scode: lemmat_tokens = <FILL IN>
# Add art to the stemmed corpus
# scode: <FILL IN>
print "\nLet's check the first tokens from document 0 after stemming:"
print corpus_lemmat[0][0:30]
One of the advantages of the lemmatizer method is that the result of lemmatization is still a true word, which is more advisable for the presentation of text processing results and lemmatization.
However, without using contextual information, lemmatize() does not remove grammatical differences. This is the reason why "is" or "are" are preserved and not replaced by infinitive "be".
As an alternative, we can apply .lemmatize(word, pos), where 'pos' is a string code specifying the part-of-speech (pos), i.e. the grammatical role of the words in its sentence. For instance, you can check the difference between wnl.lemmatize('is')
and wnl.lemmatize('is, pos='v')
.
In [ ]:
# You can comment this if the package is already available.
# Select option "d) Download", and identifier "stopwords"
# nltk.download()
Task: In the second line below we read a list of common english stopwords. Clean corpus_stemmed
by removing all tokens in the stopword list.
In [ ]:
corpus_clean = []
stopwords_en = stopwords.words('english')
n = 0
for token_list in corpus_stemmed:
n += 1
print "\rRemoving stopwords from article {0} out of {1}".format(n, n_art),
# Remove all tokens in the stopwords list and append the result to corpus_clean
# scode: clean_tokens = <FILL IN>
# scode: <FILL IN>
print "\n Let's check tokens after cleaning:"
print corpus_clean[0][0:30]
In [ ]:
Test.assertTrue(len(corpus_clean) == n_art, 'List corpus_clean does not contain the expected number of articles')
Test.assertTrue(len([c for c in corpus_clean[0] if c in stopwords_en])==0, 'Stopwords have not been removed')
Up to this point, we have transformed the raw text collection of articles in a list of articles, where each article is a collection of the word roots that are most relevant for semantic analysis. Now, we need to convert these data (a list of token lists) into a numerical representation (a list of vectors, or a matrix). To do so, we will start using the tools provided by the gensim
library.
As a first step, we create a dictionary containing all tokens in our text corpus, and assigning an integer identifier to each one of them.
In [ ]:
# Create dictionary of tokens
D = gensim.corpora.Dictionary(corpus_clean)
n_tokens = len(D)
print "The dictionary contains {0} tokens".format(n_tokens)
print "First tokens in the dictionary: "
for n in range(10):
print str(n) + ": " + D[n]
In the second step, let us create a numerical version of our corpus using the doc2bow
method. In general, D.doc2bow(token_list)
transform any list of tokens into a list of tuples (token_id, n)
, one per each token in token_list
, where token_id
is the token identifier (according to dictionary D
) and n
is the number of occurrences of such token in token_list
.
Task: Apply the doc2bow
method from gensim dictionary D
, to all tokens in every article in corpus_clean
. The result must be a new list named corpus_bow
where each element is a list of tuples (token_id, number_of_occurrences)
.
In [ ]:
# Transform token lists into sparse vectors on the D-space
# scode: corpus_bow = <FILL IN>
In [ ]:
Test.assertTrue(len(corpus_bow)==n_art, 'corpus_bow has not the appropriate size')
At this point, it is good to make sure to understand what has happened. In corpus_clean
we had a list of token lists. With it, we have constructed a Dictionary, D
, which assign an integer identifier to each token in the corpus.
After that, we have transformed each article (in corpus_clean
) in a list tuples (id, n)
.
In [ ]:
print "Original article (after cleaning): "
print corpus_clean[0][0:30]
print "Sparse vector representation (first 30 components):"
print corpus_bow[0][0:30]
print "The first component, {0} from document 0, states that token 0 ({1}) appears {2} times".format(
corpus_bow[0][0], D[0], corpus_bow[0][0][1])
Note that we can interpret each element of corpus_bow as a sparse_vector
. For example, a list of tuples
[(0, 1), (3, 3), (5,2)]
for a dictionary of 10 elements can be represented as a vector, where any tuple (id, n)
states that position id
must take value n
. The rest of positions must be zero.
[1, 0, 0, 3, 0, 2, 0, 0, 0, 0]
These sparse vectors will be the inputs to the topic modeling algorithms.
Note that, at this point, we have built a Dictionary containing
In [ ]:
print "{0} tokens".format(len(D))
and a bow representation of a corpus with
In [ ]:
print "{0} Wikipedia articles".format(len(corpus_bow))
Before starting with the semantic analyisis, it is interesting to observe the token distribution for the given corpus.
In [ ]:
# SORTED TOKEN FREQUENCIES (I):
# Create a "flat" corpus with all tuples in a single list
corpus_bow_flat = [item for sublist in corpus_bow for item in sublist]
# Initialize a numpy array that we will use to cont tokens.
# token_count[n] should store the number of ocurrences of the n-th token, D[n]
token_count = np.zeros(n_tokens)
# Count the number of occurrences of each token.
for x in corpus_bow_flat:
# Update the proper element in token_count
# scode: <FILL IN>
# Sort by decreasing number of occurences
ids_sorted = np.argsort(- token_count)
tf_sorted = token_count[ids_sorted]
ids_sorted
is a list of all token ids, sorted by decreasing number of occurrences in the whole corpus. For instance, the most frequent term is
In [ ]:
print D[ids_sorted[0]]
which appears
In [ ]:
print "{0} times in the whole corpus".format(tf_sorted[0])
In the following we plot the most frequent terms in the corpus.
In [ ]:
# SORTED TOKEN FREQUENCIES (II):
plt.rcdefaults()
# Example data
n_bins = 25
hot_tokens = [D[i] for i in ids_sorted[n_bins-1::-1]]
y_pos = np.arange(len(hot_tokens))
z = tf_sorted[n_bins-1::-1]/n_art
plt.barh(y_pos, z, align='center', alpha=0.4)
plt.yticks(y_pos, hot_tokens)
plt.xlabel('Average number of occurrences per article')
plt.title('Token distribution')
plt.show()
In [ ]:
# SORTED TOKEN FREQUENCIES:
# Example data
plt.semilogy(tf_sorted)
plt.xlabel('Average number of occurrences per article')
plt.title('Token distribution')
plt.show()
Exercise: There are usually many tokens that appear with very low frequency in the corpus. Count the number of tokens appearing only once, and what is the proportion of them in the token list.
In [ ]:
# scode: cold_tokens = <FILL IN>
print "There are {0} cold tokens, which represent {1}% of the total number of tokens in the dictionary".format(
len(cold_tokens), float(len(cold_tokens))/n_tokens*100)
Exercise: Represent graphically those 20 tokens that appear in the highest number of articles. Note that you can use the code above (headed by # SORTED TOKEN FREQUENCIES
) with a very minor modification.
In [ ]:
# scode: <WRITE YOUR CODE HERE>
Exercise: Count the number of tokens appearing only in a single article.
In [ ]:
# scode: <WRITE YOUR CODE HERE>
Exercise (All in one): Note that, for pedagogical reasons, we have used a different for
loop for each text processing step creating a new corpus_xxx
variable after each step. For very large corpus, this could cause memory problems.
As a summary exercise, repeat the whole text processing, starting from corpus_text up to computing the bow, with the following modifications:
for
loop, avoiding the creation of any intermediate corpus variables.corpus_bow1
.
In [ ]:
# scode: <WRITE YOUR CODE HERE>
Exercise (Visualizing categories): Repeat the previous exercise with a second wikipedia category. For instance, you can take "communication".
corpus_bow2
.corpus_bow1
(term1
) and corpus_bow2
(term2
).corpus_bow1
and corpus_bow2
into a 2 dimensional vector, where the first component is the frecuency of term1
and the second component is the frequency of term2
In [ ]:
# scode: <WRITE YOUR CODE HERE>
Exercise (bigrams): nltk
provides an utility to compute n-grams from a list of tokens, in nltk.util.ngrams
. Join all tokens in corpus_clean
in a single list and compute the bigrams. Plot the 20 most frequent bigrams in the corpus.
In [ ]:
# scode: <WRITE YOUR CODE HERE>
# Check the code below to see how ngrams works, and adapt it to solve the exercise.
# from nltk.util import ngrams
# sentence = 'this is a foo bar sentences and i want to ngramize it'
# sixgrams = ngrams(sentence.split(), 2)
# for grams in sixgrams:
# print grams
In [ ]:
import pickle
data = {}
data['D'] = D
data['corpus_bow'] = corpus_bow
pickle.dump(data, open("wikiresults.p", "wb"))