Text Mining

Text mining is the process of automatically extracting high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.

Typical text mining applications include:

  • Text classification (or text categorization),
  • Text clustering,
  • Sentiment analysis,
  • Named entity recognition, etc.

In this notebook:

  1. Preprocessing: textual normalization, simple tokenization
  2. Stopword removal: its importance
  3. Verify Zipf Law with Oshumed medical abstract collection

How to use this notebook

This environment is called Jupyter Notebook.

It has two types of cells:

  • Markdown cells (like this one, where you can write notes)
  • Code cells

Run code cells by pressing Shift+Enter. Let's try...


In [ ]:
# Run me: press Shift+Enter
print("Hello, world!!")

This is a hands on session, so this is time you write some of code. Let's try that.


In [ ]:
# Write code to print any string...

# Then run the code.

Preprocessing

Upper case, Punctuations

A computer does not require upper case letters and punctuations.

Note: Python already provides a list of punctuations. We simply need to import it.


In [ ]:
from string import punctuation

s = "Hello, World!!"

# Write code to lower case the string
s = ...

# Write code to remove punctuations
# HINT: for loop and for each punctuation use string replace() method
for ...
    s = ...

print(s)

Tokenization : NLTK

Natural Language Toolkit (NLTK) is a platform to work with human or natural language data using Python.

As usual, we will first convert everything to lowercase and remove punctuations.


In [ ]:
raw1 = "Grenoble is a city in southeastern France, at the foot of the French Alps, on the banks of Isère."
raw2 = "Grenoble is the capital of the department of Isère and is an important scientific centre in France."

# Write code here to convert everything in lower case and to remove punctuation.


print(raw1)
print(raw2)
# Again, SHIFT+ENTER to run the code.

NLTK already provides us with modules to easily tokenize the text. We will tokenize pieces of raw texts using word_tokenize function of NLTK package.


In [ ]:
import nltk

# Tokenization using NLTK
tokens1 = nltk.word_tokenize(raw1)
tokens2 = nltk.word_tokenize(raw2)

# print the tokens
print(tokens1)
print(tokens2)

We now build a NLTK Text object to store tokenized texts. One or more text then can be merged to form a TextCollection. This provides many useful operations helpful to statistically analyze a collection of text.


In [ ]:
# Build NLTK Text objects
text1 = nltk.Text(tokens1)
text2 = nltk.Text(tokens2)

# A list of Text objects
text_list = [text1, text2]

# Build NLTK text collection
text_collection = nltk.text.TextCollection(text_list)

NLTK TextCollection object can be used to calculate basic statistics.

  1. count the number of occurances (or term frequency) of a word
  2. obtain a frequency distribution of all the words in the text

Note: The NLTK Text objects created in the intermediate steps can also be used to calculate similar statistics at document level.


In [ ]:
# Frequency of a word
freq = text_collection.count("grenoble")
print("Frequency of word \'grenoble\' = ", freq)

In [ ]:
# Frequency distribution
freq_dist = nltk.FreqDist(text_collection)
freq_dist

Let's automate: write a function

Using above steps, we will now write a function. We will call this function raw_to_text. This function will take a list of raw texts and will return a NLTK TextCollection objects, representing the list of input text.


In [ ]:
"""
Converts a list of raw text to a NLTK TextCollection object.
Applies lower-casing and punctuation removal.
Returns:
text_collection - a NLTK TextCollection object
"""
def raw_to_text(raw_list):
    text_list = []
    for raw in raw_list:
        # Write code for lower-casing and punctuation removal
        
        
        # Write code to tokenize and create NLTK Text object
        # Name the variable 'text' to store the Text object
        
        
        # storing the text in the list
        text_list.append(text) 

    # Write code to create TextCollection from the list text_list
    text_collection = nltk.text.TextCollection(text_list) # TO DELETE
    
    # return text collection
    return text_collection

Let's test the function with some sample data


In [ ]:
raw_list_sample = ["The dog sat on the mat.",
                   "The cat sat on the mat!",
                   "We have a mat in our house."]

# Call the above raw_to_text function for the sample text
text_collection_sample = ...

Like before we can compute the frequency distribution for this collection.


In [ ]:
# Write code to compute the frequency 'mat' in the collection.
freq = ...
print("Frequency of word \'mat\' = ", freq)

# Write code to compute and display the frequency distribution of text_collection_sample

Something bigger

We will use DBPedia Ontology Classification Dataset. It includes first paragraphs of Wikipedia articles. Each paragraph is assigned one of 14 categories. Here is an example of an abstract under Written Work catgory:

The Regime: Evil Advances/Before They Were Left Behind is the second prequel novel in the Left Behind series written by Tim LaHaye and Jerry B. Jenkins. It was released on Tuesday November 15 2005. This book covers more events leading up to the first novel Left Behind. It takes place from 9 years to 14 months before the Rapture.

In this hands-on we will use 15,000 documents belonging to three categories, namely Album, Film and Written Work.

The file corpus.txt supplied here, contains 15,000 documents. Each line of the file is a document.

Now we will:

  1. Load the documents as a list
  2. Create a NLTK TextCollection
  3. Analyze different counts

Note: Each line of the file corpus.txt is a document


In [ ]:
# Write code to load documents as a list
"""
Hint 1: open the file using open()
Hint 2: use read() to load the content
Hint 3: use splitlines() to get separate documents 
"""
raw_docs = ...

print("Loaded " + str(len(raw_docs)) + " documents.")

In [ ]:
# Write code to create a NLTK TextCollection
# Hint: use raw_to_text function
text_collection = ...

# Print total number of words in these documents
print("Total number of words = ", len(text_collection))
print("Total number of unique words = ", len(set(text_collection)))

Calculate the freq distribution for this text collection of documents. Then let's see the most common words.


In [ ]:
# Write code to compute frequency distribution of text_collection
freq_dist = ...

# Let's see most common 10 words.
freq_dist.most_common(10)

Something does not seem right!! Can you point out what?

Let's try by visualizing it.


In [ ]:
# importing Python package for plotting 
import matplotlib.pyplot as plt

# To plot
plt.subplots(figsize=(12,10))
freq_dist.plot(30, cumulative=True)

Observations:

  1. Just 30 most frequent tokens make up around 260,000 out of 709,460 ($\approx 36.5\%$)
  2. Most of these are very common words such as articles, pronouns etc.

Stop word filtering

Stop words are words which are filtered out before or after processing of natural language data (text). There is no universal stop-word list. Often, stop word lists include short function words, such as "the", "is", "at", "which", and "on". Removing stop-words has been shown to increase the performance of different tasks like search.

A file of stop_words.txt is included. We will now:

  1. Load the contents of the file 'stop_words.txt' where each line is a stop word, and create a stop-word list.
  2. Modify the function raw_to_text to perform (a) stop-word removal (b) numeric words removal

Note: Each line of the file stop_words.txt is a stop word.


In [ ]:
# Write code to load stop-word list from file 'stop_words.txt'
# Hint: use the same strategy you used to load documents
stopwords = set(...)

In [ ]:
"""
VERSION 2
Converts a list of raw text to a NLTK TextCollection object.
Applies lower-casing, punctuation removal and stop-word removal.
Returns:
text_collection: a NLTK TextCollection object
"""
# Write function "raw_to_text_2".
"""
Hint 1: consult the above function "raw_to_text",
Hint 2: add a new block in the function for removing stop words
Hint 3: to remove stop words from a of tokens - 
   - create an ampty list to store clean tokens
   - for each token in the token list:
         if the token is not in stop word list
            store it in the clean token list
"""

Retest our small sample with the new version.


In [ ]:
raw_list_sample = ["The dog sat on the mat.", 
                   "The cat sat on the mat!", 
                   "We have a mat in our house."]

# Write code to obtain and see freq_dist_sample with the new raw_to_text_2
# Note: raw_to_text_2 takes two inputs/arguments
text_collection_sample = ...
freq_dist_sample = ...

freq_dist_sample

Finally, rerun with the bigger document set and replot the cumulative word frequencies.

Recall that we already have the documents loaded in the variable raw_docs


In [ ]:
# Write code to create a NLTK TextCollection with raw_to_text_2
text_collection = ...

# Write code to compute frequency distribution of text_collection
freq_dist = ...

# Write code to plot the frequencies again

Zipf law

Verify whether the dataset follows the Zipf law, by plotting the data on a log-log graph, with the axes being log (rank order) and log (frequency). You expect to obtain an alomost straight line.


In [ ]:
import numpy as np
import math

counts = np.array(list(freq_dist.values()))
tokens = np.array(list(freq_dist.keys()))
ranks = np.arange(1, len(freq_dist)+1)

# Obtaining indices that would sort the array in descending order
indices = np.argsort(-counts)
frequencies = counts[indices]

# Plotting the ranks vs frequencies
plt.subplots(figsize=(12,10))
plt.yscale('log')
plt.xscale('log')
plt.title("Zipf plot for our data")
plt.xlabel("Frequency rank of token")
plt.ylabel("Absolute frequency of token")
plt.grid()
plt.plot(ranks, frequencies, 'o', markersize=0.9)
for n in list(np.logspace(-0.5, math.log10(len(counts)-1), 17).astype(int)):
    dummy = plt.text(ranks[n], frequencies[n], " " + tokens[indices[n]],   
                     verticalalignment="bottom", horizontalalignment="left")
plt.show()