Intro to information extraction from text

TODO: for text processing. does parts of speech even. Merge with some of the content in html->text lecture. strip punct you dummy.

Most of the data we've looked at so far has been structured, meaning essentially that the data looked like a table or Excel spreadsheet. Not all data looks like that, however. Human readable text is an extremely common unstructured data source. From the text of a webpage, tweet, or document, businesses want to perform things like:

  • sentiment analysis
  • document summarization
  • document clustering
  • document recommendation

Later in MSAN 692, we'll learn how to extract the text from webpages or pieces of webpages such as the bestseller list at Amazon. For now, we can play with some prepared text files.

Text analysis uses words as data rather than numbers, which means tokenizing text; i.e., splitting the text string for a document into individual words. This problem is actually much harder than you might think. For example, if we split the document text on the space character, then "San Francisco" would be split into two words. For our purposes here, that'll work just fine. See Tokenization in this excellent information retrieval book for more information.

The goal of this lecture-lab is to get familiar with tokenizing text and how to extract some basic data, such as word frequency. To visualize information extracted from a document, we'll use word clouds like the image to the right that emphasize words according to their frequency.

Tokenizing a document

Let's use an article on Istanbul as our text file and then figure out how to get an appropriate list of words out of it.

In [18]:
! head data/IntroIstanbul.txt

        The City and ITS People
        Istanbul is one of the worlds most venerable cities. Part
        of the citys allure is its setting, where Europe faces Asia acr­oss
        the winding turquoise waters of the Bosphorus, making it the only city
        in the world to bridge two continents.

In Loading files, we learned how to read the contents of such a file into a string and split it on the space character:

In [5]:
with open('data/IntroIstanbul.txt') as f:
    contents = # read all content of the file
words = contents.split()
print(words[:25])    # print first 25 words

['The', 'City', 'and', 'ITS', 'People', 'Istanbul', 'is', 'one', 'of', 'the', 'worlds', 'most', 'venerable', 'cities.', 'Part', 'of', 'the', 'citys', 'allure', 'is', 'its', 'setting,', 'where', 'Europe', 'faces']

That looks more like it although it is still not very clean. Some of the words are capitalized. What we need, is all words normalized so that people and People are consider the same word etc...

Exercise: Implement another filter pattern to convert the words to lowercase using lower(). E.g., 'The'.lower() is 'the'.

Here's one way to do it:

In [6]:
words = [w.lower() for w in words]

['the', 'city', 'and', 'its', 'people', 'istanbul', 'is', 'one', 'of', 'the', 'worlds', 'most', 'venerable', 'cities.', 'part', 'of', 'the', 'citys', 'allure', 'is', 'its', 'setting,', 'where', 'europe', 'faces']

That's not the best we can do. For example "faces" and "face" should be the same. Let's stem the words:

In [7]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in words]

['the', 'citi', 'and', 'it', 'peopl', 'istanbul', 'is', 'one', 'of', 'the', 'world', 'most', 'vener', 'cities.', 'part', 'of', 'the', 'citi', 'allur', 'is', 'it', 'setting,', 'where', 'europ', 'face', 'asia', 'acr\xadoss', 'the', 'wind', 'turquois', 'water', 'of', 'the', 'bosphorus,', 'make', 'it', 'the', 'onli', 'citi', 'in', 'the', 'world', 'to', 'bridg', 'two']

Computing word frequencies

Let's create a bag of words representation. My work plan would have a description like "Walk through the words in a document, updating a dictionary that holds the count for each word." The plan pseudocode would have a loop over the words whose body incremented a count in a dictionary

  1. let wfreqs be an empty dictionary mapping words to word counts
  2. for each word w in words:
    if w not in wfreqs, let wfreqs[w] = 1.
    Otherwise add one to wfreqs[w].

My code implementation would look like the following.

In [8]:
from collections import defaultdict

wfreqs = defaultdict(int) # missing entries yield value 0
for w in words:
    wfreqs[w] = wfreqs[w] + 1


Computing the frequency of elements in a list is common enough that Python provides a built-in data structure called a Counter that will do this for us:

In [9]:
from collections import Counter
ctr = Counter(words)


That data structure is nice because it can give the list of, say, 10 most common words:

In [10]:

[('the', 123), ('of', 55), ('and', 40), ('to', 19), ('in', 16), ('is', 14), ('a', 13), ('city', 9), ('most', 9), ('from', 9)]


Extract the most common 10 words from ctr (i.e., not the tuples).

In [11]:
print([p[0] for p in ctr.most_common(10)])

['the', 'of', 'and', 'to', 'in', 'is', 'a', 'city', 'most', 'from']

Word clouds

Python has a nice library called wordcloud we can use to visualize the relative frequency of words. It should already be installed in your Anaconda Python directory, but if not use the command line to install it:

$ pip install wordcloud

The key elements of the following code are the creation of the WordCloud and calling fit_words() with a dictionary (type dict) of word-freq associations, wfreq.

In [12]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wordcloud = WordCloud()

fig=plt.figure(figsize=(6, 4))   # Prepare a plot 5x3 inches

That's kind of busy will all of those words in there, so let's focus on the top 30 words. To do that we will call most_common(), which gives us a list of tuples. Because fit_words() it requires a dict, we convert the most common word list into a dictionary:

In [17]:
# Get 30 most common word-freq pairs then convert to dictionary for use by WordCloud
wtuples = ctr.most_common(30)
wdict = dict(wtuples)

wordcloud = WordCloud()

fig=plt.figure(figsize=(6, 4))

That looks better but it looks like common English words like "the" and "of" are dominating the visualization. To focus on the words most relevant to the document, let's filter out such so-called English stop words. scikit-learn, a machine learning library you will become very familiar with in future classes, provides a nice list of stop words we can use:

In [14]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
english = list(ENGLISH_STOP_WORDS) # Convert to a list so I can grab a subset
print(english[:25])                # Print 25 of the words

['between', 'nobody', 'here', 'cannot', 'ourselves', 'hence', 'never', 'is', 'by', 'call', 'ie', 'which', 'off', 'due', 'this', 'some', 'any', 'another', 'why', 'one', 'ltd', 'become', 'first', 'anyone', 'during']


Filter out the English stop words from the words list we computed above and reset wfreqs to a Counter based off this filtered list.

In [15]:
goodwords = [w for w in words if w not in ENGLISH_STOP_WORDS]
goodctr = Counter(goodwords)

[('city', 9), ('istanbul', 7), ('worlds', 6), ('citys', 5), ('bosphorus', 5), ('sea', 4), ('ottoman', 4), ('important', 4), ('—', 4), ('europe', 3)]

In [16]:
wtuples = goodctr.most_common(30)
wdict = dict(wtuples)

wordcloud = WordCloud()

fig=plt.figure(figsize=(5, 3))


Add Porter stemming to the previous exercise

You can play around with the list of stop words to remove things like "important" and others to really get the key words to pop out. There is a technique to automatically damp down common English words called TFIDF, which we will learn about soon in this class.


Text files are an unstructured data source that we typically represent as a bag of words. A bag of words representation is a set of associations mapping words to their frequency or count. We typically use a dictionary data structure for bag of words because dictionary lookup is extremely efficient, versus linearly scanning an entire list of associations. We used word clouds to visualize the relative frequency of words in a document.

The data structures and techniques described in this lecture-lab form the basis of natural language processing (NLP).