CSE 6040, Fall 2015 [02]: Processing unstructured text

Over the next two classes, we will build toward our first computational data mining problem, called the association rule mining problem. The basic task is to identify commonly co-occurring items in a series of transactions.

We will apply this problem to a corpus of unstructured text. Consequently, this first class will introduce (or review, for some of you) a few essential useful Python tools for this problem:

0: Word count

Given a fragment of text, represented by a string (possibly including newlines), how many words does it contain?

Consider the following two methods to count words in a string. Look at these with a partner.


In [ ]:
quote = """I wish you'd stop talking.
I wish you'd stop prying and trying to find things out.
I wish you were dead. No. That was silly and unkind.
But I wish you'd stop talking."""
print (quote)

def countWords1 (s):
    """Counts the number of words in a given input string."""
    Lines = s.split ('\n')
    count = 0
    for line in Lines:
        Words_in_line = line.split ()
        count = count + len (Words_in_line)
    return count

def countWords2 (s):
    """Counts the number of words in a given input string."""
    return len (quote.split ())

count1 = countWords1 (quote)
count2 = countWords2 (quote)
print ("\nWord count: Method 1 says %d words, and Method 2 says %d." % (count1, count2))
assert count1 == count2

Q: Which would the two of you predict will be better, and why?

(Insert your response to the above question(s) here.)

Q: When might these methods not work as expected? With your partner, come up with one example input and write your own word counter, below, that handles that case.


In [ ]:
def yourCountWords (s):
    """Insert your method here."""
    return 0

# Write some code to test your implementation here as well.

1: Counting unique strings

Suppose you are given a Python list of email addresses. Determine the number of unique addresses.


In [ ]:
Emails = ['quigbert@cc.gatech.edu', 'dummy@gatech.edu', 'dummy@vuduc.org', 'dummy@gatech.edu', 'hopeful@gatech.edu', 'overworked@gatech.edu', 'quigbert@cc.gatech.edu']
print (Emails)

true_answer = 5
print ("\n'Emails' has %d unique addresses." % true_answer)

Method 1. Let's use a data structure called a dictionary, which stores (key, value) pairs.


In [ ]:
Dict = {}
for email in Emails:
    Dict[email] = 1
count = len (Dict)
assert count == true_answer
print ("(Method 1 worked!)")

Method 2. Let's use a different data structure, called a set. It essentially implements a set in the mathematical sense.


In [ ]:
UniqueEmails = set (Emails)
count = len (UniqueEmails)
assert count == true_answer
print ("Method 2 worked!")

Q: So, which method is better, and why? If you think one method is better than another for this problem, for what kinds of problems would you prefer the other method?

(Insert your response to the above question(s) here.)

2: Regular expressions

The preceding exercise hints at a general problem of finding specific patterns in text. A handy tool for this problem is Python's regular expression library.

A regular expression is specially formatted pattern, written as a string. Matching patterns with regular expressions has 3 steps:

  1. You come up with a pattern to find
  2. You compile it into a pattern object
  3. You apply the pattern object to a string, to find instances of the pattern within the string

It is easiest to see by example. What follows is just a small sample of what is possible with regular expressions in Python; refer to the regular expression documentation for many more examples and details.

Example 1. The simplest pattern is text that you wish to match exactly. For instance, suppose you wish to find an instance of the word "impossible" in a piece of text. Here is a snippet of Python code to do it.

Run this snippet. Try changing the search string and see what happens.


In [ ]:
import re    # Loads the regular expression library

p = re.compile ("impossible")
m = p.search ("This mission is impossible.")
if m == None:
    print ("Not found.")
else:
    print ("Found pattern at position %d" % m.start ())

The variable m in the example contains a match object if the pattern is found. You can perform queries against the match object, such as the one illustrated above.

Beyond exact text, there is a rich syntax for specifying complex patterns.

For instance, you can use character classes to match both "Impossible" and "impossible" using the pattern, "[Ii]mpossible". That is, the characters in square brackets represent the set of all characters that may match at the given position. As another example, you could match any digit using the character class, "[0123456789]", or the more compact range notation, "[0-9]". Thus, the pattern, "cat[xyz][0-9]hole" would match "catx3hole" but neither "catX3hole" nor "catxzhole".

You can also match the complement of a character class set using a caret, "^", just after the opening square bracket. For instance, "cat[a-z][^0-9]hole" would match "catx@hole" but not "catx3hole".

There are some common character classes, which have additional shortcuts. For instance, the special escaped d, or \d, will match any digit. So, to match a 7-digit phone number, you might use the pattern, "\d\d\d-\d\d\d\d".

Parentheses actually have a special meaning in a regular expression pattern. Therefore, to match an exact parenthesis, you need to prefix it (or escape it) with a backslash, \.

Example 2. For instance, suppose you wish to match a phone number written in the US standard format, like "(404) 555-1212". The pattern is a 3-digit area code, surrounded by parentheses, followed by a space, followed by a 7-digit number separated between the third and fourth digits. This pattern can be encoded as the following regular expression pattern:

\(\d\d\d\) \d\d\d-\d\d\d\d

Try the following example, which demonstrates the phone number matching pattern.


In [ ]:
def findPhone1 (s):
    """Returns the first instance of a phone number in 's', or 'None'."""
    phonePattern = re.compile ("\(\d\d\d\) \d\d\d-\d\d\d\d")
    hasPhone = phonePattern.search (s)
    if hasPhone:
        a = hasPhone.start ()
        b = hasPhone.end ()
        phone = s[a:b]
    else:
        phone = None
    return phone
    
message = "Hi Betty. Give me a ring at (404) 555-1212 when you get a chance."
findPhone1 (message)

Example 4. You can make the phone number pattern more robust by allowing zero or more spaces between the area code and the phone number, using the * option:


In [ ]:
def findPhone2 (s):
    """Returns the first instance of a phone number in 's', or 'None'."""
    phonePattern = re.compile ("\(\d\d\d\) *\d\d\d-\d\d\d\d")
    hasPhone = phonePattern.search (s)
    if hasPhone:
        a = hasPhone.start ()
        b = hasPhone.end ()
        phone = s[a:b]
    else:
        phone = None
    return phone

findPhone2 ("Phone: (404)555-1212")

Beyond "*", other wildcards include "+" to match one or more, as well as "?" to match zero or one instances.

Example 5. It's also possible to match alternatives, using the or symbol, |. For instance, suppose you wish to recognize either of the words, "prefix" or "suffix":


In [ ]:
fixFinder = re.compile ("(pre|suf)fix")
assert fixFinder.search ("prefix")
assert fixFinder.search ("suffix")
assert not fixFinder.search ("infix")

Q: Apply these technique to our phone number finder. Define a function that can match phone numbers in any of the following forms:

  • (404) 555-1212
  • 404-555-1212
  • 404-5551212
  • 404555-1212
  • 4045551212

In [ ]:
def yourPhoneFinder (s):
    """Returns the first instance of a phone number in 's', or 'None'."""
    # Fix the pattern:
    phonePattern = re.compile ("\(\d\d\d\) *\d\d\d-\d\d\d\d")
    hasPhone = phonePattern.search (s)
    if hasPhone:
        a = hasPhone.start ()
        b = hasPhone.end ()
        phone = s[a:b]
    else:
        phone = None
    return phone

assert yourPhoneFinder ("(404)555-1212")
assert yourPhoneFinder ("(404) 555-1212")
assert yourPhoneFinder ("404-555-1212")
assert yourPhoneFinder ("4045551212")

Example 6. Another common use-case is matching a string but extracting just a portion of it. For this purpose, you can use groups.

Consider the simple form of phone numbers, such as "(404) 555-1212". Suppose you wish to match a phone number, but then extract just the digits of the area code from the remainder of the number. For instance, for the above you might produce the list, ['404','555-1212'].

You can identify a group inside the pattern by enclosing the subpattern within parentheses, and then using a special syntax to give it a name. The name allows you to refer to the matched substring later on:


In [ ]:
def findPhone3 (s):
    """Returns a the first instance of a phone number in 's', or 'None'."""
    phonePattern = re.compile ("\((?P<areacode>\d\d\d)\) (?P<number>\d\d\d-\d\d\d\d)")
    hasPhone = phonePattern.search (s)
    if hasPhone:
        areacode = hasPhone.group ('areacode')
        number = hasPhone.group ('number')
        phone = [areacode, number]
    else:
        phone = None
    return phone

findPhone3 (message)

In the preceding example, the syntax, (?P<name>xxx) defines a group named name for the subpattern represented abstractly by xxx. The example calls the match object's method, group("name"), to extract the matching substring.

Example 7. One pitfall with regular expression patterns is that they get messy quickly. The re.compile() function takes a special flag, re.VERBOSE, which allows you to write regular expressions in a more structured and hopefully also more readable way. In particular, you can insert arbitrary amounts of whitespace as well as comments.


In [ ]:
def findPhone4 (s):
    """Returns a the first instance of a phone number in 's', or 'None'."""
    phonePattern = re.compile (r"""
          # Area code:
          \(
            (?P<areacode>\d\d\d)
          \)
          # Optional separator (one or more spaces)
          \s*
          # Phone number
          (?P<number>\d\d\d-\d\d\d\d)
       """, re.VERBOSE)
    hasPhone = phonePattern.search (s)
    if hasPhone:
        areacode = hasPhone.group ('areacode')
        number = hasPhone.group ('number')
        phone = [areacode, number]
    else:
        phone = None
    return phone

findPhone4 (message)

3: File I/O

Reading from or writing to a file is accomplished through a file object.

In this example, the file "skilling-j.inbox" is a text file containing a bunch of email messages. You should download it if you haven't done so already: [download]. (If you are pulling directly from the class notebooks Github repo, this file is included already.)

Example 1. You can read the entire file into a string by using open() to create a file object, and then reading its contents:


In [ ]:
inbox = open ('skilling-j.inbox', 'r') # 'r' = read mode; use 'w' for writing
assert inbox # Makes sure it opened OK

all_messages = inbox.read ()
inbox.close () # Should close a file when you are done

# Print first 500 characters
print all_messages[0:500]

Q: Do you anticipate any pitfalls in this approach?

(Insert your response to the above question(s) here.)

Example 2. A more memory-efficient way to read a file is to read chunks at a time. For text files, a particularly convenient way is to read the file one line at a time, using a file object's readline () method, or by looping directly over the object. The following example does so in order to count the number of lines in the file.


In [ ]:
inbox = open ('skilling-j.inbox', 'r') # 'r' = read mode; use 'w' for writing
assert inbox # Makes sure it opened OK

count = 0
for line in inbox: # reads one line at a time
    count = count + 1
inbox.close ()

print ("The file has %d lines." % count)

4: Putting it all together: Email counter

Q: Using all of the techniques in this lesson, write a Python program that counts the number of unique email addresses that appear in the "skilling-j.inbox" input file.


In [ ]: