Over the next two classes, we will build toward our first computational data mining problem, called the association rule mining problem. The basic task is to identify commonly co-occurring items in a series of transactions.
We will apply this problem to a corpus of unstructured text. Consequently, this first class will introduce (or review, for some of you) a few essential useful Python tools for this problem:
In [ ]:
quote = """I wish you'd stop talking.
I wish you'd stop prying and trying to find things out.
I wish you were dead. No. That was silly and unkind.
But I wish you'd stop talking."""
print (quote)
def countWords1 (s):
"""Counts the number of words in a given input string."""
Lines = s.split ('\n')
count = 0
for line in Lines:
Words_in_line = line.split ()
count = count + len (Words_in_line)
return count
def countWords2 (s):
"""Counts the number of words in a given input string."""
return len (quote.split ())
count1 = countWords1 (quote)
count2 = countWords2 (quote)
print ("\nWord count: Method 1 says %d words, and Method 2 says %d." % (count1, count2))
assert count1 == count2
Q: Which would the two of you predict will be better, and why?
(Insert your response to the above question(s) here.)
Q: When might these methods not work as expected? With your partner, come up with one example input and write your own word counter, below, that handles that case.
In [ ]:
def yourCountWords (s):
"""Insert your method here."""
return 0
# Write some code to test your implementation here as well.
In [ ]:
Emails = ['quigbert@cc.gatech.edu', 'dummy@gatech.edu', 'dummy@vuduc.org', 'dummy@gatech.edu', 'hopeful@gatech.edu', 'overworked@gatech.edu', 'quigbert@cc.gatech.edu']
print (Emails)
true_answer = 5
print ("\n'Emails' has %d unique addresses." % true_answer)
Method 1. Let's use a data structure called a dictionary, which stores (key, value) pairs.
In [ ]:
Dict = {}
for email in Emails:
Dict[email] = 1
count = len (Dict)
assert count == true_answer
print ("(Method 1 worked!)")
Method 2. Let's use a different data structure, called a set. It essentially implements a set in the mathematical sense.
In [ ]:
UniqueEmails = set (Emails)
count = len (UniqueEmails)
assert count == true_answer
print ("Method 2 worked!")
Q: So, which method is better, and why? If you think one method is better than another for this problem, for what kinds of problems would you prefer the other method?
(Insert your response to the above question(s) here.)
The preceding exercise hints at a general problem of finding specific patterns in text. A handy tool for this problem is Python's regular expression library.
A regular expression is specially formatted pattern, written as a string. Matching patterns with regular expressions has 3 steps:
It is easiest to see by example. What follows is just a small sample of what is possible with regular expressions in Python; refer to the regular expression documentation for many more examples and details.
Example 1. The simplest pattern is text that you wish to match exactly. For instance, suppose you wish to find an instance of the word "impossible" in a piece of text. Here is a snippet of Python code to do it.
Run this snippet. Try changing the search string and see what happens.
In [ ]:
import re # Loads the regular expression library
p = re.compile ("impossible")
m = p.search ("This mission is impossible.")
if m == None:
print ("Not found.")
else:
print ("Found pattern at position %d" % m.start ())
The variable m
in the example contains a match object if the pattern is found. You can perform queries against the match object, such as the one illustrated above.
Beyond exact text, there is a rich syntax for specifying complex patterns.
For instance, you can use character classes to match both "Impossible" and "impossible" using the pattern, "[Ii]mpossible
". That is, the characters in square brackets represent the set of all characters that may match at the given position. As another example, you could match any digit using the character class, "[0123456789]
", or the more compact range notation, "[0-9]
". Thus, the pattern, "cat[xyz][0-9]hole
" would match "catx3hole" but neither "catX3hole" nor "catxzhole".
You can also match the complement of a character class set using a caret, "^
", just after the opening square bracket. For instance, "cat[a-z][^0-9]hole
" would match "catx@hole" but not "catx3hole".
There are some common character classes, which have additional shortcuts. For instance, the special escaped d
, or \d
, will match any digit. So, to match a 7-digit phone number, you might use the pattern, "\d\d\d-\d\d\d\d
".
Parentheses actually have a special meaning in a regular expression pattern. Therefore, to match an exact parenthesis, you need to prefix it (or escape it) with a backslash, \
.
Example 2. For instance, suppose you wish to match a phone number written in the US standard format, like "(404) 555-1212". The pattern is a 3-digit area code, surrounded by parentheses, followed by a space, followed by a 7-digit number separated between the third and fourth digits. This pattern can be encoded as the following regular expression pattern:
\(\d\d\d\) \d\d\d-\d\d\d\d
Try the following example, which demonstrates the phone number matching pattern.
In [ ]:
def findPhone1 (s):
"""Returns the first instance of a phone number in 's', or 'None'."""
phonePattern = re.compile ("\(\d\d\d\) \d\d\d-\d\d\d\d")
hasPhone = phonePattern.search (s)
if hasPhone:
a = hasPhone.start ()
b = hasPhone.end ()
phone = s[a:b]
else:
phone = None
return phone
message = "Hi Betty. Give me a ring at (404) 555-1212 when you get a chance."
findPhone1 (message)
Example 4. You can make the phone number pattern more robust by allowing zero or more spaces between the area code and the phone number, using the *
option:
In [ ]:
def findPhone2 (s):
"""Returns the first instance of a phone number in 's', or 'None'."""
phonePattern = re.compile ("\(\d\d\d\) *\d\d\d-\d\d\d\d")
hasPhone = phonePattern.search (s)
if hasPhone:
a = hasPhone.start ()
b = hasPhone.end ()
phone = s[a:b]
else:
phone = None
return phone
findPhone2 ("Phone: (404)555-1212")
Beyond "*
", other wildcards include "+
" to match one or more, as well as "?
" to match zero or one instances.
Example 5. It's also possible to match alternatives, using the or symbol, |
. For instance, suppose you wish to recognize either of the words, "prefix" or "suffix":
In [ ]:
fixFinder = re.compile ("(pre|suf)fix")
assert fixFinder.search ("prefix")
assert fixFinder.search ("suffix")
assert not fixFinder.search ("infix")
Q: Apply these technique to our phone number finder. Define a function that can match phone numbers in any of the following forms:
In [ ]:
def yourPhoneFinder (s):
"""Returns the first instance of a phone number in 's', or 'None'."""
# Fix the pattern:
phonePattern = re.compile ("\(\d\d\d\) *\d\d\d-\d\d\d\d")
hasPhone = phonePattern.search (s)
if hasPhone:
a = hasPhone.start ()
b = hasPhone.end ()
phone = s[a:b]
else:
phone = None
return phone
assert yourPhoneFinder ("(404)555-1212")
assert yourPhoneFinder ("(404) 555-1212")
assert yourPhoneFinder ("404-555-1212")
assert yourPhoneFinder ("4045551212")
Example 6. Another common use-case is matching a string but extracting just a portion of it. For this purpose, you can use groups.
Consider the simple form of phone numbers, such as "(404) 555-1212". Suppose you wish to match a phone number, but then extract just the digits of the area code from the remainder of the number. For instance, for the above you might produce the list, ['404','555-1212']
.
You can identify a group inside the pattern by enclosing the subpattern within parentheses, and then using a special syntax to give it a name. The name allows you to refer to the matched substring later on:
In [ ]:
def findPhone3 (s):
"""Returns a the first instance of a phone number in 's', or 'None'."""
phonePattern = re.compile ("\((?P<areacode>\d\d\d)\) (?P<number>\d\d\d-\d\d\d\d)")
hasPhone = phonePattern.search (s)
if hasPhone:
areacode = hasPhone.group ('areacode')
number = hasPhone.group ('number')
phone = [areacode, number]
else:
phone = None
return phone
findPhone3 (message)
In the preceding example, the syntax, (?P<name>xxx)
defines a group named name
for the subpattern represented abstractly by xxx
. The example calls the match object's method, group("name")
, to extract the matching substring.
Example 7. One pitfall with regular expression patterns is that they get messy quickly. The re.compile()
function takes a special flag, re.VERBOSE
, which allows you to write regular expressions in a more structured and hopefully also more readable way. In particular, you can insert arbitrary amounts of whitespace as well as comments.
In [ ]:
def findPhone4 (s):
"""Returns a the first instance of a phone number in 's', or 'None'."""
phonePattern = re.compile (r"""
# Area code:
\(
(?P<areacode>\d\d\d)
\)
# Optional separator (one or more spaces)
\s*
# Phone number
(?P<number>\d\d\d-\d\d\d\d)
""", re.VERBOSE)
hasPhone = phonePattern.search (s)
if hasPhone:
areacode = hasPhone.group ('areacode')
number = hasPhone.group ('number')
phone = [areacode, number]
else:
phone = None
return phone
findPhone4 (message)
Reading from or writing to a file is accomplished through a file object.
In this example, the file "skilling-j.inbox" is a text file containing a bunch of email messages. You should download it if you haven't done so already: [download]. (If you are pulling directly from the class notebooks Github repo, this file is included already.)
Example 1. You can read the entire file into a string by using open()
to create a file object, and then reading its contents:
In [ ]:
inbox = open ('skilling-j.inbox', 'r') # 'r' = read mode; use 'w' for writing
assert inbox # Makes sure it opened OK
all_messages = inbox.read ()
inbox.close () # Should close a file when you are done
# Print first 500 characters
print all_messages[0:500]
Q: Do you anticipate any pitfalls in this approach?
(Insert your response to the above question(s) here.)
Example 2. A more memory-efficient way to read a file is to read chunks at a time. For text files, a particularly convenient way is to read the file one line at a time, using a file object's readline ()
method, or by looping directly over the object. The following example does so in order to count the number of lines in the file.
In [ ]:
inbox = open ('skilling-j.inbox', 'r') # 'r' = read mode; use 'w' for writing
assert inbox # Makes sure it opened OK
count = 0
for line in inbox: # reads one line at a time
count = count + 1
inbox.close ()
print ("The file has %d lines." % count)
In [ ]: