Objective

Beginners typically learn the usage of Python language constructs in isolation, which is good since it avoids mental overload. However, once you have learnt the basics, it is essential to see practical examples of how Python constructs are integrated to solve real-world problems. This session shows how to integrate several basic Python constructs to solve a text processing problem.

Task: Count the number of each word (ignoring case) in the Gettysburg address (in the file "getty.txt"), and print the top 10 words.

Skills:

  1. reading text files
  2. basic text clean-up operations
  3. usage of set, list and dictionary
  4. sorting with optional arguments
  5. usage of list and dictionary comprehensions
  6. usage of the string and operator modules

Problem decomposition

  1. read in the file "getty.txt" into a single string variable
  2. strip text of punctuation
  3. convert to lower case
  4. split string into words
  5. wrtie a function that counts the number of each word
  6. print the top 10 words

In [1]:
# read in the file "getty.txt" into a single string variable

In [2]:
# strip text of punctuation

In [3]:
# convert to lower case

In [4]:
# split string into words

In [5]:
# wrtie a function that counts the number of each word (loop version)

In [6]:
# wrtie a function that counts the number of each word (list comprehension version)

In [ ]:
# print the top 10 words

MCQ (Basic)

What is the the correct value after running the following code?

s = """
This old man;
he played one.
"""

print s.count('this')
  1. None
  2. Code does not run because of error
  3. 0
  4. 1
  5. 2

MCQ (Intermediate)

What is the final value of s after running the following code?

s = """
This old man;
he played one.
"""

s = s.strip().translate(None, string.punctuation).upper()
  1. This old man;\nhe played one.

  2. THIS OLD MAN\nHE PLAYED ONE

  3. THIS OLD MANHE PLAYED ONE

  4. this old manhe played one

  5. THIS OLD MAN;\NHE PLAYED ONE.

MCQ (Advanced)

Which of the following ways of removing punctuation from a string s is the slowest?

  1. new_s = []
    for char in s:
     if char not in string.punctuation:
         new_s.append(char)
    new_s = ''.join(new_s)
  2. new_s = ''
    for char in s:
     if char not in string.punctuation:
         new_s += char
  3. new_s = ''.join([char for char in s if char not in string.punctuation])
  4. new_s = s.translate(None, string.punctuation)
  5. filter(lambda x: x not in string.punctuation, s)

In [ ]: