Now You Code 2: Information Extraction

How do we make computers seem intelligent? One approach is to use term extraction. Term extration is a type of information extration where we attempt to find relevant terms in text. The relevant terms come from a corpus, or set of plausible terms we want to extract.

For example, suppose we have the text:

One day I would like to visit Syracuse

We as smart humans can be fairly confident that Syracuse is a place, more specifically a city.

A rudimentary method to make the computer interpret Syracuse as a place is to provide a corpus of cities and have the computer look up Syracuse in that corpus.

In this code exercise we will do just that. Let's first write a function to read cities from the file NYC2-cities.txt into a corpus of cities, which will be represented in Python as a list.

Then write a main program loop to input some text, split the text into a list of words and if any of the words match a city in the corpus list we will output the word is a city.

The program should handle upper / lower case matching. A good approach is to title case the input.

IMPORTANT: Please note that our program will ONLY work for one word cities, like Syracuse and will not work for multiple-word cities like San Diego. Don't worry about that now.

SAMPLE RUN

Enter some text (or ENTER to quit): one day I would like to visit syracuse and rochester
Syracuse is a city
Rochester is a city
Enter some text (or ENTER to quit): austin is in texas
Austin is a city
Enter some text (or ENTER to quit): 
Quitting...

Once again we will solve this problem using the problem simplification approach. First we will write the load_city_corpus function to build our city list. Second we will write the is_a_city function which given a word and a city list will return True when the word is a city. Finally we conclude with the main program which finds cities in our text, as demonstrated in our sample run.

Step 1: Problem Analysis for load_city_corpus

Inputs: None (reads from a file)

Outputs: a Python list of cities

Algorithm (Steps in Program):


In [ ]:
## Step 2: write the defintion for the load_city_corpus function

Step 3: Problem Analysis for is_a_city

Inputs: a string word and a Python list of cities

Outputs: True or False when word is in the list of cities.

Algorithm (Steps in Program):


In [ ]:
## Step 4: write the definition for the is_a_city function

Step 5: Problem Analysis for entire program

Inputs:

Outputs:

Algorithm (Steps in Program): (make sure to use the two functions we created)


In [1]:
## Step 6: Write complete program, making sure to use your two functions.

Step 7: Questions

  1. Explain your approach to solving this problem for cities with 2 words like New York or Los Angeles?

Answer:

  1. How would you solve the problem where you enter a city name which is not in the corpus?

Answer:

Step 8: Reflection

Reflect upon your experience completing this assignment. This should be a personal narrative, in your own voice, and cite specifics relevant to the activity as to help the grader understand how you arrived at the code you submitted. Things to consider touching upon: Elaborate on the process itself. Did your original problem analysis work as designed? How many iterations did you go through before you arrived at the solution? Where did you struggle along the way and how did you overcome it? What did you learn from completing the assignment? What do you need to work on to get better? What was most valuable and least valuable about this exercise? Do you have any suggestions for improvements?

To make a good reflection, you should journal your thoughts, questions and comments while you complete the exercise.

Keep your response to between 100 and 250 words.

--== Write Your Reflection Below Here ==--