Python Basics at PyCAR2020

Let's search some text

You already know the components of programming. You have been exercising the reasoning programming relies on for your entire life, probably without even realizing it. Programming is just a way to take the logic you already use on a daily basis and express it in a way a computer can understand and act upon.

It's just learning how to write in a different language.

One very important disclaimer before we start doing just that: Nobody memorizes this stuff. We all have to look stuff up all the time. We don’t expect you to memorize it, either. Ask questions. Ask us to review things we’ve already told you.

(Most of us ask questions we've asked before daily — we just ask them of Google.)

Now for some code. Let's say you want to search 130,000 lines of text for certain tems -- which are most common, how frequently do they occur, how often are they used in a way that's concentrated, which might indicate places you want to look more closely.

No person wants to do that by hand. And people are bad at precisely that kind of work. But it's perfect for a computer.

That length happens to correspond to The Iliad. In groups of two or three, think about a book like that. In your groups, figure out two things:

  • A whole text is made up of what parts?
  • What is the first thing you need to know to begin to search a file of text? The second thing? Third thing?

Roughly, the steps might look like this:

  1. open the file
  2. break the file into individual lines
  3. begin to examine each line
  4. if the line contains the term you're looking for, capture that
  5. does anything else about the line interest you? Is your term there multiple times, for instance?
  6. if none of your conditions are met, keep going

This is a program! See, you already know how to program. Now let’s take a minute to step through this the way a computer might.

In Python and other languages, we use the concept of variables to store values. A variable is just an easy way to reference a value we want to keep track of. So if we want to store a search term and how often our program has found it, we probably want to assign them to variables to keep track of them.

Create a string that represents the search term we want to find and assign it to a variable search_term:


In [ ]:
# This could just as easily be 'horse' or 'Helen' or 'Agamemnon' or `sand` -- or 'Trojan'
search_term = 'Achilles'

Now tell Python where the file is and to open it. The path to the file can be one variable: file_location. And we can use that to open the file itself and store that opened file in a variable file_to_read


In [ ]:
file_location = '../basics/data/iliad.txt'
file_to_read = open(file_location)

Now create a variable term_count containing the integer value of how many times we've seen it in the text. So far, that's zero


In [ ]:
# how many times our search_term has occurred
term_count = 0

So any time we want to check to see how many times we've seen our search_term or check where our file_location is, we can use these variables instead of typing out the card value!

If you forget what one of the variables is set to, you can print it out. (The print() command was optional in Python 2.x, but is now required in Python 3.x.) Let's also make a comment to remind us of what this variable does.


In [ ]:
# how many lines contain at least two of our search_term
multi_term_line = 0

When it's on multiple lines, note the line number and collect all relevant line numbers


In [ ]:
# so far, zero
line_number = 0

In [ ]:
# an empty list we hope to fill with lines we might want to explore in greater detail
line_numbers_list = []

Remember that a string is just a series of characters, not a word or a sentence. So you can represent those characters as lowercase or uppercase, or see whether the string starts with or ends with specific things. Try to make our search_term lowercase:


In [ ]:
# lowercase because of line.lower() below -- we want to compare lowercase only against lowercase
search_term = search_term.lower()

We've decided to standaradize our strings for comparison by making them lowercase. Cool. Now we need to do the comparing. Our open file is ready to explore. And to do that, we'll need a for loop. The loop will assign each line to a variable on the fly, then reference that variable to do stuff we tell it to:


In [ ]:
# begin looping line by line through our file
for line in file_to_read:
    # increment the line_number
    line_number += 1
    # make the line lowercase
    line = line.lower()
    # check whether our search_term is in the line
    if search_term in line:
        # if it is, use a tool Python gives us to count how many times
        # and add that to the number of times we've seen already
        term_count += line.count(search_term)
        # if it has counted more than one in the line, we know it's there multiple times;
        # keep track of that, too
        if line.count(search_term) > 1:
            # print(line)
            multi_term_line += 1
            # and add that to the list using a tool Python give us for lists
            line_numbers_list.append(line_number)

We've read through the whole file, but our variable search_term still holds the open file. Let's close it explicitly, using a tool Python gives us on files:


In [ ]:
file_to_read.close()

Now let's set some language so we can make our data more readable.


In [ ]:
# if this value is zero or more than one or (somehow) negative, this word should be plural
if multi_term_line != 1:
    times = 'times'
else:
    times = 'time'

Now we can drop our variables into a sentence to help better make sense of our data:


In [ ]:
# we can do it by adding the strings to one another like this:
print(search_term + ' was in The Iliad ' + str(term_count) + ' times')

In [ ]:
# or we can use what Python calls `f-strings`, which allow us to drop variables directly into a string;
# doing it this way means we don't have to keep track as much of wayward spaces or
# whether one of our variables is an integer
print(f'{search_term} was in The Iliad {term_count} times')

And how often was our term on the same line? Which lines?


In [ ]:
print(f'It was on the same line multiple times {multi_term_line} {times}')
print(f'it was on lines {line_numbers_list} multiple times')

Another way to analyze text is frequency of the words it contains. There may be insights there about what's important, or they may be terms you want to use in a FOIA request:

Let's make a dictionary to keep track of how often all the words in The Iliad occur:


In [ ]:
# a dictionary to collect words as keys and number of occurrences as the value
most_common_words = {}

Remember, we closed our file, so we'll need to open it again and set it to a variable. This is a time when making the file path its own variable saves us the trouble of finding it again:


In [ ]:
file_to_read = open(file_location)

Once again, we'll need to loop through the lines in the file. This time, we care about inspecting each individual word -- not just whether a term is somewhere in the line:


In [ ]:
for line in file_to_read:
    line = line.lower()
    # make a list of words out of each line using a Python tool for lists
    word_list = line.split()
    # and loop over each word in the line
    for word in word_list:
        # if a word is not yet in the most_common_words dictionary, add it
        # if the word is there already, increase the count by 1
        most_common_words[word] = most_common_words.get(word, 0) + 1

In [ ]:
# we now have the words we want to analyze further in a dictionary -- so we don't need that file anymore. So let's close it
file_to_read.close()

Set up our baseline variables -- where we'll want to store the top values we're looking for. We'll need one variable for most_common_word, set to None, and another for highest_count, set to zero.


In [ ]:
most_common_word = None
highest_count = 0

Now we have a dictionary of every word in The Iliad. And we can spot-check the number of times any word we'd like has appeared by using the word as the key to access that (just remember we made all the keys lowercase):


In [ ]:
print(most_common_words["homer"])
print(most_common_words['paris'])
print(most_common_words['hector'])
print(most_common_words['helen'])
print(most_common_words['sand'])
print(most_common_words['trojan'])

In [ ]:
for word, count in most_common_words.items():
    # as we go through the most_common_words dictionary,
    # set the word and the count that's the biggest we've seen so far
    if highest_count is None or count > highest_count:
        most_common_word = word
        highest_count = count

In [ ]:
print(f'The most common word in The Iliad is: {most_common_word}')
print(f'It is in The Iliad {highest_count} times')
print('Wow! How cool is that?')

As you’ve just seen, programming can be pretty tedious when you’re trying to break tasks down. So now that you’ve gotten a little bit of a taste for what writing a program is like, let’s dive into some of the nitty-gritty basics, like how you strip whitespace from a string and what happens when you mix a float and an integer.

That sounds like a lot of fun. It must. It does. We promise.

Onward.


In [ ]: