Python Basics at PyCAR2020

Let's search some text

You already know the components of programming. You have been exercising the reasoning programming relies on for your entire life, probably without even realizing it. Programming is just a way to take the logic you already use on a daily basis and express it in a way a computer can understand and act upon.

It's just learning how to write in a different language.

One very important disclaimer before we start doing just that: Nobody memorizes this stuff. We all have to look stuff up all the time. We don’t expect you to memorize it, either. Ask questions. Ask us to review things we’ve already told you.

(Most of us ask questions we've asked before daily — we just ask them of Google.)

Now for some code. Let's say you want to search 130,000 lines of text for certain tems -- which are most common, how frequently do they occur, how often are they used in a way that's concentrated, which might indicate places you want to look more closely.

No person wants to do that by hand. And people are bad at precisely that kind of work. But it's perfect for a computer.

That length happens to correspond to The Iliad. In groups of two or three, think about a book like that. In your groups, figure out two things:

  • A whole text is made up of what parts?
  • What is the first thing you need to know to begin to search a file of text? The second thing? Third thing?

Roughly, the steps might look like this:

  1. open the file
  2. break the file into individual lines
  3. begin to examine each line
  4. if the line contains the term you're looking for, capture that
  5. does anything else about the line interest you? Is your term there multiple times, for instance?
  6. if none of your conditions are met, keep going

This is a program! See, you already know how to program. Now let’s take a minute to step through this the way a computer might.

In Python and other languages, we use the concept of variables to store values. A variable is just an easy way to reference a value we want to keep track of. So if we want to store a search term and how often our program has found it, we probably want to assign them to variables to keep track of them.

Create a string that represents the search term we want to find and assign it to a variable search_term. Let's search for the string 'Achilles':


In [ ]:
# This could just as easily be 'horse' or 'Helen' or 'Agamemnon' or `sand` -- or 'Trojan'

Now tell Python where the file is and to open it. The path to the file can be one variable: file_location. And we can use that to open the file itself and store that opened file in a variable file_to_read


In [ ]:
file_location = 'data/iliad.txt'

Now create a variable term_count containing the integer value of how many times we've seen it in the text. So far, that's zero


In [ ]:
# how many times our search_term has occurred

So any time we want to check to see how many times we've seen our search_term or check where our file_location is, we can use these variables instead of typing out the card value!

If you forget what one of the variables is set to, you can print your string. (The print() command was optional in Python 2.x, but is now required in Python 3.x.) Let's also make a comment to remind us of what this variable does.


In [ ]:
# how many lines contain at least two of our search_term

When it's on multiple lines, note the line number in a line_number variable and collect all relevant line numbers into a list assigned to the variable line_numbers_list.


In [ ]:
# so far, zero

In [ ]:
# an empty list we hope to fill with lines we might want to explore in greater detail

Remember that a string is just a series of characters, not a word or a sentence. So you can represent those characters as lowercase or uppercase, or see whether the string starts with or ends with specific things. Try to make our search_term lowercase:


In [ ]:
# we want to compare lowercase only against lowercase

We've decided to standaradize our strings for comparison by making them lowercase. Cool. Now we need to do the comparing. Our open file is ready to explore. And to do that, we'll need a for loop. The loop will assign each line to a variable on the fly, then reference that variable to do stuff we tell it to:


In [ ]:
# begin looping line by line through our file

    # increment the line_number

    # make the line lowercase

    # check whether our search_term is in the line

        # if it is, use a tool Python gives us to count how many times
        # and add that to the number of times we've seen already

        # if it has counted more than one in the line, we know it's there multiple times;
        # keep track of that, too

            # and add that to the list using a tool Python give us for lists

We've read through the whole file, but our variable search_term still holds the open file. Let's close it explicitly, using a tool Python gives us on files:


In [ ]:

Now let's set some language so we can make our data more readable.


In [ ]:
# if this value is zero or more than one or (somehow) negative, at least one of our words should be plural

Now we can drop our variables into a sentence to help better make sense of our data:


In [ ]:
# we can do it by adding the strings to one another

In [ ]:
# or we can use what Python calls `f-strings`, which allow us to drop variables directly into a string;
# doing it this way means we don't have to keep track as much of wayward spaces or
# whether one of our variables is an integer

And how often was our term on the same line? Which lines?


In [ ]:

Another way to analyze text is frequency of the words it contains. There may be insights there about what's important, or they may be terms you want to use in a FOIA request:

Let's make a dictionary to keep track of how often all the words in The Iliad occur:


In [ ]:
# a dictionary to collect words as keys and number of occurrences as the value

Remember, we closed our file, so we'll need to open it again and set it to a variable. This is a time when making the file path its own variable saves us the trouble of finding it again:


In [ ]:

Once again, we'll need to loop through the lines in the file. This time, we care about inspecting each individual word -- not just whether a term is somewhere in the line:


In [ ]:
# loop through the lines in The Iliad

    # make each one lowercase

    # make a list of words out of each line using a Python tool for lists

    # and loop over each word in the line

        # if a word is not yet in the most_common_words dictionary, add it
        # if the word is there already, increase the count by 1

In [ ]:
# we now have the words we want to analyze further in a dictionary -- so we don't need that file anymore. So let's close it

Set up our baseline variables -- where we'll want to store the top values we're looking for. We'll need one variable for most_common_word, set to None, and another for highest_count, set to zero.


In [ ]:

Now we have a dictionary of every word in The Iliad. And we can spot-check the number of times any word we'd like has appeared by using the word as the key to access that (just remember we made all the keys lowercase):


In [ ]:


In [ ]:
# looping through a dictionary is a little different -- we want its keys and the values to those keys

    # as we go through the most_common_words dictionary,
    # set the word and the count that's the biggest we've seen so far

In [ ]:

As you’ve just seen, programming can be pretty tedious when you’re trying to break tasks down. So now that you’ve gotten a little bit of a taste for what writing a program is like, let’s dive into some of the nitty-gritty basics, like how you strip whitespace from a string and what happens when you mix a float and an integer.

That sounds like a lot of fun. It must. It does. We promise.

Onward.