Basic Programming Using Python: Files and Lists

Objectives

  • Learn how to open a file and read in data.
  • Understand how to interpret and remove newlines in python.
  • Use ipythonblocks library to create grids of colored cells based on characters in a file.
  • Introduce the list data structure and learn how to manipulate lists in python.
  • Use lists to store lines from a data file in order to generate multi-dimensional color grids.

File I/O: Reading in Data From Files

In our previous lesson, we learned how to set the colors in a grid based on the characters in a string. However, what happens if we want to set the colors based on the characters contained in a file? Here we will learn out to open files and read in lines of data for use in setting the colors of a grid.

As a refresher, here's our coloring function that takes in an an ImageGrid object, grid, and colors it based on the characters in the string, data:


In [2]:
def color_from_string(grid, data):
    "Color grid cells red and green according to 'R' and 'G' in data."
    assert grid.width == len(data), \
           'Grid and string lengths do not match: {0} != {1}'.format(grid.width, len(data))
    for x in range(grid.width):
        assert data[x] in 'GR', \
               'Unknown character in data string: "{0}"'.format(data[x])
        if data[x] == 'R':
            grid[x, 0] = colors['Red']
        else:
            grid[x, 0] = colors['Green']

And here's how we use it:


In [3]:
from ipythonblocks import ImageGrid, colors

row = ImageGrid(5, 1)
color_from_string(row, 'RRGRR')
row.show()


Using a conventional text editor, we can create a text file that contains just that string:


In [5]:
!cat grid_rrgrr.txt


RRGRR

Now let's read it into our program:


In [4]:
reader = open('grid_rrgrr.txt', 'r')
line = reader.readline()
reader.close()
print 'line is:', line


line is: RRGRR

The first line of our program uses a built-in function called open to open our file. open's first parameter specifies the file we want to open; the second parameter, 'r', signals that we want to read the file. (We can use 'w' to write files, which we'll explore later.) open returns a special object that keeps track of which file we opened, and how much of its data we've read. This object is sometimes called a file handle, and we can assign it to a variable like any other value.

The second line of our program asks the file handle's readline method to read the first line from the file and give it back to us as a string.

The third line of the program asks the file handle to close itself (i.e., to disconnect from the file).

Important note on closing files: When we open a file, the operating system creates a connection between our program and that file. For performance and security reasons, it will only let a single program have a fixed number of files open at any one time, and will only allow a single file to be opened by a fixed number of programs at once. Both limits are typically up in the thousands, and the operating system automatically closes open files when a program finishes running, so we're unlikely to run into problems most of the time.

But that's precisely what makes this problematic. Something that only goes wrong when we're doing something large is much harder to debug than something that also goes wrong in the small. It's therefore a very good idea to get into the habit of closing files as soon as they're no longer needed. In fact, it's such a good idea that Python and other languages have a way to guarantee that it happens automatically.


In [5]:
with open('grid_rrgrr.txt', 'r') as reader:
    open('grid_rrgrr.txt', 'r')
    line = reader.readline()
    print 'line is:', line


line is: RRGRR

The with...as... statement takes whatever is created by its first part—in our case, the result of opening a file—and assigns it to the variable given in its second part. It then executes a block of code, and when that block is finished, it cleans up the stored value. "Cleaning up" a file means closing it; it means different things for databases and connections to hardware devices, but in every case, Python guarantees to do the right thing at the right time. We'll use with statements for file I/O from now on.

Finally on the fourth line we print the string we read. The result is 'RRGRR', just as expected.

Or is it? Let's take a look at line's length:


In [7]:
print len(line)


6

Why does len tell us there are six characters instead of five? We can use another function called repr to take a closer look at what we actually read:


In [8]:
print repr(line)


'RRGRR\n'

repr stands for "representation". It returns whatever we'd have to type into a Python program to create the thing we've given it as a parameter. In this case, it's telling us that our string contains 'R', 'R', 'G', 'R', 'R', and '\n'. That last thing is called an escape sequence, and it's how Python represent a newline character in a string. We can use other escape sequences to represent other special characters:


In [9]:
print 'We\'ll put a single quote in a single-quoted string.'
print "Or we\"ll put a double quote in a double-quoted string."
print 'This\nstring\ncontains\nnewlines.'
print 'And\tthis\tone\tcontains\ttabs.'


We'll put a single quote in a single-quoted string.
Or we"ll put a double quote in a double-quoted string.
This
string
contains
newlines.
And	this	one	contains	tabs.


Carriage Return, Newline, and All That

If we create our file on Windows, it might contain 'RRGRR\r\n' instead of 'RRGRR\n'. The '\r' is a [carriage return](glossary.html#carriage_return), and it's there because Windows uses two characters to mark the ends of lines rather than just one. There's no reason to prefer one convention over the other, but problems do arise when we create files one way and try to read them with programs that expect the other. Python does its best to shield us from this by converting Windows-style '\r\n' end-of-line markers to '\n' as it reads data from files. If we really want to keep the original line endings, we need to use `'rb'` (for "read binary") when we open the file instead of just `'r'`. For more on this and other madness, see Joel Spolsky's article [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html).


The easiest way to get rid of our annoying newline character is to use str.strip, i.e., the strip method of the string data type. As its interactive help says:


In [10]:
help(str.strip)


Help on method_descriptor:

strip(...)
    S.strip([chars]) -> string or unicode
    
    Return a copy of the string S with leading and trailing
    whitespace removed.
    If chars is given and not None, remove characters in chars instead.
    If chars is unicode, S will be converted to unicode before stripping

str.strip creates a new string by removing any leading or trailing whitespace characters from the original (str.lstrip and str.rstrip remove only leading or trailing whitespace, respectively). Whitespace includes carriage return, newline, tab, and the familiar space character, so stripping the string also takes care of any accidental indentation or (invisible) trailing spaces:


In [11]:
original = '  indented with trailing spaces   '
stripped = original.strip()
print '|{0}|'.format(stripped)


|indented with trailing spaces|

Let's use this to fix our string and initialize our grid. In fact, let's write a function that takes a grid and a filename as parameters and fills the grid using the color specification in that file:


In [6]:
def color_from_file(grid, filename):
    'Color the cells in a grid using a spec stored in a file.'
    with open(filename, 'r') as reader:
        line = reader.readline()
        reader.close()
        line = line.strip()
        color_from_string(grid, line)

another_row = ImageGrid(5, 1)
color_from_file(another_row, 'grid_rrgrr.txt')
another_row.show()


That's progress, but we can do better. When we were creating grids and color strings in the same program, it was fairly easy to make sure the grid and the string were the same size. Opening a text file in an editor and counting the characters on the first line will be a lot more painful, so why don't we create the grid based on how long the string is?


In [7]:
def create_from_file(filename):
    'Create and color a grid using a spec stored in a file.'
    with open(filename, 'r') as reader:
        line = reader.readline()
        line = line.strip()
        grid = ImageGrid(len(line), 1)
        color_from_string(grid, line)
        return grid

newly_made = create_from_file('grid_rrgrr.txt')
newly_made.show()


This is starting to look like our friend skimage.novice.open: given a filename, it loads the data from that file into a suitable object in memory and gives the object back to us for further use. What's more, it does that using a function that initializes objects which are already in memory, so that we can fill things several times in exactly the same way without any duplicated code.

Using Lists to Generate Multi-dimensional Color Grids

A single row of pixels is a lot less interesting than an actual image, but before we can read the latter, we need to learn how to use lists. Just as a for loop is a way to do operations many times, a list is a way to store many values in one variable. To start our exploration of lists, try this:


In [15]:
odds = [1, 3, 5]
for number in odds:
    print number


1
3
5

[1, 3, 5] is a list. Its elements are written in square brackets and separated by commas, and just as a for loop over a string works on those characters one at a time, a for loop over a list processes the list's values one by one.

Let's do something a bit more useful with a list of numbers:


In [16]:
data = [1, 4, 2, 3, 3, 4, 3, 4, 1]
total = 0.0
for n in data:
    total += n
mean = total / len(data)
print 'mean is', mean


mean is 2.77777777778

By now, the logic here should be fairly easy to follow. data refers to our list, and total is initialized to 0.0. Each iteration of the loop adds the next number from the list to total, and when we're done, we divide the result by the list's length to get the mean. (Note that we initialize total to 0.0 rather than 0, so that it is always a floating-point number. If we didn't do this, its final value might be an integer, and the division could give us a truncated approximation to the actual mean.)


A Simpler Way

Python actually has a build-in function called `sum` that does what our loop does, so we can calculate the mean more simply using this:


In [19]:
print 'mean is', float(sum(data)) / len(data)


 mean is 2.77777777778

Again, it's important to understand that `float( sum(data)/len(data) )` might not return the right answer, since it would do integer/integer division (producing a possibly-truncated result) and then convert that value to a float.


A Deeper Look at Lists in Python

Lists are probably used more than any other data structure in programming, so let's have a closer look at them.

First, lists are ordered (i.e. they are sequences) and you can fetch a component object out of a list by indexing the list starting at index 0:


In [14]:
values = [1, 3, 5]
print values[0]
print values[1]
print values[2]


1
3
5

Second, lists are mutable, i.e., they can be changed after they are created:


In [20]:
values = [1, 3, 5]
values[0] = 'one'
values[1] = 'three'
values[2] = 'five'
print values


['one', 'three', 'five']

As the diagrams below show, this works because the list doesn't actually contain any values. Instead, it stores references to values. When we assign something to values[0], what we're really doing is putting a different reference in that location in the list. Let's quickly go through the block of code above line by line:


In [8]:
values = [1, 3, 5]
print values


[1, 3, 5]


In [13]:
values = [1, 3, 5]
values[0] = 'one'
print values


['one', 3, 5]


In [15]:
values = [1, 3, 5]
values[0] = 'one'
values[1] = 'three'
print values


['one', 'three', 5]


In [17]:
values = [1, 3, 5]
values[0] = 'one'
values[1] = 'three'
values[2] = 'five'
print values


['one', 'three', 'five']

Third, lists are variable-length and can dynamically grow and shrink in place using built in functions such as append() and remove(). For example:


In [2]:
data = [1, 4, 2, 3]
result = []
print 'The length of result before: ', len(result)
current = 0
for n in data:
    current = current + n
    result.append(current)
print 'running total:', result
print 'The length of result after: ', len(result)


The length of result before:  0
running total: [1, 5, 7, 10]
The length of result after:  4

result starts off as an empty list with a length of 0, and current starts off as zero. Each iteration of the loop adds the next value in the list data to current to calculate the running total. It then appends this value to result, so that when the program finishes we have a complete list of partial sums.

What if we want to double the values in data in place? We could try this:


In [9]:
data = [1, 4, 2, 3] # re-initialize our sample data
for n in data:
    n = 2 * n
print 'doubled data:', data


doubled data: [1, 4, 2, 3]

but as we can see, it doesn't work. When Python calculates 2*n it creates a new value in memory. It then makes the variable n point at the value for a few microseconds before going around the loop again and pointing n at the next value from the list instead. Since nothing is pointing to the temporary value we just created any longer, Python throws it away.

The right way to solve this problem is to use indexing and the range function:


In [6]:
data = [1, 4, 2, 3] # re-initialize our sample data
for i in range(4):
    data[i] = 2 * data[i]
print 'doubled data:', data


doubled data: [2, 8, 4, 6]

Once again we have violated the DRY Principle by using range(4): if we ever change the number of values in data, our loop will either fail because we're trying to index beyond its end, or what's worse, appear to succeed but not actually update some values. Let's fix that:


In [8]:
data = [1, 4, 2, 3] # re-initialize our sample data
for i in range(len(data)):
    data[i] *= 2
print 'doubled data:', data


doubled data: [2, 8, 4, 6]

That's better: len(data) is always the actual length of the list, so range(len(data)) is always the indices we need. We've also rewritten the multiplication and assignment to use an in-place operator *= so that we aren't repeating data[i].

We can actually do this even more efficiently using list comprehensions. This isn't exactly the same as the for loop solution above because it creates a new object, however it is close enough for most applications:


In [12]:
data = [1, 4, 2, 3] # re-initialize our sample data
data = [n*2 for n in data]
print 'doubled data:', data


doubled data: [2, 8, 4, 6]

We can also do a lot of other interesting things with lists, like concatenate them:


In [29]:
left = [1, 2, 3]
right = [4, 5, 6]
combined = left + right
print combined


[1, 2, 3, 4, 5, 6]

count how many times a particular value appears in them:


In [31]:
data = ['a', 'c', 'g', 'g', 'c', 't', 'a', 'c', 'g', 'g']
print data.count('g')


4

sort them:


In [33]:
data.sort()
print data


['a', 'a', 'c', 'c', 'c', 'g', 'g', 'g', 'g', 't']

and reverse them:


In [34]:
data.reverse()
print data


['t', 'g', 'g', 'g', 'g', 'c', 'c', 'c', 'a', 'a']


A Health Warning

One thing that newcomers (and even experienced programmers) often trip over is that `sort` and `reverse` mutate the list, i.e., they rearrange values within a single list rather than creating and returning a new list. If we do this:


In [37]:
sorted_data = data.sort()

then all we have is the special value `None`, which Python uses to mean "there's nothing here":


In [36]:
print sorted_data


None

At some point or another, everyone types `data = data.sort()` and then wonders where their time series has gone…


Back to Multi-dimensional Color Grids

Now that we know how to create lists, we're ready to load two-dimensional images from files. Here's our first test file:


In [38]:
!cat grid_3x3.txt


RRG
RGR
GRR

and here's how we read it line by line with Python:


In [39]:
with open('grid_3x3.txt', 'r') as source:
    for line in source:
        print line


RRG

RGR

GRR

Whoops: we forgot to strip the newlines off the ends of the lines as we read them from the file. Let's fix that:


In [40]:
with open('grid_3x3.txt', 'r') as source:
    for line in source:
        print line.strip()


RRG
RGR
GRR

That's better. As this example shows, a for loop over a file reads the lines from the file one by one and assigns each to the loop variable in turn. If we want to get all the lines at once, we can do this instead:


In [41]:
with open('grid_3x3.txt', 'r') as source:
    lines = source.readlines() # with an 's' on the end
print lines


['RRG\n', 'RGR\n', 'GRR\n']

file.readlines (with an 's' on the end to distinguish it from file.readline) reads the entire file at once and returns a list of strings, one per line. The length of this list tells us how many rows we need in our grid, while the length of the first line (minus the newline character) tells us how many columns we need:


In [42]:
with open('grid_3x3.txt', 'r') as source:
    lines = source.readlines()
height = len(lines)
width = len(lines[0]) - 1
print '{0}x{1} grid'.format(width, height)


3x3 grid

Upon reflection, that's not actually a very good test case, since we can't actually tell if we have height and width the right way around. Let's use a rectangular data file:


In [43]:
!cat grid_5x3.txt


RRRGR
RRGRR
RGRRR

and put our code in a function:


In [44]:
def read_size(filename):
    with open(filename, 'r') as source:
        lines = source.readlines()
    return len(lines[0]) - 1, len(lines)

width, height = read_size('grid_5x3.txt')
print '{0}x{1} grid'.format(width, height)


5x3 grid

As this example shows, a function can return several values at once. When it does, those values are matched against the caller's variables from left to right. This can actually be done anywhere:


In [46]:
red, green, blue = 255, 0, 128
print 'red={0} green={1} blue={2}'.format(red, green, blue)


red=255 green=0 blue=128

and gives us an easy way to swap the values of two variables:


In [47]:
low, high = 25, 10 # whoops
low, high = high, low # exchange their values
print 'low={0} high={1}'.format(low, high)


low=10 high=25

Back to our function… Rather than just returning sizes, it would be more useful for us to create and fill in a grid. As we're doing this, though, we must remember to strip the newlines off the strings we have read from the file:


In [48]:
def read_grid(filename):
    with open(filename, 'r') as source:
        lines = source.readlines()
    width, height = len(lines[0]) - 1, len(lines)
    result = ImageGrid(width, height)
    for y in range(len(lines)):
        fill_grid_line(result, y, lines[y].strip())
    return result

This is the most complicated function we've written so far, so let's go through it step by step:

  1. Define read_grid to take a single parameter.
  2. Open the file named by that parameter and assign the file handle to source.
  3. Read all of the lines from the file at once and assign the resulting list to lines.
  4. Having closed the file, calculate the width and height of the grid.
  5. Create the grid.
  6. Loop over the lines.
  7. Fill in a single line of the grid using an as-yet-unwritten function called fill_grid_line.
  8. Once the loop is done, return the resulting grid.

We need a new function fill_grid_line because the function we've been using, color_from_string, always colors row 0 of whatever grid it's given. We need something that can color any row we specify:


In [49]:
def fill_grid_line(grid, y, data):
    "Color grid cells in row y red and green according to 'R' and 'G' in data."
    assert 0 <= y < grid.height, \
           'Row index {0} not within grid height {1}'.format(y, grid.height)
    assert grid.width == len(data), \
           'Grid and string lengths do not match: {0} != {1}'.format(grid.width, len(data))
    for x in range(grid.width):
        assert data[x] in 'GR', \
               'Unknown character in data string: "{0}"'.format(data[x])
        if data[x] == 'R':
            grid[x, y] = colors['Red']
        else:
            grid[x, y] = colors['Green']

As well as adding an extra parameter y to this function, we've added an extra assertion to make sure it's between 0 and the grid's height. In fact, we could have said, "Since we're adding an extra parameter, we've added an extra assertion," since it's good practice to check every input to a function before using it.

Let's give our functions a try:


In [50]:
rectangle = read_grid('grid_5x3.txt')
rectangle.show()


Perfect—or is it? Take another look at our data file:


In [51]:
!cat grid_5x3.txt


RRRGR
RRGRR
RGRRR

The 'G' in the top row of the data file is on the right, but the green square in the top row of the data file is on the left. The green cell in the bottom row of the grid is also in the wrong place. Somehow, our grid appears to be upside down.

The problem is that we haven't used a consistent coordinate system. ImageGrid uses a Cartesian grid with the origin in the lower left and Y going upward, but we're treating the file as if the origin was at the top, just as it is in a spreadsheet. The simplest way to fix this is to reverse our list of lines before using it:


In [54]:
def read_grid(filename):
    with open(filename, 'r') as source:
        lines = source.readlines()
    width, height = len(lines[0]) - 1, len(lines)
    result = ImageGrid(width, height)
    lines.reverse() # align with ImageGrid coordinate system
    for y in range(len(lines)):
        fill_grid_line(result, y, lines[y].strip())
    return result

rectangle = read_grid('grid_5x3.txt')
rectangle.show()


All that's left is to make sure that all the lines are the same length so that we're warned of an error if we try to use a file like this:


In [56]:
!cat grid_ragged.txt


RRRGR
RRGR
RGR

Since we require all the lines to be the same length, we can compare their lengths against the length of any one line. We can do this in a loop of its own:

for line in lines:
    assert len(line) == width

or put the test in the loop that's filling the lines:

for y in range(len(lines)):
    assert len(lines[y].strip()) == width
    fill_grid_line(result, y, lines[y].strip())

The first does the checks before it makes any changes to the grid. Since we're creating the grid inside the function, though, this isn't a real worry: if there's an error in the file, our assertion will cause the function to fail and the partially-initialized grid will never be returned to the caller. We will therefore use the second form, but modify it slightly so that we only call strip once (DRY again):


In [72]:
def read_grid(filename):
    "Initialize a grid by reading lines of 'R' and 'G' from a file."
    with open(filename, 'r') as source:
        lines = source.readlines()
    width, height = len(lines[0]) - 1, len(lines)
    result = ImageGrid(width, height)
    lines.reverse()
    for y in range(len(lines)):
        string = lines[y].strip()
        assert len(string) == width, \
               'Line {0} is {1} long, not {2}'.format(y, len(string), width)
        fill_grid_line(result, y, string)
    return result

As always, we're not done until we test our change:


In [74]:
read_grid('grid_ragged.txt')


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-74-23ded2baa669> in <module>()
----> 1 read_grid('grid_ragged.txt')

<ipython-input-72-fd077232b9de> in read_grid(filename)
      8     for y in range(len(lines)):
      9         string = lines[y].strip()
---> 10         assert len(string) == width,                'Line {0} is {1} long, not {2}'.format(y, len(string), width)
     11         fill_grid_line(result, y, string)
     12     return result

AssertionError: Line 0 is 3 long, not 5

And of course we should make sure that it still works for a valid file:


In [75]:
once_more = read_grid('grid_5x3.txt')
once_more.show()


Thumbnails Revisited

We now have all the concepts we need to create thumbnails for a set of images, and almost all the tools. The one remaining piece of the puzzle is the unpleasantly-named glob:


In [80]:
import glob
print 'text files:', glob.glob('*.txt')


text files: ['grid_3x3.txt', 'grid_5x3.txt', 'grid_ragged.txt', 'grid_rrgrr.txt']

In [81]:
print 'IPython Notebooks:', glob.glob('*.ipynb')


IPython Notebooks: ['python-0-resize-image.ipynb', 'python-1-functions.ipynb', 'python-2-loops-indexing.ipynb', 'python-3-conditionals-defensive.ipynb', 'python-4-files-lists.ipynb']

"glob" was originally short for "global command", but it has long since become a verb in its own right. It takes a single string as a parameter and uses it to do wildcard matching on filenames, returning a list of matches as a result. Once we have this list, we can loop over it and create thumbnails one by one:


In [83]:
from skimage import novice
from glob import glob

DEFAULT_WIDTH = 100

def make_all_thumbnails(pattern, width=DEFAULT_WIDTH):
    "Create thumbnails for all image files matching the given pattern."
    for filename in glob(pattern):
        make_thumbnail(filename, width)

def make_thumbnail(original_filename, width=DEFAULT_WIDTH):
    "Create a thumbnail for a single image file."
    picture = novice.open(original_filename)
    new_height = int(picture.height * float(width) / picture.width)
    picture.size = (width, new_height)
    thumbnail_filename = 'thumbnail-' + original_filename
    picture.save(thumbnail_filename)

The only thing that's really new here is the way we specify the default value for thumbnail widths. Since people might call both make_all_thumbnails and make_thumbnail directly, we want to be able to set the width for either. However, we also want their default values to be the same, so we define that value once near the top of the program and use it in both function definitions. By convention, "constant" values like DEFAULT_WIDTH are spelled in UPPER CASE to indicate that they shouldn't be changed.

Key Points

  • Open a file for reading using file_handle = open(filename, 'r') and close the file using file_handle.close().
  • Opening a file using open(...) returns a file handle object that makes a connection between a program and the file.
  • All lines read in from a file contain a newline character (\n) at the end of the line.
  • Remove newline characters (and all leading or trailing whitepace) in python using line.strip().
  • Lists are elements are written in square brackets and separated by commas.
  • Lists are mutable objects - they can be changed after they are created.
  • List data structure in python has built in functions including: count(), sort(), and reverse().