ipythonblocks
library to create grids of colored cells based on characters in a file.In our previous lesson, we learned how to set the colors in a grid based on the characters in a string. However, what happens if we want to set the colors based on the characters contained in a file? Here we will learn out to open files and read in lines of data for use in setting the colors of a grid.
As a refresher, here's our coloring function that takes in an an ImageGrid object, grid, and colors it based on the characters in the string, data:
In [2]:
def color_from_string(grid, data):
"Color grid cells red and green according to 'R' and 'G' in data."
assert grid.width == len(data), \
'Grid and string lengths do not match: {0} != {1}'.format(grid.width, len(data))
for x in range(grid.width):
assert data[x] in 'GR', \
'Unknown character in data string: "{0}"'.format(data[x])
if data[x] == 'R':
grid[x, 0] = colors['Red']
else:
grid[x, 0] = colors['Green']
And here's how we use it:
In [3]:
from ipythonblocks import ImageGrid, colors
row = ImageGrid(5, 1)
color_from_string(row, 'RRGRR')
row.show()
Using a conventional text editor, we can create a text file that contains just that string:
In [5]:
!cat grid_rrgrr.txt
Now let's read it into our program:
In [4]:
reader = open('grid_rrgrr.txt', 'r')
line = reader.readline()
reader.close()
print 'line is:', line
The first line of our program uses a built-in function called open
to open our file.
open
's first parameter specifies the file we want to open;
the second parameter,
'r'
,
signals that we want to read the file.
(We can use 'w'
to write files,
which we'll explore later.)
open
returns a special object that keeps track of which file we opened,
and how much of its data we've read.
This object is sometimes called a file handle,
and we can assign it to a variable like any other value.
The second line of our program asks the file handle's readline
method
to read the first line from the file
and give it back to us as a string.
The third line of the program asks the file handle to close itself (i.e., to disconnect from the file).
Important note on closing files: When we open a file, the operating system creates a connection between our program and that file. For performance and security reasons, it will only let a single program have a fixed number of files open at any one time, and will only allow a single file to be opened by a fixed number of programs at once. Both limits are typically up in the thousands, and the operating system automatically closes open files when a program finishes running, so we're unlikely to run into problems most of the time.
But that's precisely what makes this problematic. Something that only goes wrong when we're doing something large is much harder to debug than something that also goes wrong in the small. It's therefore a very good idea to get into the habit of closing files as soon as they're no longer needed. In fact, it's such a good idea that Python and other languages have a way to guarantee that it happens automatically.
In [5]:
with open('grid_rrgrr.txt', 'r') as reader:
open('grid_rrgrr.txt', 'r')
line = reader.readline()
print 'line is:', line
The with...as...
statement takes whatever is created by its first part—in
our case,
the result of opening a file—and
assigns it to the variable given in its second part.
It then executes a block of code,
and when that block is finished,
it cleans up the stored value.
"Cleaning up" a file means closing it;
it means different things for databases and connections to hardware devices,
but in every case,
Python guarantees to do the right thing at the right time.
We'll use with
statements for file I/O from now on.
Finally on the fourth line we print the string we read.
The result is 'RRGRR'
,
just as expected.
Or is it?
Let's take a look at line
's length:
In [7]:
print len(line)
Why does len
tell us there are six characters instead of five?
We can use another function called repr
to take a closer look
at what we actually read:
In [8]:
print repr(line)
repr
stands for "representation".
It returns whatever we'd have to type into a Python program
to create the thing we've given it as a parameter.
In this case,
it's telling us that our string contains 'R', 'R', 'G', 'R', 'R', and '\n'.
That last thing is called an escape sequence,
and it's how Python represent a newline character
in a string.
We can use other escape sequences to represent other special characters:
In [9]:
print 'We\'ll put a single quote in a single-quoted string.'
print "Or we\"ll put a double quote in a double-quoted string."
print 'This\nstring\ncontains\nnewlines.'
print 'And\tthis\tone\tcontains\ttabs.'
If we create our file on Windows, it might contain 'RRGRR\r\n' instead of 'RRGRR\n'. The '\r' is a [carriage return](glossary.html#carriage_return), and it's there because Windows uses two characters to mark the ends of lines rather than just one. There's no reason to prefer one convention over the other, but problems do arise when we create files one way and try to read them with programs that expect the other. Python does its best to shield us from this by converting Windows-style '\r\n' end-of-line markers to '\n' as it reads data from files. If we really want to keep the original line endings, we need to use `'rb'` (for "read binary") when we open the file instead of just `'r'`. For more on this and other madness, see Joel Spolsky's article [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html).
The easiest way to get rid of our annoying newline character
is to use str.strip
,
i.e.,
the strip
method of the string data type.
As its interactive help says:
In [10]:
help(str.strip)
str.strip
creates a new string by removing any leading or trailing whitespace characters
from the original (str.lstrip
and str.rstrip
remove only leading or trailing whitespace, respectively).
Whitespace includes carriage return,
newline,
tab,
and the familiar space character,
so stripping the string also takes care of any accidental indentation
or (invisible) trailing spaces:
In [11]:
original = ' indented with trailing spaces '
stripped = original.strip()
print '|{0}|'.format(stripped)
Let's use this to fix our string and initialize our grid. In fact, let's write a function that takes a grid and a filename as parameters and fills the grid using the color specification in that file:
In [6]:
def color_from_file(grid, filename):
'Color the cells in a grid using a spec stored in a file.'
with open(filename, 'r') as reader:
line = reader.readline()
reader.close()
line = line.strip()
color_from_string(grid, line)
another_row = ImageGrid(5, 1)
color_from_file(another_row, 'grid_rrgrr.txt')
another_row.show()
That's progress, but we can do better. When we were creating grids and color strings in the same program, it was fairly easy to make sure the grid and the string were the same size. Opening a text file in an editor and counting the characters on the first line will be a lot more painful, so why don't we create the grid based on how long the string is?
In [7]:
def create_from_file(filename):
'Create and color a grid using a spec stored in a file.'
with open(filename, 'r') as reader:
line = reader.readline()
line = line.strip()
grid = ImageGrid(len(line), 1)
color_from_string(grid, line)
return grid
newly_made = create_from_file('grid_rrgrr.txt')
newly_made.show()
This is starting to look like our friend skimage.novice.open
:
given a filename,
it loads the data from that file into a suitable object in memory
and gives the object back to us for further use.
What's more,
it does that using a function that initializes objects which are already in memory,
so that we can fill things several times in exactly the same way
without any duplicated code.
A single row of pixels is a lot less interesting than an actual image,
but before we can read the latter,
we need to learn how to use lists.
Just as a for
loop is a way to do operations many times,
a list is a way to store many values in one variable.
To start our exploration of lists,
try this:
In [15]:
odds = [1, 3, 5]
for number in odds:
print number
[1, 3, 5]
is a list.
Its elements are written in square brackets and separated by commas,
and just as a for
loop over a string works on those characters one at a time,
a for
loop over a list processes the list's values one by one.
Let's do something a bit more useful with a list of numbers:
In [16]:
data = [1, 4, 2, 3, 3, 4, 3, 4, 1]
total = 0.0
for n in data:
total += n
mean = total / len(data)
print 'mean is', mean
By now,
the logic here should be fairly easy to follow.
data
refers to our list,
and total
is initialized to 0.0.
Each iteration of the loop adds the next number from the list to total
,
and when we're done,
we divide the result by the list's length to get the mean.
(Note that we initialize total
to 0.0 rather than 0,
so that it is always a floating-point number.
If we didn't do this,
its final value might be an integer,
and the division could give us a truncated approximation to the actual mean.)
In [19]:
print 'mean is', float(sum(data)) / len(data)
Again, it's important to understand that `float( sum(data)/len(data) )` might not return the right answer, since it would do integer/integer division (producing a possibly-truncated result) and then convert that value to a float.
Lists are probably used more than any other data structure in programming, so let's have a closer look at them.
First, lists are ordered (i.e. they are sequences) and you can fetch a component object out of a list by indexing the list starting at index 0:
In [14]:
values = [1, 3, 5]
print values[0]
print values[1]
print values[2]
Second, lists are mutable, i.e., they can be changed after they are created:
In [20]:
values = [1, 3, 5]
values[0] = 'one'
values[1] = 'three'
values[2] = 'five'
print values
As the diagrams below show,
this works because the list doesn't actually contain any values.
Instead,
it stores references to values.
When we assign something to values[0]
,
what we're really doing is putting a different reference in that location in the list. Let's quickly go through the block of code above line by line:
In [8]:
values = [1, 3, 5]
print values
In [13]:
values = [1, 3, 5]
values[0] = 'one'
print values
In [15]:
values = [1, 3, 5]
values[0] = 'one'
values[1] = 'three'
print values
In [17]:
values = [1, 3, 5]
values[0] = 'one'
values[1] = 'three'
values[2] = 'five'
print values
Third, lists are variable-length and can dynamically grow and shrink in place using built in functions such as append()
and remove()
. For example:
In [2]:
data = [1, 4, 2, 3]
result = []
print 'The length of result before: ', len(result)
current = 0
for n in data:
current = current + n
result.append(current)
print 'running total:', result
print 'The length of result after: ', len(result)
result
starts off as an empty list with a length of 0,
and current
starts off as zero.
Each iteration of the loop
adds the next value in the list data
to current
to calculate the running total.
It then appends this value to result
,
so that when the program finishes we have a complete list of partial sums.
What if we want to double the values in data
in place?
We could try this:
In [9]:
data = [1, 4, 2, 3] # re-initialize our sample data
for n in data:
n = 2 * n
print 'doubled data:', data
but as we can see,
it doesn't work.
When Python calculates 2*n
it creates a new value in memory.
It then makes the variable n
point at the value for a few microseconds
before going around the loop again
and pointing n
at the next value from the list instead.
Since nothing is pointing to the temporary value we just created any longer,
Python throws it away.
The right way to solve this problem is to use indexing and the range
function:
In [6]:
data = [1, 4, 2, 3] # re-initialize our sample data
for i in range(4):
data[i] = 2 * data[i]
print 'doubled data:', data
Once again we have violated the DRY Principle by using range(4)
:
if we ever change the number of values in data
,
our loop will either fail because we're trying to index beyond its end,
or what's worse,
appear to succeed but not actually update some values.
Let's fix that:
In [8]:
data = [1, 4, 2, 3] # re-initialize our sample data
for i in range(len(data)):
data[i] *= 2
print 'doubled data:', data
That's better:
len(data)
is always the actual length of the list,
so range(len(data))
is always the indices we need.
We've also rewritten the multiplication and assignment to use an in-place operator *=
so that we aren't repeating data[i]
.
We can actually do this even more efficiently using list comprehensions. This isn't exactly the same as the for loop
solution above because it creates a new object, however it is close enough for most applications:
In [12]:
data = [1, 4, 2, 3] # re-initialize our sample data
data = [n*2 for n in data]
print 'doubled data:', data
We can also do a lot of other interesting things with lists, like concatenate them:
In [29]:
left = [1, 2, 3]
right = [4, 5, 6]
combined = left + right
print combined
count how many times a particular value appears in them:
In [31]:
data = ['a', 'c', 'g', 'g', 'c', 't', 'a', 'c', 'g', 'g']
print data.count('g')
sort them:
In [33]:
data.sort()
print data
and reverse them:
In [34]:
data.reverse()
print data
In [37]:
sorted_data = data.sort()
then all we have is the special value `None`, which Python uses to mean "there's nothing here":
In [36]:
print sorted_data
At some point or another, everyone types `data = data.sort()` and then wonders where their time series has gone…
Now that we know how to create lists, we're ready to load two-dimensional images from files. Here's our first test file:
In [38]:
!cat grid_3x3.txt
and here's how we read it line by line with Python:
In [39]:
with open('grid_3x3.txt', 'r') as source:
for line in source:
print line
Whoops: we forgot to strip the newlines off the ends of the lines as we read them from the file. Let's fix that:
In [40]:
with open('grid_3x3.txt', 'r') as source:
for line in source:
print line.strip()
That's better.
As this example shows,
a for
loop over a file reads the lines from the file one by one
and assigns each to the loop variable in turn.
If we want to get all the lines at once,
we can do this instead:
In [41]:
with open('grid_3x3.txt', 'r') as source:
lines = source.readlines() # with an 's' on the end
print lines
file.readlines
(with an 's' on the end to distinguish it from file.readline
)
reads the entire file at once
and returns a list of strings,
one per line.
The length of this list tells us how many rows we need in our grid,
while the length of the first line (minus the newline character)
tells us how many columns we need:
In [42]:
with open('grid_3x3.txt', 'r') as source:
lines = source.readlines()
height = len(lines)
width = len(lines[0]) - 1
print '{0}x{1} grid'.format(width, height)
Upon reflection,
that's not actually a very good test case,
since we can't actually tell if we have height
and width
the right way around.
Let's use a rectangular data file:
In [43]:
!cat grid_5x3.txt
and put our code in a function:
In [44]:
def read_size(filename):
with open(filename, 'r') as source:
lines = source.readlines()
return len(lines[0]) - 1, len(lines)
width, height = read_size('grid_5x3.txt')
print '{0}x{1} grid'.format(width, height)
As this example shows, a function can return several values at once. When it does, those values are matched against the caller's variables from left to right. This can actually be done anywhere:
In [46]:
red, green, blue = 255, 0, 128
print 'red={0} green={1} blue={2}'.format(red, green, blue)
and gives us an easy way to swap the values of two variables:
In [47]:
low, high = 25, 10 # whoops
low, high = high, low # exchange their values
print 'low={0} high={1}'.format(low, high)
Back to our function… Rather than just returning sizes, it would be more useful for us to create and fill in a grid. As we're doing this, though, we must remember to strip the newlines off the strings we have read from the file:
In [48]:
def read_grid(filename):
with open(filename, 'r') as source:
lines = source.readlines()
width, height = len(lines[0]) - 1, len(lines)
result = ImageGrid(width, height)
for y in range(len(lines)):
fill_grid_line(result, y, lines[y].strip())
return result
This is the most complicated function we've written so far, so let's go through it step by step:
read_grid
to take a single parameter.source
.lines
.fill_grid_line
.We need a new function fill_grid_line
because the function we've been using,
color_from_string
,
always colors row 0 of whatever grid it's given.
We need something that can color any row we specify:
In [49]:
def fill_grid_line(grid, y, data):
"Color grid cells in row y red and green according to 'R' and 'G' in data."
assert 0 <= y < grid.height, \
'Row index {0} not within grid height {1}'.format(y, grid.height)
assert grid.width == len(data), \
'Grid and string lengths do not match: {0} != {1}'.format(grid.width, len(data))
for x in range(grid.width):
assert data[x] in 'GR', \
'Unknown character in data string: "{0}"'.format(data[x])
if data[x] == 'R':
grid[x, y] = colors['Red']
else:
grid[x, y] = colors['Green']
As well as adding an extra parameter y
to this function,
we've added an extra assertion to make sure it's between 0 and the grid's height.
In fact,
we could have said,
"Since we're adding an extra parameter,
we've added an extra assertion,"
since it's good practice to check every input to a function before using it.
Let's give our functions a try:
In [50]:
rectangle = read_grid('grid_5x3.txt')
rectangle.show()
Perfect—or is it? Take another look at our data file:
In [51]:
!cat grid_5x3.txt
The 'G' in the top row of the data file is on the right, but the green square in the top row of the data file is on the left. The green cell in the bottom row of the grid is also in the wrong place. Somehow, our grid appears to be upside down.
The problem is that we haven't used a consistent coordinate system.
ImageGrid
uses a Cartesian grid with the origin in the lower left and Y going upward,
but we're treating the file as if the origin was at the top,
just as it is in a spreadsheet.
The simplest way to fix this is to reverse our list of lines before using it:
In [54]:
def read_grid(filename):
with open(filename, 'r') as source:
lines = source.readlines()
width, height = len(lines[0]) - 1, len(lines)
result = ImageGrid(width, height)
lines.reverse() # align with ImageGrid coordinate system
for y in range(len(lines)):
fill_grid_line(result, y, lines[y].strip())
return result
rectangle = read_grid('grid_5x3.txt')
rectangle.show()
All that's left is to make sure that all the lines are the same length so that we're warned of an error if we try to use a file like this:
In [56]:
!cat grid_ragged.txt
Since we require all the lines to be the same length, we can compare their lengths against the length of any one line. We can do this in a loop of its own:
for line in lines:
assert len(line) == width
or put the test in the loop that's filling the lines:
for y in range(len(lines)):
assert len(lines[y].strip()) == width
fill_grid_line(result, y, lines[y].strip())
The first does the checks before it makes any changes to the grid.
Since we're creating the grid inside the function,
though,
this isn't a real worry:
if there's an error in the file,
our assertion will cause the function to fail
and the partially-initialized grid will never be returned to the caller.
We will therefore use the second form,
but modify it slightly so that we only call strip
once (DRY again):
In [72]:
def read_grid(filename):
"Initialize a grid by reading lines of 'R' and 'G' from a file."
with open(filename, 'r') as source:
lines = source.readlines()
width, height = len(lines[0]) - 1, len(lines)
result = ImageGrid(width, height)
lines.reverse()
for y in range(len(lines)):
string = lines[y].strip()
assert len(string) == width, \
'Line {0} is {1} long, not {2}'.format(y, len(string), width)
fill_grid_line(result, y, string)
return result
As always, we're not done until we test our change:
In [74]:
read_grid('grid_ragged.txt')
And of course we should make sure that it still works for a valid file:
In [75]:
once_more = read_grid('grid_5x3.txt')
once_more.show()
We now have all the concepts we need to create thumbnails for a set of images,
and almost all the tools.
The one remaining piece of the puzzle is the unpleasantly-named glob
:
In [80]:
import glob
print 'text files:', glob.glob('*.txt')
In [81]:
print 'IPython Notebooks:', glob.glob('*.ipynb')
"glob" was originally short for "global command", but it has long since become a verb in its own right. It takes a single string as a parameter and uses it to do wildcard matching on filenames, returning a list of matches as a result. Once we have this list, we can loop over it and create thumbnails one by one:
In [83]:
from skimage import novice
from glob import glob
DEFAULT_WIDTH = 100
def make_all_thumbnails(pattern, width=DEFAULT_WIDTH):
"Create thumbnails for all image files matching the given pattern."
for filename in glob(pattern):
make_thumbnail(filename, width)
def make_thumbnail(original_filename, width=DEFAULT_WIDTH):
"Create a thumbnail for a single image file."
picture = novice.open(original_filename)
new_height = int(picture.height * float(width) / picture.width)
picture.size = (width, new_height)
thumbnail_filename = 'thumbnail-' + original_filename
picture.save(thumbnail_filename)
The only thing that's really new here is the way we specify the default value for thumbnail widths.
Since people might call both make_all_thumbnails
and make_thumbnail
directly,
we want to be able to set the width for either.
However,
we also want their default values to be the same,
so we define that value once near the top of the program
and use it in both function definitions.
By convention,
"constant" values like DEFAULT_WIDTH
are spelled in UPPER CASE
to indicate that they shouldn't be changed.
file_handle
= open(filename, 'r')
and close the file using file_handle.close()
.open(...)
returns a file handle object that makes a connection between a program and the file.\n
) at the end of the line.line.strip()
.count()
, sort()
, and reverse()
.