FIXME
The IPython Notebook and other interactive tools are great for prototyping code and exploring data, but sooner or later we will want to use our program in a pipeline or run it in a shell script to process thousands of data files. In order to do that, we need to make it work like other Unix command-line tools.
In [46]:
!cat fractal_1.txt
We want to calculate statistics on these fractals; more specifically, we want a program that will read one or more files and report the average density of each row. For the file above, this might be:
$ fracdens file_1.txt
0.25
0.5
0.5
0.325
0.25
0.325
but we might also want to look at the density of the first four lines
head -4 file_1.txt | fracdens
or the densities of several files one after another:
fracdens file_1.txt file_2.txt
or merge densities of several files of the same size:
fracdens -m file_1.txt file_2.txt file_3.txt
Our overall requirements are:
To make this work, we need to know how to handle command-line arguments in a program, and how to get at standard input. We'll tackle these questions in turn below.
Using the text editor of your choice, save the following in a text file:
In [47]:
!cat sys_version.py
The first line imports a library called sys
,
which is short for "system".
It defines values such as sys.version
,
which describes which version of Python we are running.
We can run this script from within the IPython Notebook like this:
In [48]:
%run sys_version.py
or like this:
In [49]:
!ipython sys_version.py
The first method, %run
,
uses a special command in the IPython Notebook to run a program in a .py
file.
The second method is more general:
the exclamation mark !
tells the Notebook to run a shell command,
and it just so happens that the command we run is ipython
with the name of the script.
Here's another script that does something more interesting:
In [50]:
!cat argv_list.py
The strange name argv
stands for "argument values".
Whenever Python runs a program,
it takes all of the values given on the command line
and puts them in the list sys.argv
so that the program can determine what they were.
If we run this program with no arguments:
In [51]:
!ipython argv_list.py
the only thing in the list is the full path to our script,
which is always sys.argv[0]
.
If we run it with a few arguments, however:
In [52]:
!ipython argv_list.py first second third
then Python adds each of those arguments to that magic list. And if we use a wildcard:
In [53]:
!ipython argv_list.py fractal_*.txt
then the shell expands it before calling our script,
so that sys.argv
holds the complete list of arguments,
rather than the string containing the wildcard.
Note,
by the way,
that the %run
magic does almost the same thing—the
only difference is that we get the relative path to the script
instead of the absolute path as sys.argv[0]
:
In [54]:
%run argv_list.py fractal_*.txt
With this in hand,
let's build a version of fracdens
that processes one or more files independently of each other
(i.e.,
that doesn't look for a -m
flag,
and doesn't read from standard input).
The first step is to write a main
function that outlines our implementation,
and a placeholder for the function that does the actual work:
In [55]:
!cat fracdens_1.py
This function gets the name of the script from sys.argv[0]
,
because that's where it's always put,
and the list of files to be processed from sys.argv[1:]
.
The colon inside the brackets is important:
the expression list[1:]
means,
"All the elements of the list from index 1 to the end."
Here's a simple test:
In [56]:
%run fracdens_1.py fractal_1.txt
There is no output because we have defined two functions,
but haven't actually called either of them.
Let's add a call to main
:
In [57]:
!cat fracdens_2.py
and run that:
In [58]:
%run fracdens_2.py fractal_1.txt
Oops:
we have imported sys
in this notebook,
but we haven't imported it in our script,
which is being run in a separate instance of Python.
Let's make one more change:
In [59]:
!cat fracdens_3.py
and run that:
In [60]:
%run fracdens_3.py fractal_1.txt
Success! Now, what if we run it with several filenames?
In [61]:
%run fracdens_3.py fractal_*.txt
Good: we appear to be getting the filenames correctly.
The next step is to teach our program to handle the -m
flag
that tells it to merge data from all the files.
By convention,
flags always appear before lists of filenames,
so we could simply do this:
In [62]:
def handle_args():
script = sys.argv[0]
if sys.argv[1] == '-m':
merge_data = True
filenames = sys.argv[2:]
else:
merge_data = False
filenames = sys.argv[1:]
return script, merge_data, filenames
But there are at least three things wrong with this approach:
It doesn't scale:
if our program eventually takes several arguments
(e.g., to control what statistics are calculated),
we're going to have a lot of branches in that if
statement.
It contains a bug:
if we don't provide any arguments or filenames,
the attempt on line 2 to check sys.argv[1]
will fail with an index-out-of-bounds error.
This means that:
fracdens < fractal_221.txt
will blow up instead of printing statistics for fractal_221.txt
(which the program is reading from standard input).
It's hard to test.
As written,
handle_args
always gets data from sys.argv
,
which means we can't write unit tests for it using something like Ears.
What we really ought to do is pass in the list of strings to be processed,
so that we can write unit tests,
and then call handle_args
with sys.argv
as an argument from main
,
and with other lists of strings as arguments from tests.
Here's a better version of handle_args
that uses Python's optparse
library to handle arguments:
In [63]:
from optparse import OptionParser
def handle_args(args):
script, rest = args[0], args[1:]
parser = OptionParser()
parser.add_option('-m', '--merge', dest='merge', help='Merge data from all files',
default=False, action='store_true')
options, args = parser.parse_args(args=rest)
return script, options, args
The optparse
library defines a tool called OptionParser
,
which knows how to handle complex arrangements of command-line parameters.
Line 5 of the code above creates one of these;
line 6 then tells it that our command line may contain one flag that:
-m
and a long form --merge
(which are interchangeable);all
of the options
object;Let's try it out:
In [64]:
script, flags, filenames = handle_args(['fracdens', 'fractal_1.txt'])
print 'script name is', script
print 'flags.merge is', flags.merge
print 'filenames are', filenames
And again:
In [65]:
script, flags, filenames = handle_args(['fracdens', '-m', 'fractal_1.txt', 'fractal_2.txt'])
print 'script name is', script
print 'flags.merge is', flags.merge
print 'filenames are', filenames
Of course, the right thing to do here is to write a few unit tests:
In [66]:
import ears
def test_parse_no_args():
script, flags, filenames = handle_args(['fracdens'])
assert script == 'fracdens'
assert not flags.merge
assert filenames == []
def test_parse_one_filename():
script, flags, filenames = handle_args(['fracdens', 'fractal_1.txt'])
assert script == 'fracdens'
assert not flags.merge
assert filenames == ['fractal_1.txt']
def test_parse_just_merge():
script, flags, filenames = handle_args(['fracdens', '-m'])
assert script == 'fracdens'
assert flags.merge
assert filenames == []
def test_parse_merge_multiple():
script, flags, filenames = handle_args(['fracdens', '-m', 'fractal_1.txt', 'fractal_2.txt'])
assert script == 'fracdens'
assert flags.merge
assert len(filenames) == 2
def test_flag_after_filenames():
try:
script, flags, filenames = handle_args(['fracdens', 'fractal_1.txt', '-m'])
assert False, 'Should have had an exception'
except:
pass # exception as expected
def test_unknown_flag():
try:
script, flags, filenames = handle_args(['fracdens', '-X', 'fractal_1.txt'])
assert False, 'Should have had an exception'
except:
pass # exception as expected
ears.run()
All of our tests pass,
though OptionParser
does produce an error message during one of them.
A warning light that doesn't work is worse than no warning light at all, since it gives people a false sense of security. Similarly, error handling that doesn't actually handle errors can fool programmers into thinking that X, Y, or Z **can't** be wrong when it actually is. Our unit tests therefore ought to check that the right exceptions are raised when they should be, and as the code above shows, there's a pattern for doing this:
try:
function_that_should_raise_exception()
assert False, 'Exception was not raised!'
except:
pass # do nothing because exception was raised correctly
Let's add handle_args
to our program:
In [67]:
!cat fracdens_4.py
This runs,
but it doesn't take merging into account.
If flags.merge
is True
,
process
should produce one set of statistics for all of our files
instead of one set per file.
We could handle this as a special case:
if flags.merge:
process_all_files(filenames)
else:
process_single_file(filenames[0])
but there's a simpler way. If our program merges data, but only has one file to process, it will produce the statistics for just that file. We can therefore do this:
if flags.merge:
process_all_files(filenames)
else:
for f in filenames:
temp = [f]
process_all_files(temp)
i.e., create a list containing just one filename for each filename we have, and process those lists one by one. Let's simplify this a bit:
if flags.merge:
process(filenames)
else:
for f in filenames:
process([f])
Now,
what does process
look like?
def process(filenames):
print ...
Hm:
what are we supposed to print?
If filenames
contains just one filename, we print that
(since we're printing statistics for each file separately),
while if it contains more than one,
we print something like the word "all".
That would work,
but putting a test on the length of filenames
in process
feels like an awkward design.
Instead,
let's modify it so that the main program decides what should be printed:
if flags.merge:
process('all', filenames)
else:
for f in filenames:
process(f, [f])
This feels better:
the decision about processing files one by one or all together is made in main
,
and so is the decision about what to print in the output.
Our placeholder version of process
is then something like:
def process(title, filenames):
print title
print 'files:',
for f in filenames:
print f, # eventually replace this with real code
print # make sure there's a newline at the end
Let's try running that:
In [68]:
%run fracdens_5.py fractal_1.txt
In [69]:
%run fracdens_5.py -m fractal_*.txt
We now have five Python files named `fracdens_1.py` to `fracdens_5.py`. You should **not** do this when you are writing code yourself: instead, you should use a version control system to manage the file's evolution, and commit it each time you add some useful feature. We can't do this because we want to display successive versions simultaneously in a notebook, but that's a rare case.
The next thing our program has to do is read data from standard input if no filenames are given so that we can put it in a pipeline, redirect input to it, and so on. Let's experiment in another script:
In [70]:
!cat count_stdin.py
This little program reads lines from a special "file" called sys.stdin
,
which is actually the program's standard input.
We don't have to open it—Python and the operating system
automatically take care of that between themselves when the program is run—
but we can do almost anything with it that we could do to a regular file.
Let's try running it as if it were a regular command-line program:
In [71]:
!ipython count_stdin.py < fractal_1.txt
What if we run it using %run
?
In [72]:
%run count_stdin.py < fractal_1.txt
As you can see,
%run
doesn't understand file redirection:
that's a shell thing.
A common mistake is to try to run something that reads from standard input like this:
!ipython count_stdin.py fractal_1.txt
i.e., to forget the <
character that redirect the file to standard input.
In this case,
there's nothing in standard input,
so the program waits at the start of the loop for someone to type something on the keyboard.
Since there's no way for us to do this,
our program is stuck,
and we have to halt it using the Interrupt
option from the Kernel
menu in the Notebook.
</em>
We now need to rewrite process
to handle standard input as well as files on disk,
as well as handling merging.
There are three cases:
This gives us the following main program (which we've called fracdens_6.py
):
In [76]:
def main():
script, flags, filenames = handle_args(sys.argv)
if filenames == []:
process('stdin', None)
elif flags.merge:
process('all', filenames)
else:
for f in filenames:
process(f, [f])
and this update to process
:
In [77]:
def process(title, filenames):
if filenames is None:
densities = calc_density(sys.stdin)
else:
for f in filenames:
with open(f, 'r') as source:
densities = calc_density(source)
display(title, densities)
This version of process
uses sys.stdin
as a data source if there aren't any filenames,
and opens files one by one to create a readable source if there are.
It then uses two functions calc_density
and display
to calculate densities and display the results.
These ought to be straightforward to write,
but before we dive into them,
we have a bug to fix.
Consider what happens if we run process
with a list of filenames.
We're supposed to merge all the data from them to find the overall average density,
but what we're actually doing is calculating densities for each file separately,
and then reporting the densities of the last file.
Somehow,
we need to accumulate statistics across all of our files.
Here's one approach:
In [78]:
def process(title, filenames):
if filenames is None:
densities = calc_density(sys.stdin)
else:
with open(f[0], 'r') as source:
densities = calc_density(source)
for f in filenames[1:]:
with open(f, 'r') as source:
combine_densities(densities, source)
display(title, densities)
As you can (hopefully) guess from the names we've chosen for our functions,
calc_density
calculates densities for a single file,
while combine_densities
combines data from an open file with a running total of densities seen so far.
The else
branch of the function starts by getting the densities of the first data set,
then uses combine_densities
to add in the densities from all the other data sets.
If there aren't any—i.e.,
if filenames[1:]
is the empty list—
then combine_densities
will never be called,
and densities
will just hold the densities from the first file.
This is on the right track,
but it doesn't quite work.
The density of a line in our fractal is defined to be
the number of filled cells in a line
divided by the width of that line.
Adding or averaging the densities from different files one by one
isn't going to give us the right answer.
Instead,
we need to add up the number of filled cells per line across all our files,
and divide by the total width—i.e., the number of files times their width—at the end.
Let's rewrite process
one more time:
In [79]:
def process(title, filenames):
if filenames is None:
number = 1
width, filled = count(sys.stdin)
else:
number = len(filenames)
with open(f[0], 'r') as source:
width, filled = count(sys.stdin)
for f in filenames[1:]:
with open(f, 'r') as source:
filled = combine(source, filled)
display(title, filled, number * width)
Instead of calculating densities file by file,
we're using two functions called count
and combine
to count the number of filled cells per line
and combine a list of counts seen so far with counts from yet another file.
The first of these functions returns both the width of the data and the list of counts per line
so that we can correctly calculate averages,
and both branches of the if
set the value of number
for this purpose as well.
We could now go ahead and write count
, combine
, and display
,
but we can simplify things one more time before doing so.
count
is going to process input data line by line and return a list of numbers.
combine
is going to do this as well;
the only difference is,
it will produce its output by adding line counts to existing totals.
Let's rewrite process
so that all the reading and line-by-line counting happens in count
,
and combine
just adds values together:
In [89]:
def process(title, filenames):
if filenames is None:
number = 1
width, filled = count(sys.stdin)
else:
number = len(filenames)
with open(filenames[0], 'r') as source:
width, filled = count(source)
for f in filenames[1:]:
new_width, new_filled = count(source)
assert new_width == width, 'File widths are not the same'
filled = combine(filled, new_filled)
display(title, filled, number * width)
We're finally done with process
:
each of the functions it depends on does exactly one simple job,
and we've even included a self-check to make sure that all the input files have the same width.
Our three remaining functions are now almost trivial to write:
In [90]:
def count(source):
result = []
for line in source:
line = line.strip()
width = len(line)
n = line.count('1')
result.append(n)
return width, result
def combine(left, right):
assert len(left) == len(right), 'Data set lengths have unequal lengths'
result = []
for i in range(len(left)):
result.append( left[i] + right[i] )
return result
def display(title, counts, scaling):
print title
for c in counts:
print float(c) / scaling
Let's try running it on a single input file:
In [91]:
%run fracdens_6.py fractal_1.txt
If we double-check the file (which is nine cells wide):
In [92]:
!cat fractal_1.txt
that seems to be the right answer: 1/9, two lines of 4/9, a 3/9, a 2/9, and another 3/9. Let's try three files separately:
In [93]:
%run fracdens_6.py fractal_1.txt fractal_2.txt fractal_3.txt
and standard input:
In [94]:
!ipython fracdens_6.py < fractal_1.txt
All that's left to test is merging. Clearly, if we "merge" one file, we should get the same answer that we got for that file:
In [95]:
%run fracdens_6.py -m fractal_1.txt
and if we merge data for that file with itself, we should get the same answer (which is an easier thing to check that merging data from two different files):
In [97]:
%run fracdens_6.py -m fractal_1.txt fractal_1.txt
Why does Python think we're trying to read from a closed file?
If take a closer look at process
,
we see that the loop handling filenames[1:]
wasn't actually opening any files,
so it was trying to read from the same source
that was opened and closed for filenames[0]
.
Let's update process
one more time to create fracdens_7.py
:
In [98]:
def process(title, filenames):
if filenames is None:
number = 1
width, filled = count(sys.stdin)
else:
number = len(filenames)
with open(filenames[0], 'r') as source:
width, filled = count(source)
for f in filenames[1:]:
with open(f, 'r') as source:
new_width, new_filled = count(source)
assert new_width == width, 'File widths are not the same'
filled = combine(filled, new_filled)
display(title, filled, number * width)
and run that:
In [99]:
%run fracdens_7.py -m fractal_1.txt fractal_1.txt
Good: that's the same answer that we had before. Let's try merging all three files:
In [100]:
%run fracdens_7.py -m fractal_1.txt fractal_2.txt fractal_3.txt
That might be right—at least, it isn't obviously wrong—but how can we be sure? Let's create three simplified input files:
In [102]:
!cat test_1.txt
In [103]:
!cat test_2.txt
In [104]:
!cat test_3.txt
and try merging those:
In [105]:
%run fracdens_7.py -m test_1.txt test_2.txt
In [106]:
%run fracdens_7.py -m test_1.txt test_2.txt test_3.txt
Those values are much easier to check: the average of 1/3 and 2/3 is 1/2, and the average of 1/3, 2/3, and 3/3 is 2/3.
In [107]:
!ipython fracdens_7.py -m < test_1.txt
That's certainly not what we expect. After a bit of digging, it turns out that `ipython` assumes the `-m` flag belongs to it, rather than to our script. This doesn't show up when we run the program with `%run` because we're not launching a separate command-line instance of IPython in that case. To make sure that arguments are actually passed to our script, we need to put them all after a double dash (`--`) so that IPython can tell which are its and which are ours:
In [108]:
!ipython fracdens_7.py -- -m < test_1.txt
One of the rules in the previous lesson was, "If you're writing a loop, you're probably doing it wrong." Let's take a look at how we could eliminate most of the loops in our density calculator using NumPy arrays and another feature of Python we haven't encountered yet: list comprehensions. Suppose we have a list of numbers:
In [1]:
nums = [2, 5, 9]
Instead of writing a loop to create a list of their squares, we can do this:
In [2]:
squares = [x ** 2 for x in nums]
print squares
This doesn't change our original list:
In [3]:
print nums
and the name of the temporary variable inside the comprehension doesn't matter:
In [4]:
print [something_else ** 2 for something_else in nums]
Used sparingly, list comprehensions make a lot of programs more readable—just compare the list comprehension form of this calculation with the loop-and-conditional form:
In [7]:
from math import sqrt
signal = [-0.7, -0.3, -0.1, 0.2, 0.3, 0.5]
# The easy way
pos_roots = [sqrt(s) for s in signal if s >= 0]
# The hard way
pos_roots = []
for s in signal:
if s >= 0:
pos_roots.append(sqrt(s))
Let's use list comprehension to clean up our code. First, though, let's clean up our data. It's simple to represent our fractals using tightly-packed 1's and 0's, but nobody else's software will recognize that data format. If we use something standard, like comma-separated values (CSV), we can get rid of our own parsing code. This makes our data files somewhat larger:
In [1]:
!cat csv_1.txt
but we can now read our data with a single statement:
In [3]:
import numpy as np
print np.loadtxt('csv_1.txt', delimiter=',')
Using this,
we can rewrite process
as follows:
In [4]:
def process(title, filenames):
if filenames is None:
data = np.loadtxt(sys.stdin, delimiter=',')
display(title, data, 1)
else:
results = [np.loadtxt(f, delimiter=',') for f in filenames]
assert all([x.shape == results[0].shape for x in results]), 'File sizes differ'
for r in results[1:]:
results[0] += r
display(title, results[0], len(results))
Let's go through this line by line:
process
reads data from standard input using np.loadtxt
, then pass that two-dimensional array to display
, along with a '1' to indicate that we've only got one data set.display
with the merged data, passing in the totals and the number of arrays we read.The display
function now needs to be rewritten to do two things:
scale the counts per row,
and show them.
This isn't a great design,
as it violates our "one purpose per function" rule,
but it's good enough for now:
In [5]:
def display(title, data, number):
print title
scaling = float(number * data.shape[1])
densities = data.sum(1) / scaling
for d in densities:
print d
The program we have now produced runs exactly the same way as the previous versions:
In [7]:
%run fracdens_8.py csv_1.txt csv_2.txt
but it's two-thirds the size:
In [9]:
!wc fracdens_7.py fracdens_8.py
The new program isn't noticeably faster than the old one, though, since the time to do the calculations is dwarfed by the time needed to read the data into the program.
Is the NumPy version better than the list-based version? The answer depends on the audience we have in mind. Programmers who know Python well, and who are used to thinking in terms of applying operations to entire data sets at once, would probably have written something like the final version right from the start. Programmers who aren't that familiar with Python's features, on the other hand, would probably find the loop-based version easier to understand because it spells out the steps the program is taking, rather than the results it's producing.
These differences highlight one of the fundamental problems in programming (and indeed in any other activity that requires expertise): things that are comprehensible to a novice are painfully slow for an expert to read, while things that are natural for an expert are often opaque to novices. This doesn't mean that either is right or wrong: it's just one manifestation of the cognitive changes that occur as a task goes from being a mystery to being possible to being easy.
The same thing is true of documentation: a tutorial aimed at novices can be infuriating for experts to read, since the information they want is scattered so thinly, while a manual page for experts can seem like gibberish to people who don't yet have a mental model of the domain.
FIXME