Basic Programming With Python: Back to the Command Line

Objectives

FIXME

Lesson

The IPython Notebook and other interactive tools are great for prototyping code and exploring data, but sooner or later we will want to use our program in a pipeline or run it in a shell script to process thousands of data files. In order to do that, we need to make it work like other Unix command-line tools.


In [46]:
!cat fractal_1.txt


000100000
001100110
000111100
001110000
001100000
001110000

We want to calculate statistics on these fractals; more specifically, we want a program that will read one or more files and report the average density of each row. For the file above, this might be:

$ fracdens file_1.txt
0.25
0.5
0.5
0.325
0.25
0.325

but we might also want to look at the density of the first four lines

head -4 file_1.txt | fracdens

or the densities of several files one after another:

fracdens file_1.txt file_2.txt

or merge densities of several files of the same size:

fracdens -m file_1.txt file_2.txt file_3.txt

Our overall requirements are:

  1. If no filename is given on the command line, read data from standard input.
  2. If one or more filenames are given, read data from them and report statistics for each file separately.
  3. If the -m flag is given, merge the data for several files (i.e., calculate the average density per line across those files).

To make this work, we need to know how to handle command-line arguments in a program, and how to get at standard input. We'll tackle these questions in turn below.

Command-Line Arguments

Using the text editor of your choice, save the following in a text file:


In [47]:
!cat sys_version.py


import sys
print 'version is', sys.version

The first line imports a library called sys, which is short for "system". It defines values such as sys.version, which describes which version of Python we are running. We can run this script from within the IPython Notebook like this:


In [48]:
%run sys_version.py


version is 2.7.5 |Anaconda 1.6.1 (x86_64)| (default, Jun 28 2013, 22:20:13) 
[GCC 4.0.1 (Apple Inc. build 5493)]

or like this:


In [49]:
!ipython sys_version.py


version is 2.7.5 |Anaconda 1.6.1 (x86_64)| (default, Jun 28 2013, 22:20:13) 
[GCC 4.0.1 (Apple Inc. build 5493)]

The first method, %run, uses a special command in the IPython Notebook to run a program in a .py file. The second method is more general: the exclamation mark ! tells the Notebook to run a shell command, and it just so happens that the command we run is ipython with the name of the script.

Here's another script that does something more interesting:


In [50]:
!cat argv_list.py


import sys
print 'sys.argv is', sys.argv

The strange name argv stands for "argument values". Whenever Python runs a program, it takes all of the values given on the command line and puts them in the list sys.argv so that the program can determine what they were. If we run this program with no arguments:


In [51]:
!ipython argv_list.py


sys.argv is ['/Users/gwilson/bc/lessons/swc-python/argv_list.py']

the only thing in the list is the full path to our script, which is always sys.argv[0]. If we run it with a few arguments, however:


In [52]:
!ipython argv_list.py first second third


sys.argv is ['/Users/gwilson/bc/lessons/swc-python/argv_list.py', 'first', 'second', 'third']

then Python adds each of those arguments to that magic list. And if we use a wildcard:


In [53]:
!ipython argv_list.py fractal_*.txt


sys.argv is ['/Users/gwilson/bc/lessons/swc-python/argv_list.py', 'fractal_1.txt', 'fractal_2.txt', 'fractal_3.txt']

then the shell expands it before calling our script, so that sys.argv holds the complete list of arguments, rather than the string containing the wildcard. Note, by the way, that the %run magic does almost the same thing—the only difference is that we get the relative path to the script instead of the absolute path as sys.argv[0]:


In [54]:
%run argv_list.py fractal_*.txt


sys.argv is ['argv_list.py', 'fractal_1.txt', 'fractal_2.txt', 'fractal_3.txt']

With this in hand, let's build a version of fracdens that processes one or more files independently of each other (i.e., that doesn't look for a -m flag, and doesn't read from standard input). The first step is to write a main function that outlines our implementation, and a placeholder for the function that does the actual work:


In [55]:
!cat fracdens_1.py


def main():
    script = sys.argv[0]
    filenames = sys.argv[1:]
    for f in filenames:
        process(f)

def process(filename):
    print filename

This function gets the name of the script from sys.argv[0], because that's where it's always put, and the list of files to be processed from sys.argv[1:]. The colon inside the brackets is important: the expression list[1:] means, "All the elements of the list from index 1 to the end." Here's a simple test:


In [56]:
%run fracdens_1.py fractal_1.txt

There is no output because we have defined two functions, but haven't actually called either of them. Let's add a call to main:


In [57]:
!cat fracdens_2.py


def main():
    script = sys.argv[0]
    filenames = sys.argv[1:]
    for f in filenames:
        process(f)

def process(filename):
    print filename

# Run the program.
main()

and run that:


In [58]:
%run fracdens_2.py fractal_1.txt


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/Users/gwilson/anaconda/lib/python2.7/site-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
    202             else:
    203                 filename = fname
--> 204             __builtin__.execfile(filename, *where)

/Users/gwilson/bc/lessons/swc-python/fracdens_2.py in <module>()
      9 
     10 # Run the program.
---> 11 main()

/Users/gwilson/bc/lessons/swc-python/fracdens_2.py in main()
      1 def main():
----> 2     script = sys.argv[0]
      3     filenames = sys.argv[1:]
      4     for f in filenames:
      5         process(f)

NameError: global name 'sys' is not defined

Oops: we have imported sys in this notebook, but we haven't imported it in our script, which is being run in a separate instance of Python. Let's make one more change:


In [59]:
!cat fracdens_3.py


import sys

def main():
    script = sys.argv[0]
    filenames = sys.argv[1:]
    for f in filenames:
        process(f)

def process(filename):
    print filename

# Run the program.
main()

and run that:


In [60]:
%run fracdens_3.py fractal_1.txt


fractal_1.txt

Success! Now, what if we run it with several filenames?


In [61]:
%run fracdens_3.py fractal_*.txt


fractal_1.txt
fractal_2.txt
fractal_3.txt

Good: we appear to be getting the filenames correctly.

Handling Command-Line Flags

The next step is to teach our program to handle the -m flag that tells it to merge data from all the files. By convention, flags always appear before lists of filenames, so we could simply do this:


In [62]:
def handle_args():
    script = sys.argv[0]
    if sys.argv[1] == '-m':
        merge_data = True
        filenames = sys.argv[2:]
    else:
        merge_data = False
        filenames = sys.argv[1:]
    return script, merge_data, filenames

But there are at least three things wrong with this approach:

  1. It doesn't scale: if our program eventually takes several arguments (e.g., to control what statistics are calculated), we're going to have a lot of branches in that if statement.

  2. It contains a bug: if we don't provide any arguments or filenames, the attempt on line 2 to check sys.argv[1] will fail with an index-out-of-bounds error. This means that:

    fracdens < fractal_221.txt
    
    

    will blow up instead of printing statistics for fractal_221.txt (which the program is reading from standard input).

  3. It's hard to test. As written, handle_args always gets data from sys.argv, which means we can't write unit tests for it using something like Ears. What we really ought to do is pass in the list of strings to be processed, so that we can write unit tests, and then call handle_args with sys.argv as an argument from main, and with other lists of strings as arguments from tests.

Here's a better version of handle_args that uses Python's optparse library to handle arguments:


In [63]:
from optparse import OptionParser

def handle_args(args):
    script, rest = args[0], args[1:]
    parser = OptionParser()
    parser.add_option('-m', '--merge', dest='merge', help='Merge data from all files',
                      default=False, action='store_true')
    options, args = parser.parse_args(args=rest)
    return script, options, args

The optparse library defines a tool called OptionParser, which knows how to handle complex arrangements of command-line parameters. Line 5 of the code above creates one of these; line 6 then tells it that our command line may contain one flag that:

  • has a short form -m and a long form --merge (which are interchangeable);
  • is stored in the property called all of the options object;
  • is false by default, but should be set to true if the flag is present; and
  • tells the program to merge data from all files.

Let's try it out:


In [64]:
script, flags, filenames = handle_args(['fracdens', 'fractal_1.txt'])
print 'script name is', script
print 'flags.merge is', flags.merge
print 'filenames are', filenames


script name is fracdens
flags.merge is False
filenames are ['fractal_1.txt']

And again:


In [65]:
script, flags, filenames = handle_args(['fracdens', '-m', 'fractal_1.txt', 'fractal_2.txt'])
print 'script name is', script
print 'flags.merge is', flags.merge
print 'filenames are', filenames


script name is fracdens
flags.merge is True
filenames are ['fractal_1.txt', 'fractal_2.txt']

Of course, the right thing to do here is to write a few unit tests:


In [66]:
import ears

def test_parse_no_args():
    script, flags, filenames = handle_args(['fracdens'])
    assert script == 'fracdens'
    assert not flags.merge
    assert filenames == []

def test_parse_one_filename():
    script, flags, filenames = handle_args(['fracdens', 'fractal_1.txt'])
    assert script == 'fracdens'
    assert not flags.merge
    assert filenames == ['fractal_1.txt']

def test_parse_just_merge():
    script, flags, filenames = handle_args(['fracdens', '-m'])
    assert script == 'fracdens'
    assert flags.merge
    assert filenames == []

def test_parse_merge_multiple():
    script, flags, filenames = handle_args(['fracdens', '-m', 'fractal_1.txt', 'fractal_2.txt'])
    assert script == 'fracdens'
    assert flags.merge
    assert len(filenames) == 2

def test_flag_after_filenames():
    try:
        script, flags, filenames = handle_args(['fracdens', 'fractal_1.txt', '-m'])
        assert False, 'Should have had an exception'
    except:
        pass # exception as expected

def test_unknown_flag():
    try:
        script, flags, filenames = handle_args(['fracdens', '-X', 'fractal_1.txt'])
        assert False, 'Should have had an exception'
    except:
        pass # exception as expected

ears.run()


......
6 pass, 0 fail, 0 error
Usage: -c [options]

-c: error: no such option: -X

All of our tests pass, though OptionParser does produce an error message during one of them.


Testing Error Handling

A warning light that doesn't work is worse than no warning light at all, since it gives people a false sense of security. Similarly, error handling that doesn't actually handle errors can fool programmers into thinking that X, Y, or Z **can't** be wrong when it actually is. Our unit tests therefore ought to check that the right exceptions are raised when they should be, and as the code above shows, there's a pattern for doing this:

try:
    function_that_should_raise_exception()
    assert False, 'Exception was not raised!'
except:
    pass # do nothing because exception was raised correctly


Testing and Learning

In practice, we probably wouldn't write these unit tests if we were familiar with the `optparse` library. Since we're just introducing it, though, and are always looking for opportunities to show what unit testing looks like, we've included the tests.


Let's add handle_args to our program:


In [67]:
!cat fracdens_4.py


import sys

def main():
    script, flags, filenames = handle_args(sys.argv)
    for f in filenames:
        process(f)

def handle_args(args):
    script, rest = args[0], args[1:]
    parser = OptionParser()
    parser.add_option('-m', '--merge', dest='merge', help='Merge data from all files',
                      default=False, action='store_true')
    options, args = parser.parse_args(args=rest)
    return script, options, args

def process(filename):
    print filename

# Run the program.
main()

This runs, but it doesn't take merging into account. If flags.merge is True, process should produce one set of statistics for all of our files instead of one set per file. We could handle this as a special case:

if flags.merge:
    process_all_files(filenames)
else:
    process_single_file(filenames[0])

but there's a simpler way. If our program merges data, but only has one file to process, it will produce the statistics for just that file. We can therefore do this:

if flags.merge:
    process_all_files(filenames)
else:
    for f in filenames:
        temp = [f]
        process_all_files(temp)

i.e., create a list containing just one filename for each filename we have, and process those lists one by one. Let's simplify this a bit:

if flags.merge:
    process(filenames)
else:
    for f in filenames:
        process([f])

Now, what does process look like?

def process(filenames):
    print ...

Hm: what are we supposed to print? If filenames contains just one filename, we print that (since we're printing statistics for each file separately), while if it contains more than one, we print something like the word "all".

That would work, but putting a test on the length of filenames in process feels like an awkward design. Instead, let's modify it so that the main program decides what should be printed:

if flags.merge:
    process('all', filenames)
else:
    for f in filenames:
        process(f, [f])

This feels better: the decision about processing files one by one or all together is made in main, and so is the decision about what to print in the output. Our placeholder version of process is then something like:

def process(title, filenames):
    print title
    print 'files:',
    for f in filenames:
        print f, # eventually replace this with real code
    print # make sure there's a newline at the end

Let's try running that:


In [68]:
%run fracdens_5.py fractal_1.txt


fractal_1.txt
files: fractal_1.txt

In [69]:
%run fracdens_5.py -m fractal_*.txt


all
files: fractal_1.txt fractal_2.txt fractal_3.txt


Don't Do It This Way

We now have five Python files named `fracdens_1.py` to `fracdens_5.py`. You should **not** do this when you are writing code yourself: instead, you should use a version control system to manage the file's evolution, and commit it each time you add some useful feature. We can't do this because we want to display successive versions simultaneously in a notebook, but that's a rare case.


Handling Standard Input

The next thing our program has to do is read data from standard input if no filenames are given so that we can put it in a pipeline, redirect input to it, and so on. Let's experiment in another script:


In [70]:
!cat count_stdin.py


import sys

count = 0
for line in sys.stdin:
    count += 1

print '{0} lines in standard input'.format(count)

This little program reads lines from a special "file" called sys.stdin, which is actually the program's standard input. We don't have to open it—Python and the operating system automatically take care of that between themselves when the program is run— but we can do almost anything with it that we could do to a regular file. Let's try running it as if it were a regular command-line program:


In [71]:
!ipython count_stdin.py < fractal_1.txt


6 lines in standard input

What if we run it using %run?


In [72]:
%run count_stdin.py < fractal_1.txt


0 lines in standard input

As you can see, %run doesn't understand file redirection: that's a shell thing.


No Input Takes a Long Time to Read

A common mistake is to try to run something that reads from standard input like this:

!ipython count_stdin.py fractal_1.txt

i.e., to forget the < character that redirect the file to standard input. In this case, there's nothing in standard input, so the program waits at the start of the loop for someone to type something on the keyboard. Since there's no way for us to do this, our program is stuck, and we have to halt it using the Interrupt option from the Kernel menu in the Notebook. </em>


We now need to rewrite process to handle standard input as well as files on disk, as well as handling merging. There are three cases:

  • No filenames on the command line: process standard input as a single "file".
  • No merge flag: process each named file separately.
  • Merge flag provided: process all named files in a batch.

This gives us the following main program (which we've called fracdens_6.py):


In [76]:
def main():
    script, flags, filenames = handle_args(sys.argv)
    if filenames == []:
        process('stdin', None)
    elif flags.merge:
        process('all', filenames)
    else:
        for f in filenames:
            process(f, [f])

and this update to process:


In [77]:
def process(title, filenames):
    if filenames is None:
        densities = calc_density(sys.stdin)
    else:
        for f in filenames:
            with open(f, 'r') as source:
                densities = calc_density(source)
    display(title, densities)

This version of process uses sys.stdin as a data source if there aren't any filenames, and opens files one by one to create a readable source if there are. It then uses two functions calc_density and display to calculate densities and display the results. These ought to be straightforward to write, but before we dive into them, we have a bug to fix.

Consider what happens if we run process with a list of filenames. We're supposed to merge all the data from them to find the overall average density, but what we're actually doing is calculating densities for each file separately, and then reporting the densities of the last file. Somehow, we need to accumulate statistics across all of our files.

Here's one approach:


In [78]:
def process(title, filenames):
    if filenames is None:
        densities = calc_density(sys.stdin)
    else:
        with open(f[0], 'r') as source:
            densities = calc_density(source)
        for f in filenames[1:]:
            with open(f, 'r') as source:
                combine_densities(densities, source)
    display(title, densities)

As you can (hopefully) guess from the names we've chosen for our functions, calc_density calculates densities for a single file, while combine_densities combines data from an open file with a running total of densities seen so far. The else branch of the function starts by getting the densities of the first data set, then uses combine_densities to add in the densities from all the other data sets. If there aren't any—i.e., if filenames[1:] is the empty list— then combine_densities will never be called, and densities will just hold the densities from the first file.

This is on the right track, but it doesn't quite work. The density of a line in our fractal is defined to be the number of filled cells in a line divided by the width of that line. Adding or averaging the densities from different files one by one isn't going to give us the right answer. Instead, we need to add up the number of filled cells per line across all our files, and divide by the total width—i.e., the number of files times their width—at the end. Let's rewrite process one more time:


In [79]:
def process(title, filenames):
    if filenames is None:
        number = 1
        width, filled = count(sys.stdin)
    else:
        number = len(filenames)
        with open(f[0], 'r') as source:
            width, filled = count(sys.stdin)
        for f in filenames[1:]:
            with open(f, 'r') as source:
                filled = combine(source, filled)
    display(title, filled, number * width)

Instead of calculating densities file by file, we're using two functions called count and combine to count the number of filled cells per line and combine a list of counts seen so far with counts from yet another file. The first of these functions returns both the width of the data and the list of counts per line so that we can correctly calculate averages, and both branches of the if set the value of number for this purpose as well.

We could now go ahead and write count, combine, and display, but we can simplify things one more time before doing so. count is going to process input data line by line and return a list of numbers. combine is going to do this as well; the only difference is, it will produce its output by adding line counts to existing totals. Let's rewrite process so that all the reading and line-by-line counting happens in count, and combine just adds values together:


In [89]:
def process(title, filenames):
    if filenames is None:
        number = 1
        width, filled = count(sys.stdin)
    else:
        number = len(filenames)
        with open(filenames[0], 'r') as source:
            width, filled = count(source)
        for f in filenames[1:]:
            new_width, new_filled = count(source)
            assert new_width == width, 'File widths are not the same'
            filled = combine(filled, new_filled)
    display(title, filled, number * width)

We're finally done with process: each of the functions it depends on does exactly one simple job, and we've even included a self-check to make sure that all the input files have the same width. Our three remaining functions are now almost trivial to write:


In [90]:
def count(source):
    result = []
    for line in source:
        line = line.strip()
        width = len(line)
        n = line.count('1')
        result.append(n)
    return width, result

def combine(left, right):
    assert len(left) == len(right), 'Data set lengths have unequal lengths'
    result = []
    for i in range(len(left)):
        result.append( left[i] + right[i] )
    return result

def display(title, counts, scaling):
    print title
    for c in counts:
        print float(c) / scaling

Let's try running it on a single input file:


In [91]:
%run fracdens_6.py fractal_1.txt


fractal_1.txt
0.111111111111
0.444444444444
0.444444444444
0.333333333333
0.222222222222
0.333333333333

If we double-check the file (which is nine cells wide):


In [92]:
!cat fractal_1.txt


000100000
001100110
000111100
001110000
001100000
001110000

that seems to be the right answer: 1/9, two lines of 4/9, a 3/9, a 2/9, and another 3/9. Let's try three files separately:


In [93]:
%run fracdens_6.py fractal_1.txt fractal_2.txt fractal_3.txt


fractal_1.txt
0.111111111111
0.444444444444
0.444444444444
0.333333333333
0.222222222222
0.333333333333
fractal_2.txt
0.111111111111
0.444444444444
0.444444444444
0.444444444444
0.222222222222
0.444444444444
fractal_3.txt
0.111111111111
0.555555555556
0.333333333333
0.444444444444
0.333333333333
0.333333333333

and standard input:


In [94]:
!ipython fracdens_6.py < fractal_1.txt


stdin
0.111111111111
0.444444444444
0.444444444444
0.333333333333
0.222222222222
0.333333333333

All that's left to test is merging. Clearly, if we "merge" one file, we should get the same answer that we got for that file:


In [95]:
%run fracdens_6.py -m fractal_1.txt


all
0.111111111111
0.444444444444
0.444444444444
0.333333333333
0.222222222222
0.333333333333

and if we merge data for that file with itself, we should get the same answer (which is an easier thing to check that merging data from two different files):


In [97]:
%run fracdens_6.py -m fractal_1.txt fractal_1.txt


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/gwilson/anaconda/lib/python2.7/site-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
    202             else:
    203                 filename = fname
--> 204             __builtin__.execfile(filename, *where)

/Users/gwilson/bc/lessons/swc-python/fracdens_6.py in <module>()
     56 
     57 # Run the program.
---> 58 main()

/Users/gwilson/bc/lessons/swc-python/fracdens_6.py in main()
      7         process('stdin', None)
      8     elif flags.merge:
----> 9         process('all', filenames)
     10     else:
     11         for f in filenames:

/Users/gwilson/bc/lessons/swc-python/fracdens_6.py in process(title, filenames)
     29             width, filled = count(source)
     30         for f in filenames[1:]:
---> 31             new_width, new_filled = count(source)
     32             assert new_width == width, 'File widths are not the same'
     33             filled = combine(filled, new_filled)

/Users/gwilson/bc/lessons/swc-python/fracdens_6.py in count(source)
     36 def count(source):
     37     result = []
---> 38     for line in source:
     39         line = line.strip()
     40         width = len(line)

ValueError: I/O operation on closed file

Why does Python think we're trying to read from a closed file? If take a closer look at process, we see that the loop handling filenames[1:] wasn't actually opening any files, so it was trying to read from the same source that was opened and closed for filenames[0]. Let's update process one more time to create fracdens_7.py:


In [98]:
def process(title, filenames):
    if filenames is None:
        number = 1
        width, filled = count(sys.stdin)
    else:
        number = len(filenames)
        with open(filenames[0], 'r') as source:
            width, filled = count(source)
        for f in filenames[1:]:
            with open(f, 'r') as source:
                new_width, new_filled = count(source)
                assert new_width == width, 'File widths are not the same'
                filled = combine(filled, new_filled)
    display(title, filled, number * width)

and run that:


In [99]:
%run fracdens_7.py -m fractal_1.txt fractal_1.txt


all
0.111111111111
0.444444444444
0.444444444444
0.333333333333
0.222222222222
0.333333333333

Good: that's the same answer that we had before. Let's try merging all three files:


In [100]:
%run fracdens_7.py -m fractal_1.txt fractal_2.txt fractal_3.txt


all
0.111111111111
0.481481481481
0.407407407407
0.407407407407
0.259259259259
0.37037037037

That might be right—at least, it isn't obviously wrong—but how can we be sure? Let's create three simplified input files:


In [102]:
!cat test_1.txt


010
010
010

In [103]:
!cat test_2.txt


011
011
011

In [104]:
!cat test_3.txt


111
111
111

and try merging those:


In [105]:
%run fracdens_7.py -m test_1.txt test_2.txt


all
0.5
0.5
0.5

In [106]:
%run fracdens_7.py -m test_1.txt test_2.txt test_3.txt


all
0.666666666667
0.666666666667
0.666666666667

Those values are much easier to check: the average of 1/3 and 2/3 is 1/2, and the average of 1/3, 2/3, and 3/3 is 2/3.


That Was Unexpected

Just for fun, let's get our program to process standard input with merging turned on:


In [107]:
!ipython fracdens_7.py -m < test_1.txt


usage: ipython [-h] [--profile TERMINALIPYTHONAPP.PROFILE] [-c TERMINALIPYTHONAPP.CODE_TO_RUN]
               [--logappend TERMINALINTERACTIVESHELL.LOGAPPEND] [--autocall TERMINALINTERACTIVESHELL.AUTOCALL]
               [--ipython-dir TERMINALIPYTHONAPP.IPYTHON_DIR] [--gui TERMINALIPYTHONAPP.GUI] [--pylab [TERMINALIPYTHONAPP.PYLAB]]
               [-m TERMINALIPYTHONAPP.MODULE_TO_RUN] [--colors TERMINALINTERACTIVESHELL.COLORS]
               [--log-level TERMINALIPYTHONAPP.LOG_LEVEL] [--ext TERMINALIPYTHONAPP.EXTRA_EXTENSION]
               [--matplotlib [TERMINALIPYTHONAPP.MATPLOTLIB]] [--cache-size TERMINALINTERACTIVESHELL.CACHE_SIZE]
               [--logfile TERMINALINTERACTIVESHELL.LOGFILE] [--config TERMINALIPYTHONAPP.EXTRA_CONFIG_FILE] [--no-autoindent]
               [--deep-reload] [--classic] [--term-title] [--no-confirm-exit] [--autoindent] [--no-term-title] [--pprint] [--color-info]
               [--init] [--pydb] [--no-color-info] [--autoedit-syntax] [--confirm-exit] [--no-autoedit-syntax] [--quick] [--banner]
               [--automagic] [--no-automagic] [--nosep] [-i] [--quiet] [--no-deep-reload] [--no-pdb] [--debug] [--pdb] [--no-pprint]
               [--no-banner]
ipython: error: argument -m/--m: expected one argument

That's certainly not what we expect. After a bit of digging, it turns out that `ipython` assumes the `-m` flag belongs to it, rather than to our script. This doesn't show up when we run the program with `%run` because we're not launching a separate command-line instance of IPython in that case. To make sure that arguments are actually passed to our script, we need to put them all after a double dash (`--`) so that IPython can tell which are its and which are ours:


In [108]:
!ipython fracdens_7.py -- -m < test_1.txt


stdin
0.333333333333
0.333333333333
0.333333333333

A More Advanced Solution

One of the rules in the previous lesson was, "If you're writing a loop, you're probably doing it wrong." Let's take a look at how we could eliminate most of the loops in our density calculator using NumPy arrays and another feature of Python we haven't encountered yet: list comprehensions. Suppose we have a list of numbers:


In [1]:
nums = [2, 5, 9]

Instead of writing a loop to create a list of their squares, we can do this:


In [2]:
squares = [x ** 2 for x in nums]
print squares


[4, 25, 81]

This doesn't change our original list:


In [3]:
print nums


[2, 5, 9]

and the name of the temporary variable inside the comprehension doesn't matter:


In [4]:
print [something_else ** 2 for something_else in nums]


[4, 25, 81]

Used sparingly, list comprehensions make a lot of programs more readable—just compare the list comprehension form of this calculation with the loop-and-conditional form:


In [7]:
from math import sqrt
signal = [-0.7, -0.3, -0.1, 0.2, 0.3, 0.5]

# The easy way
pos_roots = [sqrt(s) for s in signal if s >= 0]

# The hard way
pos_roots = []
for s in signal:
    if s >= 0:
        pos_roots.append(sqrt(s))

Let's use list comprehension to clean up our code. First, though, let's clean up our data. It's simple to represent our fractals using tightly-packed 1's and 0's, but nobody else's software will recognize that data format. If we use something standard, like comma-separated values (CSV), we can get rid of our own parsing code. This makes our data files somewhat larger:


In [1]:
!cat csv_1.txt


0,0,0,1,0,0,0,0,0
0,0,1,1,0,0,1,1,0
0,0,0,1,1,1,1,0,0
0,0,1,1,1,0,0,0,0
0,0,1,1,0,0,0,0,0
0,0,1,1,1,0,0,0,0

but we can now read our data with a single statement:


In [3]:
import numpy as np
print np.loadtxt('csv_1.txt', delimiter=',')


[[ 0.  0.  0.  1.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  1.  0.  0.  1.  1.  0.]
 [ 0.  0.  0.  1.  1.  1.  1.  0.  0.]
 [ 0.  0.  1.  1.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  1.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  1.  1.  0.  0.  0.  0.]]

Using this, we can rewrite process as follows:


In [4]:
def process(title, filenames):
    if filenames is None:
        data = np.loadtxt(sys.stdin, delimiter=',')
        display(title, data, 1)
    else:
        results = [np.loadtxt(f, delimiter=',') for f in filenames]
        assert all([x.shape == results[0].shape for x in results]), 'File sizes differ'
        for r in results[1:]:
            results[0] += r
        display(title, results[0], len(results))

Let's go through this line by line:

  • If no filenames have been given, process reads data from standard input using np.loadtxt, then pass that two-dimensional array to display, along with a '1' to indicate that we've only got one data set.
  • Otherwise, if we do have filenames, we use a list comprehension to load data from all of the files with a single statement.
  • Then, on line 7, we check that the shapes of all of the arrays match the shape of the first array.
  • Assuming they are, we add the first and following arrays to array 0 to get the total number of times each cell in the grid has ever been filled.
  • Finally, we call display with the merged data, passing in the totals and the number of arrays we read.

The display function now needs to be rewritten to do two things: scale the counts per row, and show them. This isn't a great design, as it violates our "one purpose per function" rule, but it's good enough for now:


In [5]:
def display(title, data, number):
    print title
    scaling = float(number * data.shape[1])
    densities = data.sum(1) / scaling
    for d in densities:
        print d

The program we have now produced runs exactly the same way as the previous versions:


In [7]:
%run fracdens_8.py csv_1.txt csv_2.txt


csv_1.txt
0.111111111111
0.444444444444
0.444444444444
0.333333333333
0.222222222222
0.333333333333
csv_2.txt
0.111111111111
0.444444444444
0.444444444444
0.444444444444
0.222222222222
0.444444444444

but it's two-thirds the size:


In [9]:
!wc fracdens_7.py fracdens_8.py


      59     185    1690 fracdens_7.py
      42     133    1238 fracdens_8.py
     101     318    2928 total

The new program isn't noticeably faster than the old one, though, since the time to do the calculations is dwarfed by the time needed to read the data into the program.

Is the NumPy version better than the list-based version? The answer depends on the audience we have in mind. Programmers who know Python well, and who are used to thinking in terms of applying operations to entire data sets at once, would probably have written something like the final version right from the start. Programmers who aren't that familiar with Python's features, on the other hand, would probably find the loop-based version easier to understand because it spells out the steps the program is taking, rather than the results it's producing.

These differences highlight one of the fundamental problems in programming (and indeed in any other activity that requires expertise): things that are comprehensible to a novice are painfully slow for an expert to read, while things that are natural for an expert are often opaque to novices. This doesn't mean that either is right or wrong: it's just one manifestation of the cognitive changes that occur as a task goes from being a mystery to being possible to being easy.

The same thing is true of documentation: a tutorial aimed at novices can be infuriating for experts to read, since the information they want is scattered so thinly, while a manual page for experts can seem like gibberish to people who don't yet have a mental model of the domain.

Key Points

FIXME