Command-Line Programs

The IPython Notebook and other interactive tools are great for prototyping code and exploring data, but sooner or later we will want to use our program in a pipeline or run it in a shell script to process thousands of data files. In order to do that, we need to make our programs work like other Unix command-line tools. For example, we may want a program that reads a data set and prints the average inflammation per patient:

$ python readings.py --mean inflammation-01.csv
5.45
5.425
6.1
...
6.4
7.05
5.9

but we might also want to look at the minimum of the first four lines

$ head -4 inflammation-01.csv | python readings.py --min

or the maximum inflammations in several files one after another:

$ python readings.py --max inflammation-*.csv

Our overall requirements are:

  1. If no filename is given on the command line, read data from standard input.
  2. If one or more filenames are given, read data from them and report statistics for each file separately.
  3. Use the --min, --mean, or --max flag to determine what statistic to print.

To make this work, we need to know how to handle command-line arguments in a program, and how to get at standard input. We'll tackle these questions in turn below.

Objectives

  • Use the values of command-line arguments in a program.
  • Handle flags and files separately in a command-line program.
  • Read data from standard input in a program so that it can be used in a pipeline.

Command-Line Arguments

Using the text editor of your choice, save the following in a text file:


In [26]:
!cat code/sys-version.py


import sys
print 'version is', sys.version

The first line imports a library called sys, which is short for "system". It defines values such as sys.version, which describes which version of Python we are running. We can run this script from within the IPython Notebook like this:


In [25]:
%run code/sys-version.py


version is 2.7.8 |Anaconda 2.0.1 (x86_64)| (default, Aug 21 2014, 15:21:46) 
[GCC 4.2.1 (Apple Inc. build 5577)]

or like this:


In [27]:
!ipython code/sys-version.py


version is 2.7.8 |Anaconda 2.0.1 (x86_64)| (default, Aug 21 2014, 15:21:46) 
[GCC 4.2.1 (Apple Inc. build 5577)]

The first method, %run, uses a special command in the IPython Notebook to run a program in a .py file. The second method is more general: the exclamation mark ! tells the Notebook to run a shell command, and it just so happens that the command we run is ipython with the name of the script.

Here's another script that does something more interesting:


In [28]:
!cat code/argv-list.py


import sys
print 'sys.argv is', sys.argv

The strange name argv stands for "argument values". Whenever Python runs a program, it takes all of the values given on the command line and puts them in the list sys.argv so that the program can determine what they were. If we run this program with no arguments:


In [29]:
!ipython code/argv-list.py


sys.argv is ['/Users/aproeme/swc/2014-12-03-edinburgh/archer/python/code/argv-list.py']

the only thing in the list is the full path to our script, which is always sys.argv[0]. If we run it with a few arguments, however:


In [30]:
!ipython code/argv-list.py first second third


sys.argv is ['/Users/aproeme/swc/2014-12-03-edinburgh/archer/python/code/argv-list.py', 'first', 'second', 'third']

then Python adds each of those arguments to that magic list.

With this in hand, let's build a version of readings.py that always prints the per-patient mean of a single data file. The first step is to write a function that outlines our implementation, and a placeholder for the function that does the actual work. By convention this function is usually called main, though we can call it whatever we want:


In [31]:
!cat code/readings-01.py


import sys
import numpy as np

def main():
    script = sys.argv[0]
    filename = sys.argv[1]
    data = np.loadtxt(filename, delimiter=',')
    for m in data.mean(axis=1):
        print m

This function gets the name of the script from sys.argv[0], because that's where it's always put, and the name of the file to process from sys.argv[1]. Here's a simple test:


In [32]:
%run code/readings-01.py inflammation-01.csv

There is no output because we have defined a function, but haven't actually called it. Let's add a call to main:


In [33]:
!cat code/readings-02.py


import sys
import numpy as np

def main():
    script = sys.argv[0]
    filename = sys.argv[1]
    data = np.loadtxt(filename, delimiter=',')
    for m in data.mean(axis=1):
        print m

main()

and run that:


In [36]:
%run code/readings-02.py data/inflammation-01.csv


5.45
5.425
6.1
5.9
5.55
6.225
5.975
6.65
6.625
6.525
6.775
5.8
6.225
5.75
5.225
6.3
6.55
5.7
5.85
6.55
5.775
5.825
6.175
6.1
5.8
6.425
6.05
6.025
6.175
6.55
6.175
6.35
6.725
6.125
7.075
5.725
5.925
6.15
6.075
5.75
5.975
5.725
6.3
5.9
6.75
5.925
7.225
6.15
5.95
6.275
5.7
6.1
6.825
5.975
6.725
5.7
6.25
6.4
7.05
5.9

The Right Way to Do It

If our programs can take complex parameters or multiple filenames, we shouldn't handle sys.argv directly. Instead, we should use Python's argparse library, which handles common cases in a systematic way, and also makes it easy for us to provide sensible error messages for our users.

Challenges

  1. Write a command-line program that does addition and subtraction:

    $ python arith.py 1 + 2
    3
    $ python arith.py 3 - 4
    -1

    What goes wrong if you try to add multiplication using '*' to the program?

  2. Using the glob module introduced 03-loop.ipynb, write a simple version of ls that shows files in the current directory with a particular suffix:

    $ python my_ls.py py
    left.py
    right.py
    zero.py

Handling Multiple Files

The next step is to teach our program how to handle multiple files. Since 60 lines of output per file is a lot to page through, we'll start by creating three smaller files, each of which has three days of data for two patients:


In [38]:
!ls data/small-*.csv


data/small-01.csv data/small-02.csv data/small-03.csv

In [39]:
!cat data/small-01.csv


0,0,1
0,1,2

In [40]:
%run code/readings-02.py data/small-01.csv


0.333333333333
1.0

Using small data files as input also allows us to check our results more easily: here, for example, we can see that our program is calculating the mean correctly for each line, whereas we were really taking it on faith before. This is yet another rule of programming: "test the simple things first".

We want our program to process each file separately, so we need a loop that executes once for each filename. If we specify the files on the command line, the filenames will be in sys.argv, but we need to be careful: sys.argv[0] will always be the name of our script, rather than the name of a file. We also need to handle an unknown number of filenames, since our program could be run for any number of files.

The solution to both problems is to loop over the contents of sys.argv[1:]. The '1' tells Python to start the slice at location 1, so the program's name isn't included; since we've left off the upper bound, the slice runs to the end of the list, and includes all the filenames. Here's our changed program:


In [41]:
!cat code/readings-03.py


import sys
import numpy as np

def main():
    script = sys.argv[0]
    for filename in sys.argv[1:]:
        data = np.loadtxt(filename, delimiter=',')
        for m in data.mean(axis=1):
            print m

main()

and here it is in action:


In [42]:
%run code/readings-03.py data/small-01.csv data/small-02.csv


0.333333333333
1.0
13.6666666667
11.0

Note: at this point, we have created three versions of our script called readings-01.py, readings-02.py, and readings-03.py. We wouldn't do this in real life: instead, we would have one file called readings.py that we committed to version control every time we got an enhancement working. For teaching, though, we need all the successive versions side by side.

Challenges

  1. Write a program called check.py that takes the names of one or more inflammation data files as arguments and checks that all the files have the same number of rows and columns. What is the best way to test your program?

Handling Command-Line Flags

The next step is to teach our program to pay attention to the --min, --mean, and --max flags. These always appear before the names of the files, so we could just do this:


In [43]:
!cat code/readings-04.py


import sys
import numpy as np

def main():
    script = sys.argv[0]
    action = sys.argv[1]
    filenames = sys.argv[2:]

    for f in filenames:
        data = np.loadtxt(f, delimiter=',')

        if action == '--min':
            values = data.min(axis=1)
        elif action == '--mean':
            values = data.mean(axis=1)
        elif action == '--max':
            values = data.max(axis=1)

        for m in values:
            print m

main()

This works:


In [44]:
%run code/readings-04.py --max data/small-01.csv


1.0
2.0

but there are seveal things wrong with it:

  1. main is too large to read comfortably.

  2. If action isn't one of the three recognized flags, the program loads each file but does nothing with it (because none of the branches in the conditional match). Silent failures like this are always hard to debug.

This version pulls the processing of each file out of the loop into a function of its own. It also checks that action is one of the allowed flags before doing any processing, so that the program fails fast:


In [45]:
!cat code/readings-05.py


import sys
import numpy as np

def main():
    script = sys.argv[0]
    action = sys.argv[1]
    filenames = sys.argv[2:]
    assert action in ['--min', '--mean', '--max'], \
           'Action is not one of --min, --mean, or --max: ' + action
    for f in filenames:
        process(f, action)

def process(filename, action):
    data = np.loadtxt(filename, delimiter=',')

    if action == '--min':
        values = data.min(axis=1)
    elif action == '--mean':
        values = data.mean(axis=1)
    elif action == '--max':
        values = data.max(axis=1)

    for m in values:
        print m

main()

This is four lines longer than its predecessor, but broken into more digestible chunks of 8 and 12 lines.

Python has a module named argparse that helps handle complex command-line flags. We will not cover this module in this lesson but you can go to Tshepang Lekhonkhobe's Argparse tutorial that is part of Python's Official Documentation.

Challenges

  1. Rewrite this program so that it uses -n, -m, and -x instead of --min, --mean, and --max respectively. Is the code easier to read? Is the program easier to understand?

  2. Separately, modify the program so that if no parameters are given (i.e., no action is specified and no filenames are given), it prints a message explaining how it should be used.

  3. Separately, modify the program so that if no action is given it displays the means of the data.

Handling Standard Input

The next thing our program has to do is read data from standard input if no filenames are given so that we can put it in a pipeline, redirect input to it, and so on. Let's experiment in another script:


In [46]:
!cat code/count-stdin.py


import sys

count = 0
for line in sys.stdin:
    count += 1

print count, 'lines in standard input'

This little program reads lines from a special "file" called sys.stdin, which is automatically connected to the program's standard input. We don't have to open it—Python and the operating system take care of that when the program starts up— but we can do almost anything with it that we could do to a regular file. Let's try running it as if it were a regular command-line program:


In [47]:
!ipython code/count-stdin.py < data/small-01.csv


2 lines in standard input

What if we run it using %run?


In [49]:
%run code/count-stdin.py < data/small-01.csv


0 lines in standard input

As you can see, %run doesn't understand file redirection: that's a shell thing.

A common mistake is to try to run something that reads from standard input like this:

!ipython count_stdin.py small-01.csv

i.e., to forget the < character that redirect the file to standard input. In this case, there's nothing in standard input, so the program waits at the start of the loop for someone to type something on the keyboard. Since there's no way for us to do this, our program is stuck, and we have to halt it using the Interrupt option from the Kernel menu in the Notebook.

We now need to rewrite the program so that it loads data from sys.stdin if no filenames are provided. Luckily, numpy.loadtxt can handle either a filename or an open file as its first parameter, so we don't actually need to change process. That leaves main:

def main():
    script = sys.argv[0]
    action = sys.argv[1]
    filenames = sys.argv[2:]
    assert action in ['--min', '--mean', '--max'], \
           'Action is not one of --min, --mean, or --max: ' + action
    if len(filenames) == 0:
        process(sys.stdin, action)
    else:
        for f in filenames:
            process(f, action)

Let's try it out (we'll see in a moment why we send the output through head):


In [50]:
!ipython code/readings-06.py --mean < data/small-01.csv


=========
 IPython
=========

Tools for Interactive Computing in Python
=========================================

    A Python shell with automatic history (input and output), dynamic object
    introspection, easier configuration, command completion, access to the
    system shell and more.  IPython can also be embedded in running programs.

Usage

    ipython [subcommand] [options] [-c cmd | -m mod | file] [--] [arg] ...

    If invoked with no options, it executes the file and exits, passing the
    remaining arguments to the script, just as if you had specified the same
    command with python. You may need to specify `--` before args to be passed
    to the script, to prevent IPython from attempting to parse them. If you
    specify the option `-i` before the filename, it will enter an interactive
    IPython session after running the script, rather than exiting. Files ending
    in .py will be treated as normal Python, but files ending in .ipy can
    contain special IPython syntax (magic commands, shell expansions, etc.).

    Almost all configuration in IPython is available via the command-line. Do
    `ipython --help-all` to see all available options.  For persistent
    configuration, look into your `ipython_config.py` configuration file for
    details.

    This file is typically installed in the `IPYTHONDIR` directory, and there
    is a separate configuration directory for each profile. The default profile
    directory will be located in $IPYTHONDIR/profile_default. IPYTHONDIR
    defaults to to `$HOME/.ipython`.  For Windows users, $HOME resolves to
    C:\Documents and Settings\YourUserName in most instances.

    To initialize a profile with the default configuration file, do::

      $> ipython profile create

    and start editing `IPYTHONDIR/profile_default/ipython_config.py`

    In IPython's documentation, we will refer to this directory as
    `IPYTHONDIR`, you can change its default location by creating an
    environment variable with this name and setting it to the desired path.

    For more information, see the manual available in HTML and PDF in your
    installation, or online at http://ipython.org/documentation.html.

Subcommands
-----------

Subcommands are launched as `ipython cmd [args]`. For information on using
subcommand 'cmd', do: `ipython cmd -h`.

locate
    print the path to the IPython dir
kernel
    Start a kernel without an attached frontend.
console
    Launch the IPython terminal-based Console.
install-nbextension
    Install IPython notebook extension files
notebook
    Launch the IPython HTML Notebook Server.
profile
    Create and manage IPython profiles.
qtconsole
    Launch the IPython Qt Console.
nbconvert
    Convert notebooks to/from other formats.
trust
    Sign notebooks to trust their potentially unsafe contents at load.
history
    Manage the IPython history database.

Options
-------

Arguments that take values are actually convenience aliases to full
Configurables, whose aliases are listed on the help line. For more information
on full configurables, see '--help-all'.

--no-autoindent
    Turn off autoindenting.
--autoedit-syntax
    Turn on auto editing of files with syntax errors.
--deep-reload
    Enable deep (recursive) reloading by default. IPython can use the
    deep_reload module which reloads changes in modules recursively (it
    replaces the reload() function, so you don't need to change anything to
    use it). deep_reload() forces a full reload of modules whose code may
    have changed, which the default reload() function does not.  When
    deep_reload is off, IPython will use the normal reload(), but
    deep_reload will still be available as dreload(). This feature is off
    by default [which means that you have both normal reload() and
    dreload()].
--confirm-exit
    Set to confirm when you try to exit IPython with an EOF (Control-D
    in Unix, Control-Z/Enter in Windows). By typing 'exit' or 'quit',
    you can force a direct exit without any confirmation.
--pylab
    Pre-load matplotlib and numpy for interactive use with
    the default matplotlib backend.
--matplotlib
    Configure matplotlib for interactive use with
    the default matplotlib backend.
--term-title
    Enable auto setting the terminal title.
--classic
    Gives IPython a similar feel to the classic Python prompt.
--autoindent
    Turn on autoindenting.
--no-automagic
    Turn off the auto calling of magic commands.
--banner
    Display a banner upon starting IPython.
--automagic
    Turn on the auto calling of magic commands. Type %%magic at the
    IPython  prompt  for  more information.
--no-deep-reload
    Disable deep (recursive) reloading by default.
--no-term-title
    Disable auto setting the terminal title.
--nosep
    Eliminate all spacing between prompts.
-i
    If running code from the command line, become interactive afterwards.
    Note: can also be given simply as '-i'.
--debug
    set log level to logging.DEBUG (maximize logging output)
--pprint
    Enable auto pretty printing of results.
--no-autoedit-syntax
    Turn off auto editing of files with syntax errors.
--quiet
    set log level to logging.CRITICAL (minimize logging output)
--no-color-info
    Disable using colors for info related things.
--color-info
    IPython can display information about objects via a set of func-
    tions, and optionally can use colors for this, syntax highlighting
    source code and various other elements.  However, because this
    information is passed through a pager (like 'less') and many pagers get
    confused with color codes, this option is off by default.  You can test
    it and turn it on permanently in your ipython_config.py file if it
    works for you.  Test it and turn it on permanently if it works with
    your system.  The magic function %%color_info allows you to toggle this
    interactively for testing.
--init
    Initialize profile with default config files.  This is equivalent
    to running `ipython profile create <profile>` prior to startup.
--no-pdb
    Disable auto calling the pdb debugger after every exception.
--quick
    Enable quick startup with no config files.
--no-confirm-exit
    Don't prompt the user when exiting.
--pydb
    Use the third party 'pydb' package as debugger, instead of pdb.
    Requires that pydb is installed.
--pdb
    Enable auto calling the pdb debugger after every exception.
--no-pprint
    Disable auto pretty printing of results.
--no-banner
    Don't display a banner upon starting IPython.
--profile=<Unicode> (BaseIPythonApplication.profile)
    Default: u'default'
    The IPython profile to use.
--pylab=<CaselessStrEnum> (InteractiveShellApp.pylab)
    Default: None
    Choices: ['auto', 'gtk', 'gtk3', 'inline', 'nbagg', 'osx', 'qt', 'qt4', 'qt5', 'tk', 'wx']
    Pre-load matplotlib and numpy for interactive use, selecting a particular
    matplotlib backend and loop integration.
--matplotlib=<CaselessStrEnum> (InteractiveShellApp.matplotlib)
    Default: None
    Choices: ['auto', 'gtk', 'gtk3', 'inline', 'nbagg', 'osx', 'qt', 'qt4', 'qt5', 'tk', 'wx']
    Configure matplotlib for interactive use with the default matplotlib
    backend.
--colors=<CaselessStrEnum> (InteractiveShell.colors)
    Default: 'LightBG'
    Choices: ('NoColor', 'LightBG', 'Linux')
    Set the color scheme (NoColor, Linux, or LightBG).
--cache-size=<Integer> (InteractiveShell.cache_size)
    Default: 1000
    Set the size of the output cache.  The default is 1000, you can change it
    permanently in your config file.  Setting it to 0 completely disables the
    caching system, and the minimum value accepted is 20 (if you provide a value
    less than 20, it is reset to 0 and a warning is issued).  This limit is
    defined because otherwise you'll spend more time re-flushing a too small
    cache than working
--logfile=<Unicode> (InteractiveShell.logfile)
    Default: ''
    The name of the logfile to use.
--profile-dir=<Unicode> (ProfileDir.location)
    Default: u''
    Set the profile location directly. This overrides the logic used by the
    `profile` option.
-c <Unicode> (InteractiveShellApp.code_to_run)
    Default: ''
    Execute the given command string.
--autocall=<Enum> (InteractiveShell.autocall)
    Default: 0
    Choices: (0, 1, 2)
    Make IPython automatically call any callable object even if you didn't type
    explicit parentheses. For example, 'str 43' becomes 'str(43)' automatically.
    The value can be '0' to disable the feature, '1' for 'smart' autocall, where
    it is not applied if there are no more arguments on the line, and '2' for
    'full' autocall, where all callable objects are automatically called (even
    if no arguments are present).
--ipython-dir=<Unicode> (BaseIPythonApplication.ipython_dir)
    Default: u''
    The name of the IPython directory. This directory is used for logging
    configuration (through profiles), history storage, etc. The default is
    usually $HOME/.ipython. This options can also be specified through the
    environment variable IPYTHONDIR.
--gui=<CaselessStrEnum> (InteractiveShellApp.gui)
    Default: None
    Choices: ('glut', 'gtk', 'gtk3', 'none', 'osx', 'pyglet', 'qt', 'qt4', 'tk', 'wx')
    Enable GUI event loop integration with any of ('glut', 'gtk', 'gtk3',
    'none', 'osx', 'pyglet', 'qt', 'qt4', 'tk', 'wx').
--logappend=<Unicode> (InteractiveShell.logappend)
    Default: ''
    Start logging to the given file in append mode.
-m <Unicode> (InteractiveShellApp.module_to_run)
    Default: ''
    Run the module as a script.
--log-level=<Enum> (Application.log_level)
    Default: 30
    Choices: (0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL')
    Set the log level by value or name.
--ext=<Unicode> (InteractiveShellApp.extra_extension)
    Default: ''
    dotted module name of an IPython extension to load.
--config=<Unicode> (BaseIPythonApplication.extra_config_file)
    Default: u''
    Path to an extra config file to load.
    If specified, load this config file in addition to any other IPython config.

To see all available configurables, use `--help-all`

Examples
--------

    ipython --matplotlib       # enable matplotlib integration
    ipython --matplotlib=qt    # enable matplotlib integration with qt4 backend
    
    ipython --log-level=DEBUG  # set logging to DEBUG
    ipython --profile=foo      # start with profile foo
    
    ipython qtconsole          # start the qtconsole GUI application
    ipython help qtconsole     # show the help for the qtconsole subcmd
    
    ipython console            # start the terminal-based console application
    ipython help console       # show the help for the console subcmd
    
    ipython notebook           # start the IPython notebook
    ipython help notebook      # show the help for the notebook subcmd
    
    ipython profile create foo # create profile foo w/ default config files
    ipython help profile       # show the help for the profile subcmd
    
    ipython locate             # print the path to the IPython directory
    ipython locate profile foo # print the path to the directory for profile `foo`
    
    ipython nbconvert           # convert notebooks to/from other formats

[TerminalIPythonApp] CRITICAL | Bad config encountered during initialization:
[TerminalIPythonApp] CRITICAL | Unrecognized flag: '--mean'

Whoops: why are we getting IPython's help rather than the line-by-line average of our data? The answer is that IPython has a hard time telling which command-line arguments are meant for it, and which are meant for the program it's running. To make our meaning clear, we have to use -- (a double dash) to separate the two:


In [51]:
!ipython code/readings-06.py -- --mean < data/small-01.csv


0.333333333333
1.0

That's better. In fact, that's done: the program now does everything we set out to do.

Note that the basic python interpreter (as opposed to ipython) doesn't have this problem:


In [24]:
!python code/readings-06.py --mean < data/small-01.csv


0.333333333333
1.0

Challenges

  1. Write a program called line-count.py that works like the Unix wc command:
    • If no filenames are given, it reports the number of lines in standard input.
    • If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.

Key Points

  • The sys library connects a Python program to the system it is running on.
  • The list sys.argv contains the command-line arguments that a program was run with.
  • Avoid silent failures.
  • The "file" sys.stdin connects to a program's standard input.
  • The "file" sys.stdout connects to a program's standard output.