In [ ]:

    
import sys
import logging
import argparse
import collections

Scripting

As opposed to R, with python is extremely easy to create handy scripts. Those are very useful when working from the command line and/or in HPC (high performance computing).

A word on the Unix philosophy

When writing a script, it's always a good idea to follow the Unix philosophy, which emphasizes simplicity, interoperability and modularity instead of overengineering. In short:

Write programs that do one thing and do it well.
Write programs to work together.
Write programs to handle text streams, because that is a universal interface.

If you have even a basic knowledge of the use of the bash (or bash-like) command line, you would probably already be familiar with these concepts. Consider the following example:

> curl --silent "http://wodaklab.org/cyc2008/resources/CYC2008_complex.tab" | head -n 5
ORF     Name    Complex PubMed_id       Method  GO_id   GO_term Jaccard_Index
YKR068C BET3    TRAPP complex   10727015        "Affinity Capture-Western,Affinity Capture-MS"  GO:0030008      TRAPP complex   1
YML077W BET5    TRAPP complex
YDR108W GSG1    TRAPP complex
YGR166W KRE11   TRAPP complex

Here we have chained two command line tools: curl to stream a text file from the internet and piped it into head to show only the first 5 rows. Anideal python script should follow the same principles. Immagine we wanted to substitute head with a little script that transforms the text file in a way such that for each complex name (Name column) we report all the genes belonging to that complex. For instance:

> curl --silent "http://wodaklab.org/cyc2008/resources/CYC2008_complex.tab" | ./cyc2txt | head -n 5
SIR     YLR442C,YDL042C,YDR227W
SIP     YGL208W,YDR422C
PAC1    YGR078C,YDR488C
SIT     YDL047W
CPA     YJR109C,YOR303W

Parsing the command line

As shown in the example above, command line tools often accept options and even input files (i.e. head -n 5). Parsing these arguments with the necessary flexibility is not trivial. Writing a command line argument parser that handles positional and optional arguments, potentially with some checks on their type is not trivial.



In [ ]:

    
def parse_args(cmd_line):
    Args = collections.namedtuple('Args',
                                  ['n', 'in_file'])
    n_trigger = False
    # default value for "n"
    n = 1
    for arg in cmd_line:
        if n_trigger:
            n = int(arg)
            n_trigger = False
            continue
        if arg == '-n':
            # next argument belongs to "-n"
            n_trigger = True
            continue
        else:
            # it must be the positional argument
            in_file = arg
    return Args(n=n, in_file=in_file)



In [ ]:

    
# immaginary command line
cmd_line = '-n 5 myfile.txt'
parse_args(cmd_line.split())



In [ ]:

    
# immaginary command line with multiple input files
cmd_line = '-n 5 myfile.txt another_one.txt'
parse_args(cmd_line.split())

Note: in real life we would use the following startegy to read the arguments from the command line:

import sys
cmd_line = ' '.join(sys.argv[1:])

sys.argv[0] will be the name of the script, as called from the command line

We need to extend our original function, to account for additional positional arguments. We'll also add an extra boolean option.



In [ ]:

    
def parse_args(cmd_line):
    Args = collections.namedtuple('Args',
                                  ['n',
                                   'verbose',
                                   'in_file',
                                   'another_file'])
    n_trigger = False
    # default value for "n"
    n = 1
    # default value for "verbose"
    verbose = False
    # list to hold the positional arguments
    positional = []
    for arg in cmd_line:
        if n_trigger:
            n = int(arg)
            n_trigger = False
            continue
        if arg == '-n':
            # next argument belongs to "-n"
            n_trigger = True
        elif arg == '--verbose' or arg == '-v':
            verbose = True
        else:
            # it must be the positional argument
            positional.append(arg)
    return Args(n=n,
                verbose=verbose,
                in_file=positional[0],
                another_file=positional[1])



In [ ]:

    
# immaginary command line with multiple input files
cmd_line = '-n 5 myfile.txt another_one.txt'
parse_args(cmd_line.split())

What if the --verbose option can be called multiple times to modulate the amount of verbosity of our script?



In [ ]:

    
def parse_args(cmd_line):
    Args = collections.namedtuple('Args',
                                  ['n',
                                   'verbose',
                                   'in_file',
                                   'another_file'])
    n_trigger = False
    # default value for "n"
    n = 1
    # default value for "verbose"
    verbose = 0
    # list to hold the positional arguments
    positional = []
    for arg in cmd_line:
        if n_trigger:
            n = int(arg)
            n_trigger = False
            continue
        if arg == '-n':
            # next argument belongs to "-n"
            n_trigger = True
        elif arg == '--verbose' or arg == '-v':
            verbose += 1
        else:
            # it must be the positional argument
            positional.append(arg)
    return Args(n=n,
                verbose=verbose,
                in_file=positional[0],
                another_file=positional[1])



In [ ]:

    
# immaginary command line with increased verbosity
cmd_line = '-n 5 -v -v myfile.txt another_one.txt'
parse_args(cmd_line.split())



In [ ]:

    
# by convention we can also increase verbosity in the following manner
cmd_line = '-n 5 -vvv myfile.txt another_one.txt'
parse_args(cmd_line)

Let's add this additional functionality, hopefully you are starting to see how complicated and prone to bugs is writing your own command line parser!



In [ ]:

    
def parse_args(cmd_line):
    Args = collections.namedtuple('Args',
                                  ['n',
                                   'verbose',
                                   'in_file',
                                   'another_file'])
    n_trigger = False
    # default value for "n"
    n = 1
    # default value for "verbose"
    verbose = 0
    # list to hold the positional arguments
    positional = []
    for arg in cmd_line:
        if n_trigger:
            n = int(arg)
            n_trigger = False
            continue
        if arg == '-n':
            # next argument belongs to "-n"
            n_trigger = True
        elif arg == '--verbose' or arg == '-v' or arg.startswith('-v'):
            if arg.startswith('-v') and len(arg) > 2 and len({char for char in arg[1:]}) == 1:
                verbose += len(arg[1:])
            else:
                verbose += 1
        else:
            # it must be the positional argument
            positional.append(arg)
    return Args(n=n,
                verbose=verbose,
                in_file=positional[0],
                another_file=positional[1])



In [ ]:

    
# by convention we can also increase verbosity in the following manner
cmd_line = '-n 5 -vvv myfile.txt another_one.txt'
parse_args(cmd_line.split())

The `argparse` module

Python as a very useful module to create scripts, and it is included the standard library: argparse. It allows to create command line parser that are concise yet very flexible and powerful.

Let's rewrite our last example using argparse.



In [ ]:

    
def parse_args(cmd_line):
    parser = argparse.ArgumentParser(prog='fake_script',
                                     description='An argparse test')
    
    # positional arguments
    parser.add_argument('my_file',
                        help='My input file')
    parser.add_argument('another_file',
                        help='Another input file')
    
    # optional arguments
    parser.add_argument('-n',
                        type=int,
                        default=1,
                        help='Number of Ns [Default: 1]')
    parser.add_argument('-v', '--verbose',
                        action='count',
                        default=0,
                        help='Increase verbosity level')
    
    return parser.parse_args(cmd_line)



In [ ]:

    
# by convention we can also increase verbosity in the following manner
cmd_line = '-n 5 -vvv myfile.txt another_one.txt'
parse_args(cmd_line.split())

By indicating the type of the -n options, we can easily check for its type.



In [ ]:

    
# by convention we can also increase verbosity in the following manner
cmd_line = '-n not_an_integer -vvv myfile.txt another_one.txt'
parse_args(cmd_line.split())

...and we also get an -h (help) option for free, already formatted!



In [ ]:

    
# by convention we can also increase verbosity in the following manner
cmd_line = '-h'
parse_args(cmd_line.split())

More `argparse` examples

Boolean arguments

Sometimes you want to add a parameter to your script that is a simple trigger and doesn't receive any value. The action keyword argument in argparse allows us to implement such behavior.

argparse's documentation has more examples on how to use the action argument.



In [ ]:

    
def parse_args(cmd_line):
    parser = argparse.ArgumentParser(prog='fake_script',
                                     description='An argparse example')
    
    # boolean option
    parser.add_argument('-f',
                        '--force',
                        action='store_true',
                        default=False,
                        help='Force file creation')
    
    return parser.parse_args(cmd_line)



In [ ]:

    
cmd_line = '-f'
parse_args(cmd_line.split())



In [ ]:

    
cmd_line = ''
parse_args(cmd_line.split())

Multiple choices

Sometimes not only you would like to define a type for an option, but only allow certain values from a list.



In [ ]:

    
def parse_args(cmd_line):
    parser = argparse.ArgumentParser(prog='fake_script',
                                     description='An argparse example')
    
    # multiple choices positional argument
    parser.add_argument('-m',
                        '--metric',
                        choices=['jaccard',
                                 'hamming'],
                        default='jaccard',
                        help='Distance metric [Default: jaccard]')
    parser.add_argument('-b',
                        '--bootstraps',
                        type=int,
                        choices=range(10, 21),
                        default=10,
                        help='Bootstraps [Default: 10]')
    
    return parser.parse_args(cmd_line)



In [ ]:

    
cmd_line = '-m euclidean'
parse_args(cmd_line.split())



In [ ]:

    
cmd_line = '-m hamming -b 15'
parse_args(cmd_line.split())

Flexible number of arguments: nargs

In some cases you might want to have multiple values assigned to an option: for that the nargs keyword argument is a flexible option.



In [ ]:

    
def parse_args(cmd_line):
    parser = argparse.ArgumentParser(prog='fake_script',
                                     description='An argparse example')
    
    parser.add_argument('fastq',
                        nargs='+',
                        help='Input fastq files')
    parser.add_argument('-m',
                        '--mate-pairs',
                        nargs='*',
                        help='Mate pairse fastq files')
    
    return parser.parse_args(cmd_line)



In [ ]:

    
cmd_line = 'r1.fq.gz r2.fq.gz'
parse_args(cmd_line.split())



In [ ]:

    
cmd_line = 'r1.fq.gz r2.fq.gz -m m1.fq.gz m2.fq.gz'
parse_args(cmd_line.split())



In [ ]:

    
cmd_line = '-m m1.fq.gz m2.fq.gz'
parse_args(cmd_line.split())



In [ ]:

    
cmd_line = '-h'
parse_args(cmd_line.split())

One script to rule them all: subcommands

Some software sometimes contain more than one utility at a time, which is handy if you don't want to remember all the various subcommands and their options. Common command line examples are git (e.g. git commit and git push) and in bioinformatics there are many more examples (bwa, bedtools, samtools, ...). If you are developing a program that performs many related tasks, it might be a good idea to have them as functions/classes in a module, and call them through a single script with many subcommands.

Here's an example:



In [ ]:

    
def init(options):
    print('Init the project')
    print(options.name, options.description)
    
def add(options):
    print('Add an entry')
    print(options.ID, options.name,
          options.description, options.color)

def parse_args(cmd_line):
    parser = argparse.ArgumentParser(prog='fake_script',
                                     description='An argparse example')
    
    subparsers = parser.add_subparsers()
    
    parser_init = subparsers.add_parser('init',
                            help='Initialize the project')
    parser_init.add_argument('-n',
                             '--name',
                             default='Project',
                             help='Project name')
    parser_init.add_argument('-d',
                             '--description',
                             default='My project',
                             help='Project description')
    parser_init.set_defaults(func=init)
    
    parser_add = subparsers.add_parser('add',
                            help='Add an entry')
    parser_add.add_argument('ID',
                            help='Entry ID')
    parser_add.add_argument('-n',
                            '--name',
                            default='',
                            help='Entry name')
    parser_add.add_argument('-d',
                            '--description',
                            default = '',
                            help='Entry description')
    parser_add.add_argument('-c',
                            '--color',
                            default='red',
                            help='Entry color')
    parser_add.set_defaults(func=add)
    
    return parser.parse_args(cmd_line)



In [ ]:

    
cmd_line = '-h'
parse_args(cmd_line.split())



In [ ]:

    
cmd_line = 'init -h'
parse_args(cmd_line.split())



In [ ]:

    
cmd_line = 'add -h'
parse_args(cmd_line.split())



In [ ]:

    
cmd_line = 'init -n my_project -d awesome'
options = parse_args(cmd_line.split())
options.func(options)



In [ ]:

    
cmd_line = 'add test -n entry1 -d my_entry'
options = parse_args(cmd_line.split())
options.func(options)

Logging

A good script is able to keep the user informed about "what's going on" during its execution. Given that a script might use the standard output as a way to ouput the results of the script, using the print function might not always be an option. In fact it is good practice to at least output the script execution messages to stderr, using the sys module. This allows you to redirect the stdout to a file or another program/script, while being able to monitor the execution messages or to redirect them to a different file.



In [ ]:

    
sys.stderr.write('Running an immaginary analysis on the input genes\n')
# the result of our immaginary analysis
value = 400
# regular output of our immaginary script
print('\t'.join(['gene1', 'gene2', str(value)]))

It is also a good idea to return a non-zero exit code when the script is encountering an error. By default, python will return an non-zero exit code when the script end because of an uncatched exception. If you are cathcing it and want to exit in a slightly more grateful way you can use the sys.exit function.



In [ ]:

    
user_provided_value = 'a'
try:
    # impossible
    parameter = int(user_provided_value)
except ValueError:
    sys.stderr.write('Invalid type provided\n')
    sys.exit(1)

If you want to be more flexible with your logging, you can use the logging module, present in python's standard library. It allows the user to:

redirect the logs to file and standard error at the same time
modulate the verbosity of the output
add custom formatters (including color with minimal tweaking)



In [ ]:

    
# create the logger
logger = logging.getLogger('fake_script')

# set the verbosity level
logger.setLevel(logging.DEBUG)

# we want the log to be redirected
# to std. err.
ch = logging.StreamHandler()
# we want a rich output with additional information
formatter = logging.Formatter('%(asctime)s - %(name)s - [%(levelname)s] - %(message)s',
                              '%Y-%m-%d %H:%M:%S')
ch.setFormatter(formatter)
logger.addHandler(ch)



In [ ]:

    
# debug message, will be shown given the level we have set
logger.debug('test')



In [ ]:

    
logger.setLevel(logging.WARNING)
# debug message, will not be shown given the level we have set
logger.debug('not-so-interesting debugging information')
# warning message, will be shown
logger.warning('this might break our script, but i\'m not sure')

The logging levels available are (in order of severity):

DEBUG
INFO
WARNING
ERROR
CRITICAL

And as you might have immagined there are corresponding functions to log messages with those levels of severity.

Have a look at python's documentation for a more in-depth description of the module and its capabilities.

Script template

Find below a minimal script template, including an utility function to parse arguments and minimal logging (can also be found here).



In [ ]:

    
#!/usr/bin/env python
'''Description here'''

import logging
import argparse

def get_options():
    description = ''
    parser = argparse.ArgumentParser(description=description)

    parser.add_argument('name',
                        help='Name')

    return parser.parse_args()


def set_logging(level=logging.INFO):
    logger = logging.getLogger()
    logger.setLevel(level)
    ch = logging.StreamHandler()
    logger.addHandler(ch)
    return logger

if __name__ == "__main__":
    options = get_options()
    
    logger = set_logging()

You might be wondering what's the reason for having the if __name__ == "__main__": bit. The reasons are multiple:

performance: having the bottom-level code inside a scope makes it slightly faster, expecially for scripts that need limited time to run
your script might be part of a module and not always intended to be run as a script

FOr the latter reason, immagine your script is part of a python module and contains a function that you want to reuse later. You only want the function, but you are not interested in running the rest of the script. By encapsulating the low level of the script inside if __name__ == "__main__": you are allowing yourself (or your user) to use import to obtain their function of interest.

In fact, the main script gets __main__ as the value of the __name__ variable, while any imported script or module gets a different value for __name__. That is the main reason why that (ugly) expression is commonly used in scripts.