In [ ]:
import sys
import logging
import argparse
import collections
As opposed to R, with python is extremely easy to create handy scripts. Those are very useful when working from the command line and/or in HPC (high performance computing).
When writing a script, it's always a good idea to follow the Unix philosophy, which emphasizes simplicity, interoperability and modularity instead of overengineering. In short:
If you have even a basic knowledge of the use of the bash
(or bash-like) command line, you would probably already be familiar with these concepts. Consider the following example:
> curl --silent "http://wodaklab.org/cyc2008/resources/CYC2008_complex.tab" | head -n 5
ORF Name Complex PubMed_id Method GO_id GO_term Jaccard_Index
YKR068C BET3 TRAPP complex 10727015 "Affinity Capture-Western,Affinity Capture-MS" GO:0030008 TRAPP complex 1
YML077W BET5 TRAPP complex
YDR108W GSG1 TRAPP complex
YGR166W KRE11 TRAPP complex
Here we have chained two command line tools: curl
to stream a text file from the internet and piped it into head
to show only the first 5 rows. Anideal python script should follow the same principles. Immagine we wanted to substitute head
with a little script that transforms the text file in a way such that for each complex name (Name
column) we report all the genes belonging to that complex. For instance:
> curl --silent "http://wodaklab.org/cyc2008/resources/CYC2008_complex.tab" | ./cyc2txt | head -n 5
SIR YLR442C,YDL042C,YDR227W
SIP YGL208W,YDR422C
PAC1 YGR078C,YDR488C
SIT YDL047W
CPA YJR109C,YOR303W
As shown in the example above, command line tools often accept options and even input files (i.e. head -n 5
). Parsing these arguments with the necessary flexibility is not trivial. Writing a command line argument parser that handles positional and optional arguments, potentially with some checks on their type is not trivial.
In [ ]:
def parse_args(cmd_line):
Args = collections.namedtuple('Args',
['n', 'in_file'])
n_trigger = False
# default value for "n"
n = 1
for arg in cmd_line:
if n_trigger:
n = int(arg)
n_trigger = False
continue
if arg == '-n':
# next argument belongs to "-n"
n_trigger = True
continue
else:
# it must be the positional argument
in_file = arg
return Args(n=n, in_file=in_file)
In [ ]:
# immaginary command line
cmd_line = '-n 5 myfile.txt'
parse_args(cmd_line.split())
In [ ]:
# immaginary command line with multiple input files
cmd_line = '-n 5 myfile.txt another_one.txt'
parse_args(cmd_line.split())
Note: in real life we would use the following startegy to read the arguments from the command line:
import sys
cmd_line = ' '.join(sys.argv[1:])
sys.argv[0]
will be the name of the script, as called from the command line
We need to extend our original function, to account for additional positional arguments. We'll also add an extra boolean option.
In [ ]:
def parse_args(cmd_line):
Args = collections.namedtuple('Args',
['n',
'verbose',
'in_file',
'another_file'])
n_trigger = False
# default value for "n"
n = 1
# default value for "verbose"
verbose = False
# list to hold the positional arguments
positional = []
for arg in cmd_line:
if n_trigger:
n = int(arg)
n_trigger = False
continue
if arg == '-n':
# next argument belongs to "-n"
n_trigger = True
elif arg == '--verbose' or arg == '-v':
verbose = True
else:
# it must be the positional argument
positional.append(arg)
return Args(n=n,
verbose=verbose,
in_file=positional[0],
another_file=positional[1])
In [ ]:
# immaginary command line with multiple input files
cmd_line = '-n 5 myfile.txt another_one.txt'
parse_args(cmd_line.split())
What if the --verbose
option can be called multiple times to modulate the amount of verbosity of our script?
In [ ]:
def parse_args(cmd_line):
Args = collections.namedtuple('Args',
['n',
'verbose',
'in_file',
'another_file'])
n_trigger = False
# default value for "n"
n = 1
# default value for "verbose"
verbose = 0
# list to hold the positional arguments
positional = []
for arg in cmd_line:
if n_trigger:
n = int(arg)
n_trigger = False
continue
if arg == '-n':
# next argument belongs to "-n"
n_trigger = True
elif arg == '--verbose' or arg == '-v':
verbose += 1
else:
# it must be the positional argument
positional.append(arg)
return Args(n=n,
verbose=verbose,
in_file=positional[0],
another_file=positional[1])
In [ ]:
# immaginary command line with increased verbosity
cmd_line = '-n 5 -v -v myfile.txt another_one.txt'
parse_args(cmd_line.split())
In [ ]:
# by convention we can also increase verbosity in the following manner
cmd_line = '-n 5 -vvv myfile.txt another_one.txt'
parse_args(cmd_line)
Let's add this additional functionality, hopefully you are starting to see how complicated and prone to bugs is writing your own command line parser!
In [ ]:
def parse_args(cmd_line):
Args = collections.namedtuple('Args',
['n',
'verbose',
'in_file',
'another_file'])
n_trigger = False
# default value for "n"
n = 1
# default value for "verbose"
verbose = 0
# list to hold the positional arguments
positional = []
for arg in cmd_line:
if n_trigger:
n = int(arg)
n_trigger = False
continue
if arg == '-n':
# next argument belongs to "-n"
n_trigger = True
elif arg == '--verbose' or arg == '-v' or arg.startswith('-v'):
if arg.startswith('-v') and len(arg) > 2 and len({char for char in arg[1:]}) == 1:
verbose += len(arg[1:])
else:
verbose += 1
else:
# it must be the positional argument
positional.append(arg)
return Args(n=n,
verbose=verbose,
in_file=positional[0],
another_file=positional[1])
In [ ]:
# by convention we can also increase verbosity in the following manner
cmd_line = '-n 5 -vvv myfile.txt another_one.txt'
parse_args(cmd_line.split())
argparse
modulePython as a very useful module to create scripts, and it is included the standard library: argparse
. It allows to create command line parser that are concise yet very flexible and powerful.
Let's rewrite our last example using argparse
.
In [ ]:
def parse_args(cmd_line):
parser = argparse.ArgumentParser(prog='fake_script',
description='An argparse test')
# positional arguments
parser.add_argument('my_file',
help='My input file')
parser.add_argument('another_file',
help='Another input file')
# optional arguments
parser.add_argument('-n',
type=int,
default=1,
help='Number of Ns [Default: 1]')
parser.add_argument('-v', '--verbose',
action='count',
default=0,
help='Increase verbosity level')
return parser.parse_args(cmd_line)
In [ ]:
# by convention we can also increase verbosity in the following manner
cmd_line = '-n 5 -vvv myfile.txt another_one.txt'
parse_args(cmd_line.split())
By indicating the type of the -n
options, we can easily check for its type.
In [ ]:
# by convention we can also increase verbosity in the following manner
cmd_line = '-n not_an_integer -vvv myfile.txt another_one.txt'
parse_args(cmd_line.split())
...and we also get an -h
(help) option for free, already formatted!
In [ ]:
# by convention we can also increase verbosity in the following manner
cmd_line = '-h'
parse_args(cmd_line.split())
argparse
examplesBoolean arguments
Sometimes you want to add a parameter to your script that is a simple trigger and doesn't receive any value. The action
keyword argument in argparse
allows us to implement such behavior.
argparse
's documentation has more examples on how to use the action
argument.
In [ ]:
def parse_args(cmd_line):
parser = argparse.ArgumentParser(prog='fake_script',
description='An argparse example')
# boolean option
parser.add_argument('-f',
'--force',
action='store_true',
default=False,
help='Force file creation')
return parser.parse_args(cmd_line)
In [ ]:
cmd_line = '-f'
parse_args(cmd_line.split())
In [ ]:
cmd_line = ''
parse_args(cmd_line.split())
Multiple choices
Sometimes not only you would like to define a type
for an option, but only allow certain values from a list.
In [ ]:
def parse_args(cmd_line):
parser = argparse.ArgumentParser(prog='fake_script',
description='An argparse example')
# multiple choices positional argument
parser.add_argument('-m',
'--metric',
choices=['jaccard',
'hamming'],
default='jaccard',
help='Distance metric [Default: jaccard]')
parser.add_argument('-b',
'--bootstraps',
type=int,
choices=range(10, 21),
default=10,
help='Bootstraps [Default: 10]')
return parser.parse_args(cmd_line)
In [ ]:
cmd_line = '-m euclidean'
parse_args(cmd_line.split())
In [ ]:
cmd_line = '-m hamming -b 15'
parse_args(cmd_line.split())
Flexible number of arguments: nargs
In some cases you might want to have multiple values assigned to an option: for that the nargs
keyword argument is a flexible option.
In [ ]:
def parse_args(cmd_line):
parser = argparse.ArgumentParser(prog='fake_script',
description='An argparse example')
parser.add_argument('fastq',
nargs='+',
help='Input fastq files')
parser.add_argument('-m',
'--mate-pairs',
nargs='*',
help='Mate pairse fastq files')
return parser.parse_args(cmd_line)
In [ ]:
cmd_line = 'r1.fq.gz r2.fq.gz'
parse_args(cmd_line.split())
In [ ]:
cmd_line = 'r1.fq.gz r2.fq.gz -m m1.fq.gz m2.fq.gz'
parse_args(cmd_line.split())
In [ ]:
cmd_line = '-m m1.fq.gz m2.fq.gz'
parse_args(cmd_line.split())
In [ ]:
cmd_line = '-h'
parse_args(cmd_line.split())
One script to rule them all: subcommands
Some software sometimes contain more than one utility at a time, which is handy if you don't want to remember all the various subcommands and their options. Common command line examples are git
(e.g. git commit
and git push
) and in bioinformatics there are many more examples (bwa
, bedtools
, samtools
, ...). If you are developing a program that performs many related tasks, it might be a good idea to have them as functions/classes in a module, and call them through a single script with many subcommands
.
Here's an example:
In [ ]:
def init(options):
print('Init the project')
print(options.name, options.description)
def add(options):
print('Add an entry')
print(options.ID, options.name,
options.description, options.color)
def parse_args(cmd_line):
parser = argparse.ArgumentParser(prog='fake_script',
description='An argparse example')
subparsers = parser.add_subparsers()
parser_init = subparsers.add_parser('init',
help='Initialize the project')
parser_init.add_argument('-n',
'--name',
default='Project',
help='Project name')
parser_init.add_argument('-d',
'--description',
default='My project',
help='Project description')
parser_init.set_defaults(func=init)
parser_add = subparsers.add_parser('add',
help='Add an entry')
parser_add.add_argument('ID',
help='Entry ID')
parser_add.add_argument('-n',
'--name',
default='',
help='Entry name')
parser_add.add_argument('-d',
'--description',
default = '',
help='Entry description')
parser_add.add_argument('-c',
'--color',
default='red',
help='Entry color')
parser_add.set_defaults(func=add)
return parser.parse_args(cmd_line)
In [ ]:
cmd_line = '-h'
parse_args(cmd_line.split())
In [ ]:
cmd_line = 'init -h'
parse_args(cmd_line.split())
In [ ]:
cmd_line = 'add -h'
parse_args(cmd_line.split())
In [ ]:
cmd_line = 'init -n my_project -d awesome'
options = parse_args(cmd_line.split())
options.func(options)
In [ ]:
cmd_line = 'add test -n entry1 -d my_entry'
options = parse_args(cmd_line.split())
options.func(options)
A good script is able to keep the user informed about "what's going on" during its execution. Given that a script might use the standard output as a way to ouput the results of the script, using the print
function might not always be an option. In fact it is good practice to at least output the script execution messages to stderr
, using the sys
module. This allows you to redirect the stdout
to a file or another program/script, while being able to monitor the execution messages or to redirect them to a different file.
In [ ]:
sys.stderr.write('Running an immaginary analysis on the input genes\n')
# the result of our immaginary analysis
value = 400
# regular output of our immaginary script
print('\t'.join(['gene1', 'gene2', str(value)]))
It is also a good idea to return a non-zero exit code when the script is encountering an error. By default, python will return an non-zero exit code when the script end because of an uncatched exception. If you are cathcing it and want to exit in a slightly more grateful way you can use the sys.exit
function.
In [ ]:
user_provided_value = 'a'
try:
# impossible
parameter = int(user_provided_value)
except ValueError:
sys.stderr.write('Invalid type provided\n')
sys.exit(1)
If you want to be more flexible with your logging, you can use the logging
module, present in python's standard library. It allows the user to:
In [ ]:
# create the logger
logger = logging.getLogger('fake_script')
# set the verbosity level
logger.setLevel(logging.DEBUG)
# we want the log to be redirected
# to std. err.
ch = logging.StreamHandler()
# we want a rich output with additional information
formatter = logging.Formatter('%(asctime)s - %(name)s - [%(levelname)s] - %(message)s',
'%Y-%m-%d %H:%M:%S')
ch.setFormatter(formatter)
logger.addHandler(ch)
In [ ]:
# debug message, will be shown given the level we have set
logger.debug('test')
In [ ]:
logger.setLevel(logging.WARNING)
# debug message, will not be shown given the level we have set
logger.debug('not-so-interesting debugging information')
# warning message, will be shown
logger.warning('this might break our script, but i\'m not sure')
The logging levels available are (in order of severity):
And as you might have immagined there are corresponding functions to log messages with those levels of severity.
Have a look at python's documentation for a more in-depth description of the module and its capabilities.
Find below a minimal script template, including an utility function to parse arguments and minimal logging (can also be found here).
In [ ]:
#!/usr/bin/env python
'''Description here'''
import logging
import argparse
def get_options():
description = ''
parser = argparse.ArgumentParser(description=description)
parser.add_argument('name',
help='Name')
return parser.parse_args()
def set_logging(level=logging.INFO):
logger = logging.getLogger()
logger.setLevel(level)
ch = logging.StreamHandler()
logger.addHandler(ch)
return logger
if __name__ == "__main__":
options = get_options()
logger = set_logging()
You might be wondering what's the reason for having the if __name__ == "__main__":
bit. The reasons are multiple:
FOr the latter reason, immagine your script is part of a python module and contains a function that you want to reuse later. You only want the function, but you are not interested in running the rest of the script. By encapsulating the low level of the script inside if __name__ == "__main__":
you are allowing yourself (or your user) to use import
to obtain their function of interest.
In fact, the main script gets __main__
as the value of the __name__
variable, while any imported script or module gets a different value for __name__
. That is the main reason why that (ugly) expression is commonly used in scripts.