Python is a modern, open source, object-oriented programming language, created by a Dutch programmer, Guido van Rossum. Officially, it is an interpreted scripting language (meaning that it is not compiled until it is run) for the C programming language; in fact, Python itself is coded in C. Frequently, it is compared to languages like Perl and Ruby. It offers the power and flexibility of lower level (i.e. compiled) languages, without the steep learning curve, and without most of the associated debugging pitfalls. The language is very clean and readable, and it is available for almost every modern computing platform.
Python offers a number of advantages to scientists, both for experienced and novice programmers alike:
Powerful and easy to use
Python is simultaneously powerful, flexible and easy to learn and use (in general, these qualities are traded off for a given programming language). Anything that can be coded in C, FORTRAN, or Java can be done in Python, almost always in fewer lines of code, and with fewer debugging headaches. Its standard library is extremely rich, including modules for string manipulation, regular expressions, file compression, mathematics, profiling and debugging (to name only a few). Unnecessary language constructs, such as END
statements and brackets are absent, making the code terse, efficient, and easy to read. Finally, Python is object-oriented, which is an important programming paradigm particularly well-suited to scientific programming, which allows data structures to be abstracted in a natural way.
Interactive
Python may be run interactively on the command line, in much the same way as Octave or S-Plus/R. Rather than compiling and running a particular program, commands may entered serially followed by the Return
key. This is often useful for mathematical programming and debugging.
Extensible
Python is often referred to as a “glue” language, meaning that it is a useful in a mixed-language environment. Frequently, programmers must interact with colleagues that operate in other programming languages, or use significant quantities of legacy code that would be problematic or expensive to re-code. Python was designed to interact with other programming languages, and in many cases C or FORTRAN code can be compiled directly into Python programs (using utilities such as f2py
or weave
). Additionally, since Python is an interpreted language, it can sometimes be slow relative to its compiled cousins. In many cases this performance deficit is due to a short loop of code that runs thousands or millions of times. Such bottlenecks may be removed by coding a function in FORTRAN, C or Cython, and compiling it into a Python module.
Third-party modules
There is a vast body of Python modules created outside the auspices of the Python Software Foundation. These include utilities for database connectivity, mathematics, statistics, and charting/plotting. Some notables include:
DataFrame
class is useful for spreadsheet-like representation and mannipulation of data. Also includes high-level plotting functionality.Free and open
Python is released on all platforms under the GNU public license, meaning that the language and its source is freely distributable. Not only does this keep costs down for scientists and universities operating under a limited budget, but it also frees programmers from licensing concerns for any software they may develop. There is little reason to buy expensive licenses for software such as Matlab or Maple, when Python can provide the same functionality for free!
Here is a quick example of a Python program. We will call it stats.py
, because Python programs typically end with the .py
suffix. This code consists of some fake data, and two functions mean and var which calculate mean and variance, respectively. Python can be internally documented by adding lines beginning with the #
symbol, or with simple strings enclosed in quotation marks. Here is the code:
In [ ]:
# Import modules you might use
import numpy as np
# Some data, in a list
my_data = [12, 5, 17, 8, 9, 11, 21]
# Function for calulating the mean of some data
def mean(data):
# Initialize sum to zero
sum_x = 0.0
# Loop over data
for x in data:
# Add to sum
sum_x += x
# Divide by number of elements in list, and return
return sum_x / len(data)
Notice that, rather than using parentheses or brackets to enclose units of code (such as loops or conditional statements), python simply uses indentation. This relieves the programmer from worrying about a stray bracket causing her program to crash. Also, it forces programmers to code in neat blocks, making programs easier to read. So, for the following snippet of code:
In [ ]:
sum_x = 0
# Loop over data
for x in my_data:
# Add to sum
sum_x += x
print(sum_x)
The first line initializes a variable to hold the sum, and the second initiates a loop, where each element in the data list is given the name x
, and is used in the code that is indented below. The first line of subsequent code that is not indented signifies the end of the loop. It takes some getting used to, but works rather well.
Now lets call the function:
In [ ]:
mean(my_data)
Our specification of mean and var are by no means the most efficient implementations. Python provides some syntax and built-in functions to make things easier, and sometimes faster:
In [ ]:
# Function for calulating the mean of some data
def mean(data):
# Call sum, then divide by the numner of elements
return sum(data)/len(data)
# Function for calculating variance of data
def var(data):
# Get mean of data from function above
x_bar = mean(data)
# Do sum of squares in one line
sum_squares = sum([(x - x_bar)**2 for x in data])
# Divide by n-1 and return
return sum_squares/(len(data)-1)
In the new implementation of mean
, we use the built-in function sum
to reduce the function to a single line. Similarly, var
employs a list comprehension syntax to make a more compact and efficient loop.
An alternative looping construct involves the map function. Suppose that we had a number of datasets, for each which we want to calculate the mean:
In [ ]:
x = (45, 95, 100, 47, 92, 43)
y = (65, 73, 10, 82, 6, 23)
z = (56, 33, 110, 56, 86, 88)
datasets = (x,y,z)
In [ ]:
datasets
This can be done using a classical loop:
In [ ]:
means = []
for d in datasets:
means.append(mean(d))
In [ ]:
means
Or, more succinctly using map
:
In [ ]:
list(map(mean, datasets))
Similarly we did not have to code these functions to get means and variances; the numpy package that we imported at the beginning of the module has similar methods:
In [ ]:
np.mean(datasets, axis=1)
In the introduction above, you have already seen some of the important Python data structures, including integers, floating-point numbers, lists and tuples. It is worthwhile, however, to quickly introduce all of the built-in data structures relevant to everyday Python programming.
The simplest data structure are literals, which appear directly in programs, and include most simple strings and numbers:
In [ ]:
42 # Integer
0.002243 # Floating-point
5.0J # Imaginary
'foo'
"bar" # Several string types
s = """Multi-line
string"""
There are a handful of constants that exist in the built-in-namespace. Importantly, there are boolean values True
and False
In [ ]:
type(True)
Either of these can be negated using not
.
In [ ]:
not False
In addition, there is a None
type that represents the absence of a value.
In [ ]:
x = None
print(x)
All the arithmetic operators are available in Python:
In [ ]:
15/4
Compatibility Corner: Note that when using Python 2, you would get a different answer! Dividing an integer by an integer will yield another integer. Though this is "correct", it is not intuitive, and hence was changed in Python 3.
Operator precendence can be enforced using parentheses:
In [ ]:
(14 - 5) * 4
There are several Python data structures that are used to encapsulate several elements in a set or sequence.
In [ ]:
(34,90,56) # Tuple with three elements
In [ ]:
(15,) # Tuple with one element
In [ ]:
(12, 'foobar') # Mixed tuple
Individual elements in a tuple can be accessed by indexing. This amounts to specifying the appropriate element index enclosed in square brackets following the tuple name:
In [ ]:
foo = (5,7,2,8,2,-1,0,4)
foo[0]
Notice that the index is zero-based, meaning that the first index is zero, rather than one (in contrast to R). So above, 5 retrieves the sixth item, not the fifth.
Two or more sequential elements can be indexed by slicing:
In [ ]:
foo[2:5]
This retrieves the third, fourth and fifth (but not the sixth!) elements -- i.e., up to, but not including, the final index. One may also slice or index starting from the end of a sequence, by using negative indices:
In [ ]:
foo[:-2]
As you can see, this returns all elements except the final two.
You can add an optional third element to the slice, which specifies a step value. For example, the following returns every other element of foo
, starting with the second element of the tuple.
In [ ]:
foo[1::2]
The elements of a tuple, as defined above, are immutable. Therefore, Python takes offense if you try to change them:
In [ ]:
a = (1,2,3)
a[0] = 6
The TypeError
is called an exception, which in this case indicates that you have tried to perform an action on a type that does not support it. We will learn about handling exceptions further along.
Finally, the tuple()
function can create a tuple from any sequence:
In [ ]:
tuple('foobar')
Why does this happen? Because in Python, strings are considered a sequence of characters.
In [ ]:
# List with five elements
[90, 43.7, 56, 1, -4]
In [ ]:
# Tuple with one element
[100]
In [ ]:
# Empty list
[]
Elements of a list can be arbitrarily substituted by assigning new values to the associated index:
In [ ]:
bar = [5,8,4,2,7,9,4,1]
bar[3] = -5
bar
Operations on lists are somewhat unusual. For example, multiplying a list by an integer does not multiply each element by that integer, as you might expect, but rather:
In [ ]:
bar * 3
Which is simply three copies of the list, concatenated together. This is useful for generating lists with identical elements:
In [ ]:
[0]*10
(incidentally, this works with tuples as well)
In [ ]:
(3,)*10
Since lists are mutable, they retain several methods, some of which mutate the list. For example:
In [ ]:
bar.extend(foo) # Adds foo to the end of bar (in-place)
bar
In [ ]:
bar.append(5) # Appends 5 to the end of bar
bar
In [ ]:
bar.insert(0, 4) # Inserts 4 at index 0
bar
In [ ]:
bar.remove(7) # Removes the first occurrence of 7
bar
In [ ]:
bar.remove(100) # Oops! Doesn’t exist
In [ ]:
bar.pop(4) # Removes and returns indexed item
In [ ]:
bar.reverse() # Reverses bar in place
bar
In [ ]:
bar.sort() # Sorts bar in place
bar
Some methods, however, do not change the list:
In [ ]:
bar.count(7) # Counts occurrences of 7 in bar
In [ ]:
bar.index(7) # Returns index of first 7 in bar
One of the more flexible built-in data structures is the dictionary. A dictionary maps a collection of values to a set of associated keys. These mappings are mutable, and unlike lists or tuples, are unordered. Hence, rather than using the sequence index to return elements of the collection, the corresponding key must be used. Dictionaries are specified by a comma-separated sequence of keys and values, which are separated in turn by colons. The dictionary is enclosed by curly braces.
For example:
In [ ]:
my_dict = {'a':16, 'b':(4,5), 'foo':'''(noun) a term used as a universal substitute
for something real, especially when discussing technological ideas and
problems'''}
my_dict
In [ ]:
my_dict['b']
Notice that a
indexes an integer, b
a tuple, and foo
a string (now you know what foo means). Hence, a dictionary is a sort of associative array. Some languages refer to such a structure as a hash or key-value store.
As with lists, being mutable, dictionaries have a variety of methods and functions that take dictionary arguments. For example, some dictionary functions include:
In [ ]:
len(my_dict)
In [ ]:
# Checks to see if ‘a’ is in my_dict
'a' in my_dict
Some useful dictionary methods are:
In [ ]:
# Returns a copy of the dictionary
my_dict.copy()
In [ ]:
# Returns key/value pairs as list
my_dict.items()
In [ ]:
# Returns list of keys
my_dict.keys()
In [ ]:
# Returns list of values
my_dict.values()
When we try to index a value that does not exist, it raises a KeyError
.
In [ ]:
my_dict['c']
If we would rather not get the error, we can use the get
method, which returns None
if the value is not present.
In [ ]:
my_dict.get('c')
Custom return values can be specified with a second argument.
In [ ]:
my_dict.get('c', -1)
It is easy to remove items from a dictionary.
In [ ]:
my_dict.popitem()
In [ ]:
# Empties dictionary
my_dict.clear()
my_dict
In [ ]:
my_set = {4, 5, 5, 7, 8}
my_set
We can also use the set
constructor.
In [ ]:
empty_set = set()
empty_set
In [ ]:
empty_set.add(-5)
another_set = empty_set
another_set
As we would expect, we can perform set operations.
In [ ]:
my_set | another_set
In [ ]:
my_set & another_set
In [ ]:
my_set - {4}
The set
function is useful for returning the unique elements of a data structure. For example, recall bar
:
In [ ]:
bar
In [ ]:
set(bar)
Bassi S (2007) A Primer on Python for Life Science Researchers. PLoS Comput Biol 3(11): e199.