Introduction to Python

(via xkcd)

What is Python?

Python is a modern, open source, object-oriented programming language, created by a Dutch programmer, Guido van Rossum. Officially, it is an interpreted scripting language (meaning that it is not compiled until it is run) for the C programming language; in fact, Python itself is coded in C. Frequently, it is compared to languages like Perl and Ruby. It offers the power and flexibility of lower level (i.e. compiled) languages, without the steep learning curve, and without most of the associated debugging pitfalls. The language is very clean and readable, and it is available for almost every modern computing platform.

Why use Python for scientific programming?

Python offers a number of advantages to scientists, both for experienced and novice programmers alike:

Powerful and easy to use
Python is simultaneously powerful, flexible and easy to learn and use (in general, these qualities are traded off for a given programming language). Anything that can be coded in C, FORTRAN, or Java can be done in Python, almost always in fewer lines of code, and with fewer debugging headaches. Its standard library is extremely rich, including modules for string manipulation, regular expressions, file compression, mathematics, profiling and debugging (to name only a few). Unnecessary language constructs, such as END statements and brackets are absent, making the code terse, efficient, and easy to read. Finally, Python is object-oriented, which is an important programming paradigm particularly well-suited to scientific programming, which allows data structures to be abstracted in a natural way.

Interactive
Python may be run interactively on the command line, in much the same way as Octave or S-Plus/R. Rather than compiling and running a particular program, commands may entered serially followed by the Return key. This is often useful for mathematical programming and debugging.

Extensible
Python is often referred to as a “glue” language, meaning that it is a useful in a mixed-language environment. Frequently, programmers must interact with colleagues that operate in other programming languages, or use significant quantities of legacy code that would be problematic or expensive to re-code. Python was designed to interact with other programming languages, and in many cases C or FORTRAN code can be compiled directly into Python programs (using utilities such as f2py or weave). Additionally, since Python is an interpreted language, it can sometimes be slow relative to its compiled cousins. In many cases this performance deficit is due to a short loop of code that runs thousands or millions of times. Such bottlenecks may be removed by coding a function in FORTRAN, C or Cython, and compiling it into a Python module.

Third-party modules
There is a vast body of Python modules created outside the auspices of the Python Software Foundation. These include utilities for database connectivity, mathematics, statistics, and charting/plotting. Some notables include:

NumPy: Numerical Python (NumPy) is a set of extensions that provides the ability to specify and manipulate array data structures. It provides array manipulation and computational capabilities similar to those found in Matlab or Octave.
SciPy: An open source library of scientific tools for Python, SciPy supplements the NumPy module. SciPy gathering a variety of high level science and engineering modules together as a single package. SciPy includes modules for graphics and plotting, optimization, integration, special functions, signal and image processing, genetic algorithms, ODE solvers, and others.
Matplotlib: Matplotlib is a python 2D plotting library which produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Its syntax is very similar to Matlab.
Pandas: A module that provides high-performance, easy-to-use data structures and data analysis tools. In particular, the DataFrame class is useful for spreadsheet-like representation and mannipulation of data. Also includes high-level plotting functionality.
IPython: An enhanced Python shell, designed to increase the efficiency and usability of coding, testing and debugging Python. It includes both a Qt-based console and an interactive HTML notebook interface, both of which feature multiline editing, interactive plotting and syntax highlighting.

Free and open
Python is released on all platforms under the GNU public license, meaning that the language and its source is freely distributable. Not only does this keep costs down for scientists and universities operating under a limited budget, but it also frees programmers from licensing concerns for any software they may develop. There is little reason to buy expensive licenses for software such as Matlab or Maple, when Python can provide the same functionality for free!

Sample code: mean and standard deviation

Here is a quick example of a Python program. We will call it stats.py, because Python programs typically end with the .py suffix. This code consists of some fake data, and two functions mean and var which calculate mean and variance, respectively. Python can be internally documented by adding lines beginning with the # symbol, or with simple strings enclosed in quotation marks. Here is the code:



In [ ]:

    
# Import modules you might use
import numpy as np

# Some data, in a list
my_data = [12, 5, 17, 8, 9, 11, 21]

# Function for calulating the mean of some data
def mean(data):

    # Initialize sum to zero
    sum_x = 0.0

    # Loop over data
    for x in data:

        # Add to sum
        sum_x += x 
    
    # Divide by number of elements in list, and return
    return sum_x / len(data)

Notice that, rather than using parentheses or brackets to enclose units of code (such as loops or conditional statements), python simply uses indentation. This relieves the programmer from worrying about a stray bracket causing her program to crash. Also, it forces programmers to code in neat blocks, making programs easier to read. So, for the following snippet of code:



In [ ]:

    
sum_x = 0

# Loop over data
for x in my_data:
    
    # Add to sum
    sum_x += x 

print(sum_x)

The first line initializes a variable to hold the sum, and the second initiates a loop, where each element in the data list is given the name x, and is used in the code that is indented below. The first line of subsequent code that is not indented signifies the end of the loop. It takes some getting used to, but works rather well.

Now lets call the function:



In [ ]:

    
mean(my_data)

Our specification of mean and var are by no means the most efficient implementations. Python provides some syntax and built-in functions to make things easier, and sometimes faster:



In [ ]:

    
# Function for calulating the mean of some data
def mean(data):

    # Call sum, then divide by the numner of elements
    return sum(data)/len(data)

# Function for calculating variance of data
def var(data):

    # Get mean of data from function above
    x_bar = mean(data)

    # Do sum of squares in one line
    sum_squares = sum([(x - x_bar)**2 for x in data])

    # Divide by n-1 and return
    return sum_squares/(len(data)-1)

In the new implementation of mean, we use the built-in function sum to reduce the function to a single line. Similarly, var employs a list comprehension syntax to make a more compact and efficient loop.

An alternative looping construct involves the map function. Suppose that we had a number of datasets, for each which we want to calculate the mean:



In [ ]:

    
x = (45, 95, 100, 47, 92, 43)
y = (65, 73, 10, 82, 6, 23)
z = (56, 33, 110, 56, 86, 88) 
datasets = (x,y,z)



In [ ]:

    
datasets

This can be done using a classical loop:



In [ ]:

    
means = []
for d in datasets:
    means.append(mean(d))



In [ ]:

    
means

Or, more succinctly using map:



In [ ]:

    
list(map(mean, datasets))

Similarly we did not have to code these functions to get means and variances; the numpy package that we imported at the beginning of the module has similar methods:



In [ ]:

    
np.mean(datasets, axis=1)

Data Types and Data Structures

In the introduction above, you have already seen some of the important Python data structures, including integers, floating-point numbers, lists and tuples. It is worthwhile, however, to quickly introduce all of the built-in data structures relevant to everyday Python programming.

Literals

The simplest data structure are literals, which appear directly in programs, and include most simple strings and numbers:



In [ ]:

    
42              # Integer
0.002243        # Floating-point
5.0J            # Imaginary
'foo'
"bar"           # Several string types
s = """Multi-line
string"""

There are a handful of constants that exist in the built-in-namespace. Importantly, there are boolean values True and False



In [ ]:

    
type(True)

Either of these can be negated using not.



In [ ]:

    
not False

In addition, there is a None type that represents the absence of a value.



In [ ]:

    
x = None
print(x)

All the arithmetic operators are available in Python:



In [ ]:

    
15/4

Compatibility Corner: Note that when using Python 2, you would get a different answer! Dividing an integer by an integer will yield another integer. Though this is "correct", it is not intuitive, and hence was changed in Python 3.

Operator precendence can be enforced using parentheses:



In [ ]:

    
(14 - 5) * 4

There are several Python data structures that are used to encapsulate several elements in a set or sequence.

Tuples

The first sequence data structure is the tuple, which simply an immutable, ordered sequence of elements. These elements may be of arbitrary and mixed types. The tuple is specified by a comma-separated sequence of items, enclosed by parentheses:



In [ ]:

    
(34,90,56) # Tuple with three elements



In [ ]:

    
(15,) # Tuple with one element



In [ ]:

    
(12, 'foobar') # Mixed tuple

Individual elements in a tuple can be accessed by indexing. This amounts to specifying the appropriate element index enclosed in square brackets following the tuple name:



In [ ]:

    
foo = (5,7,2,8,2,-1,0,4)
foo[0]

Notice that the index is zero-based, meaning that the first index is zero, rather than one (in contrast to R). So above, 5 retrieves the sixth item, not the fifth.

Two or more sequential elements can be indexed by slicing:



In [ ]:

    
foo[2:5]

This retrieves the third, fourth and fifth (but not the sixth!) elements -- i.e., up to, but not including, the final index. One may also slice or index starting from the end of a sequence, by using negative indices:



In [ ]:

    
foo[:-2]

As you can see, this returns all elements except the final two.

You can add an optional third element to the slice, which specifies a step value. For example, the following returns every other element of foo, starting with the second element of the tuple.



In [ ]:

    
foo[1::2]

The elements of a tuple, as defined above, are immutable. Therefore, Python takes offense if you try to change them:



In [ ]:

    
a = (1,2,3)
a[0] = 6

The TypeError is called an exception, which in this case indicates that you have tried to perform an action on a type that does not support it. We will learn about handling exceptions further along.

Finally, the tuple() function can create a tuple from any sequence:



In [ ]:

    
tuple('foobar')

Why does this happen? Because in Python, strings are considered a sequence of characters.

Lists

Lists complement tuples in that they are a mutable, ordered sequence of elements. To distinguish them from tuples, they are enclosed by square brackets:



In [ ]:

    
# List with five elements
[90, 43.7, 56, 1, -4]



In [ ]:

    
# Tuple with one element
[100]



In [ ]:

    
# Empty list
[]

Elements of a list can be arbitrarily substituted by assigning new values to the associated index:



In [ ]:

    
bar = [5,8,4,2,7,9,4,1]
bar[3] = -5
bar

Operations on lists are somewhat unusual. For example, multiplying a list by an integer does not multiply each element by that integer, as you might expect, but rather:



In [ ]:

    
bar * 3

Which is simply three copies of the list, concatenated together. This is useful for generating lists with identical elements:



In [ ]:

    
[0]*10

(incidentally, this works with tuples as well)



In [ ]:

    
(3,)*10

Since lists are mutable, they retain several methods, some of which mutate the list. For example:



In [ ]:

    
bar.extend(foo) # Adds foo to the end of bar (in-place)
bar



In [ ]:

    
bar.append(5) # Appends 5 to the end of bar
bar



In [ ]:

    
bar.insert(0, 4) # Inserts 4 at index 0
bar



In [ ]:

    
bar.remove(7) # Removes the first occurrence of 7
bar



In [ ]:

    
bar.remove(100) # Oops! Doesn’t exist



In [ ]:

    
bar.pop(4) # Removes and returns indexed item



In [ ]:

    
bar.reverse() # Reverses bar in place
bar



In [ ]:

    
bar.sort() # Sorts bar in place
bar

Some methods, however, do not change the list:



In [ ]:

    
bar.count(7) # Counts occurrences of 7 in bar



In [ ]:

    
bar.index(7) # Returns index of first 7 in bar

Dictionaries

One of the more flexible built-in data structures is the dictionary. A dictionary maps a collection of values to a set of associated keys. These mappings are mutable, and unlike lists or tuples, are unordered. Hence, rather than using the sequence index to return elements of the collection, the corresponding key must be used. Dictionaries are specified by a comma-separated sequence of keys and values, which are separated in turn by colons. The dictionary is enclosed by curly braces.

For example:



In [ ]:

    
my_dict = {'a':16, 'b':(4,5), 'foo':'''(noun) a term used as a universal substitute 
           for something real, especially when discussing technological ideas and 
           problems'''}
my_dict



In [ ]:

    
my_dict['b']

Notice that a indexes an integer, b a tuple, and foo a string (now you know what foo means). Hence, a dictionary is a sort of associative array. Some languages refer to such a structure as a hash or key-value store.

As with lists, being mutable, dictionaries have a variety of methods and functions that take dictionary arguments. For example, some dictionary functions include:



In [ ]:

    
len(my_dict)



In [ ]:

    
# Checks to see if ‘a’ is in my_dict
'a' in my_dict

Some useful dictionary methods are:



In [ ]:

    
# Returns a copy of the dictionary
my_dict.copy()



In [ ]:

    
# Returns key/value pairs as list
my_dict.items()



In [ ]:

    
# Returns list of keys
my_dict.keys()



In [ ]:

    
# Returns list of values
my_dict.values()

When we try to index a value that does not exist, it raises a KeyError.



In [ ]:

    
my_dict['c']

If we would rather not get the error, we can use the get method, which returns None if the value is not present.



In [ ]:

    
my_dict.get('c')

Custom return values can be specified with a second argument.



In [ ]:

    
my_dict.get('c', -1)

It is easy to remove items from a dictionary.



In [ ]:

    
my_dict.popitem()



In [ ]:

    
# Empties dictionary
my_dict.clear()
my_dict

Sets

If we don't require labels for our unordered collection of values, we can use a set. Sets store unique collections of values.



In [ ]:

    
my_set = {4, 5, 5, 7, 8}
my_set

We can also use the set constructor.



In [ ]:

    
empty_set = set()
empty_set



In [ ]:

    
empty_set.add(-5)
another_set = empty_set
another_set

As we would expect, we can perform set operations.



In [ ]:

    
my_set | another_set



In [ ]:

    
my_set & another_set



In [ ]:

    
my_set - {4}

The set function is useful for returning the unique elements of a data structure. For example, recall bar:



In [ ]:

    
bar



In [ ]:

    
set(bar)

References

Bassi S (2007) A Primer on Python for Life Science Researchers. PLoS Comput Biol 3(11): e199.

Learn Python the Hard Way

Learn X in Y Minutes (where X=Python)