Writing idiomatic Python

We now have some interesting data about the attendees of this tutorial, so let's make use of it in Python by writing an interface to it.

First though, let's consider some important points about writing good, maintainable and reliable code.

Make use of existing tools.

The first thing to know about our data is that we have written our data in "yaml" form. YAML is an excellent data serialisation format similar to JSON with a few extra benefits, notably including the ability to write comments. This makes YAML a great choice for writing configuration files or simply for storing human readable data.

By using YAML we can tap into existing loaders, such as PyYAML, to avoid having to write our own - this will save us a bunch of time and allow us to focus on the important parts of our code.

An important thing to note about YAML is that it is an extensible serialisation language, meaning it is possible to define our own complex types and ultimately construct arbitrary Python objects. This is great, but it does open us up to potentially malicious intents, so we use the yaml.safe_load function when dealing with any data which may be untrusted:


In [ ]:
import yaml

example = """
members:
 - Ringo
 - John
 - Paul
 - George
"""

print yaml.safe_load(example)

In a similar vein - read about Python's standard library, especially itertools and collections.

Know your data types

Python's built-in datatypes are powerful - they generally use the best possible algorithms for their specific problems, so we should make use of them where possible.

In your groups, discuss:

  • What is the difference between a list and a tuple?
  • What is a dictionary used for?
  • What is a Set good for?
  • What are the differences between a list and a numeric numpy array?

Use generators

Generators are incredibly powerful in Python, if nothing else use them to simplify your code from:

def green_bottles(n_bottles=10):
    lines = []
    for i in xrange(n_bottles, 0, -1):
        lines.append('{} green bottles'.format(i))
    return lines

print '\n'.join(green_bottles(5))

To

def green_bottles(n_bottles=10):
    for i in xrange(n_bottles, 0, -1):
         yield '{} green bottles'.format(i)

print '\n'.join(green_bottles(5))

Note that the return type is different: however calling list() on the result of green_bottles will give exactly equivalent.

Not only does this often simplify code, but it can lead to some impressive speedups in some situations, especially when large quantities of data are being processed then disregarded sequentially.

Use context managers

Context managers allow one to run "enter" and "exit" functionality before and after the content within a "with" statement respectively.

The most common context manager is created with open to open a file. In order to avoid running out of file descriptors in the Python process, is is important to close an open file once it is finished with. Without using a context manager, one must remember to call .close() on the file handle when we are done with it, but inside a context manager, this is done for us automatically:

with open(fname, 'r') as fh:
    contents = ''.join(fh)

Additionally, Python's contextlib module makes it easy to define our own context managers.

Have a consistent style

Python's PEP8 style guide gives some good suggestions on coding style, particularly with naming conventions and whitespace. Specifically:

  • classes names should be in CapWords
  • function, variable and method names should be lowercase, with words separated by underscores

Following the guide makes your code more readable to those who also follow the guide (much of the scientific Python world) and gives users convenient hints as to what a certain object type might be.

Importantly though:

A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is most important.


Defining an interface to our classroom

We have a collection of YAML files, each representing a group in the classroom, where a group is a collection of individual attendees.

Exercise:

Using sample data in 'sample_data/*.yaml', Python's glob.glob function, and PyYAML's yaml.safe_load function, load the data and print the filename and length of each group.

The first few steps have been completed for you:


In [ ]:
import os
import glob

pattern = os.path.join('sample_data', '*.yaml')
print glob.glob(pattern)

When you are ready, take a look at my solution by running the next cell - there is no right or wrong answer. If you think your solution is more readable (and/or Pythonic) than mine, I'd like to hear about it.


In [ ]:
%load solutions/idiomatic_1.py

Using a suitable datatype

Now that we've explored the use of yaml.safe_load, let's define a function which loads each group in the given "classroom".


In [ ]:
import os
import glob
import yaml


def load_classroom(glob_pattern):
    for fname in sorted(glob.glob(glob_pattern)):
        with open(fname) as fh:
            yield tuple(yaml.safe_load(fh).get('members') or [])


pattern = os.path.join('sample_data', '*.yaml')
classroom = list(load_classroom(pattern))

print classroom

The most appropriate datatype for the data depends on the questions that we wish to ask of it.

For instance, supposing we had a long list of groups and wanted to do an analysis on the group lengths, we may wish to construct a numpy array exclusively holding those lengths. Once we have that, computing simple statistics is simply a numpy method away:


In [ ]:
import numpy as np

def group_lengths(classroom):
    return np.array([len(group) for group in classroom])

lengths = group_lengths(classroom)
print 'Classroom group sizes: mean {}; var {};'.format(lengths.mean(), lengths.var())

Exercise: In your groups, discuss and implement an approach for turning our classroom (made up of a list of tuples) into a flat list of attendees. (hint: there are many approaches possible, and the itertools module contains at least one of them)

Once you have the list of attendees, compute the average Github username for the class.


In [ ]:
#attendees = ...

(Extension): As a group, see how many other approaches you can find for flattening out a nested list.


In [ ]:
#attendees = ...

Using functions as arguments to other functions

One of the strangest concepts for some familiar with other languages, is Python's ability to pass functions as arguments to other functions.

First, let's define a function which scores a group based on the number of people in the group. 3 is the ideal number, 2 is good, 4 is better than 1, but 1 is better than 5+.


In [ ]:
def group_score(group):
    length = len(group)
    # Map length of group to a score.
    scores_mapping = {3: 0,
                      2: 1,
                      4: 2,
                      1: 3}
    # Get the score. If the length isn't defined
    # in the scored_mapping, use length as default.
    score = scores_mapping.get(length, length)
    return score


for group_len in range(1, 7):
    print 'Group length {}, score {}'.format(group_len, group_score(range(group_len)))

We can now use the function group_score as an argument to the sorted function, which accepts a function in its key keyword argument.

This will allow us to sort our classroom groups by order of the score that our function gives them.


In [ ]:
sorted(classroom, key=group_score)

Notice that we do not call the group_score function ourselves - that is the job of sorted, which calls the function on each of the groups in our classroom inside a for-loop.

Exercise: Define a function called fail which takes the classroom, a function, and a number as its only arguments. The function should return a generator of all the groups who's score, after calling the passed-in function is greater than the number passed as an argument:


In [ ]:
def fail(classroom, scoring_function, maximum_score):
    pass

When called with our group_score function, and a maximum score of 2, we should see that the Jacksons are the only group who fail the class:

>>> print list(fail(classroom, group_score, 4))
[('Jackie', 'Tito', 'Jermaine', 'Marlon', 'Randy', 'Michael')]

In [ ]:
print list(fail(classroom, group_score, 4))

My solution to the problem can be seen below:


In [ ]:
%load solutions/idiomatic_2.py

Intro | Previous | Next























In [1]:
%run resources/load_style.py