PEP 8 - Python Style Guide

A PEP is a Python Enhancement Proposal. PEP 8 (the eighth PEP) describes how to write Python code in a common style that will be easily readable by other programmers. If this seems unnecessary, consider that programmers spend much more time reading code than writing it.

You can read PEP 8 here: https://www.python.org/dev/peps/pep-0008/

pycodestyle

Wouldn't it be nice if you didn't need to remember all of these silly rules for how to write PEP 8-consistent code? What if there was a tool that would tell you if your code matches PEP 8 conventions or no?

There is such a tool, called pycodestyle.


In [ ]:
"""
This is some ugly code that does not conform to PEP 8.

Check me with pycodestyle:
    pycodestyle ../resources/pep8_example.py
"""
from string import *
import math, os, sys

def f(x):
    """This function has lines that are just too long. The maximum suggested line length is 80 characters."""
    return 4.27321*x**3  -8.375134*x**2  + 7.451431*x + 2.214154 - math.log(3.42153*x) + (1 + math.exp(-6.231452*x**2))
def g(x,
     y):
    print("Bad splitting of arguments")

# examples of bad spacing
mydict  =  { 'ham' : 2,  'eggs'  : 7  }#this is badly spaced
mylist=[ 1 , 2 , 3 ]

myvar   = 7
myvar2  = myvar*myvar
myvar10 = myvar**10

# badly formatted math
a= myvar+7 *  18-myvar2  /  2

l = 1 # l looks like 1 in some fonts
I = l # also bad
O = 0 # O looks like 0 in some fonts

In [ ]:
!pycodestyle ../resources/pep8_example.py

Exercise 1

Load the ../resources/pep8_example.py file in a text editor (you can use Jupyter notebook, or something else) and fix the problems in that pycodestyle is complaining about. Then rerun pycodestyle using the cell above, or from the terminal:

cd PythonWorkshop-ICE/resources
pycodestyle pep8_example.py

Naming conventions

Use descriptive names for your variables, functions, and classes. In Python, the following conventions are usually observed:

  • Variables, functions, and function arguments are lower-case, with underscores to separate words.
    index = 0
      num_columns = 3
      length_m = 7.2   # you can add units to a variable name
    
  • Constants can be written in all-caps.
    CU_SPECIFIC_HEAT_CAPACITY = 376.812   # J/(kg K)
    
  • Class names are written with the CapWords convention:
    class MyClass:
    

Programmers coming from other programming languages (especially FORTRAN and C/C++) should avoid using special encodings (e.g., Hungarian notation) in their variable names:

# don't do this!
iLoopVar = 0      # i indicates integer
szName = 'Test'   # sz means 'string'
gGlobalVar = 7    # g indicates a global variable

Comments

Comments are helpful when they clarify code. They should be used sparingly. Why?

  • If a code is so difficult to read that it needs a comment to explain it, it should probably be rewritten.
  • Someone may update the code and forget to update a comment, making it misinformation.
  • Comments tend to clutter the code and make it difficult to read.

Consider this example:


In [ ]:
# this function does foo to the bar!
def foo(bar):
    bar = not bar   # bar is active low, so we invert the logic
    if bar == True:   # bar can sometimes be true
        print("The bar is True!")   # success!
    else:   # sometimes bar is not true
        print("Argh!")   # I hate it when the bar is not true!

Only one of these comments is helpful. This code is much easier to read when written properly:


In [ ]:
def foo(bar):
    """
    This function does foo to the bar!
    
    Bar is active low, so we invert the logic.
    """
    bar = not bar    # logic inversion
    if bar:
        print("The bar is True!")
    else:
        print("Argh!")

Doc strings

Doc-strings are a useful way to document what a function (or class) does.


In [ ]:
def add_two_numbers(a, b):
    """This function returns the result of a + b."""
    return a + b

In a Jupyter notebook (like this one) or an iPython shell, you can get information about what a function does and what arguments it does by reading its doc-string:


In [ ]:
add_two_numbers?

Doc-strings can be several lines long:


In [ ]:
def analyze_data(data, old_format=False, make_plots=True):
    """
    This function analyzes our super-important data.
    
    If you want to use the old data format, set old_format to True.
    Set make_plots to false if you do not want to plot the data.
    """
    # analysis ...

If you are working on a large project, there may be project-specific conventions on how to write doc-strings. For example:


In [ ]:
def google_style_doc_string(arg1, arg2):
    """Example Google-style doc-string.
    
    Put a brief description of what the function does here.
    In this case, the function does nothing.
    
    Args:
        arg1 (str): Your full name (name + surname)
        arg2 (int): Your favorite number

    Returns:
        bool: The return value. True for success, False otherwise.
    """

def scipy_style_doc_string(x, y):
    """This is a SciPy/NumPy-style doc-string.
    
    All of the functions in SciPy and NumPy use this format for their
    doc-strings.
    
    Parameters
    ----------
    x : float
        Description of parameter `x`.
    y :
        Description of parameter `y` (with type not specified)

    Returns
    -------
    err_code : int
        Non-zero value indicates error code, or zero on success.
    err_msg : str or None
        Human readable error message, or None on success.
    """

In large Python projects, you may see doc-strings like this:


In [ ]:
def sphinx_example(variable):
    """This function does something.

    :param variable: Some variable that the function uses.
    :type variable: str. 
    :returns: int -- the return code. 
    """ 
    return 0

These doc-strings are for use with Sphinx, which can be used to automatically generate html documentation from code (similar to doxygen).

General advice

Import only what you need

For the love of God and all that is holy, do not do this:


In [ ]:
from numpy import *
from scipy import *
from pickle import *
from scipy.stats import *

Why not? Imagine that you import these libraries at the top of your code. At some point in a ~200 line script, you see this:


In [ ]:
with open('../resources/mystery_data', 'rb') as f:
    data = array(load(f))
x, y = data[:, 0], data[:, 1]
r = linregress(x, y)
s = polyfit(x, y, 1)
print(r.slope - s[0])

Can you identify which function belongs to which library? Don't do this!

Avoid deeply nested logic

Remember that it is best to keep your functions short and concise. As a result, it is best to avoid deeply nested if ... elif ... else logic structures. These can become very long, which obscures the logic and makes them difficult to read. Consider this example:


In [ ]:
import os

class AnalyzeData:
    def __init__(self, fname):
        self.fname = fname
        self.import_data()
        self.analyze_data()
        
    def import_data(self):
        file_extension = os.path.splitext(self.fname)[-1]
        if 'csv' in file_extension:
            print("Import comma-separated data")
            # many lines of code, maybe with several if statements
        elif 'tab' in file_extension:
            print("Import tab-separated data")
            # many lines of code, maybe with several if statements
        elif 'dat' in file_extension:
            print("Import data with | delimiters (old-school)")
            # many lines of code, maybe with several if statements
        else:
            print("Unknown data format. I give up!")   # should use an exception here; see later...
            return
        
    def analyze_data(self):
        """Do some super-awesome data analysis!"""

This long list of if statements is nasty to look at, and if you want to add more file types, it will become worse. Consider the alternative, which uses a dictionary with functions as values:


In [ ]:
class AnalyzeData:
    def __init__(self, fname):
        self.fname = fname
        self.import_data()
        self.analyze_data()
        
    def import_data(self):
        valid_extensions = {'.csv': self._import_csv,
                            '.tab': self._import_tab,
                            '.dat': self._import_dat}
        file_extension = os.path.splitext(self.fname)[-1]
        importer_function = valid_extensions[file_extension]
        importer_function()

    def _import_csv(self):
        print("Import comma-separated data")
        # many lines of code, perhaps with function calls
        
    def _import_tab(self):
        print("Import tab-separated data")
        # many lines of code, perhaps with function calls
        
    def _import_dat(self):
        print("Import data with | delimiters (old-school)")
        # many lines of code, perhaps with function calls
        
    def analyze_data(self):
        """Do some super-awesome data analysis!"""

        
a = AnalyzeData('data.tab')
# a = AnalyzeData('data.xls')   # unknown file type, throws exception!

This code is much clearer and nicer to read. Adding more valid file types increases import_data() by only one line (actually the valid_extensions dictionary could be moved out of this function), and removing file types is similarly easy.

Did you notice that the else case is gone? Because we are using a dictionary, an invalid extension will automatically generate a KeyError -- try uncommenting the last line in the cell above.

Finally, this type of structure makes unit testing much easier!

Avoid deeply nested loops

The following example was taken from a real C++ code, and converted to Python:


In [ ]:
for j in range(4):                     # Loop over course clipping
    for i in [0, 64]:                  # Loop over each attenuation
        for k in [63]:                 # Loop over fine clipping
            for channel in range(7):   # Loop over each channel
                
                
                """Does lots of stuff (55 lines of code)"""
                
                
    """Does some other stuff (30 lines of code) at end of i, k, and channel loops"""

Don't do this!

Many things that you would need a loop for in C++ can be done in one line in Python. Code with many nested loops will also run very slowly in Python. In the Numpy lesson you will learn how Numpy eliminates the need for many nested loops.

If you absolutely must use nested loops, try to wrap the interior code in a function:


In [ ]:
def loop():
    for j in range(4):
        for i in [0, 64]:
            for k in [63]:   # this is left-over code
                for channel in range(7):
                    inner_loop_function(i, k, channel)
        outer_loop_function(j)

Or, even better (with proper variable names!):


In [ ]:
def inner_loop(course_clipping):
    fine_clipping = 63
    for attenuation in [0, 64]:
        for channel in range(7):
            inner_loop_details(attenuation, fine_clipping, channel)

def outer_loop():
    for course_clipping in range(4):
        inner_loop(course_clipping)
        outer_loop_details(j)

Functions

When to use functions

Python can be used as a scripting language (like Bash or Perl), and often times Python programs start out as scripts. Here is an example of a script that renames image files (call it image_renamer.py):


In [ ]:
#!/usr/bin/env python3

from glob import glob
import os

jpeg_file_list = glob('Image_*.jpg')
for old_file_name in jpeg_file_list:
    fname_parts = old_file_name.split('_')
    new_file_name = fname_parts[0] + '_0' + fname_parts[1]   # add leading zero: 01 -> 001
    os.rename(old_file_name, new_file_name)

The first line indicates to the shell that this is a Python 3 script (the #! combination is called a shebang).

You can run this script as an executable from the shell, just like any other program:

chmod a+x image_renamer.py
./image_renamer.py

Often times, this is all you need. However, it has several disadvantages:

  1. They are not very reuseable -- reusing this code generally means copy + paste.
  2. There is nothing to break up the program -- like reading a book without chapters or headings.
  3. Difficult to test -- must be run in the correct environment (directory with images).

Functions solve all of these problems. Consider this code:


In [ ]:
"""
image_renamer.py -- simple script to rename images.
"""
from glob import glob
import sys
import os


def rename_images(image_list, test=False):
    for old_file_name in image_list:
        fname_parts = old_file_name.split('_')
        new_file_name = fname_parts[0] + '_0' + fname_parts[1]   # add leading zero: 01 -> 001
        if test:
            print(new_file_name)
        else:
            os.rename(old_file_name, new_file_name)
        

if __name__ == '__main__':   # only run this part if the file is being executed as a script
    directory = './'
    if len(sys.argv) == 2:
        directory = sys.argv[1]
    jpeg_file_list = glob(directory + '/Image_*.jpg')
    rename_images(jpeg_file_list)

To be fair, the code is now longer, and in some ways more complicated. However, it has several advantages over the simple script. Recalling our previous list, note that:

  1. Reusing the code is now very easy:

    """new_code.py"""
    from image_renamer import rename_images
    
    rename_images(some_directory)
    
  2. The parts of the script are now easy to identify.
  3. You can now test the code to see what it does:

In [ ]:
rename_images(['Image_01.jpg', 'Image_02.jpg'], test=True)

Functions should have descriptive names

Functions should have names that describe what they are for.

For example, what does this function do?


In [ ]:
def myfunc(mylist):
    import re
    f = re.compile('([0-9]+)_.*')
    return [int(f.findall(mystr)[0]) for mystr in mylist]

myfunc(['000_Image.png', '123_Image.png', '054_Image.png'])

A better name could be:

def extract_integer_index(file_list):

If you name things well, it makes comments unnecessary. Your code will speak for itself!

Functions should be short

Here is an example of a function that is a bit too long. It is not very long because it is an example, but in real physics code it is not uncommon to find single functions that are hundreds of lines long!


In [ ]:
def analyze():
    print("******************************")
    print("    Starting the Analysis!    ")
    print("******************************")

    # create fake data
    x = [4.1, 2.8, 6.7, 3.5, 7.9, 8.0, 2.1, 6.3, 6.6, 4.2, 1.5]
    y = [2.2, 5.3, 6.3, 2.4, 0.1, 0.67, 7.8, 9.1, 7.1, 4.9, 5.1]
    
    # make tuple and sort
    data = list(zip(x, y))
    data.sort()
    
    # calculate statistics
    y_sum = 0
    xy_sum = 0
    xxy_sum = 0
    for xx, yy in data:
        y_sum += xx
        xy_sum += xx*yy
        xxy_sum += xx*xx*yy
    xbar = xy_sum / y_sum
    x2bar = xxy_sum/y_sum
    std_dev = (x2bar - xbar**2)**0.5
    
    # print the results
    print("Mean:   ", xbar)
    print("Std Dev:", std_dev)

    print("Analysis successful!")

analyze()

How can we improve this code? Our analysis function is really doing three things:

  1. Creating fake data
  2. Calculating some statistics
  3. Printing the the status and results

Each of these things can be put in a separate function.


In [ ]:
def generate_fake_data():
    x = [4.1, 2.8, 6.7, 3.5, 7.9, 8.0, 2.1, 6.3, 6.6, 4.2, 1.5]
    y = [2.2, 5.3, 6.3, 2.4, 0.1, 0.67, 7.8, 9.1, 7.1, 4.9, 5.1]
    data = list(zip(x, y))
    data.sort()
    return data

def calculate_mean_and_stddev(xy_data):
    y_sum = 0
    xy_sum = 0
    xxy_sum = 0
    for xx, yy in xy_data:
        y_sum += xx
        xy_sum += xx*yy
        xxy_sum += xx*xx*yy
    xbar = xy_sum / y_sum
    x2bar = xxy_sum/y_sum
    std_dev = (x2bar - xbar**2)**0.5
    return xbar, std_dev

def generate_data_and_compute_statistics():
    data = generate_fake_data()
    mean, std_dev = calculate_mean_and_stddev(data)
    print("Mean:   ", mean)
    print("Std Dev:", std_dev)

generate_data_and_compute_statistics()

We note three important results of this code restructuring:

  1. It is much easier to tell at a glance what analyze() does.
  2. The comments (which we used to organize our code before) are no longer needed.
  3. generate_fake_data() and calculate_mean_and_stddev() can now be reused elsewhere.

Functions should do one thing

A useful principle for guiding the creation of functions is that functions should do one thing.

In the previous section, our large analysis() function was doing several things, so we broke it up into smaller functions. Notice that calculate_mean_and_stddev() does two things. Should we break it up into two functions, calculate_mean() and calculate_stddev()?

The answer depends on two things:

  1. Will you ever want to calculate the mean and standard deviation separately?
  2. Will splitting the function result in a large amount of duplicated code?

Functions should do what they claim to do

Avoid cases where a function does more than what you would expect it to do.

For example, this function claims to just write data to a file; however, it also modifies the data!


In [ ]:
def write_data_to_file(data, filename='data.dat'):
    with open(filename, 'w') as f:
        data *= 2
        f.write(data)

Try to imagine a much larger code where you have a factor of two introduced, and you can't figure out where it came from. Then try to imagine searching a large code for the number 2.

Exceptions

Exceptions are a mechanism for handling errors. Traditionally, errors were handled with return codes, like this:


In [ ]:
def example_only_does_not_work():
    fin = open('does_not_exist.txt', 'r')
    if not fin:
        return -1
    # ... do stuff with file
    fin.close()
    return 0

This kind of code is problematic for a few reasons:

  1. The return codes (and therefore errors) can be ignored/forgotten.
  2. The return code must be either checked by the function that calls it, or explicitly passed to higher level functions.
  3. Return codes are generally integers, so they must be looked up in a table. They also can't provide any specific details.

To illustrate point #2, consider the following code:


In [ ]:
def foo():
    return -1   # error code!

def bar():
    foo()
    return 0    # return success?

def baz():
    bar()
    return 0    # no errors, right?

Exceptions offer an elegant solution to all three of the problems listed above.

Raising exceptions

Exceptions must derive from the BaseException class (user-defined exceptions should be derived from Exception). It is common to use one of the built-in exception subclasses. Common examples include:

  1. ImportError - raised when trying to import an unknown module.
  2. IndexError - raised when trying to access an invalid element in an array.
  3. KeyError - raised when trying to use an invalid key with a dictionary.
  4. NameError - raised when trying to use a variable that hasn't been defined.
  5. TypeError - raised when trying to use an object of the wrong type.
  6. ValueError - raised when an argument has the correct type but a bad value.
  7. OSError - base exception for problems with reading/writing a file (and other things).
  8. RuntimeError - catch-all class for errors while code is running.

In general, you can use these built-in exceptions when there is one that suits the problem. For instance, you might raise a ValueError or TypeError when checking arguments to a function:


In [ ]:
def foobar(value):
    if not isinstance(value, int):
        raise TypeError("foobar requires and int!")
    if value < 0:
        raise ValueError("foobar argument 'value' should be > 0; you passed: %i" % value)
    
# uncomment to test:
# foobar(2.7)
# foobar(-7)

You do not need to add a string argument when raising an exception. This works fine:


In [ ]:
raise Exception

However, this is not very helpful. In general, you should add some descriptive text to your exceptions to explain to the user what exactly went wrong.

To make your exceptions even more useful, or when there isn't a built-in exception that meets your needs, you can roll your own by sub-classing Exception or one of the other built-in exceptions:


In [ ]:
class MyCustomException(Exception):
    pass

# using a doc-string instead of 'pass' is more helpful
class CorruptFile(OSError):
    """Raise this exception when attempting to read a file that is corrupt."""
    
# uncomment to test...
# raise MyCustomException("Test")
# raise CorruptFile("Oh no, the file is corrupted!")

Handling exceptions

Handling exceptions is done by using try ... except blocks. That is, you try some operation where you suspect there may be some problems. If there are no problems, you continue on your merry way, except in error cases where you deal with the problem before continuing on.

Let's return to the example from the top of this section to see how this works:


In [ ]:
def foo():
    raise RuntimeError("Oh no! Can't foo!")

def bar():
    foo()

def baz():
    try:
        bar()
    except RuntimeError:
        print("Foo had an error, but it is being handled...")
        # do something useful to handle the error, or keep going
        
baz()

This is much better than using return codes (e.g., return -1 for errors) because:

  1. We can't ignore the error; we are forced to deal with it or the program execution stops.
  2. If you do forget to deal with the error, there is a descriptive error message that tells you what went wrong.
  3. bar() doesn't need to worry about error handling! The error-handling code only occurs where the error happens (where the exception is raised) and at the upper levels of your program, where the flow of the program is controlled.

What about that nice, descriptive error message that we wrote? Wouldn't it be nice if we could reuse that information in our except block? You can, and it's easy! Just convert the exception to a string:


In [ ]:
def baz():
    try:
        bar()
    except RuntimeError as e:
        print('baz error: ' + str(e))
        
baz()

Finally, in some cases you may want to do something in the event that an exception is not thrown. Maybe you were expecting an exception, but for some bizarre reason it wasn't raised, which might be interesting. In these cases, you can add an else to the end of the try ... except block:


In [ ]:
def foo():
    """This foo actually foos."""
    pass

def baz():
    try:
        bar()
    except RuntimeError:
        print("Bar raised an exception!")
    else:
        print("No exception was raised??")
        
baz()

Don't dismiss this as being a useless edge case -- exceptions are used for all kinds of things in Python. For instance, did you remember to install the pycodestyle package for this module?


In [ ]:
try:
    import pycodestyle
except ImportError:
    print("You didn't remember to install it. :(")
else:
    print("Nice job!")

Exercise 2

You are given the following functions (don't change them!):


In [ ]:
import random

values = {'a': 0, 'b': 1, 'c': 2}

# DON'T CHANGE THESE
def one(values):
    print(v)   # throws NameError because v is not defined

def two(values):
    value['c'] /= values['a']

def three(values):
    return values['d']

def tricky():
    if random.randint(0, 1):
        raise ValueError
    else:
        raise RuntimeError

Handle the exceptions thrown by each of the functions. The first one is done as an example.


In [ ]:
try:
    one(values)
except NameError:
    pass

# two(values)

# three(values)

# try:
#     for i in range(10):
#         tricky()
# except __:

Understanding stack traces

Consider this line from some earlier code:


In [ ]:
fin = open('does_not_exist.txt', 'r')

The file does not exist, so it raises an error -- very sensibly, a FileNotFoundError. Here, we have not handled this exception, so Python performs a "stack trace" or "traceback" (basically unrolling your code to show you where the error occurred).

These tracebacks are an excellent way to figure out what went wrong in your program. However, they can appear to be a little cryptic to the uninitiated, so we will look at how to understand them.

Consider this example, where you are trying to fit a quadratic function to two data points:


In [ ]:
from scipy.optimize import curve_fit

def f(x, a, b, c):
    return a*x**2 + b*x + c

x = [0, 1]
y = [2, 3]
curve_fit(f, x, y)

The traceback indicates that the error is a TypeError, and then starts in the current file (listed in green), where the offending call is made. It tells you that the error originates on line 8 (in this case, of the notebook cell).

Aside: you can view line numbers in an notebook by selecting a cell, pressing escape, and then pressing the (lowercase) 'L' key. Press 'L' again to turn the line numbers off.

The traceback then goes to the file where the offending function resides (in this case, in minpack.pyc in the scipy library). The exception originated during a call to leastsq().

Finally, the traceback shows you where the actual TypeError exception was raised (also in the minpack.pyc file, just at a different line). The TypeError tells you that N=3 must not exceed M=2.

This doesn't seem very helpful at first. What actually went wrong? What are N and M? In fact, the problem is one of basic linear algebra: we are trying to fit three unknowns (from our quadratic) with only two equations (one from each (x,y) data point). We need more data! Try adding another junk data point, and you will see that the error goes away.

To summarize, we note the following useful lessons:

  1. Tracebacks appear cryptic, and can be quite long, but once you understand them they are very helpful!
  2. Exceptions allow you to propagate an error from where it actually occurs to where the function is used, much higher up, without any additional code (unlike return codes).
  3. Make sure that your error messages are helpful! Probably the message about "Improper input: N=3 must not exceed M=2" seemed very clear to the original authors, but maybe is less clear to users. How could you make the error message easier to understand?

Unit Testing

Someone hands you the following code to calculate $n!$:


In [ ]:
def factorial(n):
    n_fact = n
    while n > 1:
        n -= 1
        n_fact *= n
    return n_fact

Usually, you check that this code is working by doing something like this:


In [ ]:
print(factorial(3), 3*2)
print(factorial(5), 5*4*3*2)

This sort of testing works fine, but it has a few issues:

  1. You need to manually test several cases each time you change your code.
  2. This sort of informal testing tends to miss edge-cases.

To illustrate point 2, note the following:


In [ ]:
factorial(0)

Oops! Recall that $0! \equiv 1$. Also, note that factorial(-1) = -1, which is also wrong!

Writing a unit test is not much more work than our manual testing above. A possible test suite could look like this:


In [ ]:
correct_factorials = {0: 1, 1: 1, 2: 2, 3: 6, 4: 24, 5: 120}
for n, expected in correct_factorials.items():
    assert factorial(n) == expected

The test fails because factorial(0) = 0, but you wouldn't know that from the output. All you know is that something isn't working.

A more realistic example of unit testing using pytest can be found in ../resouces/pytest_example. Please open this directory and have a look at the code, which is organized as follows (ignoring the __pycache__ directories):

../resources/pytest_example
    factorial.py
    tests/
        __init__.py
        test_factorial.py

The file names and directory structure are important (see the pytest website). pytest can be run as follows:


In [ ]:
!pytest ../resources/pytest_example/

pytest tells us that 2/3 tests failed. One test, test_n_zero, fails because we are trying to assert that factorial(0), which equals zero, is equivalent to one: 0 == 1.

The other test that fails is test_n_negative. A proper version of our factorial version might be expected to raise a ValueError for a negative number, but the one above doesn't, so it fails the test.

For a quick pytest tutorial, look here. For more details, see the pytest website. Several other unit testing frameworks exist, but we prefer pytest because it requires the least amount of code to set up tests and has the cleanest looking tests.

Classes

When to use classes

The question "When should I use classes?" is more difficult to answer than "When should I use functions?" (for which the answer is: almost always). Classes are generally used in Object-Oriented Programming (OOP). A full discussion of OOP is beyond the scope of this course, so we will just give some general guidance here.

You should consider using classes when:

  1. You have several functions manipulating the same set of data.
  2. You find that you are passing the same arguments to several functions.
  3. You want parts of your code to be responsible for maintaining their own internal state.
  4. You want your code to have an easy-to-use interface that doesn't require understanding exactly what the code does.

Consider this code:


In [ ]:
import random

def create_data_set(length, lower_bound=0, upper_bound=10, seed_value=None):
    random.seed(seed_value)
    return [random.uniform(lower_bound, upper_bound) for i in range(length)]
    
def shuffle(data):
    random.shuffle(data)
    return data

def mean(data):
    return sum(data)/len(data)
    
def display(data):
    print(data)
    
def analyze(data):
    print(mean(data))    
    display(data)
    new_data = shuffle(data)
    display(new_data)
    
data = create_data_set(5)
analyze(data)

The first function creates a data set (initialization), while the other functions manipulate this data set. In this case, it may make sense to create a class:


In [ ]:
import random

class DataSet:
    def __init__(self, length, lower_bound=0, upper_bound=10, seed_value=None):
        random.seed(seed_value)
        self.data = [random.uniform(lower_bound, upper_bound) for i in range(length)]
    
    def shuffle(self):
        random.shuffle(self.data)

    def mean(self):
        return sum(self.data)/len(self.data)

    def display(self):
        print(self.data)
        
    def analyze(self):
        print(self.mean())    
        self.display()
        self.shuffle()
        self.display()
        
a = DataSet(length=5)
a.analyze()

In this simple case, the class version and the function version appear more-or-less the same. However, the function version is actually better because it allows more flexibility: what if you wanted to analyze some other data set besides a set of random numbers?

To see the real benefit of using classes, we need to consider something a bit more complex:


In [ ]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.animation as animation
from IPython.core.display import display, HTML
from math import sin, cos, atan2
import random


def generate_random_path(num_points):
    """Generate a random list of points pacman should visit."""
    xlim = (-PacMan.X_BOUNDS, PacMan.X_BOUNDS)
    ylim = (-PacMan.Y_BOUNDS, PacMan.Y_BOUNDS)
    waypoints = []
    for i in range(num_points):
        waypoints.append((random.uniform(*xlim), random.uniform(*ylim)))
    return waypoints


class PacMan:
    RADIUS = 0.1           # size of pacman
    ANGLE_DELTA = 5        # degrees; controls how fast pacman's mouth opens/closes
    MAX_MOUTH_ANGLE = 30   # degrees; maximum mouth opening half-angle
    MAX_SPEED = 0.02       # controls how fast pacman moves
    X_BOUNDS = 1           # controls x-axis display range
    Y_BOUNDS = 0.5         # controls y-axis display range
    
    def __init__(self, waypoints=None):
        self._init_figure()
        self._init_pacman()
        if waypoints:
            self.waypoints = waypoints
            self.go_home()
        else:
            self.waypoints = []
        self._show_animation()
        
    def _init_figure(self):
        self.fig = plt.figure(figsize=(10, 8))
        self.ax = self.fig.add_subplot(111, aspect='equal')
        self.ax.set_xlim(-self.X_BOUNDS, self.X_BOUNDS)
        self.ax.set_ylim(-self.Y_BOUNDS, self.Y_BOUNDS)
        plt.tight_layout()
        
    def _init_pacman(self):
        self.x = 0
        self.y = 0
        self.angle = 0
        self.angle_set = False
        self.mouth_closing = True
        self.mouth_open_angle = 30
        pacman_patch = patches.Wedge((self.x, self.y), self.RADIUS, 
                                     self.mouth_open_angle, -self.mouth_open_angle,
                                     color="yellow", ec="none")
        self.pacman = self.ax.add_patch(pacman_patch)
    
    def _animate_mouth(self):
        if self.mouth_closing:
            self.mouth_open_angle -= self.ANGLE_DELTA
        else:
            self.mouth_open_angle += self.ANGLE_DELTA
        if self.mouth_open_angle <= 0:
            self.mouth_open_angle = 1
            self.mouth_closing = False
        if self.mouth_open_angle >= self.MAX_MOUTH_ANGLE:
            self.mouth_closing = True
        self.pacman.set_theta1(self.mouth_open_angle)
        self.pacman.set_theta2(-self.mouth_open_angle)

    def _calculate_angle_to_point(self, x, y):
        dx = x - self.x
        dy = y - self.y
        angle_rad = atan2(dy, dx)
        return angle_rad
        
    def _animate_motion(self):
        if not self.waypoints:
            return
        way_x, way_y = self.waypoints[0]
        if (self.x == way_x) and (self.y == way_y):
            self.waypoints.pop(0)
            self.angle_set = False
            return
        if not self.angle_set:
            self.angle = self._calculate_angle_to_point(way_x, way_y)
            self.angle_set = True
        dx = self.MAX_SPEED*cos(self.angle)
        dy = self.MAX_SPEED*sin(self.angle)
        if abs(way_x - (self.x + dx)) >= self.MAX_SPEED:
            self.x += dx
        else:
            self.x = way_x
        if abs(way_y - (self.y + dy)) >= self.MAX_SPEED:
            self.y += dy
        else:
            self.y = way_y
        tx = mpl.transforms.Affine2D().rotate(self.angle) + \
             mpl.transforms.Affine2D().translate(self.x, self.y) + self.ax.transData
        self.pacman.set_transform(tx)
    
    def _next_frame(self, i):
        self._animate_mouth()
        self._animate_motion()
        return self.pacman,
    
    def _show_animation(self):
        if u'inline' in mpl.get_backend():
            ani = animation.FuncAnimation(self.fig, self._next_frame, frames=500, interval=30, blit=True)
            display(HTML(ani.to_html5_video()))
            plt.clf()
        else:
            ani = animation.FuncAnimation(self.fig, self._next_frame, interval=30)
            if mpl.get_backend() == u'MacOSX':
                plt.show(block=False)
            else:
                plt.show()
            
    def add_waypoint(self, x, y):
        """Add a point where pacman should go. This function is non-blocking."""
        self.waypoints.append((x, y))
    
    def add_random_path(self, num_points):
        """Add a list of random points to pacman's waypoint list."""
        random_points = generate_random_path(num_points)
        self.waypoints.extend(random_points)
        
    def go_home(self):
        """Send pacman back to the origin (0, 0)."""
        self.add_waypoint(-self.MAX_SPEED, 0)
        self.add_waypoint(0, 0)

In [ ]:
random_path = generate_random_path(num_points=10)
pac = PacMan(random_path)

Note that Pacman is responsible for maintaining his own internal state. There are functions to manage how Pacman moves and opens/closes his mouth. All the user has to do is tell him where to go.

If you have a Mac (needed for non-blocking animation), you can move Pacman via three "public" functions (the last three), and you can use them without understanding exactly what is happening inside the class. Otherwise, you should tell Pacman where to go using the __init__ function.

Exercise 3

Without changing the PacMan class, make Pacman go in a square.

For an extra challenge, change Pacman's color to purple (hint: you might need to use set_color) and make him bigger, again without changing the code inside the class!


In [ ]:
# Insert code here

Private fields and methods

Variables inside classes are called fields. Functions inside classes are called methods.

By convention, fields and methods that start with an underscore (e.g., _init_pacman()) are "private", although not in the way that Java or C++ methods are private. These items can still be accessed by users, but the underscore indicates that users should not generally mess with them (they are not part of the public API).

Fields and methods that start with two underscores can also be considered private, but the two underscores have a particular use in Python called "name mangling", and they are intended to help prevent conflicts during inheritance. Unless you know what you are doing, stick to single underscores.

Methods that start and end with two underscores (e.g., __init__()) are generally reserved for Python system calls. Don't name your methods this way.

Going back to the Pacman example, we note that there are only three methods needed to make pacman move: add_waypoint(), random_path(), and go_home. Each of these can be easily used without any knowledge of the complicated class internals. It is good programming to provide a simple, easy-to-use interface to classes that is difficult to use incorrectly.

Encapsulation

Encapsulation is an object-oriented programming concept that it is a good idea to prevent users from meddling with the internals of your class except via an approved external interface.

In traditional OO languages like Java and C++, encapsulation is strongly encouraged, while Python is less strict.

Here is an example of how Python classes are typically written:


In [ ]:
class Rect:
    def __init__(self, width, height):
        self.width = width
        self.height = height
        
    def area(self):
        return self.width*self.height
    
    def perimeter(self):
        return 2*self.width + 2*self.height

This has a minimum of extra code ("boilerplate" in programmer-speak) and is generally the right way to make a Python class. However, note that we can do the following:


In [ ]:
a = Rect(3, -1)    # fine
print('Area of a:', a.area())

b = Rect(2, 's')   # also fine?
print('Area of b:', b.area())

It is generally good practice to validate the inputs of your classes (e.g., to avoid generating string Rects as above). We may also want to prevent users from changing the internal variables of our class accidentally or in ways that would ultimately generate bad outputs. This is traditionally done using the getter/setter model:


In [ ]:
from numbers import Number

class EncapsulatedRect:
    def __init__(self, width, height):
        self.set_width(width)
        self.set_height(height)
        
    def area(self):
        return self._width*self._height
    
    def perimeter(self):
        return 2*self._width + 2*self._height
    
    def get_width(self):
        return self._width
    
    def get_height(self):
        return self._height
    
    def set_width(self, width):
        if isinstance(width, Number) and width > 0:
            self._width = width
        else:
            raise ValueError('set_width: value should be a non-negative number.')
        
    def set_height(self, height):
        if isinstance(height, Number) and height > 0:
            self._height = height
        else:
            raise ValueError('set_height: value should be a non-negative number.')

Here, _width and _height are internal variables, which can only be changed by approved setters which make sure that the values are good.

Unlike in C++ and Java, however, even in our EncapsulatedRect we can still modify _width and _height directly:


In [ ]:
d = EncapsulatedRect(4, 5)
d._width = 2
print(d.area())

In general, the more "Pythonic" approach is actually Rect rather than EncapsulatedRect. In particular, Python encourages directly accessing fields rather than using getters and setters, which add boilerplate and clutter the code. Python expects users to be smart enough to use classes correctly.

Note that it is still good practice to validate inputs in Python. But how can you do that without using a set_... method? Python offers a @property decorator for this purpose, but we will not discuss its use here.

Exercise 4

Write a simple class called Point2D to represent a mathematical 2D point. You should be able to construct and interact with the point using either rectangular or polar coordinates. Include methods (or functions) to add and subtract two points.

Inheritance

What is inheritance?

Inheritance is a more advanced Python topic, so in case you have forgotten or didn't get to the end of your Python tutorial, here is a brief example:


In [ ]:
class Foo:
    def __init__(self, value):
        self.value = value
        
    def square(self):
        return self.value**2
    
    
class Bar(Foo):   # Bar inherits from Foo
    def __init__(self, value):
        self.value = value
        
    def double(self):
        return 2*self.value
    
    
baz = Bar(9)
print(baz.double())   # baz knows how to double because it is a Bar
print(baz.square())   # baz inherited the ability to square from Foo

Inheritance for specialization

The classic example of using inheritance for specialization is something like this:


In [ ]:
class BasicClass:
    name = "Test"
    value = 42
    
class AdvancedClass(BasicClass):
    extra = [1, 2, 3]
    
adv = AdvancedClass
adv.value

The AdvancedClass has everything that the BasicClass has, plus more! However, in Python, you could also do this:


In [ ]:
basic = BasicClass()
basic.extra = [1, 2, 3]   # works fine

You can do the same thing with functions:


In [ ]:
basic.f = lambda x: x + 7
basic.f(3)

However, note that a new BasicClass object will not have these features:


In [ ]:
basic2 = BasicClass()
# basic2.extra    # this won't work
# basic2.f(8)     # this won't work either

Finally, there are (at least) four cases when you should definitely use inheritance:

  1. You are going to create objects from both the base class and the specialized class.
  2. You will create multiple objects from either/both the base and specialized classes.
  3. If you care about the type of the object (see the section on raising exceptions below).
  4. The features you are adding to a class are numerous and/or non-trivial. In this case, inheritance is much cleaner.

In general, you should prefer using inheritance over manually adding fields or methods.

super Function

Duplicated code is evil!

Duplicating code wastes your time, makes your programs longer and harder to read, and makes them more error-prone. If you make a change to a block of code that is duplicated elsewhere, you will then need to manually change that code in each location it is repeated. Yuck!

Here is a trivial example of how the super function can save you time and money!


In [ ]:
class Foo:
    def __init__(self, value):
        self.value = value
        
    def compute(self):
        print("Foo does some complicated calculations here.")
        self.value += 3
        print("Value:", self.value)
    
    
class Bar(Foo):
    def __init__(self, value):
        self.value = value
        
    def compute(self):
        print("Bar does its own complicated calculations here.")
        self.value *= 2
        super(Bar, self).compute()   # calls compute() function of parent, Foo
        
        
b = Bar(7)
b.compute()

We can also use super to call "special" functions, like the __init__ function (constructor):


In [ ]:
class Bar(Foo):
    def __init__(self, value):
        """
        This constructor is actually not needed. If you comment it out,
        then Foo's constructor will be called automatically. (Try it!)
        However, imagine you want to do something before calling Foo's
        constructor.
        """
        super(Bar, self).__init__(value)   # explicitly calls Foo's constructor
        
    def compute(self):
        print("Bar does its own complicated calculations here.")
        self.value *= 2
        super(Bar, self).compute()   # calls compute() function of parent, Foo
        
        
b = Bar(9)
b.compute()

These are very trivial examples, but please believe that the super function can really cut down on a lot of duplicated code! Use it as often as you can.