Basic Programming Using Python: Unit Testing

Objectives

  • Explain why it is not practical to prove a program correct by testing it.
  • Correctly write unit tests using an xUnit-style unit testing framework.
  • Name and explain the three types of results a test can produce.

Setting Expectations

Like any other piece of experimental apparatus, a complex program requires a much higher investment in testing than a simple one. Putting it another way, a small script that is only going to be used once, to produce one figure, probably doesn't need separate testing: its output is either correct or not. A linear algebra library that will be used by thousands of people in twice that number of applications over the course of a decade, on the other hand, definitely does.

Unfortunately, it's practically impossible to prove that a program will always do what it's supposed to. To see why, consider a function that checks whether a character strings contains only the letters 'A', 'C', 'G', and 'T'. These four tests clearly aren't sufficient:

assert is_all_bases('A')
assert is_all_bases('C')
assert is_all_bases('G')
assert is_all_bases('T')

because this version of is_all_bases passes them:

def is_all_bases(bases):
    return True

Adding these tests isn't enough:

assert not is_all_bases('X')
assert not is_all_bases('Y')
assert not is_all_bases('Z')

because this version still passes:

def is_all_bases(bases):
    return bases[0] in 'ACGT'

We can add yet more tests:

assert is_all_bases('ACGCGA')
assert not is_all_bases('CGAZ')

but no matter how many we have, we can always write a function that passes them, but does the wrong thing in other cases. And as we add more tests, we have to start worrying about whether the tests themselves are correct, and about whether we can afford the time needed to write them. After all, if we really want to check that the square root function is correct for all values between 0.0 and 1.0, we need to write over a billion test cases; that's a lot of typing, and the chances of us getting every one right are effectively zero.

Testing is still worth doing, though: it's one of those things that doesn't work in theory, but is surprisingly effective in practice. If we choose our tests carefully, we can demonstrate that our software is as likely to be correct as a mathematical proof or a physical experiment.

Ensuring that we have the right answer is only one reason to to software. The other is that it speeds up development by reducing the amount of re-work we have to do. Even small programs can be quite complex, and changing one thing can all too easily break something else. If we test changes as we make them, and automatically re-test things we've already done, we can catch and fix errors while the changes are still fresh in our minds.

Unit Testing

Most people don't enjoy writing tests, so if we want them to actually do it, it must be easy to:

  • add or change tests,
  • understand the tests that have already been written,
  • run those tests, and
  • understand those tests' results.

Test results must also be reliable. If a testing tool says that code is working when it's not, or reports problems when there actually aren't any, people will lose faith in it and stop using it.

The simplest kind of test is a unit test that checks the behavior of one component of a program. As an example, suppose we're testing a function called rectangle_area that returns the area of an [x0, y0, x1, y1] rectangle. We'll start by testing our code directly using assert. Here, we call the function three times with different arguments, checking that the right value is returned each time.


In [11]:
from rectangle import rectangle_area

assert rectangle_area([0, 0, 1, 1]) == 1.0
assert rectangle_area([1, 1, 4, 4]) == 9.0
assert rectangle_area([0, 1, 4, 7]) == 24.0


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-11-ebf7f5f1c120> in <module>()
      3 assert rectangle_area([0, 0, 1, 1]) == 1.0
      4 assert rectangle_area([1, 1, 4, 4]) == 9.0
----> 5 assert rectangle_area([0, 1, 4, 7]) == 24.0

AssertionError: 

This result is used, in the sense that we know something's wrong, but look what happens if we run the tests in a different order:


In [12]:
assert rectangle_area([0, 1, 4, 7]) == 24.0
assert rectangle_area([1, 1, 4, 4]) == 9.0
assert rectangle_area([0, 0, 1, 1]) == 1.0


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-12-548f3f32c981> in <module>()
----> 1 assert rectangle_area([0, 1, 4, 7]) == 24.0
      2 assert rectangle_area([1, 1, 4, 4]) == 9.0
      3 assert rectangle_area([0, 0, 1, 1]) == 1.0

AssertionError: 

Python halts at the first failed assertion, so the second and third tests aren't run at all. It would be more helpful if we could get data from all of our tests every time they're run, since the more information we have, the faster we're likely to be able to track down bugs. It would also be helpful to have some kind of summary report: if our test suite includes thirty or forty tests (as it well might for a complex function or library that's widely used), we'd like to know how many passed or failed.

Here's a different approach. First, let's put each test in a function with a meaningful name:


In [13]:
def test_unit_square():
    assert rectangle_area([0, 0, 1, 1]) == 1.0

def test_large_square():
    assert rectangle_area([1, 1, 4, 4]) == 9.0

def test_actual_rectangle():
    assert rectangle_area([0, 1, 4, 7]) == 24.0

Next, import a library called ears and ask it to run our tests for us:


In [14]:
import ears
ears.run()


..f
2 pass, 1 fail, 0 error
----------------------------------------
fail: test_actual_rectangle
Traceback (most recent call last):
  File "ears.py", line 43, in run
    test()
  File "<ipython-input-13-643689ad0a0f>", line 8, in test_actual_rectangle
    assert rectangle_area([0, 1, 4, 7]) == 24.0
AssertionError

ears.run looks in the calling program for functions whose names start with the letters 'test_' and runs each one. If the function complete without an assertion being triggered, we count the test as a success. If an assertion fails, we count the test as a failure, but if any other exception occurs, we count it as an error because the odds are that the test itself is broken.

ears belongs to a family of tools called xUnit testing library. The name "xUnit" comes from the fact that many of them are imitations of a Java testing library called JUnit. The Wikipedia page on the subject lists dozens of similar frameworks in almost as many languages, all of which have a similar structure: each test is a single function that follows some naming convention (e.g., starts with 'test_'), and the framework runs them in some order and reports how many passed, failed, or were broken.

Most unit tests aren't as simple as a single function call, and many include several assertions to check several aspects of the values that functions return. For example, suppose we have a function called border that's supposed to draw a black border around an image grid. Here are a couple of unit tests for it:


In [23]:
from ipythonblocks import ImageGrid
from border import border

black = (0, 0, 0)
white = (255, 255, 255)

def test_border_2x2():
    fixture = ImageGrid(2, 2, fill=white)
    border(fixture, black)
    assert fixture[0, 0].rgb == black
    assert fixture[0, 1].rgb == black
    assert fixture[1, 0].rgb == black
    assert fixture[1, 1].rgb == black

def count_colors(grid):
    num_black = num_white = num_other = 0
    for x in range(grid.width):
        for y in range(grid.height):
            if grid[x, y].rgb == black:
                num_black += 1
            elif grid[x, y].rgb == white:
                num_white += 1
            else:
                num_other = 0
    return num_black, num_white, num_other
    
def test_border_3x3():
    fixture = ImageGrid(3, 3, fill=white)
    border(fixture, black)
    num_black, num_white, num_other = count_colors(fixture)
    assert num_black == 8
    assert num_white == 1
    assert num_other == 0
    assert fixture[1, 1].rgb == white # only white cell is in the center
        
ears.run('test_border_')


...
3 pass, 0 fail, 0 error

The first test checks things directly; the second uses a helper function to count cells of different colors, then checks that those counts are correct and that the only white cell is in the middle of the 3×3 grid. If we go on to test grids of a few other sizes, we can use this helper function to check them as well.

This example also demonstrates that writing tests can be as difficult as writing the program in the first place. In fact, if we don't build our program out of small functions that are more-or-less independent, writing tests can actually be more complicated than writing the code itself. Luckily, there's a technique to help us build things right.

Test-Driven Development

Libraries like ear can't think of test cases for us. We still have to decide what to test and how many tests to run. Our best guide here is economics: we want the tests that are most likely to give us useful information that we don't already have. For example, if rectangle_area([0, 0, 1, 1]) works, there's probably not much point testing rectangle_area([0, 0, 2, 2]), since it's hard to think of a bug that would show up in one case but not in the other.

We should therefore try to choose tests that are as different from each other as possible, so that we force the code we're testing to execute in all the different ways it can. Another way of thinking about this is that we should try to find boundary cases. If a function works for zero, one, and a million values, it will probably work for eighteen values.

Using boundary values as tests has another advantage: it can help us design our software. To see how, consider this test case for our rectangle area function:

def test_inverted_rectangle():
    assert rectangle_area([1, 5, 5, 2]) == -12.0

Is that test correct? I.e., are rectangles with x1<x0 or y1<y0 legal, and do they have negative area? Or should the test be:

def test_inverted_rectangle():
    try:
        rectangle_area([1, 5, 5, 2])
        assert False, 'Function did not raise exception for invalid rectangle'
    except ValueError:
        pass # rectangle_area failed with the expected kind of exception
    except Exception:
        assert False, 'Function did not raise correct kind of exception for invalid rectangle'

The logic in this second version may take a moment to work out, but the idea is straightforward: we want to check that rectangle_area raises a ValueError exception if it's given a rectangle whose upper edge is below or to the left of its lower edge.

Here's another test case that can help us design our software:

def test_zero_width():
    assert rectangle_area([2, 1, 2, 8]) == 0

We might decide that rectangles with negative areas aren't allowed, but what about rectangles with zero area, i.e., rectangles that are actually lines? Any actual implementation of rectangle_area will do something with one of these; writing unit tests for boundary cases is a good way to specify exactly what that something is.

Unit tests are actually such a good way to define how functions ought to behave that many programmers use a practice called test-driven development (TDD). Instead of writing code, then figuring out how to test it, these programmers:

  1. write some unit tests for a function that doesn't exist yet,
  2. write that function,
  3. modify it until it passes all of the tests, then
  4. clean up the function, i.e., make it more readable or more efficient without breaking any of the tests.

The mantra often used during TDD is "red, green, refactor": get a red light (i.e., some failing tests), make it turn green (i.e., get something working), and then clean it up by refactoring. This cycle should take anywhere from a couple of minutes to an hour or so. If it takes longer than that, the change being made is probably too large, and should be broken down into smaller (and more comprehensible) steps.

TDD's proponents argue that it helps people produce better code for two reasons. First, it encourages them to write code in small, self-contained chunks, and to actually write tests for those chunks. Second, it frees them from confirmation bias: since they haven't written their function yet, their subconscious cannot steer their testing toward proving it correct rather than finding errors.

Empirical studies of TDD have had mixed results: some have found it beneficial, while others have found no effect. But even if you don't use it day to day, trying it a few times helps you learn how to design functions and programs that are easier to test.

Key Points

  • Use a unit-testing framework to check and re-check code's correctness.
  • Put each unit test in its own small function.
  • Use test-driven development to define how functions should behave.