Overview

Contents

  • Introduction to Python
    • The Python Interpreter
    • First Steps with Python
    • Importing Libraries
    • About the Data
    • Arrays and their Attributes
    • Getting Help
    • More on Arrays
    • Basic Data Visualization
  • Repeating Tasks with Loops
    • Sequences
    • More Complex Loops
    • Iterators and Generators
  • Analyzing Data from Multiple Files
    • Looping over Files
    • Generating a Plot
    • Putting it All Together
  • Conditional Evaluation
    • Conditional Expressions in Python
    • Checking our Data
  • Creating Functions for Reuse
    • Composing Multiple Functions
    • Cleaning Up our Analysis Code
    • Positional versus Keyword Arguments
    • Documenting Functions
  • Capstone: Fitting Linear Models
    • About the Data
    • Introducing statsmodels
  • Python at the Command Line
    • Our First Python Script
    • Alternative Command-Line Tools
    • Modularization
  • Understanding and Handling Errors
  • Defensive Programming
    • Assertions
    • Test-Driven Development
    • Testing for Quality Control
    • Unit Testing
  • Analyzing and Optimizing Performance
    • Benchmarking
  • Capstone
    • Your Tasks
    • Getting Started
  • Connecting to SQLite with Python

Introduction to Python

The Python Interpreter

Jupyter Notebook

First Steps with Python

Importing Libraries

About the Data

The data are formatted such that:

  • Each column is the monthly mean, January (1) through December (12)
  • Each row is a year, starting from January 1948 (1) through December 2016 (69)

More information on the data can be found here.

Arrays and their Attributes

How many rows and columns are there in the barrow array?

Challenge

What do each of the following code samples do?

barrow[0]
barrow[0,]
barrow[-1]
barrow[-3:-1]

Slicing NumPy Arrays

Challenge

What's the mean monthly temperature in August of 2016? Converted to degrees Fahrenheit?

Degrees F can be calculated from degrees K by the formula:

$$ T_F = \left(T_K \times \frac{9}{5}\right) - 459.67 $$

Calculating on NumPy Arrays

What is the overall mean temperature in any month in Barrow between 1948 and 2016 in degrees C?

How cold was the coldest February in Barrow, by monthly mean temperatures, in degrees C?

Challenge

What's the minimum, maximum, and mean monthly temperature for August in Barrow, in degrees C?

Getting Help

More on Arrays

What is the mean temperature in 1948? In 1949? And so on...

Remembering the difference between axis = 0 and axis = 1 is tricky, even for experienced Python programmers. Here are some helpful, visual reminders:

Basic Data Visualization

Repeating Tasks with Loops

Sequences

Character Strings

Lists and Tuples

Performing Calculations with Lists

More Complex Loops

Challenge: Looping over Sequences

Write a for loop that iterates through the letters of your favorite city, putting each letter inside a list. The result should be a list with an element for each letter.

Hint: You can create an empty list like this:

letters = []

Hint: You can confirm you have the right result by comparing it to:

list("my favorite city")

Challenge: Sequences and Mutability

Which of the sequences we've learned about are immutable (i.e., they can't be changed)?

  • Strings are (immutable / mutable)?
  • Lists are (immutable / mutable)?
  • Tuples are (immutable / mutable)?

And what does this mean for working with each data type?

"birds".upper()

[1, 2, 3].append(4)

(1, 2, 3)

Iterators and Generators

Analyzing Data from Multiple Files

First Step: Looping over Files

Second Step: Generating a Plot

Third Step: Putting It All Together

Challenge: Integrating over Multiple File Datasets

For each location (each file), plot the difference between that location's mean temperature and the mean across all locations.

Hint: One way to calculate the mean across five (5) files is by adding the 5 arrays together, then dividing by 5. You can add arrays together in a loop like this:

# Start with an array full of zeros that is 69-elements long
running_total = np.zeros((69))

for fname in filenames:
    data = np.loadtxt(fname, delimiter = ',')
    running_total = running_total + data.mean(axis = 1)

Hint: How do you difference two arrays? Remember how the plus, +, and minus, -, operators work on arrays?

Conditional Evaluation

This code can be represented by the following workflow.

Challenge: Conditional Expressions

How can you make this code print "Greater" by changing only one line?

a_number = 42

if a_number > 100:
    print('Greater')

else:
    print('Not greater')

print('Done')

There are two (2) one-line changes you could make. Can you find them both?

Conditional Expressions in Python

What do each of the following evaluate to, True or False?

1 < 2
1 <= 1
3 == 3
2 != 3

Checking our Data

Challenge: Fitting a Line over Multiple File Datasets

Write a for loop, with an if statement inside, that calculates a line of best fit for each dataset's temperature anomalies and prints out a message as to whether that trend line is positive or negative.

Hint: What we want to know about each trend line is whether, for:

results = sm.OLS(y_data, x_data).fit()
b0, b1 = results.params

If b1, the slope of the line, is positive or negative.

Creating Functions for Reuse

What happens if we remove the keyword return from this function? Make this change and call the function again.

Composing Multiple Functions

Now that we've created a function that converts temperatures in degrees Kelvin to degrees Celsius, let's see if we can write a function that converts from degrees Celsius to degrees Fahrenheit.

$$ T_F = \left(T_C \times \frac{9}{5}\right) + 32 $$

Cleaning Up our Analysis Code

Positional versus Keyword Arguments

Documenting Functions

Challenge: Functions

Create one (or both, for an extra challenge) of the following functions...

  • A function called fences that takes an input character string and surrounds it on both sides with another string, e.g., "pasture" becomes "|pasture|" or "@pasture@" if either "|" or "@" are provided.
  • A function called rescale that takes an array and returns a corresponding array of values scaled to lie in the range 0.0 to 1.0.

Hint: Strings can be concatenated with the plus operator.

'cat' + 's'

Hint: If $x_0$ and $x_1$ are the lowest and highest values in an array, respectively, then the replacement value for any element $x$, scaled to between 0.0 and 1.0, should be:

$$ \frac{x - x_0}{x_1 - x_0} $$

Capstone: Fitting Linear Models

We've seen the basics of the Python programming language. Now, let's get some hands-on experience applying what we've learned to real scientific data. In this exercise, we'll see how to fit linear trends to data using a new Python library, statsmodels. After you see an initial example, you'll have time to extend the example on your own.

About the Data

The data are formatted such that:

  • Each column is the monthly mean, January (1) through December (12)
  • Each row is a year, starting from January 1948 (1) through December 2016 (69)

More information on the data can be found here.

Introducing statsmodels

Without going into too much detail, our linear trend line has two components: a constant term ($\alpha$) and the slope of the trend line ($\beta$). Using linear algebra, we represent these two terms as two columns in a matrix. To fit a linear model with a constant term, the first column is a column of ones.

$$ \begin{align} [\mathrm{Temp.\ anomaly}]&=[\mathrm{Some\ constant,\ }\alpha] + [\mathrm{Slope\ of\ trend\ line},\beta]\times[\mathrm{Year}]\\ \left[\begin{array}{r} -2.04\\ -0.20\\ 0.88\\ \vdots\\ \end{array}\right] &= \left[\begin{array}{rr} 1 & 1948\\ 1 & 1949\\ 1 & 1950\\ \vdots & \vdots\\ \end{array}\right] \left[\begin{array}{r} \alpha\\ \beta\end{array}\right] \end{align} $$

Challenge: Fitting a Line over Multiple Datasets

Write a for loop, with an if statement inside, that calculates a line of best fit for each dataset's temperature anomalies and prints out a message as to whether that trend line is positive or negative.

Hint: What we want to know about each trend line is whether, for:

results = sm.OLS(y_data, x_data).fit()
b0, b1 = results.params

If b1, the slope of the line, is positive or negative. So, to break that down:

  1. Loop over all the temperature files;
  2. Calculate the temperature anomaly;
  3. Fit an OLS regression to the anomaly data;
  4. print() out whether the trend line is "positive" or "negative;"

Hint: Looping over Files


In [1]:
import glob

filenames = glob.glob('*.csv')
filenames


Out[1]:
['barrow.temperature.csv',
 'reston.temperature.csv',
 'land_o_lakes.temperature.csv',
 'key_west.temperature.csv',
 'wvu.temperature.csv']

Python at the Command Line

We've seen a lot of tools and techniques for improving our productivity through reproducible Python code. So far, however, we've been working exclusively within Jupyter Notebook. Jupyter Notebook is great for interactive, exploratory work in Python and encourages literate programming, as we discussed earlier. A Notebook is a great place to demonstrate to your future self or your peers how some Python code works.

But when it's time to scale-up your work and process data, you want to be on the command line, for all the reasons we saw when we discussed the Unix shell earlier.

Let's explore Python programs at the command line using the following Python script, temp_extremes.py.

'''
Reports the min and max July temperatures for each file
that matches the given filename pattern.
'''

import csv
import os
import sys
import glob

def main():
    # Get the user-specified directory
    directory = sys.argv[1]

    # Pattern to use in searching for files
    filename_pattern = os.path.join(directory, '*temperature.csv')

    for filename in glob.glob(filename_pattern):
        july_temps = []

        # While the file is open...
        with open(filename, 'r') as stream:
            # Use a function to read the file
            reader = csv.reader(stream)

            # Each row is a year
            for row in reader:
                # Add this year's July temperature to the list
                july_temps.append(row[6])

        # A human-readable name for the file        
        pretty_name = os.path.basename(filename)
        print(pretty_name, '--Hottest July mean temp. was', max(july_temps), 'deg K')
        print(pretty_name, '--Coolest July mean temp. was', min(july_temps), 'deg K')


if __name__ == '__main__':
    main()

We can run this script on the command line by typing the following:

$ python3 temp_extremes.py .

Remember that the single dot, . represents the current working directory, which is where all of our temperature CSV files are located.

Our First Python Script

Encapsulating to Keep the Namespace Clean

Alternative Command-Line Tools

sys.argv is a rather crude tool for processing command-line arguments. There are a couple of alternatives I suggest you look into if you are going to be writing command-line programs in Python:

Optional: Try out Python Fire

Modularization

Installing Your Project as a Module

We haven't covered installing new Python modules, but when the time is right for you to package your code together as a single module (e.g., as package_name, in the example above), consider installing your module "in development mode" first.

Understanding and Handling Errors

Defensive Programming

Assertions

Assertations Regarding Type (Advanced)

Challenge: Asserting Type in a Function

Recall the celsius_to_fahr() function we saw earlier. Implement a type-checking assertion that produces an AssertionError when the temp_c argument is not a number. Remember that there are two types of numbers we've see in Python so far:

  • float
  • int

You can decide whether the celsius_to_fahr() function should accept one or both of these types as inputs. Don't forget to provide a helpful message as part of the AssertionError.


In [137]:
def celsius_to_fahr(temp_c):
    return (temp_c * (9/5)) + 32

Assertions Regarding Inherited Type (Advanced)

What if we used OrderedDict to represent our metadata?

Duck Typing (Advanced)

For more information about types, classes, and how Python represents objects, see:

Test-Driven Development

For example, suppose we need to find where two or more time series overlap. The range of each time series is represented as a pair of numbers, which are the time the interval started and ended. The output is the largest range that they all include.

Challenge: Fix the Range Overlap Function

Fix range_overlap(); re-run test_range_overlap() after each change you make.


In [152]:
def range_overlap(ranges):
    '''Return common overlap among a set of [low, high] ranges.'''
    for i, (low, high) in enumerate(ranges):
        if i == 0:
            lowest, highest = low, high
            continue
            
        lowest = max(lowest, low)
        highest = min(highest, high)
        
    if lowest >= highest:
        return None
        
    return (lowest, highest)

test_range_overlap()

When in Doubt...

Testing for Quality Control

Unit Testing

Analyzing and Optimizing Performance

"Premature optimization is the root of all evil." - Sir Tony Hoare (later popularized by Donald Knuth)

Benchmarking

Line and Memory Profiling

Line and memory profilers aren't available in the Anaconda installation I had you use, but you can read all about this topic on this excellent blog post.

Capstone

Now you have a chance to bring together everything you've learned in this Software Carpentry workshop, particularly:

  • Using the Unix shell to download and manage delimited text data files;
  • Importing data into Python;
  • Using NumPy or other Python tools to summarize the data and diagnose any data issues;
  • Cleaning and plotting the data using reproducible Python functions;

For this open-ended exercise, we'll use data from the Woods Hole Oceanographic Institution's "Martha's Vineyard Coastal Observatory." Choose which of the following datasets you want to work with:

Each of these data sources has an online file directory:

Once you've decided which dataset you want to work with, follow along with me using that particular record. I'm going to use the oceanographic data in this example.

Your Tasks

  1. Download 3 days worth of meteorological or oceanographic data files using wget or curl in the Unix shell.
  2. Join the files together as one file using cat in the Unix shell.
  3. Read the data into a Python session.
  4. Create a time plot of a variable of your choice.
  5. Filter the rows of the table to only daytime observations.
  6. Write a function to calculate the range of values in air temperature (meteorological record)) or water temperature (oceanographic record). Apply this function teach of the 3+ days for which you obtained data.

Getting Started

For this exercise, we want to work with multiple data files. Each data file in the index is one day, but we want to work with multiple days worth of data. How can we quickly and conveniently download multiple data files from the web?

This is something= the Unix shell is really great at automating. The WHOI dataset we're using exposes multiple data files at unique URLs. Below is an example file from the 120th day of 2018.

ftp://mvcodata.whoi.edu/pub/mvcodata/data/OcnDat_s/2018/2018120_OcntDat_s.C99

To download other days, we need only change one number in the URL:

ftp://mvcodata.whoi.edu/pub/mvcodata/data/OcnDat_s/2018/2018120_OcnDat_s.C99
ftp://mvcodata.whoi.edu/pub/mvcodata/data/OcnDat_s/2018/2018119_OcnDat_s.C99
ftp://mvcodata.whoi.edu/pub/mvcodata/data/OcnDat_s/2018/2018118_OcnDat_s.C99
...

How can we automate this with the Unix shell? First we need to figure out which shell program to use. Recall that the Unix shell offers many different small programs; depending on which variant of the Unix shell we're using, a certain program might not be available.

which wget
which curl

So you may have both installed. If you do, use wget instead of curl; both programs do the same thing.

Downloading Online Datafiles with wget

Here's an example shell script to get us started. We iterate over three (3) days (118, 119, 120), storing each day number in a variable called day. In each iteration, we use wget to download the file, inserting that day number stored in the variable day.

cd
cd Desktop

for day in 118 119 120
do
  wget "ftp://mvcodata.whoi.edu/pub/mvcodata/data/OcnDat_s/2018/2018${day}_OcnDat_s.C99"
done

Downloading Online Datafiles with curl

Unlike wget, the curl program prints the downloaded text data directly to the screen. Remember how we dealt with taking screen output and redirecting it to a file?

cd
cd Desktop

for day in 118 119 120
do
  curl "ftp://mvcodata.whoi.edu/pub/mvcodata/data/OcnDat_s/2018/2018${day}_OcnDat_s.C99" > 2018${day}_OcnDat_s.C99
done

Good Luck!

Try the rest of the Capstone on your own. Helpful hints are provided throughout. You're encouraged to work with a partner but feel free to work independently if that suits you.

The hints below use only the packages we've seen in Python 3 so far, but if you're feeling adventurous, the pandas package has more and better tools for dealing with mixed, tabular data like the meteorology and oceanographic records here.

import pandas as pd

For help getting started with pandas, check out 10 Minutes to Pandas, in particular, the sections:

Hint: Combining Multiple Text Files in the Unix Shell

Remember the cat program?

cat 2018120_OcnDat_s.C99

You can provide multiple files to the cat program and it will combine them line-by-line.

cat 2018120_OcnDat_s.C99 2018119_OcnDat_s.C99 2018118_OcnDat_s.C99

But this just prints everything out to the screen. Use redirection to store the output in a new file called ocean.txt or met.txt, depending on which data source you're using.

Hint: Reading in Data Using Python

The WHOI datasets are apparently space-delimited. They might also be called "fixed width" because each data column appears at the same distance along each line.


In [162]:
import numpy as np

data = np.loadtxt('/home/arthur/Desktop/ocean.txt', delimiter = ' ')

# How many rows and columns?
data.shape


Out[162]:
(216, 24)

Hint: Plotting Data in Python

Revisit your notes from when we did this earlier! Remember you need to import the plotting capability in Jupyter Notebook:

import matplotlib.pyplot as pyplot
%matplotlib inline

Hint: Filter Data Using Python

The fifth column contains the hour of the day. Recall that Python starts counting at zero, so column 5 is at position 4. The three dots (...) just mean "everything else" or, more specifically, "all the rows." Recall that with NumPy arrays, we count the number of rows, then columns: "rows comma columns."


In [163]:
data[...,4]


Out[163]:
array([  0.,   0.,   0.,   1.,   1.,   1.,   2.,   2.,   2.,   3.,   3.,
         3.,   4.,   4.,   4.,   5.,   5.,   5.,   6.,   6.,   6.,   7.,
         7.,   7.,   8.,   8.,   8.,   9.,   9.,   9.,  10.,  10.,  10.,
        11.,  11.,  11.,  12.,  12.,  12.,  13.,  13.,  13.,  14.,  14.,
        14.,  15.,  15.,  15.,  16.,  16.,  16.,  17.,  17.,  17.,  18.,
        18.,  18.,  19.,  19.,  19.,  20.,  20.,  20.,  21.,  21.,  21.,
        22.,  22.,  22.,  23.,  23.,  23.,   0.,   0.,   0.,   1.,   1.,
         1.,   2.,   2.,   2.,   3.,   3.,   3.,   4.,   4.,   4.,   5.,
         5.,   5.,   6.,   6.,   6.,   7.,   7.,   7.,   8.,   8.,   8.,
         9.,   9.,   9.,  10.,  10.,  10.,  11.,  11.,  11.,  12.,  12.,
        12.,  13.,  13.,  13.,  14.,  14.,  14.,  15.,  15.,  15.,  16.,
        16.,  16.,  17.,  17.,  17.,  18.,  18.,  18.,  19.,  19.,  19.,
        20.,  20.,  20.,  21.,  21.,  21.,  22.,  22.,  22.,  23.,  23.,
        23.,   0.,   0.,   0.,   1.,   1.,   1.,   2.,   2.,   2.,   3.,
         3.,   3.,   4.,   4.,   4.,   5.,   5.,   5.,   6.,   6.,   6.,
         7.,   7.,   7.,   8.,   8.,   8.,   9.,   9.,   9.,  10.,  10.,
        10.,  11.,  11.,  11.,  12.,  12.,  12.,  13.,  13.,  13.,  14.,
        14.,  14.,  15.,  15.,  15.,  16.,  16.,  16.,  17.,  17.,  17.,
        18.,  18.,  18.,  19.,  19.,  19.,  20.,  20.,  20.,  21.,  21.,
        21.,  22.,  22.,  22.,  23.,  23.,  23.])

The range() function in Python returns a list of consecutive integers between the first and the second number. The np.in1d function tests each element of our "hour" column to see if it is in the list of numbers from 7 to 18, inclusive.


In [164]:
np.in1d(data[...,4], range(7,19))


Out[164]:
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False], dtype=bool)

To filter the full NumPy array to just those rows where the "hour" is between 7 and 18 (inclusive), we can take this vector of True and False values and put inside brackets, as below. Remember that the three dots (...) just mean "everything else" or, more specifically, "all the columns."


In [165]:
daytime = data[np.in1d(data[...,4], range(7,19)),...]
daytime.shape


Out[165]:
(108, 24)

Hint: Creating a Function to Calculate Temperature Ranges

The range of daytime temperatures for a given day is the daily maximum temperature minus the daily minimum temperature. Create a single function to do this, then call it at least 3 times, once for each unique day in your dataset.

How can you present this function with just the data for a given day? There are a few different ways. In both datasets, the 4th column (column 3 in Python, where we start counting from zero) has an integer representing the day of the month.


In [166]:
daytime[...,3]


Out[166]:
array([ 28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,
        28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,
        28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,
        28.,  28.,  28.,  29.,  29.,  29.,  29.,  29.,  29.,  29.,  29.,
        29.,  29.,  29.,  29.,  29.,  29.,  29.,  29.,  29.,  29.,  29.,
        29.,  29.,  29.,  29.,  29.,  29.,  29.,  29.,  29.,  29.,  29.,
        29.,  29.,  29.,  29.,  29.,  29.,  30.,  30.,  30.,  30.,  30.,
        30.,  30.,  30.,  30.,  30.,  30.,  30.,  30.,  30.,  30.,  30.,
        30.,  30.,  30.,  30.,  30.,  30.,  30.,  30.,  30.,  30.,  30.,
        30.,  30.,  30.,  30.,  30.,  30.,  30.,  30.,  30.])

In [167]:
daytime[...,3] == 28


Out[167]:
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False], dtype=bool)

For another way to get all the records associated with a single day, remember that every day has the same number of records. Now that we've filtered to just the daytime records, there should be 36 records per day.


In [168]:
daytime.shape


Out[168]:
(108, 24)

In [169]:
108 / 3


Out[169]:
36.0

We can slice the first 36 rows of our data table to obtain the first day. The second day would then be rows 37 through 72, and so on.


In [170]:
daytime[0:36,3]


Out[170]:
array([ 28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,
        28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,
        28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,  28.,
        28.,  28.,  28.])

Connecting to SQLite with Python


In [178]:
cursor.close()
connection.close()

Best Practices with Database Connections