statsmodels
The data are formatted such that:
How many rows and columns are there in the barrow
array?
What is the overall mean temperature in any month in Barrow between 1948 and 2016 in degrees C?
How cold was the coldest February in Barrow, by monthly mean temperatures, in degrees C?
What is the mean temperature in 1948? In 1949? And so on...
Remembering the difference between axis = 0
and axis = 1
is tricky, even for experienced Python programmers. Here are some helpful, visual reminders:
Write a for
loop that iterates through the letters of your favorite city, putting each letter inside a list. The result should be a list with an element for each letter.
Hint: You can create an empty list like this:
letters = []
Hint: You can confirm you have the right result by comparing it to:
list("my favorite city")
Which of the sequences we've learned about are immutable (i.e., they can't be changed)?
And what does this mean for working with each data type?
"birds".upper()
[1, 2, 3].append(4)
(1, 2, 3)
For each location (each file), plot the difference between that location's mean temperature and the mean across all locations.
Hint: One way to calculate the mean across five (5) files is by adding the 5 arrays together, then dividing by 5. You can add arrays together in a loop like this:
# Start with an array full of zeros that is 69-elements long
running_total = np.zeros((69))
for fname in filenames:
data = np.loadtxt(fname, delimiter = ',')
running_total = running_total + data.mean(axis = 1)
Hint: How do you difference two arrays? Remember how the plus, +
, and minus, -
, operators work on arrays?
This code can be represented by the following workflow.
How can you make this code print "Greater" by changing only one line?
a_number = 42
if a_number > 100:
print('Greater')
else:
print('Not greater')
print('Done')
There are two (2) one-line changes you could make. Can you find them both?
What do each of the following evaluate to, True
or False
?
1 < 2
1 <= 1
3 == 3
2 != 3
Write a for
loop, with an if
statement inside, that calculates a line of best fit for each dataset's temperature anomalies and prints out a message as to whether that trend line is positive or negative.
Hint: What we want to know about each trend line is whether, for:
results = sm.OLS(y_data, x_data).fit()
b0, b1 = results.params
If b1
, the slope of the line, is positive or negative.
What happens if we remove the keyword return
from this function? Make this change and call the function again.
Now that we've created a function that converts temperatures in degrees Kelvin to degrees Celsius, let's see if we can write a function that converts from degrees Celsius to degrees Fahrenheit.
$$ T_F = \left(T_C \times \frac{9}{5}\right) + 32 $$Create one (or both, for an extra challenge) of the following functions...
fences
that takes an input character string and surrounds it on both sides with another string, e.g., "pasture" becomes "|pasture|" or "@pasture@" if either "|" or "@" are provided.rescale
that takes an array and returns a corresponding array of values scaled to lie in the range 0.0 to 1.0.Hint: Strings can be concatenated with the plus operator.
'cat' + 's'
Hint: If $x_0$ and $x_1$ are the lowest and highest values in an array, respectively, then the replacement value for any element $x$, scaled to between 0.0 and 1.0, should be:
$$ \frac{x - x_0}{x_1 - x_0} $$We've seen the basics of the Python programming language. Now, let's get some hands-on experience applying what we've learned to real scientific data. In this exercise, we'll see how to fit linear trends to data using a new Python library, statsmodels
. After you see an initial example, you'll have time to extend the example on your own.
The data are formatted such that:
Without going into too much detail, our linear trend line has two components: a constant term ($\alpha$) and the slope of the trend line ($\beta$). Using linear algebra, we represent these two terms as two columns in a matrix. To fit a linear model with a constant term, the first column is a column of ones.
$$ \begin{align} [\mathrm{Temp.\ anomaly}]&=[\mathrm{Some\ constant,\ }\alpha] + [\mathrm{Slope\ of\ trend\ line},\beta]\times[\mathrm{Year}]\\ \left[\begin{array}{r} -2.04\\ -0.20\\ 0.88\\ \vdots\\ \end{array}\right] &= \left[\begin{array}{rr} 1 & 1948\\ 1 & 1949\\ 1 & 1950\\ \vdots & \vdots\\ \end{array}\right] \left[\begin{array}{r} \alpha\\ \beta\end{array}\right] \end{align} $$Write a for
loop, with an if
statement inside, that calculates a line of best fit for each dataset's temperature anomalies and prints out a message as to whether that trend line is positive or negative.
Hint: What we want to know about each trend line is whether, for:
results = sm.OLS(y_data, x_data).fit()
b0, b1 = results.params
If b1
, the slope of the line, is positive or negative. So, to break that down:
print()
out whether the trend line is "positive" or "negative;"
In [1]:
import glob
filenames = glob.glob('*.csv')
filenames
Out[1]:
We've seen a lot of tools and techniques for improving our productivity through reproducible Python code. So far, however, we've been working exclusively within Jupyter Notebook. Jupyter Notebook is great for interactive, exploratory work in Python and encourages literate programming, as we discussed earlier. A Notebook is a great place to demonstrate to your future self or your peers how some Python code works.
But when it's time to scale-up your work and process data, you want to be on the command line, for all the reasons we saw when we discussed the Unix shell earlier.
Let's explore Python programs at the command line using the following Python script, temp_extremes.py
.
'''
Reports the min and max July temperatures for each file
that matches the given filename pattern.
'''
import csv
import os
import sys
import glob
def main():
# Get the user-specified directory
directory = sys.argv[1]
# Pattern to use in searching for files
filename_pattern = os.path.join(directory, '*temperature.csv')
for filename in glob.glob(filename_pattern):
july_temps = []
# While the file is open...
with open(filename, 'r') as stream:
# Use a function to read the file
reader = csv.reader(stream)
# Each row is a year
for row in reader:
# Add this year's July temperature to the list
july_temps.append(row[6])
# A human-readable name for the file
pretty_name = os.path.basename(filename)
print(pretty_name, '--Hottest July mean temp. was', max(july_temps), 'deg K')
print(pretty_name, '--Coolest July mean temp. was', min(july_temps), 'deg K')
if __name__ == '__main__':
main()
We can run this script on the command line by typing the following:
$ python3 temp_extremes.py .
Remember that the single dot, .
represents the current working directory, which is where all of our temperature CSV files are located.
sys.argv
is a rather crude tool for processing command-line arguments. There are a couple of alternatives I suggest you look into if you are going to be writing command-line programs in Python:
argparse
, another built-in library, that handles common cases in a systematic way. Check out this tutorial.Fire
, a very new Python module from Google, which can turn any Python object (function, class, etc.) into a command-line API.We haven't covered installing new Python modules, but when the time is right for you to package your code together as a single module (e.g., as package_name
, in the example above), consider installing your module "in development mode" first.
Recall the celsius_to_fahr()
function we saw earlier. Implement a type-checking assertion that produces an AssertionError
when the temp_c
argument is not a number. Remember that there are two types of numbers we've see in Python so far:
float
int
You can decide whether the celsius_to_fahr()
function should accept one or both of these types as inputs. Don't forget to provide a helpful message as part of the AssertionError
.
In [137]:
def celsius_to_fahr(temp_c):
return (temp_c * (9/5)) + 32
What if we used OrderedDict
to represent our metadata?
For more information about types, classes, and how Python represents objects, see:
For example, suppose we need to find where two or more time series overlap. The range of each time series is represented as a pair of numbers, which are the time the interval started and ended. The output is the largest range that they all include.
Fix range_overlap()
; re-run test_range_overlap()
after each change you make.
In [152]:
def range_overlap(ranges):
'''Return common overlap among a set of [low, high] ranges.'''
for i, (low, high) in enumerate(ranges):
if i == 0:
lowest, highest = low, high
continue
lowest = max(lowest, low)
highest = min(highest, high)
if lowest >= highest:
return None
return (lowest, highest)
test_range_overlap()
"Premature optimization is the root of all evil." - Sir Tony Hoare (later popularized by Donald Knuth)
Line and memory profilers aren't available in the Anaconda installation I had you use, but you can read all about this topic on this excellent blog post.
Now you have a chance to bring together everything you've learned in this Software Carpentry workshop, particularly:
For this open-ended exercise, we'll use data from the Woods Hole Oceanographic Institution's "Martha's Vineyard Coastal Observatory." Choose which of the following datasets you want to work with:
solar_campmt_m50[W/m^2]
) and rainfall (rain_campmt[mm]
); the order of the fields in the delimited files is listed here.Each of these data sources has an online file directory:
Once you've decided which dataset you want to work with, follow along with me using that particular record. I'm going to use the oceanographic data in this example.
wget
or curl
in the Unix shell.cat
in the Unix shell.For this exercise, we want to work with multiple data files. Each data file in the index is one day, but we want to work with multiple days worth of data. How can we quickly and conveniently download multiple data files from the web?
This is something= the Unix shell is really great at automating. The WHOI dataset we're using exposes multiple data files at unique URLs. Below is an example file from the 120th day of 2018.
ftp://mvcodata.whoi.edu/pub/mvcodata/data/OcnDat_s/2018/2018120_OcntDat_s.C99
To download other days, we need only change one number in the URL:
ftp://mvcodata.whoi.edu/pub/mvcodata/data/OcnDat_s/2018/2018120_OcnDat_s.C99
ftp://mvcodata.whoi.edu/pub/mvcodata/data/OcnDat_s/2018/2018119_OcnDat_s.C99
ftp://mvcodata.whoi.edu/pub/mvcodata/data/OcnDat_s/2018/2018118_OcnDat_s.C99
...
How can we automate this with the Unix shell? First we need to figure out which shell program to use. Recall that the Unix shell offers many different small programs; depending on which variant of the Unix shell we're using, a certain program might not be available.
which wget
which curl
So you may have both installed. If you do, use wget
instead of curl
; both programs do the same thing.
Here's an example shell script to get us started. We iterate over three (3) days (118, 119, 120), storing each day number in a variable called day
. In each iteration, we use wget
to download the file, inserting that day number stored in the variable day
.
cd
cd Desktop
for day in 118 119 120
do
wget "ftp://mvcodata.whoi.edu/pub/mvcodata/data/OcnDat_s/2018/2018${day}_OcnDat_s.C99"
done
Unlike wget
, the curl
program prints the downloaded text data directly to the screen. Remember how we dealt with taking screen output and redirecting it to a file?
cd
cd Desktop
for day in 118 119 120
do
curl "ftp://mvcodata.whoi.edu/pub/mvcodata/data/OcnDat_s/2018/2018${day}_OcnDat_s.C99" > 2018${day}_OcnDat_s.C99
done
Try the rest of the Capstone on your own. Helpful hints are provided throughout. You're encouraged to work with a partner but feel free to work independently if that suits you.
The hints below use only the packages we've seen in Python 3 so far, but if you're feeling adventurous, the pandas
package has more and better tools for dealing with mixed, tabular data like the meteorology and oceanographic records here.
import pandas as pd
For help getting started with pandas
, check out 10 Minutes to Pandas, in particular, the sections:
Remember the cat
program?
cat 2018120_OcnDat_s.C99
You can provide multiple files to the cat
program and it will combine them line-by-line.
cat 2018120_OcnDat_s.C99 2018119_OcnDat_s.C99 2018118_OcnDat_s.C99
But this just prints everything out to the screen. Use redirection to store the output in a new file called ocean.txt
or met.txt
, depending on which data source you're using.
The WHOI datasets are apparently space-delimited. They might also be called "fixed width" because each data column appears at the same distance along each line.
In [162]:
import numpy as np
data = np.loadtxt('/home/arthur/Desktop/ocean.txt', delimiter = ' ')
# How many rows and columns?
data.shape
Out[162]:
Revisit your notes from when we did this earlier! Remember you need to import the plotting capability in Jupyter Notebook:
import matplotlib.pyplot as pyplot
%matplotlib inline
The fifth column contains the hour of the day. Recall that Python starts counting at zero, so column 5 is at position 4. The three dots (...
) just mean "everything else" or, more specifically, "all the rows." Recall that with NumPy arrays, we count the number of rows, then columns: "rows comma columns."
In [163]:
data[...,4]
Out[163]:
The range()
function in Python returns a list of consecutive integers between the first and the second number. The np.in1d
function tests each element of our "hour" column to see if it is in the list of numbers from 7 to 18, inclusive.
In [164]:
np.in1d(data[...,4], range(7,19))
Out[164]:
To filter the full NumPy array to just those rows where the "hour" is between 7 and 18 (inclusive), we can take this vector of True
and False
values and put inside brackets, as below. Remember that the three dots (...
) just mean "everything else" or, more specifically, "all the columns."
In [165]:
daytime = data[np.in1d(data[...,4], range(7,19)),...]
daytime.shape
Out[165]:
The range of daytime temperatures for a given day is the daily maximum temperature minus the daily minimum temperature. Create a single function to do this, then call it at least 3 times, once for each unique day in your dataset.
How can you present this function with just the data for a given day? There are a few different ways. In both datasets, the 4th column (column 3 in Python, where we start counting from zero) has an integer representing the day of the month.
In [166]:
daytime[...,3]
Out[166]:
In [167]:
daytime[...,3] == 28
Out[167]:
For another way to get all the records associated with a single day, remember that every day has the same number of records. Now that we've filtered to just the daytime records, there should be 36 records per day.
In [168]:
daytime.shape
Out[168]:
In [169]:
108 / 3
Out[169]:
We can slice the first 36 rows of our data table to obtain the first day. The second day would then be rows 37 through 72, and so on.
In [170]:
daytime[0:36,3]
Out[170]:
In [178]:
cursor.close()
connection.close()