This lesson introduces Python as an environment for reproducible scientific data analysis and programming. The materials are based on the Software Carpentry Programming with Python lesson.
At the end of this lesson, you will be able to:
This lesson will introduce Python as a general purpose programming language. Python is a great programming language to use for a wide variety of applications, including:
As with the Software Carpentry lesson, this lesson is licensed for open use under the CC BY 4.0 license.
Python is a general purpose programming language that allows for the rapid development of scientific workflows. Python's main advantages are:
The only language that computers really understand is machine language, or binary: ones and zeros. Anything we tell computers to do has to be translated to binary for computers to execute.
Python is what we call an interpreted language. This means that computers can translate Python to machine code as they are reading it. This distinguishes Python from languages like C, C++, or Java, which have to be compiled to machine code before they are run. The details aren't important to us; what is important is that we can use Python in two ways:
For this lesson, we'll be using the Python interpreter that is embedded in Jupyter Notebook. Jupyter Notebook is a fancy, browser-based environment for literate programming, the combination of Python scripts with rich text for telling a story about the task you set out to do with Python. This is a powerful way for collecting the code, the analysis, the context, and the results in a single place.
The Python interpreter we'll interact with in Jupyter Notebook is the same interpreter we could use from the command line. To launch Jupyter Notebook:
jupyter notebook
; then press ENTER.jupyter notebook
; then press ENTER.Let's try out the Python interpreter.
In [1]:
print('Hello, world!')
Alternatively, we could save that one line of Python code to a text file with a *.py
extension and then execute that file. We'll see that towards the end of this lesson.
In interactive mode, the Python interpreter does three things for us, in order:
This is called a read, evaluate, print loop (REPL). Let's try it out.
In [2]:
5 * 11
Out[2]:
We can use Python as a fancy calculator, like any programming language.
When we perform calculations with Python, or run any Python statement that produces output, if we don't explicitly save that output somewhere, then we can't access it again. Python prints the output to the screen, but it doesn't keep a record of the output.
In order to save the output of an arbitrary Python statement, we have to assign that output to a variable. We do this using the equal sign operator:
In [3]:
weight_kg = 5 * 11
Notice there is no output associated with running this command. That's because the output we saw earlier has instead been saved to the variable named number
.
If we want to retrieve this output, we can ask Python for the value associated with the variable named number
.
In [4]:
weight_kg
Out[4]:
As we saw earlier, we can also use the function print()
to explicitly print the value to the screen.
In [5]:
print('Weight in pounds:', 2.2 * weight_kg)
A function like print()
can take multiple arguments, or inputs to the function. In the example above, we've provided two arguments; two different things to print to the screen in a sequence.
We can also change a variable's assigned value.
In [6]:
weight_kg = 57.5
print('Weight in pounds:', 2.2 * weight_kg)
If we imagine the variable as a sticky note with a name written on it, assignment is like putting the sticky note on a particular value. See this illustration.
This means that assigning a value to one variable does not change the values of other variables. For example:
In [7]:
weight_lb = 2.2 * weight_kg
weight_lb
Out[7]:
In [8]:
weight_kg = 100.0
weight_lb
Out[8]:
Since weight_lb
doesn't depend on where its initial value came from, it isn't automatically updated when weight_kg
changes. This is different from how, say, spreadsheets work.
What are some tasks you're hoping to complete with Python? Alternatively, what kinds of things have you done in other programming languages?
When you're thinking of starting a new computer-aided analysis or building a new software tool, there's always the possibility that someone else has created just the piece of software you need to get your job done faster. Because Python is a popular, general-purpose, and open-source programming language with a long history, there's a wealth of completed software tools out there written in the Python for you to use. Each of these software libraries extends the basic functionality of Python to let you do new and better things.
The Python Package Index (PyPI), is the place to start when you're looking for a piece of Python software to use. We'll talk about that later.
For now, we'll load a Python package that is already available on our systems. NumPy is a numerical computing library that allows us to both represent sophisticated data structures and perform calculations on them.
In [9]:
import numpy
Now that we've imported numpy
, we have access to new tools and functions that we didn't have before. For instance, we can use numpy
to read in tabular data for us to work with.
In [10]:
numpy.loadtxt('barrow.temperature.csv', delimiter = ',')
Out[10]:
The expression numpy.loadtxt()
is a function call that asks Python to run the function loadtxt()
which belongs to the numpy
library. Here, the word numpy
is the namespace to which a function belongs. This dotted notation is used everywhere in Python to refer to the parts of things as thing.component
.
Because the loadtxt()
function and others belong to the numpy
library, to access them we will always have to type numpy.
in front of the function name. This can get tedious, especially in interactive mode, so Python allows us to come up with a new namespace as an alias.
In [11]:
import numpy as np
The np
alias for the numpy
library is a very common alias; so common, in fact, that you can get help for NumPy functions by looking up np
and the function name in a search engine.
With this alias, the loadtxt()
function is now called as:
In [12]:
np.loadtxt('barrow.temperature.csv', delimiter = ',')
Out[12]:
np.loadtxt()
has two arguments: the name of the file we want to read, and the delimiter that separates values on a line. These both need to be character strings (or strings for short), so we put them in quotes.
Finally, note that we haven't stored the Barrow temperature data because we haven't assigned it to a variable. Let's fix that.
In [13]:
barrow = np.loadtxt('barrow.temperature.csv', delimiter = ',')
The data we're using for this lesson are monthly averages of surface air temperatures from 1948 to 2016 for five different locations. They are derived from the NOAA NCEP CPC Monthly Global Surface Air Temperature Data Set, which has a 0.5 degree spatial resolution.
What is the unit for air temperature used in this dataset? Recall that when we assign a value to a variable, we don't see any output on the screen. To see our Barrow temperature data, we can use the print()
function again.
In [14]:
print(barrow)
The data are formatted such that:
Now that our data are stored in memory, we can start asking substantial questions about it. First, let's ask how Python represents the value stored in the barrow
variable.
In [15]:
type(barrow)
Out[15]:
This output indicates that barrow
currently refers to an N-dimensional array created by the NumPy library.
A NumPy array contains one or more elements of the same data type. The type()
function only tells us that we have a NumPy array. We can find out the type of data contained in the array by asking for the data type of the array.
In [16]:
barrow.dtype
Out[16]:
This tells us that the NumPy array's elements are 64-bit floating point, or decimal numbers.
In the last example, we accessed an attribute of the barrow
array called dtype
. Because dtype
is not a function, we don't call it using a pair of parentheses. We'll talk more about this later but, for now, it's sufficient to distinguish between these examples:
np.loadtxt()
- A function that takes arguments, which go inside the parenthesesbarrow.dtype
- An attribute of the barrow
array; the dtype
of an array doesn't depend on anything, so dtype
is not a function and it does not take argumentsHow many rows and columns are there in the barrow
array?
In [17]:
barrow.shape
Out[17]:
We see there are 64 rows and 12 columns.
The shape
attribute, like the dtype
, is a piece of information that was generated and stored when we first created the barrow
array. This extra information, shape
and dtype
, describe barrow
in the same way an adjective describes a noun. We use the same dotted notation here as we did with the loadtxt()
function because they have the same part-and-whole relationship.
To access the elements of the barrow
array, we use square-brackets as follows.
In [18]:
barrow[0, 0]
Out[18]:
The 0, 0
element is the element in the first row and the first column. Python starts counting from zero, not from one, just like other languages in the C family (including C++, Java, and Perl).
With this bracket notation, remember that rows are counted first, then columns. For instance, this is the value in the first row and second column of the array:
In [19]:
barrow[0, 1]
Out[19]:
Challenge: What do each of the following code samples do?
barrow[0]
barrow[0,]
barrow[-1]
barrow[-3:-1]
We can make a larger selection with slicing. For instance, here is the first year of monthly average temperatures, all 12 of them, for Barrow:
In [20]:
barrow[0, 0:12]
Out[20]:
The notation 0:12
can be read, "Start at index 0 and go up to, but not including, index 12." The up-to-but-not-including is important; we have 12 values in the array but, since we started counting at zero, there isn't a value at index 12.
In [21]:
barrow[0, 12]
Slices don't have to start at zero and they also don't have to include the upper or lower bound, if we want to simply take all the ending or beginning values, respectively.
Here's the last six monthly averages of the first three years, written two different ways:
In [22]:
barrow[0:3, 6:12]
Out[22]:
In [23]:
barrow[:3, 6:]
Out[23]:
If we don't include a number at all, then the :
symbol indicates "take everying."
In [25]:
barrow[0, :]
Out[25]:
Challenge: What's the mean monthly temperature in August of 2016? Converted to degrees Fahrenheit?
Degrees F can be calculated from degrees K by the formula:
Arrays know how to perform common mathematical operations on their values. This allows us to treat them like pure numbers, as we did earlier.
Convert the first year of Barrow air temperatures from degrees Kelvin to degrees Celsius.
In [26]:
barrow[0,:] - 273.15
Out[26]:
We can also perform calculations that have only matrices as operands.
Calculate the monthly average of the first two years of air temperatures in Barrow. (Consider this the average of the monthly averages.) Then convert to Celsius.
In [27]:
two_year_sum = barrow[0,:] + barrow[1,:]
(two_year_sum / 2) - 273.15
Out[27]:
Quite often, we want to do more than add, subtract, multiply, and divide values of data. NumPy knows how to do more complex operations on arrays, including statistical summaries of the data.
What is the overall mean temperature in any month in Barrow between 1948 and 2016 in degrees C?
In [28]:
barrow.mean() - 273.15
Out[28]:
Here, note that the mean()
function is an attribute of the barrow
array. When the attribute of a Python object is a function, we usually call it a method. Methods are functions that belong to Python objects. Here, the barrow
array owns a method called mean()
. When we call the mean()
method of barrow
, the array knows how to calculate its overall mean.
Note, also, that barrow.mean()
is an example of a function that doesn't have to take any input. In this example, no input is needed because the overall mean is not dependent on any external information.
How cold was the coldest February in Barrow, by monthly mean temperatures, in degrees C?
In [29]:
barrow[:,1].min() - 273.15
Out[29]:
Challenge: What's the minimum, maximum, and mean monthly temperature for August in Barrow, in degrees C?
How did we find out what methods the array has for us to use?
While in Jupyter Notebook, we can use the handy shortcut:
In [30]:
barrow?
In general, Python provides two helper functions, help()
and dir()
.
help(barrow)
In [32]:
dir(barrow)
Out[32]:
Before, we saw how the built-in statistical summary methods like mean()
could be used to calculate an overall statistical summary for the barrow
array or a subset of the array. More often, we want to look at partial statistics, such as the mean temperature in each year or the maximum temperature in each month.
One way to do this is to take what we saw earlier and apply it to each row or column.
What is the mean temperature in 1948? In 1949? And so on...
In [33]:
barrow[0,:].mean()
Out[33]:
In [34]:
barrow[1,:].mean()
Out[34]:
But this gets tedious very quickly. Instead, we can calculate a statisical summary over a particular axis of the array: along its rows or along its columns.
In [35]:
barrow.mean()
Out[35]:
In [36]:
barrow.mean(axis = 1)
Out[36]:
Recall that in our square bracket notation, we number the rows first, then the columns. Also, recall that Python starts counting at zero. Therefore, the "1" axis refers to the columns. To calculate a mean across the column axis, that is, the mean temperature in each year, we use the axis = 1
argument in the function call.
As a quick check, we can confirm that there are 69 values in the output, one for each of the 69 years between 1948 and 2016, inclusive.
In [37]:
barrow.mean(axis = 1).shape
Out[37]:
What, then, does the following function call give us?
In [38]:
barrow.mean(axis = 0)
Out[38]:
Remembering the difference between axis = 0
and axis = 1
is tricky, even for experienced Python programmers. Here are some helpful, visual reminders:
Visualization, as line plots, bar charts, and other kinds, can often help us to better interpret data, particularly when we have a large set of numbers. Here, we'll look at some basic data visualization tools available in Python.
While there is no "official" plotting library in Python, the most often used is a library called matplotlib
. First, we'll import the pyplot
module from matplotlib
and use two of its functions to create and display a heatmap of our data.
In [39]:
import matplotlib.pyplot as pyplot
%matplotlib inline
We also used a Jupyter Notebook "magic" command to make matplotlib
figures display inside our notebook, instead of a separate window. Commands with percent signs in front of them, as in the above example, are particular to Jupyter Notebook and won't work in a basic Python intepreter or command-line.
In [40]:
image = pyplot.imshow(barrow, aspect = 1/3)
pyplot.show()
Let's look at the average temperature over the years.
In [41]:
avg_temp = barrow.mean(axis = 1)
avg_temp_plot = pyplot.plot(avg_temp)
pyplot.show()
Let's review what we've just learned.
import library_name
;variable = value
;my_array.shape
;my_array.dtype
;array[x, y]
to select the element in row X, column Y from an array;a:b
to specify a slice that includes indices from A to B;matplotlib
;We saw how to do these things for just one CSV file we loaded into Python. But what if we're interested in more places than just Barrow, Alaska? We could execute the same code for each place we're interested in, each CSV file...
In this next lesson, we'll learn how to instruct Python to do the same thing multiple times without mistake.
Before we discuss how Python can execute repetitive tasks, we need to discuss sequences. Sequences are a very powerful and flexible class of data structures in Python.
The simplest example of a sequence is a character string.
In [42]:
'Hello, world!'
Out[42]:
Character strings can be created with either single or double quotes; it doesn't matter which. If you want to include quotes inside a string you have some options...
In [43]:
print("My mother's cousin")
In [44]:
print('My mother\'s cousin said, "Behave."')
Every character string is a sequence of letters.
In [45]:
word = 'bird'
We can access each letter in this sequence of letters using the index notation we saw earlier.
In [46]:
print(word[0])
print(word[1])
print(word[2])
print(word[3])
This is a bad approach, though, for multiple reasons.
Here's a better approach.
In [47]:
for letter in word:
print(letter)
This is much shorter! And we can quickly see how it scales much better.
In [48]:
word = 'cockatoo'
for letter in word:
print(letter)
Some things to note about this example:
for
is what gives this technique its name; it's a for loop.word
sequence.for
loop, the variable letter
takes on a new value.letter
in the first iteration? The second?)Most importantly, the second line is indented by four spaces. In Python, code blocks are structured by indentation. When a group of Python lines belong to a for
loop, in this example, those lines are indented one level more than the line with the for
statement. It doesn't matter whether we use tabs or spaces, or how many we use, as long as we're consistent throughout the Python program.
You'll note that Jupyter Notebook indented the second line automatically for you and that it uses four spaces. Jupyter Notebook follows the Python Enhancement Proposal 8, or PEP-8, which is a style guide for Python. It is therefore strongly recommended that you use four (4) spaces for indentation in your own Python code, wherever you're writing it.
Another type of sequence in Python that is more general and more useful than a character string is the list. Lists are formed by putting comma-separated values inside square brackets.
In [49]:
cities = ['Barrow', 'Veracruz', 'Chennai']
for city in cities:
print('Visit beautiful', city, '!')
Lists can be indexed and sequenced just as we say with character strings (another sequence) and NumPy arrays.
In [50]:
cities[0]
Out[50]:
In [51]:
cities[0:1]
Out[51]:
In [52]:
cities[0:2]
Out[52]:
In [53]:
cities[-1]
Out[53]:
What really distinguishes lists is that they can hold different types of data in a single instance, including other lists!
In [54]:
things = [100, 'oranges', [1,2,3]]
for thing in things:
print(thing)
Just as we saw with NumPy arrays, lists also have a number of useful built-in methods.
How many times is the value "Barrow"
found in the cities
list? Exactly once.
In [55]:
cities.count('Barrow')
Out[55]:
Some of the methods change the array in-place, meaning that they change the underlying list without returning a new list. Since there is no output from these methods, to confirm that something changed with cities
, we have to examine its value again.
In [56]:
cities.reverse()
cities
Out[56]:
Also, try running the above code block multiple times over.
A tuple is a lot like a list in that it can take multiple types of data. However, once it is created, a tuple cannot be changed.
In [57]:
cities
Out[57]:
In [58]:
cities_tuple = tuple(cities)
cities_tuple
Out[58]:
In [59]:
cities[2] = 'Accra'
cities
Out[59]:
In [60]:
cities_tuple[2] = 'Accra'
And, therefore, tuples also don't have the useful functions that lists have.
In [61]:
cities.append('Paris')
cities
Out[61]:
In [62]:
cities_tuple.append('Paris')
Tuples are advantageous when you have sequence data that you want to make sure doesn't get changed. Partly because they can't be changed, tuples are also slightly faster in computations where the sequence is very large.
Lists and tuples also allow us to quickly exchange values between variables and the elements of a sequence.
In [63]:
option1, option2 = ('Chocolate', 'Strawberry')
print('Option 1 is:', option1)
print('Option 2 is:', option2)
Performing calculations with lists is different from with NumPy arrays.
In [64]:
numbers = [1, 1, 2, 3, 5, 8]
numbers * 2
Out[64]:
In [65]:
numbers + 1
In [66]:
numbers + [13]
Out[66]:
So, with lists, the addition operator +
means to concatenate while the multiplication operator *
means to compound. This is the same for strings:
In [67]:
'butter' + 'fly'
Out[67]:
In [68]:
'Cats! ' * 5
Out[68]:
We can execute any arbitrary Python code inside a loop. That includes changing the value of an assigned variable within a loop.
In [69]:
length = 0
for letter in 'Bangalore':
length = length + 1
length
Out[69]:
Note that a loop variable is just a variable that’s being used to record progress in a loop. It still exists after the loop is over, and we can re-use variables previously defined as loop variables as well.
In [70]:
letter
Out[70]:
Also, in the last exercise, we used a loop to count the number of letters in the string 'Bangalore'
. Python has a built-in function to count the number of elements in any sequence.
In [71]:
len('Bangalore')
Out[71]:
In [72]:
cities
Out[72]:
In [73]:
len(cities)
Out[73]:
Write a for
loop that iterates through the letters of your favorite city, putting each letter inside a list. The result should be a list with an element for each letter.
Hint: You can create an empty list like this:
letters = []
Hint: You can confirm you have the right result by comparing it to:
list("my favorite city")
Which of the sequences we've learned about are immutable (i.e., they can't be changed)?
And what does this mean for working with each data type?
"birds".upper()
[1, 2, 3].append(4)
(1, 2, 3)
Earlier, we saw that lists in Python have a built-in function to reverse the elements.
In [74]:
cities.reverse()
cities
Out[74]:
There's also a built-in global function, reversed()
, that returns a reversed version of a sequence.
In [75]:
reversed(cities)
Out[75]:
But, whoa, what is this? Why didn't we get a reversed version of the cities
list?
In [76]:
print(reversed(cities))
What we got from reversed()
is called an iterator. A Python iterator is any object that iteratively produces a value as part of a loop.
Remember that everything in Python is an object. Objects that are also iterators have a special built-in behavior where they "give up" a value when they're used inside a for
loop, while
loop, or similar iterative procedure. Put another way, iterators only produce new values on demand. By returning an iterator instead of a reversed copy of the list, the reversed()
function is saving memory and computing time.
In [77]:
for each in reversed(cities):
print(each)
We have almost all of the tools we need to process all of our data files. The last thing we need is a tool for navigating our file system and finding the files that we're interested in.
In [78]:
import glob
The glob
library contains a function, also called glob
, that finds files and directories whose names match a pattern. The pattern is referred to as a "glob regular expression," hence the name of this Python module and its chief function.
glob
patterns have two wildcards:
*
character matches zero or more characters;?
character matches any one character.We can use this to get the names of all the CSV files in the current directory:
In [79]:
glob.glob('*.csv')
Out[79]:
As you can see, glob
returns a list of the matching files as a list. This means that we can loop over the list of filenames, doing something with each filename in turn.
In our case, we want to generate a set of plots for each file in our temperature dataset.
First, let's confirm that we can loop over filenames and load in each file.
In [80]:
filenames = glob.glob('*.csv')
for fname in filenames:
print('Working on', fname, '...')
data = np.loadtxt(fname, delimiter = ',')
Let's use the variables created in the last iteration of the loop to test that we can create the matplotlib
figure that we want. Recall that after a for
loop has finished, the looping variable retains the value of the last element accessed in the loop.
In [81]:
fname
Out[81]:
In [82]:
fig = pyplot.figure(figsize = (10.0, 3.0))
axes1 = fig.add_subplot(1, 3, 1)
axes2 = fig.add_subplot(1, 3, 2)
axes3 = fig.add_subplot(1, 3, 3)
axes1.set_ylabel('Average (deg K)')
axes1.plot(data.mean(axis = 1))
axes2.set_ylabel('Minimum (deg K)')
axes2.plot(data.min(axis = 1))
axes3.set_ylabel('Maximum (deg K)')
axes3.plot(data.max(axis = 1))
fig.tight_layout()
pyplot.show()
Now we need to combine the code we've written so far so that we create these images for each file.
In [83]:
import glob
import numpy as np
import matplotlib.pyplot as pyplot
%matplotlib inline
# Get a list of the filenames we're interested in
filenames = glob.glob('*.csv')
for fname in filenames:
print('Working on', fname, '...')
# Load the data
data = np.loadtxt(fname, delimiter = ',')
# Create a 1 x 3 figure
fig = pyplot.figure(figsize = (10.0, 3.0))
axes1 = fig.add_subplot(1, 3, 1)
axes2 = fig.add_subplot(1, 3, 2)
axes3 = fig.add_subplot(1, 3, 3)
axes1.set_ylabel('Average (deg K)')
axes1.plot(data.mean(axis = 1))
axes2.set_ylabel('Minimum (deg K)')
axes2.plot(data.min(axis = 1))
axes3.set_ylabel('Maximum (deg K)')
axes3.plot(data.max(axis = 1))
fig.tight_layout()
pyplot.show()
Note that I've introduced comments into my code using the hash character, #
. Anything that comes after this symbol, on the same line, we'll be ignored by the Python interpreter. This is a really useful way of reminding yourself, as well as communicating to others, what certain parts of your code are doing.
For each location (each file), plot the difference between that location's mean temperature and the mean across all locations.
Hint: One way to calculate the mean across five (5) files is by adding the 5 arrays together, then dividing by 5. You can add arrays together in a loop like this:
# Start with an array full of zeros that is 69-elements long
running_total = np.zeros((69))
for fname in filenames:
data = np.loadtxt(fname, delimiter = ',')
running_total = running_total + data.mean(axis = 1)
Hint: How do you difference two arrays? Remember how the plus, +
, and minus, -
, operators work on arrays?
In [84]:
filenames = glob.glob('*.csv')
running_total = np.zeros((69))
for fname in filenames:
data = np.loadtxt(fname, delimiter = ',')
running_total = running_total + data.mean(axis = 1)
overall_mean = running_total / len(filenames)
for fname in filenames:
data = np.loadtxt(fname, delimiter = ',')
fig = pyplot.figure(figsize = (10.0, 3.0))
axis1 = fig.add_subplot(1, 1, 1)
axis1.plot((data.mean(axis = 1) - overall_mean))
axis1.set_title(fname)
pyplot.show()
As humans, we often face choices in our work and daily lives. Given some information about the world, we make a decision to do one thing or another.
It should be obvious that our computer programs need to do this as well. So far, we've written Python code that does the exact same thing with whatever input it receives. This is a great benefit, of course, for our productivity; we can reliably perform analyses the same way on multiple datasets.
Now, we need to learn how Python can detect certain conditions and act accordingly.
Conditional evaluation in Python is performed using what are called conditional statements.
In [85]:
a_number = 42
if a_number > 100:
print('Greater')
else:
print('Not greater')
print('Done')
This code can be represented by the following workflow.
The if
statement in Python is a conditional statement; it contains a conditional expression, which evaluates to True
or False
.
In [86]:
a_number > 100
Out[86]:
Thus, when the conditional expression following the if
statement evaluates to False
, the codes skips to the else
statement. Note that it is the indented code blocks below the if
and else
statements that is conditionally executed.
True
and False
are special values in Python and the first letter is always capitalized.
Conditional statements don't have to include the else
statement. If there is else
statement, Python simply does nothing when the test is False
.
In [87]:
if a_number > 100:
print(a_number, 'is greater than 100')
Note that there is no output associated with this code.
How can you make this code print "Greater" by changing only one line?
a_number = 42
if a_number > 100:
print('Greater')
else:
print('Not greater')
print('Done')
There are two (2) one-line changes you could make. Can you find them both?
As we've seen, conditional expressions evaluate to either True
or False
. These two, special values are called logical or Boolean values. Let's get more familiar with Booleans.
In [88]:
True and True
Out[88]:
In [89]:
True or False
Out[89]:
In [90]:
True or not True
Out[90]:
In [91]:
not True
Out[91]:
Here, and
and or
are two comparison operators. We've seen "greater than" already and there are a couple more.
What do each of the following evaluate to, True
or False
?
1 < 2
1 <= 1
3 == 3
2 != 3
We can combine multiple comparison operators to create complex tests in our code.
In [92]:
if (1 > 0) and (-1 > 0):
print('Both parts are true')
else:
print('At least one part is false')
What happens if we change the and
to an or
?
In [93]:
if (1 > 0) or (-1 > 0):
print('At least one part is true')
Finally, we can chain conditional expressions together using elif
, which stands for "else if" and follows an initial if
statement.
In [94]:
number = -3
if number > 0:
print('Positive')
elif number == 0:
print('Is Zero')
else:
print('Negative')
Whereas each if
statement has only one else
, it can have as many elif
statements as you need.
How can we put conditional evaluation to work in analyzing our temperature data? Let's say we're interested in temperature anomalies; that is, the year-to-year deviation in temperature from a long-term mean.
Let's also say we want to fit a straight line to the anomalies. We can use another Python library, statsmodels
, to fit an ordinary-least squares (OLS) regression to our temperature anomaly.
In [95]:
import statsmodels.api as sm
data = np.loadtxt('wvu.temperature.csv', delimiter = ',')
# Subtract the location's long-term mean
y_data = data.mean(axis = 1) - data.mean()
# Create an array of numbers 1948, 1949, ..., 2016
x_data = np.arange(1948, 2017)
# Add a constant (the intercept term)
x_data = sm.add_constant(x_data)
# Fit the temperature anomalies to a simple time series
results = sm.OLS(y_data, x_data).fit()
results.summary()
Out[95]:
In [96]:
b0, b1 = results.params
# Calculate a line of best fit
fit_line = b0 + (b1 * x_data[:,1])
fig = pyplot.figure(figsize = (10.0, 3.0))
axis1 = fig.add_subplot(1, 1, 1)
axis1.plot(x_data, y_data, 'k')
axis1.plot(x_data, fit_line, 'r')
axis1.set_xlim(1948, 2016)
axis1.set_title(fname)
pyplot.show()
What if we wanted to detect the direction of this fit line automatically? It could be tedious to have a human being tally up how many are trending upwards versus downwards... Let's have the computer do it.
Write a for
loop, with an if
statement inside, that calculates a line of best fit for each dataset's temperature anomalies and prints out a message as to whether that trend line is positive or negative.
Hint: What we want to know about each trend line is whether, for:
results = sm.OLS(y_data, x_data).fit()
b0, b1 = results.params
If b1
, the slope of the line, is positive or negative.
In [97]:
filenames = glob.glob('*.csv')
# I can do these things outside of the loop
# because the X data are the same for each dataset
x_data = np.arange(1948, 2017)
x_data = sm.add_constant(x_data) # Add the intercept term)
for fname in filenames:
data = np.loadtxt(fname, delimiter = ',')
# Subtract the location's long-term mean
y_data = data.mean(axis = 1) - data.mean()
# Fit the temperature anomalies to a simple time series
results = sm.OLS(y_data, x_data).fit()
b0, b1 = results.params
if b1 > 0:
print(fname, '-- Long-term trend is positive')
else:
print(fname, '-- Long-term trend is negative')
At this point, what have we learned?
glob.glob()
to create a list of files whose names match a given pattern;if
statements to test a condition;elif
and else
statements to test alternative conditions;==
, >=
, <=
, and
, and or
;X and Y
is only true if both X and Y are true;X or Y
is true if either X, Y, or both are true;if
statement inside a for
loop.At this point, we've done a lot of interesting things with our temperature dataset. But our code is getting kind of long. The interesting parts that we've figured out together could be useful to us in the future or to other people. Is there a way for us to package this code for later re-use?
In Python, we can do just that be creating our own custom functions. We'll start by defining a function that converts temperatures from Kelvin to Celsius.
In [98]:
def kelvin_to_celsius(temp_k):
return temp_k - 273.15
How do we know this function works? We can devise tests where we know what output should correspond with a given input.
In [99]:
if kelvin_to_celsius(273.15) == 0:
print('Passed')
if kelvin_to_celsius(373.15) == 100:
print('Passed')
Let's break down this function definition.
def
keyword indicates to Python that we are defining a function; the name of the function we want to define comes next.for
loops and if
statements, after the colon, :
, we indent 4 spaces and then write the body of the function. This is what the function actually does when it is called. Inside the body, any arguments provided are available as variables with the name of the argument as we wrote it in the function definition.What happens if we remove the keyword return
from this function? Make this change and call the function again.
Now that we've created a function that converts temperatures in degrees Kelvin to degrees Celsius, let's see if we can write a function that converts from degrees Celsius to degrees Fahrenheit.
$$ T_F = \left(T_C \times \frac{9}{5}\right) + 32 $$
In [100]:
def celsius_to_fahr(temp_c):
return (temp_c * (9/5)) + 32
In [101]:
if celsius_to_fahr(0) == 32:
print('Passed')
if celsius_to_fahr(-40) == -40:
print('Passed')
Now, what if we want to convert temperatures in degrees to Kelvin to degrees Fahrenheit?
In [102]:
def kelvin_to_fahr(temp_k):
temp_c = kelvin_to_celsius(temp_k)
return celsius_to_fahr(temp_c)
In [103]:
kelvin_to_fahr(300)
Out[103]:
Another way to do this is to chain multiple function calls.
In [104]:
celsius_to_fahr(kelvin_to_celsius(300))
Out[104]:
In [105]:
print(fname)
kelvin_to_fahr(data.mean(axis = 0))
Out[105]:
Now that we understand functions in Python we can begin to clean up the codebase we've established so far.
For starters, let's create a function that calculates temperature anomalies.
In [106]:
def temperature_anomaly(temp_array):
return temp_array.mean(axis = 1) - temp_array.mean()
temperature_anomaly(data)
Out[106]:
This works great! Now, suppose I suspect the long-term mean is a little biased; maybe I want a different measure of central tendency to use in calculating anomalies. Say, maybe I want to subtract the median temperature value instead of the mean.
In Python, I can just add another argument to provide this flexibility.
In [107]:
def temperature_anomaly(temp_array, long_term_avg):
return temp_array.mean(axis = 1) - long_term_avg
temperature_anomaly(data, np.median(data))
Out[107]:
This function is more flexible, but I also have to type more. Suppose that I want a function that allows me to use the median value but will default to the mean value, because the mean value is what most people who will use this function want.
In Python, functions can take default arguments; these are arguments that could be provided when the function is called, but aren't required.
In [108]:
def temperature_anomaly(temp_array, long_term_avg = None):
if long_term_avg:
return temp_array.mean(axis = 1) - long_term_avg
else:
return temp_array.mean(axis = 1) - temp_array.mean()
pyplot.plot(temperature_anomaly(data))
pyplot.show()
In [109]:
pyplot.plot(temperature_anomaly(data, np.median(data)))
pyplot.show()
None
is a special value in Python. It means just that: nothing. It is similar to "null" in other programming languages. Like True
and False
, the first letter is always capitalized.
Because None
has a "False-y" value, we can treat it like the Python False
.
In [110]:
not None
Out[110]:
However, for various reasons, we usually want to distinguish values that are False-y and values that are None
. Let's make two changes to our function:
None
;else
statement, which is unnecessary;
In [111]:
def temperature_anomaly(temp_array, long_term_avg = None):
if long_term_avg is not None:
return temp_array.mean(axis = 1) - long_term_avg
return temp_array.mean(axis = 1) - temp_array.mean()
pyplot.plot(temperature_anomaly(data))
pyplot.show()
In a Python function, there are two kinds of arguments: positional arguments and keyword arguments.
Positional arguments are simpler and more common. In this example, Python knows that the value 1
we've provided is the first (and only) argument, so it assigns this value to the first argument in the corresponding function definition.
In [112]:
data.mean(1)
Out[112]:
In [113]:
data.mean?
We can call this same function more explicitly by providing the axis
, 1, as a keyword argument.
In [114]:
data.mean(axis = 1)
Out[114]:
Keyword arguments are provided as key-value pairs.
Positional arguments have to be provided in the right order, otherwise Python doesn't know which value goes with which argument. When there's only one argument, it doesn't matter, but with multiple arguments, they have to come in order.
In [115]:
np.zeros?
In [116]:
np.zeros(10, np.int)
Out[116]:
In [117]:
np.zeros(np.int, 10)
Keyword arguments, on the other hand, can be provided in any order.
In [118]:
np.zeros(dtype = np.int, shape = 10)
Out[118]:
As we saw earlier, comments help clarify to others and our future selves what our intentions are and how our code works. This kind of documentation is critical for creating software that other people want to use and that we will be able to make sense of later.
Functions, in particular, should be documented well using both in-line comments and also docstrings.
In [119]:
def temperature_anomaly(temp_array, long_term_avg = None):
'''
Calculates the inter-annual temperature anomalies. Subtracts the
long-term mean by default but another long-term average can
be provided as the `long_term_avg` argument.
'''
if long_term_avg is not None:
return temp_array.mean(axis = 1) - long_term_avg
# If no long-term average is provided, use the overall mean
return temp_array.mean(axis = 1) - temp_array.mean()
Create one (or both, for an extra challenge) of the following functions...
fences
that takes an input character string and surrounds it on both sides with another string, e.g., "pasture" becomes "|pasture|" or "@pasture@" if either "|" or "@" are provided.rescale
that takes an array and returns a corresponding array of values scaled to lie in the range 0.0 to 1.0.Hint: Strings can be concatenated with the plus operator.
'cat' + 's'
Hint: If $x_0$ and $x_1$ are the lowest and highest values in an array, respectively, then the replacement value for any element $x$, scaled to between 0.0 and 1.0, should be:
$$ \frac{x - x_0}{x_1 - x_0} $$