Questions for students to consider:
1) What happens when you get a new dataset that you need to analyze in the same way you analyzed a previous data set?
2) What processes do you do often? How do you implement these?
3) Do you have a clear workflow you could replicate?
4) Or even better, could you plug this new data set into your old workflow?
Optional lesson
Basic Overview of the suggested workflow using Socrative (Optional)
Please download the cleaned data file:
Lets begin by creating a new Jupyter notebook.
Question:
Learning Objective: Employ best practices of naming a variable including: don’t use existing function names, avoid periods in names, don’t use numbers at the beginning of a variable name
References:
In [ ]:
# write out three variables, assign a number, string, list
x = 'Asia' # String
y = 1952 # an integer
z = 1.5 # a floating point number
cal_1 = y * z
print(cal_1)
# or
x, y = 'Asia', 'Africa'
w = x
w = x + x #concatinating strings (combinging strings)
print(w)
h = 'Africa'
list_1 = ['Asia', 'Africa', 'Europe'] # list
print(list_1)
Questions for students:
Python indexing is from 0 to length of list - 1
Example:
list_1 = ['Asia', 'Africa', 'Europe']
Asia index = 0,
Africa index = 1,
Europe index = 2
list_1 is not a very descriptive and identifiable variable.
What would be a better name for the variable that holds these values?
In [ ]:
countries = ['Asia', 'Africa', 'Europe']
Global variables are available in the environment your script is working in. Every variable we have made at this point is a global variable.
Local variables will be useful to understand when we start using functions in the automation of our code. Local variables only exist in the function environment, not the global environment your linear workflow code is.
See what variants exist in the Jupyter notebook:
%who
In [ ]:
%who
To get started we will import the python modules that we will use in the session. These modules are developed by programmers and made available as open source packages for python. We would normally have to install each of these ourself but they are included as part of the Anaconda Python Distribution.
The %matplotlib inline statement is part of the Jupyter and IPython magic that enables plaots generated by the matplotlib package to be discplayed as output in the Jupyter Notebook instead of open in a separate window.
In [ ]:
import numpy as np
import pandas as pd
import pylab as plt
import matplotlib
%matplotlib inline
We will continue where the data exploration module left off but importing the cleaned gapminder dataset and setting it equal to a new varaible named df to denote that we have imported a pandas dataframe.
As validation that we have imported the data we will also look at the top five rows of data using the head method of pandas.
In [ ]:
cleaned_data_location = 'gapminder_cleaned.csv'
In [ ]:
df = pd.read_csv(cleaned_data_location)
df.head()
In [ ]:
df['year'].unique()
Learning Objectives
As you write software there comes a time when you are going to encounter a situation where you want to do the same analysis step as you have already done in your analysis. Our natural tendancy is the copy the code that we wrote and paste it into teh new location for reuse. Sounds easy, right. Copy, paste, move on...not so fast.
What happens if there is a problem with the code or you decide to tweak it, just a little, to change a format or enahce it?
You wil have to change the code in every place you ahve copied it. How do you know if you got all of the copies? What happens if one of the copies is not changed?
These examples illustrate the principle of "Don't Repeat Yourself". We are going to look at how to refactor our code and pull pieces out by making them functions. They we will call the function everytime we want to use that code.
In [ ]:
# Define which continent / category we will use
category = 'lifeexp'
continent = 'asia'
In [ ]:
# Create a mask that selects the continent of choice
mask_continent = df['continent'] == continent
df_continent = df[mask_continent]
In [ ]:
# Loop through years and calculate the statistic of interest
years = df_continent['year'].unique()
summary = []
for year in years:
mask_year = df_continent['year'] == year
df_year = df_continent[mask_year]
value = np.mean(df_year[category])
summary.append((continent, year, value))
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', category])
In [ ]:
summary.plot.line('year', 'lifeexp')
Learning Objectives
In [ ]:
def calculate_statistic_over_time(data, category, continent, func=None):
if func is None:
func = np.mean
# Create a mask that selects the continent of choice
mask_continent = data['continent'] == continent
data_continent = data[mask_continent]
# Loop through years and calculate the statistic of interest
years = data_continent['year'].unique()
summary = []
for year in years:
mask_year = data_continent['year'] == year
data_year = data_continent[mask_year]
value = func(data_year[category])
summary.append((continent, year, value))
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', category])
return summary
In [ ]:
category = 'lifeexp'
continents = df['continent'].unique()
fig, ax = plt.subplots()
for continent in continents:
output = calculate_statistic_over_time(df, category, continent)
output.plot.line('year', category, ax=ax)
In [ ]:
category = 'lifeexp'
mean_values = df.groupby('continent').mean()[category]
mean_values = mean_values.sort_values(ascending=False)
continents = mean_values.index.values
n_continents = len(continents)
cmap = plt.cm.coolwarm_r
fig, ax = plt.subplots()
for ii, continent in enumerate(continents):
this_color = cmap(float(ii / n_continents))
output = calculate_statistic_over_time(df, category, continent)
output.plot.line('year', category, ax=ax, label=continent,
color=this_color)
plt.legend(loc=(1.02, 0))
ax.set(ylabel=category, xlabel='Year',
title='{} over time'.format(category))
plt.setp(ax.lines, lw=4, alpha=.4)
In [ ]:
def plot_statistic_over_time(data, category, func=None,
cmap=None, ax=None, legend=True,
sort=True):
if ax is None:
fig, ax = plt.subplots()
if cmap is None:
cmap = plt.cm.viridis
if sort is True:
# Sort the continents by the category of choice
mean_values = df.groupby('continent').mean()[category]
mean_values = mean_values.sort_values(ascending=False)
continents = mean_values.index.values
else:
continents = np.unique(df['continent'])
n_continents = len(continents)
# Loop through continents, calculate its stat, and add a line
for ii, continent in enumerate(continents):
this_color = cmap(float(ii / n_continents))
output = calculate_statistic_over_time(data, category, continent)
output.plot.line('year', category, ax=ax, label=continent,
color=this_color)
if legend is True:
plt.legend(loc=(1.02, 0))
else:
ax.get_legend().set(visible=False)
ax.set(ylabel=category, xlabel='Year',
title='{} over time'.format(category))
plt.setp(ax.lines, lw=4, alpha=.4)
return ax
In [ ]:
plot_statistic_over_time(df, category, continent, cmap=plt.cm.coolwarm)
In [ ]:
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
categories = ['pop', 'gdppercap']
for ax, i_category in zip(axs, categories):
plot_statistic_over_time(df, i_category, continent,
ax=ax, sort=False)
plt.setp(axs[0].get_legend(), visible=False)
plt.tight_layout()
In [ ]:
fig, axs = plt.subplots(1, 2, figsize=(10, 5), sharey=True)
cmaps = [plt.cm.viridis, plt.cm.coolwarm]
for ax, cmap in zip(axs, cmaps):
plot_statistic_over_time(df, category, continent,
cmap=cmap, ax=ax, legend=False)
In [ ]:
ax = df.groupby(['continent', 'year']).mean()['lifeexp']\
.unstack('continent').plot(cmap=plt.cm.viridis, alpha=.4, lw=3)
Learning Objective
Now we are going to cut and paste the functions that we have created above. We are going to save them to a single file so we will paste all of the code that we want in that file below. Additionally, we will use the Jupyter / IPython magic, specifically the magic command writefile to save teh contents of the cell below to a file. The file name follows the %%writefile command below.
We are assuming that the notebook is in the project code directory and notebook subdirectory.
In [ ]:
%%writefile stats_and_plot.py
def calculate_statistic_over_time(data, category, continent, func=None):
import numpy as np
import pandas as pd
if func is None:
func = np.mean
# Create a mask that selects the continent of choice
mask_continent = data['continent'] == continent
data_continent = data[mask_continent]
# Loop through years and calculate the statistic of interest
years = data_continent['year'].unique()
summary = []
for year in years:
mask_year = data_continent['year'] == year
data_year = data_continent[mask_year]
value = func(data_year[category])
summary.append((continent, year, value))
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', category])
return summary
def plot_statistic_over_time(data, category, func=None, cmap=None, ax=None, legend=True, sort=True):
if ax is None:
fig, ax = plt.subplots()
if cmap is None:
cmap = plt.cm.viridis
if sort is True:
# Sort the continents by the category of choice
mean_values = df.groupby('continent').mean()[category]
mean_values = mean_values.sort_values(ascending=False)
continents = mean_values.index.values
else:
continents = np.unique(df['continent'])
n_continents = len(continents)
# Loop through continents, calculate its stat, and add a line
for ii, continent in enumerate(continents):
this_color = cmap(float(ii / n_continents))
output = calculate_statistic_over_time(data, category, continent)
output.plot.line('year', category, ax=ax, label=continent,
color=this_color)
if legend is True:
plt.legend(loc=(1.02, 0))
else:
ax.get_legend().set(visible=False)
ax.set(ylabel=category, xlabel='Year',
title='{} over time'.format(category))
plt.setp(ax.lines, lw=4, alpha=.4)
return ax
If you would like to see the contents of your file you can type:
%load stats_and_plot.py
or try
%pycat stats_and_plot.py
In [ ]:
Some information on imports:
In order to use external code within python two things have to happen: 1) The code has to exist on you local computer and 2) we have to reference or import to code to use it in our program.
The first requirement is satisfied when we install the softwre on our computer using conda or pip. Or as we will see in a minute, we can create custom functions that we create. Secondly, we need to tell python how to access and refer to the packages or source code we want to use.
Import Guidelines
https://www.python.org/dev/peps/pep-0008/#imports
Imports should be grouped in the following order:
You should put a blank line between each group of imports.
In [ ]:
from stats_and_plot import calculate_statistic_over_time
# Saving for publication
cmaps = [plt.cm.magma, plt.cm.rainbow]
for ii, cmap in enumerate(cmaps):
fig, ax = plt.subplots(figsize=(10, 10), sharey=True)
plot_statistic_over_time(df, category, continent,
cmap=cmap, ax=ax, legend=False)
labels = [ax.get_xticklabels(), ax.get_yticklabels(),
ax.yaxis.label, ax.xaxis.label, ax.title]
_ = plt.setp(labels, fontsize=30)
# ax.set_axis_off()
fig.savefig('fig_{}.png'.format(ii), transparent=True, bbox_inches='tight', dpi=300)
Lets get some help on our new code:
In [ ]:
calculate_statistic_over_time?
In [ ]:
help(calculate_statistic_over_time)
Get details on DocStrings here.
In [ ]:
"""Form a complex number.
Keyword arguments:
real -- the real part (default 0.0)
imag -- the imaginary part (default 0.0)
"""
In [ ]:
%%writefile stats_and_plot.py
def calculate_statistic_over_time(data, category, continent, func=None):
"""
Calculate a statistic on the continent. The default statistic is numpys' mean.
Keyword arguments:
data -- the dataframe data source
category -- the category to be summarized
continent -- the continent to be examined
func -- the function to be applied to the data (default numpy.mean)
"""
import numpy as np
import pandas as pd
if func is None:
func = np.mean
# Create a mask that selects the continent of choice
mask_continent = data['continent'] == continent
data_continent = data[mask_continent]
# Loop through years and calculate the statistic of interest
years = data_continent['year'].unique()
summary = []
for year in years:
mask_year = data_continent['year'] == year
data_year = data_continent[mask_year]
value = func(data_year[category])
summary.append((continent, year, value))
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', category])
return summary
def plot_statistic_over_time(data, category, func=None, cmap=None, ax=None, legend=True, sort=True):
if ax is None:
fig, ax = plt.subplots()
if cmap is None:
cmap = plt.cm.viridis
if sort is True:
# Sort the continents by the category of choice
mean_values = df.groupby('continent').mean()[category]
mean_values = mean_values.sort_values(ascending=False)
continents = mean_values.index.values
else:
continents = np.unique(df['continent'])
n_continents = len(continents)
# Loop through continents, calculate its stat, and add a line
for ii, continent in enumerate(continents):
this_color = cmap(float(ii / n_continents))
output = calculate_statistic_over_time(data, category, continent)
output.plot.line('year', category, ax=ax, label=continent,
color=this_color)
if legend is True:
plt.legend(loc=(1.02, 0))
else:
ax.get_legend().set(visible=False)
ax.set(ylabel=category, xlabel='Year',
title='{} over time'.format(category))
plt.setp(ax.lines, lw=4, alpha=.4)
return ax
Optional lesson
In [ ]:
import pytest
In [ ]:
# Check to see if pytest is installed:
pytest.__version__
In [ ]:
# content of test_sample.py
def func(x):
return x + 1
def test_answer():
assert func(3) == 5
In [ ]:
!pytest
In [ ]:
pytest.assert?
A nice, very brief intro to the syntax of the Python 3 programming language can be viewed here: