Why automate your work flow, and how to approach the process

Questions for students to consider:

1) What happens when you get a new dataset that you need to analyze in the same way you analyzed a previous data set?

2) What processes do you do often? How do you implement these?

3) Do you have a clear workflow you could replicate?

4) Or even better, could you plug this new data set into your old workflow?

Learning Objectives of Automation Module:

Lesson 1 (10-15 min)

Employ best practices of naming a variable including: don’t use existing function names, avoid periods in names, don’t use numbers at the beginning of a variable name.

Lesson 2 (10-15 min)

Define "Don't Repeat Yourself" (DRY) and provide examples of how you would implement DRY in your code
Identify code that can be modularized following DRY and implement a modular workflow using functions.

Lesson 3 (60 min)

Know how to construct a function: variables, function name, syntax, documentation, return values
Demonstrate use of function within the notebook / code.
Construct and compose function documentation that clearly defines inputs, output variables and behaviour.

Lesson 4 (10-15 min)

Organize a set of functions within a python (.py) script and use it (import it into) in a Jupyter notebook.

Lesson 5 (20-30 min)

Optional lesson

Use asserts to test validity of function inputs and outputs

Lesson 6 (To be added at a later date)

(right now this is not useful in people's current workflows in python) Demonstration of how to pull all components we went over together into one finished document

Basic Overview of the suggested workflow using Socrative (Optional)

Use Socrative quiz to collect answers from student activities (students can run their code in their notebooks, and post to socrative). This will allow the instructor to see what solutions students came up with, and identify any places where misconceptions and confusion are coming up. Using Socrative quizes also allows for a record of the student work to be analyzed after class to see how students are learning and where they are having troubles.
sharing of prepared Socrative Quizes designed to be used with the automation module can be shared by URL links to each teacher so they do not have to be remade.

Level of Python / Jupiter Automation

Good - Documenting all analysis steps in enough details that will enable them to be reproduced successfully.
Better - Script your analysis
Best - Script your analysis and write tests to validate each step.

Setup

Please download the cleaned data file:

https://github.com/Reproducible-Science-Curriculum/automation-RR-Jupyter/blob/master/gapminder_cleaned.csv

Lesson 1

Lets begin by creating a new Jupyter notebook.

Question:

Accounding to the organization we setup where should we put this notebook?

Review of good variable practices

Learning Objective: Employ best practices of naming a variable including: don’t use existing function names, avoid periods in names, don’t use numbers at the beginning of a variable name

Types of variables:

strings, integers, etc..

References:

PEP8 - Style Guide for Python Code - https://www.python.org/dev/peps/pep-0008/
https://www.tutorialspoint.com/python3/python_variable_types.htm

Keep in mind that code is read many more times then it is written!

Naming conventions that should be followed

Rules



In [ ]:

    
# write out three variables, assign a number, string, list 
x = 'Asia' # String
y = 1952  # an integer
z = 1.5 # a floating point number

cal_1 = y * z
print(cal_1)

# or
x, y = 'Asia', 'Africa'
w = x
w = x + x #concatinating strings (combinging strings)
print(w)

h = 'Africa'

list_1 = ['Asia', 'Africa', 'Europe'] # list
print(list_1)

Questions for students:

what do you think will happen with this code? x * z
what do you think will happen with this code? list_1[0]
what do you think will happen with this code? list_1[1:2]

Lists and Indexing

Python indexing is from 0 to length of list - 1

Example:

list_1 = ['Asia', 'Africa', 'Europe']

Asia index = 0, Africa index = 1,
Europe index = 2

list_1 is not a very descriptive and identifiable variable.

What would be a better name for the variable that holds these values?



In [ ]:

    
countries = ['Asia', 'Africa', 'Europe']

Scope of Variables

Global variables

Global variables are available in the environment your script is working in. Every variable we have made at this point is a global variable.

Local variables

Local variables will be useful to understand when we start using functions in the automation of our code. Local variables only exist in the function environment, not the global environment your linear workflow code is.

Other useful conventions with variables to follow

Set-up variables at the begining of your page, after importing libraries
use variables instead of file names, or exact values or strings so that if you need to change the value of something you don't have to search through all your code to make sure you made the change everywhere, simply change the value of the variable at the top. -- This will also make your code more reproducible in the end.

See what variants exist in the Jupyter notebook:

%who



In [ ]:

    
%who

Let's Get Started

To get started we will import the python modules that we will use in the session. These modules are developed by programmers and made available as open source packages for python. We would normally have to install each of these ourself but they are included as part of the Anaconda Python Distribution.

The %matplotlib inline statement is part of the Jupyter and IPython magic that enables plaots generated by the matplotlib package to be discplayed as output in the Jupyter Notebook instead of open in a separate window.



In [ ]:

    
import numpy as np
import pandas as pd
import pylab as plt
import matplotlib

%matplotlib inline

We will continue where the data exploration module left off but importing the cleaned gapminder dataset and setting it equal to a new varaible named df to denote that we have imported a pandas dataframe.

As validation that we have imported the data we will also look at the top five rows of data using the head method of pandas.



In [ ]:

    
cleaned_data_location = 'gapminder_cleaned.csv'



In [ ]:

    
df = pd.read_csv(cleaned_data_location)
df.head()



In [ ]:

    
df['year'].unique()

Lesson 2 (10-15 min)

Learning Objectives

Define "Don't Repeat Yourself" (DRY) and provide examples of how you would implement DRY in your code
Identify code that can be modularized following DRY and implement a modular workflow using functions.

As you write software there comes a time when you are going to encounter a situation where you want to do the same analysis step as you have already done in your analysis. Our natural tendancy is the copy the code that we wrote and paste it into teh new location for reuse. Sounds easy, right. Copy, paste, move on...not so fast.

What happens if there is a problem with the code or you decide to tweak it, just a little, to change a format or enahce it?

You wil have to change the code in every place you ahve copied it. How do you know if you got all of the copies? What happens if one of the copies is not changed?

These examples illustrate the principle of "Don't Repeat Yourself". We are going to look at how to refactor our code and pull pieces out by making them functions. They we will call the function everytime we want to use that code.



In [ ]:

    
# Define which continent / category we will use
category = 'lifeexp'
continent = 'asia'



In [ ]:

    
# Create a mask that selects the continent of choice
mask_continent = df['continent'] == continent
df_continent = df[mask_continent]



In [ ]:

    
# Loop through years and calculate the statistic of interest
years = df_continent['year'].unique()
summary = []

for year in years:
    mask_year = df_continent['year'] == year
    df_year = df_continent[mask_year]
    value = np.mean(df_year[category])
    summary.append((continent, year, value))
    
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', category])



In [ ]:

    
summary.plot.line('year', 'lifeexp')

Lesson 3

Learning Objectives

Know how to construct a function: variables, function name, syntax, documentation, return values
Demonstrate use of function within the notebook / code.
Construct and compose function documentation that clearly defines inputs, output



In [ ]:

    
def calculate_statistic_over_time(data, category, continent, func=None):
    if func is None:
        func = np.mean
        
    # Create a mask that selects the continent of choice
    mask_continent = data['continent'] == continent
    data_continent = data[mask_continent]

    # Loop through years and calculate the statistic of interest
    years = data_continent['year'].unique()
    summary = []
    for year in years:
        mask_year = data_continent['year'] == year
        data_year = data_continent[mask_year]
        value = func(data_year[category])
        summary.append((continent, year, value))

    # Turn the summary into a dataframe so that we can visualize easily
    summary = pd.DataFrame(summary, columns=['continent', 'year', category])
    return summary



In [ ]:

    
category = 'lifeexp'
continents = df['continent'].unique()

fig, ax = plt.subplots()
for continent in continents:
    output = calculate_statistic_over_time(df, category, continent)
    output.plot.line('year', category, ax=ax)



In [ ]:

    
category = 'lifeexp'
mean_values = df.groupby('continent').mean()[category]
mean_values = mean_values.sort_values(ascending=False)
continents = mean_values.index.values

n_continents = len(continents)
cmap = plt.cm.coolwarm_r

fig, ax = plt.subplots()
for ii, continent in enumerate(continents):
    this_color = cmap(float(ii / n_continents))
    output = calculate_statistic_over_time(df, category, continent)
    output.plot.line('year', category, ax=ax, label=continent,
                     color=this_color)
    plt.legend(loc=(1.02, 0))
    ax.set(ylabel=category, xlabel='Year',
           title='{} over time'.format(category))
    
plt.setp(ax.lines, lw=4, alpha=.4)



In [ ]:

    
def plot_statistic_over_time(data, category, func=None,
                             cmap=None, ax=None, legend=True,
                             sort=True):
    if ax is None:
        fig, ax = plt.subplots()
    if cmap is None:
        cmap = plt.cm.viridis
    
    if sort is True:
        # Sort the continents by the category of choice
        mean_values = df.groupby('continent').mean()[category]
        mean_values = mean_values.sort_values(ascending=False)
        continents = mean_values.index.values
    else:
        continents = np.unique(df['continent'])
    n_continents = len(continents)

    # Loop through continents, calculate its stat, and add a line
    for ii, continent in enumerate(continents):
        this_color = cmap(float(ii / n_continents))
        output = calculate_statistic_over_time(data, category, continent)
        output.plot.line('year', category, ax=ax, label=continent,
                         color=this_color)
        if legend is True:
            plt.legend(loc=(1.02, 0))
        else:
            ax.get_legend().set(visible=False)
        ax.set(ylabel=category, xlabel='Year',
               title='{} over time'.format(category))

    plt.setp(ax.lines, lw=4, alpha=.4)
    return ax



In [ ]:

    
plot_statistic_over_time(df, category, continent, cmap=plt.cm.coolwarm)



In [ ]:

    
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
categories = ['pop', 'gdppercap']
for ax, i_category in zip(axs, categories):
    plot_statistic_over_time(df, i_category, continent,
                             ax=ax, sort=False)
plt.setp(axs[0].get_legend(), visible=False)
plt.tight_layout()



In [ ]:

    
fig, axs = plt.subplots(1, 2, figsize=(10, 5), sharey=True)
cmaps = [plt.cm.viridis, plt.cm.coolwarm]
for ax, cmap in zip(axs, cmaps):
    plot_statistic_over_time(df, category, continent,
                             cmap=cmap, ax=ax, legend=False)



In [ ]:

    
ax = df.groupby(['continent', 'year']).mean()['lifeexp']\
       .unstack('continent').plot(cmap=plt.cm.viridis, alpha=.4, lw=3)

Lesson 4

Learning Objective

Organize a set of functions within a python (.py) script and use it (import it into) in a Jupyter notebook.

Now we are going to cut and paste the functions that we have created above. We are going to save them to a single file so we will paste all of the code that we want in that file below. Additionally, we will use the Jupyter / IPython magic, specifically the magic command writefile to save teh contents of the cell below to a file. The file name follows the %%writefile command below.

We are assuming that the notebook is in the project code directory and notebook subdirectory.



In [ ]:

    
%%writefile stats_and_plot.py
def calculate_statistic_over_time(data, category, continent, func=None):
    import numpy as np
    import pandas as pd

    if func is None:
        func = np.mean
        
    # Create a mask that selects the continent of choice
    mask_continent = data['continent'] == continent
    data_continent = data[mask_continent]

    # Loop through years and calculate the statistic of interest
    years = data_continent['year'].unique()
    summary = []
    for year in years:
        mask_year = data_continent['year'] == year
        data_year = data_continent[mask_year]
        value = func(data_year[category])
        summary.append((continent, year, value))

    # Turn the summary into a dataframe so that we can visualize easily
    summary = pd.DataFrame(summary, columns=['continent', 'year', category])
    return summary


def plot_statistic_over_time(data, category, func=None, cmap=None, ax=None, legend=True, sort=True):
    if ax is None:
        fig, ax = plt.subplots()
    if cmap is None:
        cmap = plt.cm.viridis
    
    if sort is True:
        # Sort the continents by the category of choice
        mean_values = df.groupby('continent').mean()[category]
        mean_values = mean_values.sort_values(ascending=False)
        continents = mean_values.index.values
    else:
        continents = np.unique(df['continent'])
    n_continents = len(continents)

    # Loop through continents, calculate its stat, and add a line
    for ii, continent in enumerate(continents):
        this_color = cmap(float(ii / n_continents))
        output = calculate_statistic_over_time(data, category, continent)
        output.plot.line('year', category, ax=ax, label=continent,
                         color=this_color)
        if legend is True:
            plt.legend(loc=(1.02, 0))
        else:
            ax.get_legend().set(visible=False)
        ax.set(ylabel=category, xlabel='Year',
               title='{} over time'.format(category))

    plt.setp(ax.lines, lw=4, alpha=.4)
    return ax

If you would like to see the contents of your file you can type:

%load stats_and_plot.py

or try

%pycat stats_and_plot.py



In [ ]:

Imports

Some information on imports:

In order to use external code within python two things have to happen: 1) The code has to exist on you local computer and 2) we have to reference or import to code to use it in our program.

The first requirement is satisfied when we install the softwre on our computer using conda or pip. Or as we will see in a minute, we can create custom functions that we create. Secondly, we need to tell python how to access and refer to the packages or source code we want to use.

Import Guidelines

https://www.python.org/dev/peps/pep-0008/#imports

Imports should be grouped in the following order:

standard library imports
related third party imports
local application/library specific imports

You should put a blank line between each group of imports.



In [ ]:

    
from stats_and_plot import calculate_statistic_over_time

# Saving for publication
cmaps = [plt.cm.magma, plt.cm.rainbow]
for ii, cmap in enumerate(cmaps):
    fig, ax = plt.subplots(figsize=(10, 10), sharey=True)
    plot_statistic_over_time(df, category, continent,
                             cmap=cmap, ax=ax, legend=False)
    labels = [ax.get_xticklabels(), ax.get_yticklabels(),
              ax.yaxis.label, ax.xaxis.label, ax.title]
    _ = plt.setp(labels, fontsize=30)
#     ax.set_axis_off()
    fig.savefig('fig_{}.png'.format(ii), transparent=True, bbox_inches='tight', dpi=300)

Lets get some help on our new code:



In [ ]:

    
calculate_statistic_over_time?



In [ ]:

    
help(calculate_statistic_over_time)

Docstring

Get details on DocStrings here.



In [ ]:

    
"""Form a complex number.

    Keyword arguments:
    real -- the real part (default 0.0)
    imag -- the imaginary part (default 0.0)
    """



In [ ]:

    
%%writefile stats_and_plot.py
def calculate_statistic_over_time(data, category, continent, func=None):
    """
    Calculate a statistic on the continent. The default statistic is numpys' mean.
    
    Keyword arguments:
    data -- the dataframe data source
    category -- the category to be summarized
    continent -- the continent to be examined
    func -- the function to be applied to the data (default numpy.mean)
    """
    import numpy as np
    import pandas as pd

    if func is None:
        func = np.mean
        
    # Create a mask that selects the continent of choice
    mask_continent = data['continent'] == continent
    data_continent = data[mask_continent]

    # Loop through years and calculate the statistic of interest
    years = data_continent['year'].unique()
    summary = []
    for year in years:
        mask_year = data_continent['year'] == year
        data_year = data_continent[mask_year]
        value = func(data_year[category])
        summary.append((continent, year, value))

    # Turn the summary into a dataframe so that we can visualize easily
    summary = pd.DataFrame(summary, columns=['continent', 'year', category])
    return summary


def plot_statistic_over_time(data, category, func=None, cmap=None, ax=None, legend=True, sort=True):
    if ax is None:
        fig, ax = plt.subplots()
    if cmap is None:
        cmap = plt.cm.viridis
    
    if sort is True:
        # Sort the continents by the category of choice
        mean_values = df.groupby('continent').mean()[category]
        mean_values = mean_values.sort_values(ascending=False)
        continents = mean_values.index.values
    else:
        continents = np.unique(df['continent'])
    n_continents = len(continents)

    # Loop through continents, calculate its stat, and add a line
    for ii, continent in enumerate(continents):
        this_color = cmap(float(ii / n_continents))
        output = calculate_statistic_over_time(data, category, continent)
        output.plot.line('year', category, ax=ax, label=continent,
                         color=this_color)
        if legend is True:
            plt.legend(loc=(1.02, 0))
        else:
            ax.get_legend().set(visible=False)
        ax.set(ylabel=category, xlabel='Year',
               title='{} over time'.format(category))

    plt.setp(ax.lines, lw=4, alpha=.4)
    return ax

Lesson 5

Optional lesson

Use asserts to test validity of function inputs and outputs

Python Testing



In [ ]:

    
import pytest



In [ ]:

    
# Check to see if pytest is installed:
pytest.__version__



In [ ]:

    
# content of test_sample.py
def func(x):
    return x + 1

def test_answer():
    assert func(3) == 5



In [ ]:

    
!pytest



In [ ]:

    
pytest.assert?

Python 3 Syntax

A nice, very brief intro to the syntax of the Python 3 programming language can be viewed here:

https://learnxinyminutes.com/docs/python3/