Analysis Workflow

This lecture takes some of the ideas we developed last week, and organizes them into a workflow. We'll do this by example, analyzing the pronto data. Topics will include:

  • Retrieving data
  • Encapsulating repeated code into functions
  • Creating python modules
  • Organizing the analysis directory with multiple directories

These topics provide a rough outline for an analysis workflow.

Retrieving Data


In [1]:
# Packages
from urllib import request
import os
import pandas as pd

In [2]:
# Constants used in analysis
TRIP_DATA = "https://data.seattle.gov/api/views/tw7j-dfaw/rows.csv?accessType=DOWNLOAD"
TRIP_FILE = "pronto_trips.csv"

WEATHER_DATA = "http://uwseds.github.io/data/pronto_weather.csv"
WEATHER_FILE = "pronto_weather.csv"

In [ ]:
# Get the URL data
#request.urlretrieve(TRIP_DATA, TRIP_FILE)

In [3]:
!ls -lh


total 44M
-rw-rw-r-- 1 ubuntu ubuntu 4.7K Oct  9 14:53 analysis_workflow.ipynb
-rw-rw-r-- 1 ubuntu ubuntu 556K Oct  4 14:15 Project-overview.pptx
-rw-rw-r-- 1 ubuntu ubuntu  43M Oct  4 14:43 pronto_trips.csv

Two challenges

  1. The file is big. We don't want to download it if it's already present.
  2. We're going to repeatedly download files. We don't want to just copy and paste the same code.

Encapsulating Repeated Code In Functions

A function is code that can be invoked by many callers. A function may have arguments that are specified by the caller, and returns values created by the function.


In [7]:
# Example function
def xyz(input):     # The function's name is "func". It has one argument "input".
    return int(input) + 1 # The function returns one value, input + 1

print (xyz("3"))
#a = xyz(3)
#print (xyz(a))


4

In [8]:
def addTwo(input1, input2):
    return input1 + input2
#
addTwo(1, 2)


Out[8]:
3

Colin will provide more details about function, such as variable scope, and multiple return values.


In [10]:
# Function to download from a URL
def download(url, filename):
    print("Downloading", filename)
    #request.urlretrieve(url, filename)

In [11]:
download(TRIP_DATA, TRIP_FILE)


Downloading pronto_trips.csv

In [13]:
# Enhancing function to detect file already present
import os.path
def download(url, filename):
    if os.path.isfile(filename):
        print("Already present %s." % filename)
    else:
        print("Downloading %s" % filename)
        #request.urlretrieve(url, filename)
        
download(TRIP_DATA, "none.csv")


Downloading none.csv

In [14]:
import download
download.download_file(TRIP_DATA, "none.csv")


Downloading none.csv

Creating A Python Module

We'll leave the Jupyter notebook and start using a file editor.

Using the Python Module

We have moved the download function to an external file. Now, we want to use that file.

Accessing a Python Module in Another Directory

Advanced Pandas

  • Pivot tables
  • Plotting