Analysis Workflow

This lecture takes some of the ideas we developed last week, and organizes them into a workflow. We'll do this by example, analyzing the pronto data. Topics will include:

Retrieving data
Encapsulating repeated code into functions
Creating python modules
Organizing the analysis directory with multiple directories

These topics provide a rough outline for an analysis workflow.

Retrieving Data



In [1]:

    
# Packages
from urllib import request
import os
import pandas as pd



In [2]:

    
# Constants used in analysis
TRIP_DATA = "https://data.seattle.gov/api/views/tw7j-dfaw/rows.csv?accessType=DOWNLOAD"
TRIP_FILE = "pronto_trips.csv"

WEATHER_DATA = "http://uwseds.github.io/data/pronto_weather.csv"
WEATHER_FILE = "pronto_weather.csv"



In [ ]:

    
# Get the URL data
#request.urlretrieve(TRIP_DATA, TRIP_FILE)



In [3]:

    
!ls -lh









    



total 44M
-rw-rw-r-- 1 ubuntu ubuntu 4.7K Oct  9 14:53 analysis_workflow.ipynb
-rw-rw-r-- 1 ubuntu ubuntu 556K Oct  4 14:15 Project-overview.pptx
-rw-rw-r-- 1 ubuntu ubuntu  43M Oct  4 14:43 pronto_trips.csv

Two challenges

The file is big. We don't want to download it if it's already present.
We're going to repeatedly download files. We don't want to just copy and paste the same code.

Encapsulating Repeated Code In Functions

A function is code that can be invoked by many callers. A function may have arguments that are specified by the caller, and returns values created by the function.



In [7]:

    
# Example function
def xyz(input):     # The function's name is "func". It has one argument "input".
    return int(input) + 1 # The function returns one value, input + 1

print (xyz("3"))
#a = xyz(3)
#print (xyz(a))



In [8]:

    
def addTwo(input1, input2):
    return input1 + input2
#
addTwo(1, 2)









    Out[8]:





3

Colin will provide more details about function, such as variable scope, and multiple return values.



In [10]:

    
# Function to download from a URL
def download(url, filename):
    print("Downloading", filename)
    #request.urlretrieve(url, filename)



In [11]:

    
download(TRIP_DATA, TRIP_FILE)









    



Downloading pronto_trips.csv



In [13]:

    
# Enhancing function to detect file already present
import os.path
def download(url, filename):
    if os.path.isfile(filename):
        print("Already present %s." % filename)
    else:
        print("Downloading %s" % filename)
        #request.urlretrieve(url, filename)
        
download(TRIP_DATA, "none.csv")









    



Downloading none.csv



In [14]:

    
import download
download.download_file(TRIP_DATA, "none.csv")









    



Downloading none.csv

Creating A Python Module

We'll leave the Jupyter notebook and start using a file editor.

Using the Python Module

We have moved the download function to an external file. Now, we want to use that file.

Accessing a Python Module in Another Directory

Advanced Pandas

Pivot tables
Plotting