CSE 6040, Fall 2015 [10]: A Large-Data Workflow

This notebook derives from an awesome demo by the makers of plot.ly.

In particular, this notebook starts with a large database of complaints filed by residents of New York City since 2010 via 311 calls. The full dataset is available at the NYC open data portal. At about 6 GB and 10 million complaints, you can infer that a) you might not want to read it all into memory at once, and b) NYC residents are really whiny. (OK, maybe you should only make conclusion "a".) The notebook then combines the use of sqlite, pandas, and Plotly to build interactive visualizations. So it's a great way to exercise several of the things we've learned so far in our course!

Getting started

To complete this notebook, you'll need to get your environment set up. The basic steps are:

Set up plotly.
Download the sample dataset.

Set up plotly. To do the interactive visualization part of this notebook, you'll need to install plotly and sign up for an online plotly account. From the command-line on your system, you can do this by running:

pip install plotly

From within this notebook, you might also be able to accomplish the same thing by running the following inside the notebook.

The following example is for a default Mac OS X install of Anaconda; you may need to edit it for other systems.



In [ ]:

    
!pip install plotly

The plotly service requires access to their servers.

To get started, you will need to sign up for a plotly account, if you haven't done so already, at:

https://plot.ly/

It's free! Well, to the extent that any too-good-to-be-true web service is "free."

Once you've done that, figure out what your API key is by visiting:

https://plot.ly/settings/api

Lastly, sign into the plotly servers from within your notebook as follows.

Please modify this code to use your own username and API key.



In [ ]:

    
import plotly.plotly as py
py.sign_in ('USERNAME', 'APIKEY')

Next, as a quick test let's make a simple plot using the "baby names" data set from Lab 8.



In [ ]:

    
import pandas as pd

# Build a Pandas data frame
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
BabyDataSet = zip (names, births)
df = pd.DataFrame(data=BabyDataSet, columns=['Names', 'Births'])
df



In [ ]:

    
# Plot, using `plotly`
from plotly.graph_objs import Bar

plot_data = [Bar (x=df.Names, y=df.Births)]
py.iplot (plot_data)

Download a sample dataset. Next, grab a copy of today's dataset, which is a small (~ 20%) subset of the full dataset:

[SQLite DB, ~ 257 MiB] http://cse6040.gatech.edu/fa15/NYC-311-2M.db

Connect to this database as you did in the last lab.



In [ ]:

    
# SQLite database filename
DB_FILENAME = 'NYC-311-2M.db'

# Connect
import sqlite3 as db
disk_engine = db.connect (DB_FILENAME)

Preview the data. This sample database has just a single table, named data. Let's query it and see how long it takes to read. To carry out the query, we will use the SQL reader built into pandas.



In [ ]:

    
import time

print ("Reading ...")
start_time = time.time ()

# Perform SQL query through the disk_engine connection.
# The return value is a pandas data frame.
df = pd.read_sql_query('SELECT * FROM data', disk_engine)

elapsed_time = time.time () - start_time
print ("==> Took %g seconds." % elapsed_time)

# Dump the first few rows
df.head()

More SQL stuff

Partial queries: LIMIT clause. The preceding command was overkill for what we wanted, which was just to preview the table. Instead, we could have used the LIMIT option to ask for just a few results.



In [ ]:

    
query = '''
  SELECT *
    FROM data
    LIMIT 5
'''
start_time = time.time ()
df = pd.read_sql_query(query, disk_engine)
elapsed_time = time.time () - start_time
print ("==> LIMIT version took %g seconds." % elapsed_time)

df

Set membership: IN operator. Another common idiom is to ask for rows whose attributes fall within a set, for which you can use the IN operator.



In [ ]:

    
query = '''
  SELECT ComplaintType, Descriptor, Agency
    FROM data
    WHERE Agency IN ("NYPD", "DOB")
    LIMIT 10
'''

df = pd.read_sql_query (query, disk_engine)
df.head()

Finding unique values: DISTINCT qualifier. Yet another common idiom is to ask for the unique values of some attribute, for which you can use the DISTINCT qualifier.



In [ ]:

    
query = 'SELECT DISTINCT City FROM data'
df = pd.read_sql_query(query, disk_engine)

print ("Found %d unique cities. The first few are:" % len (df))
df.head()

Renaming columns: AS operator. Sometimes you might want to rename a result column. For instance, the following query counts the number of complaints by "Agency," using the COUNT(*) function and GROUP BY clause, which we discussed in an earlier lab. If you wish to refer to the counts column of the resulting data frame, you can give it a more "friendly" name using the AS operator.



In [ ]:

    
query = '''
  SELECT Agency, COUNT(*) AS NumComplaints
    FROM data
    GROUP BY Agency
'''
df = pd.read_sql_query (query, disk_engine)
df.head()

Ordering results: ORDER clause. You can also order the results. For instance, suppose we want to execute the previous query by number of complaints.



In [ ]:

    
query = '''
  SELECT Agency, COUNT(*) as NumComplaints
    FROM data
    GROUP BY Agency
    ORDER BY NumComplaints
'''
df = pd.read_sql_query (query, disk_engine)
df.tail ()

Note that the above example prints the bottom (tail) of the data frame. You could have also asked for the query results in reverse (descending) order, by prefixing the ORDER BY attribute with a - (minus) symbol.



In [ ]:

    
query = '''
  SELECT Agency, COUNT(*) as NumComplaints
    FROM data
    GROUP BY Agency
    ORDER BY -NumComplaints
'''
df = pd.read_sql_query (query, disk_engine)
df.head ()

And of course we can plot all of this data!



In [ ]:

    
py.iplot ([Bar (x=df.Agency, y=df.NumComplaints)],
          filename='311/most common complaints by city')

Exercise. Create a pandas data frame that shows the number of complaints for each type, in descending order. What is the most common type of complaint?



In [ ]:

    
# Insert your answer here

Let's also visualize the result, as a bar chart showing complaint types on the x-axis and the number of complaints on the y-axis. If necessary, modify the plotly command below to pull the correct columns from your data frame.



In [ ]:

    
py.iplot({
    'data': [Bar (x=df.ComplaintType, y=df.NumComplaints)],
    'layout': { 
        'margin': {'b': 150}, # Make the bottom margin a bit bigger to handle the long text
        'xaxis': {'tickangle': 40}} # Angle the labels a bit
    }, filename='311/most common complaints by complaint type')

Exercise. Determine the Top 10 whiniest cities. (That is, the 10 cities with the largest numbers of complaints.)



In [ ]:

    
# Insert your answer here
query = '''
'''
df = pd.read_sql_query (query, disk_engine)
df.head (10)

You should notice two bits of funny behavior, namely, that cities are treated in a case-sensitive manner and that None appears as a city. (Presumably this setting occurs when a complaint is non-localized or the city is not otherwise specified.)

Case-insensitive grouping: COLLATE NOCASE. One way to carry out the preceding query in a case-insensitive way is to add a COLLATE NOCASE qualifier to the GROUP BY clause.

Let's filter out the 'None' cases as well, while we are at it.



In [ ]:

    
query = '''
  SELECT City, COUNT(*) AS NumComplaints
    FROM data
    WHERE City <> 'None'
    GROUP BY City COLLATE NOCASE
    ORDER BY -NumComplaints
    LIMIT 10
'''
df = pd.read_sql_query (query, disk_engine)
df

Brooklyn is NYC's whiniest city. I knew it!

Lastly, for later use, let's save the names of just the top 7 cities.



In [ ]:

    
TOP_CITIES = df.head (7)['City']
TOP_CITIES

Multiple series in `plotly`

Here is another example of how to use a query to extract data and then recombine the results into a plot.

Suppose we want to look at the number of complaints by type and by city.

Furthermore, suppose we want to render these results as a bar chart with "complaints" along the x-axis and cumulative counts, as stacked bars, along the y-axis, where different bars correspond to different cities. The plotly package requires that we create a list of traces, where each trace is a series to plot.

Here's how we might construct such a list of traces.



In [ ]:

    
traces = []

for city in TOP_CITIES:
    query = '''
      SELECT ComplaintType, COUNT(*) as NumComplaints
        FROM data
        WHERE City = "{}" COLLATE NOCASE
        GROUP BY ComplaintType
        ORDER BY -NumComplaints
    '''.format (city)
    df = pd.read_sql_query (query, disk_engine)
    
    traces.append (Bar (x=df['ComplaintType'],
                        y=df.NumComplaints, 
                        name=city.capitalize()))

From this list, we can create the stacked bar chart accordingly.



In [ ]:

    
from plotly.graph_objs import Layout

py.iplot({'data': traces,
          'layout': Layout (barmode='stack',
                            xaxis={'tickangle': 40},
                            margin={'b': 150})},
         filename='311/complaints by city stacked')

You can also click on the legend entries to hide/show the traces. Click-and-drag to zoom in and shift-drag to pan.

Exercise. Make a variation of the above stacked bar chart that shows, for each complaint type (x-axis), the percentage of complaints attributed to each city.

Your code should create a new list of traces, norm_traces, which the plotly code below can then render as the final result.



In [ ]:

    
py.iplot({'data': norm_traces, 
          'layout': Layout(
                barmode='stack',
                xaxis={'tickangle': 40, 'autorange': False, 'range': [-0.5, 16]},
                yaxis={'title': 'Percent of Complaints by City'},
                margin={'b': 150},
                title='Relative Number of 311 Complaints by City')
         }, filename='311/relative complaints by city', validate=False)

From the above data, what would you conclude about the various areas of NY city?

Part 2: SQLite time series with Pandas

Filter SQLite rows with timestamp strings: `YYYY-MM-DD hh:mm:ss`



In [ ]:

    
query = '''
  SELECT ComplaintType, CreatedDate, City
    FROM data
    WHERE CreatedDate < "2015-09-15 23:59:59"
          AND CreatedDate > "2015-09-15 00:00:00"
'''
df = pd.read_sql_query (query, disk_engine)
df

Pull out the hour unit from timestamps with `strftime`



In [ ]:

    
query = '''
  SELECT CreatedDate, STRFTIME ('%H', CreatedDate) AS Hour, ComplaintType
    FROM data
    LIMIT 5
'''
df = pd.read_sql_query (query, disk_engine)
df.head()

Count the number of complaints (rows) per hour with `STRFTIME`, `GROUP BY`, and `COUNT(*)`



In [ ]:

    
query = '''
  SELECT
      CreatedDate,
      strftime ('%H', CreatedDate) as Hour,
      COUNT (*) AS `Complaints per Hour`
    FROM data
    GROUP BY Hour
'''
df = pd.read_sql_query (query, disk_engine)
df.head()



In [ ]:

    
py.iplot({
    'data': [Bar (x=df['Hour'], y=df['Complaints per Hour'])],
    'layout': Layout (xaxis={'title': 'Hour in Day'},
                      yaxis={'title': 'Number of Complaints'})},
         filename='311/complaints per hour')

Filter noise complaints by hour



In [ ]:

    
query = '''
  SELECT CreatedDate,
         STRFTIME ('%H', CreatedDate) AS Hour,
         COUNT (*) AS `Complaints per Hour`
    FROM data
    WHERE ComplaintType LIKE '%Noise%'
    GROUP BY Hour
    ORDER BY -`Complaints per Hour`
'''
df = pd.read_sql_query (query, disk_engine)
display (df.head(n=2))

py.iplot({
    'data': [Bar(x=df['Hour'], y=df['Complaints per Hour'])],
    'layout': Layout(xaxis={'title': 'Hour in Day'},
                     yaxis={'title': 'Number of Complaints'},
                     title='Number of Noise Complaints in NYC by Hour in Day'
                    )}, filename='311/noise complaints per hour')

Segregate complaints by hour



In [ ]:

    
complaint_traces = {} # Each series in the graph will represent a complaint
complaint_traces['Other'] = {}

for hour in range(1, 24):
    hour_str = '0'+str(hour) if hour < 10 else str(hour)
    query = '''
      SELECT CreatedDate,
             ComplaintType,
             STRFTIME ('%H', CreatedDate) AS Hour,
             COUNT (*) AS NumComplaints
        FROM data
        WHERE Hour = "{}"
        GROUP BY ComplaintType
        ORDER BY -NumComplaints
    '''.format (hour_str)
    df = pd.read_sql_query (query, disk_engine)
    
    complaint_traces['Other'][hour] = sum (df.NumComplaints)
    
    # Grab the 7 most common complaints for that hour
    for i in range(7):
        complaint = df.get_value(i, 'ComplaintType')
        count = df.get_value(i, 'NumComplaints')
        complaint_traces['Other'][hour] -= count
        if complaint in complaint_traces:
            complaint_traces[complaint][hour] = count
        else:
            complaint_traces[complaint] = {hour: count}



In [ ]:

    
traces = []
for complaint in complaint_traces:
    traces.append({
        'x': range(25),
        'y': [complaint_traces[complaint].get(i, None) for i in range(25)],
        'name': complaint,
        'type': 'bar'
    })

py.iplot({
    'data': traces, 
    'layout': {
        'barmode': 'stack',
        'xaxis': {'title': 'Hour in Day'},
        'yaxis': {'title': 'Number of Complaints'},
        'title': 'The 7 Most Common 311 Complaints by Hour in a Day'
    }}, filename='311/most common complaints by hour')

Aggregated time series

First, create a new column with timestamps rounded to the previous 15 minute interval



In [ ]:

    
minutes = 15
seconds = minutes*60

query = '''
  SELECT CreatedDate,
         DATETIME ((STRFTIME ('%s', CreatedDate) / {seconds}) * {seconds},
                   'unixepoch')
           AS Interval
    FROM data
    LIMIT 10
'''.format (seconds=seconds)
df = pd.read_sql_query (query, disk_engine)
display (df.head ())

Then, GROUP BY that interval and COUNT(*)



In [ ]:

    
minutes = 15
seconds = minutes*60

query = '''
  SELECT CreatedDate,
         DATETIME ((STRFTIME ('%s', CreatedDate) / {seconds}) * {seconds},
                   'unixepoch')
           AS Interval,
         COUNT (*) AS `Complaints / Interval`
    FROM data
    GROUP BY Interval
    ORDER BY Interval
    LIMIT 500
'''.format (seconds=seconds)
df = pd.read_sql_query (query, disk_engine)

display (df.head ())
display (df.tail ())



In [ ]:

    
py.iplot(
    {
        'data': [{
            'x': df.Interval,
            'y': df['Complaints / Interval'],
            'type': 'bar'
        }],
        'layout': {
            'title': 'Number of 311 Complaints per 15 Minutes'
        }
}, filename='311/complaints per 15 minutes')



In [ ]:

    
hours = 24
minutes = hours*60
seconds = minutes*60

query = '''
  SELECT CreatedDate,
         DATETIME ((STRFTIME ('%s', CreatedDate) / {seconds}) * {seconds},
                   'unixepoch')
           AS Interval,
         COUNT (*) AS `Complaints / Interval`
    FROM data
    GROUP BY Interval
    ORDER BY Interval
    LIMIT 500
'''.format (seconds=seconds)

df = pd.read_sql_query (query, disk_engine)
df.head ()



In [ ]:

    
py.iplot(
    {
        'data': [{
            'x': df.Interval,
            'y': df['Complaints / Interval'],
            'type': 'bar'
        }],
        'layout': {
            'title': 'Number of 311 Complaints per Day'
        }
}, filename='311/complaints per day')

Learn more

Find more open data sets on Data.gov and NYC Open Data
Learn how to setup MySql with Pandas and Plotly
Add interactive widgets to IPython notebooks for customized data exploration
Big data workflows with HDF5 and Pandas
Interactive graphing with Plotly



In [ ]:

    
#from IPython.core.display import HTML
#import urllib2
#HTML(urllib2.urlopen('https://raw.githubusercontent.com/plotly/python-user-guide/css-updates/custom.css').read())