Byte 2 Google Data Labs Notebook

This is an example notebook that illustrates use of Pandas and Seaborn to load and explore data from a Fusion Table.

We start by loading required libraries:


In [1]:
# Import libraries.
import csv
import httplib2
from apiclient.discovery import build
import urllib
import json
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-646d63de88ed> in <module>()
      2 import csv
      3 import httplib2
----> 4 from apiclient.discovery import build
      5 import urllib
      6 import json

/Users/nbanovic/Documents/Teaching/data/data/Assignments/jmankoff-explore/apiclient/__init__.py in <module>()
     16   )
     17 
---> 18 from googleapiclient import channel
     19 from googleapiclient import discovery
     20 from googleapiclient import errors

/Users/nbanovic/Documents/Teaching/data/data/Assignments/jmankoff-explore/googleapiclient/channel.py in <module>()
     61 import uuid
     62 
---> 63 from googleapiclient import errors
     64 from oauth2client import util
     65 import six

/Users/nbanovic/Documents/Teaching/data/data/Assignments/jmankoff-explore/googleapiclient/errors.py in <module>()
     24 import json
     25 
---> 26 from oauth2client import util
     27 
     28 

ImportError: cannot import name 'util'

Load Data from a Fusion Table

Here we download the data and store it in a local JSON file for easier access.


In [3]:
# This API key is provided by google as described in the tutorial
API_KEY = 'AIzaSyCuKKwcT5mUYZP_Q-heqCPjvdWyacRuX00'

# This is the table id for the fusion table
TABLE_ID = '1ymz3EtGdi4qKGMl5AxEFXtTlgk3tKi8iCpjTzvM'

# open the data stored in a file called "data.json"
try:
    fp = open("data.json")
    response = json.load(fp)

# but if that file does not exist, download the data from fusiontables
except IOError:
    service = build('fusiontables', 'v1', developerKey=API_KEY)
    query = "SELECT * FROM " + TABLE_ID + " WHERE  AnimalType = 'DOG'"
    response = service.query().sql(sql=query).execute()
    fp = open("data.json", "w")
    json.dump(response, fp)

In [4]:
# Check how many rows we have.
print len(response['rows'])


9635

Load JSON Data into a Dataframe

Now that we have the JSON file locally, we will load it into a Pandas dataframe for easy access.


In [5]:
data_df = pd.DataFrame(response[u'rows'], columns = response[u'columns'])

In [6]:
# Show a few rows.
data_df.head()


Out[6]:
Outcome AnimalID AnimalType IntakeDate IntakeYear IntakeMonth Name Breed Age Sex ... OutcomeSubtype ZipFound Latitude Longitude ZipPlaced OutcomeDate OutcomeYear OutcomeMonth icon type IconName
0 A346562 DOG 2011-04-16T00:00:00Z 2011 April DIXIE Older than 1 year MALE ... 40218 38.1938 -85.6573 NaN 2011-04-16T00:00:00Z 2011 April NaN
1 A364651 DOG 2011-03-11T00:00:00Z 2011 March KAYLEE YORKSHIRE TERR - POODLE TOY Older than 1 year FEMALE ... 40202 38.2529 -85.7514 40218 2011-04-01T00:00:00Z 2011 April NaN
2 A369673 DOG 2011-03-31T00:00:00Z 2011 March AUSTIN Older than 1 year MALE ... 40228 38.1376 -85.6274 NaN 2011-04-12T00:00:00Z 2011 April NaN
3 A415767 DOG 2011-04-12T00:00:00Z 2011 April EBONY Older than 1 year FEMALE ... 40215 38.1919 -85.7841 40215 2011-04-12T00:00:00Z 2011 April NaN
4 A420365 DOG 2011-06-19T00:00:00Z 2011 June MARLEY SIBERIAN HUSKY - MIX Older than 1 year FEMALE ... 40216 38.1889 -85.8314 40216 2011-06-21T00:00:00Z 2011 June NaN

5 rows × 24 columns

Select and Clean Up Data

Pandas makes it easy to access different columns and rows in the dataframe. Here we create a new data frame containing only Latitude and Longitude.


In [22]:
lat_lon_df = data_df[['Latitude', 'Longitude']]
lat_lon_df.head()


Out[22]:
Latitude Longitude
0 38.1938 -85.6573
1 38.2529 -85.7514
2 38.1376 -85.6274
3 38.1919 -85.7841
4 38.1889 -85.8314

Now we can quickly convert both columns to numeric and remove values that could not be converted (NaN). It is all done on one line!


In [19]:
lat_lon_df = lat_lon_df.apply(lambda x: pd.to_numeric(x, errors='ignore')).dropna()

Visualize Data

We can now use Seaborn to quickly plot the data from the data frame. In this example we quickly compare the two variables in a pairplot.


In [21]:
sns.pairplot(lat_lon_df)


Out[21]:
<seaborn.axisgrid.PairGrid at 0x7ff6c409d3d0>

In [ ]: