This workbook shows many of the core features in Tables in an fairly large example that involves real world wrangling with data. It takes two large data sets from the Berkeley Open Data Portal - the city wide parcel data base and the business license database. It begins by doing some visualization on maps of the parcel data. Digging into this reveals the wrangling challenge of working with native data - Use Codes are literally all over the map. It illustrates some structured techniques for doing the wrangling that leave behind a clear definition of how the raw data is transformed into workable parcel data. It then joins the residential portion of the parcel data with the business license data to answer a simple question - do people with business licenses live in larger homes? Yes, enough bigger for a study.
First, a couple of preliminaries that will be in all our notebooks
In [1]:
from datascience import *
import numpy as np
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
%matplotlib inline
# datascience version number of last run of this notebook
version.__version__
Out[1]:
We've grabbed a big chunk of data from the recent Berkeley Open Data portal, so let's read it in as a Table.
In [2]:
raw_parcels = Table.read_table("./data/BerkeleyData/Parcels.csv")
Cool, what does that look like. Tables print themselves in nice HTML format.
In [3]:
raw_parcels
Out[3]:
How many parcels are we talking about here?
In [4]:
raw_parcels.num_rows
Out[4]:
Tables are ordered collections of labeled columns.
How many columns and what are their names? You can tell by inspection above. But you also need to be able get at it programmatically. (Be sure to try command completion with tab.)
In [5]:
raw_parcels.column_labels
Out[5]:
In [6]:
# Len of a table is the number of columns
len(raw_parcels)
Out[6]:
This table seems to contain geocoded data, since it has columns called latitude and longitude. Let's just assume that is what's going on and throw all 28,000 points on a map. We can even label the points in case you click on them.
In [7]:
raw_parcels.points('latitude','longitude')
Out[7]: