This workbook shows many of the core features in Tables in an fairly large example that involves real world wrangling with data. It takes two large data sets from the Berkeley Open Data Portal - the city wide parcel data base and the business license database. It begins by doing some visualization on maps of the parcel data. Digging into this reveals the wrangling challenge of working with native data - Use Codes are literally all over the map. It illustrates some structured techniques for doing the wrangling that leave behind a clear definition of how the raw data is transformed into workable parcel data. It then joins the residential portion of the parcel data with the business license data to answer a simple question - do people with business licenses live in larger homes? Yes, enough bigger for a study.

First, a couple of preliminaries that will be in all our notebooks


In [1]:
from datascience import *
import numpy as np
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
%matplotlib inline
# datascience version number of last run of this notebook
version.__version__


Out[1]:
'0.3.dev19'

We've grabbed a big chunk of data from the recent Berkeley Open Data portal, so let's read it in as a Table.

https://data.cityofberkeley.info/browse


In [2]:
raw_parcels = Table.read_table("./data/BerkeleyData/Parcels.csv")

Cool, what does that look like. Tables print themselves in nice HTML format.


In [3]:
raw_parcels


Out[3]:
CONDO APN POINT_X POINT_Y LocationID StreetNum Prequalifi Direction StreetName StreetSufx Unit City State Zip UseCode UseCodeDes BldgSqft LotSqft latitude longitude X_min X_max Y_min Y_max Shape_area Shape_len
YES 055 183213100 565028 4.19089e+06 71843 2550 nan nan DANA ST nan BERKELEY CA 94704 7390 CONDOMINIUM-COMMON AREA 0 0 37.8632 -122.261 565007 565050 4.19088e+06 4.19091e+06 1018.78 129.821
YES 053 166004000 562508 4.19004e+06 67007 1012 nan nan GRAYSON ST A BERKELEY CA 94710 4200 INDUSTRIAL LIGHT/MANUFAC. 1932 1667 37.8557 -122.289 562497 562520 4.19002e+06 4.19006e+06 476.854 92.7454
YES 053 166004200 562508 4.19004e+06 67009 1012 nan nan GRAYSON ST C BERKELEY CA 94710 4101 CONDOMINIUM-INDUSTRIAL 1932 1667 37.8557 -122.289 562497 562520 4.19002e+06 4.19006e+06 476.854 92.7454
YES 057 209002100 562032 4.19171e+06 47615 1813 nan nan NINTH ST nan BERKELEY CA 94710 7300 CONDOMINIUMS-SINGLE RESDL 972 2600 37.8707 -122.295 562012 562053 4.1917e+06 4.19172e+06 483.811 103.162
YES 057 208303100 562499 4.19172e+06 12322 1901 nan nan CURTIS ST nan BERKELEY CA 94702 7300 CONDOMINIUMS-SINGLE RESDL 786 2700 37.8708 -122.289 562483 562515 4.19171e+06 4.19173e+06 533.749 96.0604
YES 052 152702400 564234 4.18923e+06 65546 1829 nan nan SIXTY-THIRD ST nan BERKELEY CA 94703 7300 CONDOMINIUMS-SINGLE RESDL 1387 2363 37.8483 -122.27 564226 564243 4.18921e+06 4.18926e+06 438.797 103.952
YES 052 157310800 565682 4.19013e+06 48709 2641 nan nan WEBSTER ST 1 BERKELEY CA 94705 7300 CONDOMINIUMS-SINGLE RESDL 630 1167 37.8562 -122.253 565672 565692 4.19011e+06 4.19014e+06 413.513 89.6718
YES 052 151603400 563132 4.18952e+06 65541 1314 nan nan HASKELL ST A BERKELEY CA 94702 1160 SINGLE FAMILY RESIDENTIAL 935 1575 37.851 -122.282 563120 563144 4.1895e+06 4.18955e+06 570.197 116.043
YES 055 183203100 565019 4.19095e+06 7145 2522 nan nan DANA ST 101 BERKELEY CA 94704 9300 MEDICAL-DENTAL BUILDING 1228 1270 37.8637 -122.261 564998 565040 4.19093e+06 4.19097e+06 1034.69 130.691
YES 056 200500400 563426 4.19151e+06 47852 2122 nan nan JEFFERSON AVE nan BERKELEY CA 94703 7340 CONDOMINIUM-SFR,R&T 402.1 1403 2166 37.8689 -122.279 563369 563480 4.19142e+06 4.19161e+06 16800.6 587.516

... (28795 rows omitted)

How many parcels are we talking about here?


In [4]:
raw_parcels.num_rows


Out[4]:
28805

Tables are ordered collections of labeled columns.

How many columns and what are their names? You can tell by inspection above. But you also need to be able get at it programmatically. (Be sure to try command completion with tab.)


In [5]:
raw_parcels.column_labels


Out[5]:
('CONDO',
 'APN',
 'POINT_X',
 'POINT_Y',
 'LocationID',
 'StreetNum',
 'Prequalifi',
 'Direction',
 'StreetName',
 'StreetSufx',
 'Unit',
 'City',
 'State',
 'Zip',
 'UseCode',
 'UseCodeDes',
 'BldgSqft',
 'LotSqft',
 'latitude',
 'longitude',
 'X_min',
 'X_max',
 'Y_min',
 'Y_max',
 'Shape_area',
 'Shape_len')

In [6]:
# Len of a table is the number of columns
len(raw_parcels)


Out[6]:
26

This table seems to contain geocoded data, since it has columns called latitude and longitude. Let's just assume that is what's going on and throw all 28,000 points on a map. We can even label the points in case you click on them.


In [7]:
raw_parcels.points('latitude','longitude')


Out[7]: