This workbook shows many of the core features in Tables in an fairly large example that involves real world wrangling with data. It takes two large data sets from the Berkeley Open Data Portal - the city wide parcel data base and the business license database. It begins by doing some visualization on maps of the parcel data. Digging into this reveals the wrangling challenge of working with native data - Use Codes are literally all over the map. It illustrates some structured techniques for doing the wrangling that leave behind a clear definition of how the raw data is transformed into workable parcel data. It then joins the residential portion of the parcel data with the business license data to answer a simple question - do people with business licenses live in larger homes? Yes, enough bigger for a study.
First, a couple of preliminaries that will be in all our notebooks
In [1]:
from datascience import *
import numpy as np
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
%matplotlib inline
# datascience version number of last run of this notebook
version.__version__
Out[1]:
We've grabbed a big chunk of data from the recent Berkeley Open Data portal, so let's read it in as a Table.
In [2]:
raw_parcels = Table.read_table("./data/BerkeleyData/Parcels.csv")
Cool, what does that look like. Tables print themselves in nice HTML format.
In [3]:
raw_parcels
Out[3]:
How many parcels are we talking about here?
In [4]:
raw_parcels.num_rows
Out[4]:
Tables are ordered collections of labeled columns.
How many columns and what are their names? You can tell by inspection above. But you also need to be able get at it programmatically. (Be sure to try command completion with tab.)
In [5]:
raw_parcels.column_labels
Out[5]:
In [6]:
# Len of a table is the number of columns
len(raw_parcels)
Out[6]:
This table seems to contain geocoded data, since it has columns called latitude and longitude. Let's just assume that is what's going on and throw all 28,000 points on a map. We can even label the points in case you click on them.
In [7]:
raw_parcels.points('latitude','longitude')
Out[7]:
We can also easily do some descriptive statistics on the Table to try to figure out what we are looking at. We might want to select some useful columns and plot that.
First, select. This returns a new table with the selected columns.
In [8]:
parcels = raw_parcels.select(('APN', 'BldgSqft','LotSqft'))
In [9]:
# Table.hist shows a histogram of the data in each of the columns
parcels.select(('BldgSqft','LotSqft')).hist(bins=50)
Well, that's not terrible useful. Seems we have a very long tail on this distribution. We might want to look at some of this data. We get a better sense by sorting and plotting.
Sort also returns a new table
In [10]:
sorted_parcels = parcels.sort('BldgSqft', descending=True)
sorted_parcels
Out[10]:
Yep, big industrial properties with half a million sq ft. Here's a bit more.
In [11]:
sorted_parcels.select(('BldgSqft','LotSqft')).plot()
In [12]:
parcels.select(('BldgSqft','LotSqft')).stats()
Out[12]:
If we are going to make much sense of the data, we will have to break it down into useful categories and do analysis within those.
Each of the columns in a Table can be selected to get a column of values.
In [13]:
raw_parcels['UseCodeDes']
Out[13]:
OK, so how many use codes and descriptions do we have?
In [14]:
use_codes = np.unique(raw_parcels['UseCode'])
len(use_codes)
Out[14]:
In [15]:
use_desc = np.unique(raw_parcels['UseCodeDes'])
len(use_desc)
Out[15]:
Isn't that interesting. There must be some places where the same description has multiple different codes! We could already start grouping things by usecode, but 100 codes is still a lot to deal with. Guess we get to do some wrangling. Grouping may still be handy because we will see can find the most common one.
In [16]:
parcels_by_use = raw_parcels.select(('APN','UseCode','UseCodeDes')).group('UseCode')
parcels_by_use
Out[16]:
Now that's interesting. Much more information that 'GROUP BY' in SQL and things like that because the entries in the table are lists of values in the group. Notice that 1100 and 1101 are both "SINGLE FAMILY RESIDENTIAL".
We can provide a collection function to combine these values into one.
In [17]:
use_codes = raw_parcels.select(('APN','UseCodeDes')).group('UseCodeDes', len)
use_codes
Out[17]:
Now we can sort that. Notice that the label for the collected column reflects ow we got here.
In [18]:
common_uses = use_codes.sort('APN len', descending=True)
common_uses.max_str_rows = 20
common_uses
Out[18]:
Let's pull out the 16,568 'SINGLE FAMILY RESIDENTIAL' parcels.
Unlike select, we want the rows of the table WHERE some particular property holds.
In [19]:
residential = raw_parcels.where(raw_parcels['UseCodeDes'] == 'SINGLE FAMILY RESIDENTIAL')
In [20]:
residential.points('latitude','longitude')
Out[20]:
Well, that is more useful. We might like to understand what portion of the parcels are 'SINGLE FAMILY RESIDENTIAL". It would be nice to visualize that too. This brings us to one of the key observations. We have a real programming language around tables and we are learning how to use it. For example we might define a simple function.
In [21]:
parcels = raw_parcels.select(['APN','UseCode','UseCodeDes','BldgSqft','LotSqft','latitude','longitude'])
Not only can we use columns of a table by indexing them with their label, we can introduce new columns. Let's apply our color function to every description to get a color for each row.
Tables can also be used to construct handy data tools for doing operations on tables.
In [22]:
categories = Table.from_rows([("Residential",["RESIDENTIAL", "SFR", "SINGLE RESDL", "SINGLE FAM"], "blue"),
("Multi-unit", ["UNIT", "FRAT", "SORO"], "orange"),
("Industrial", ["INDUSTRIAL"], "red"),
("Commercial", ["COMMERCIAL", "DEALERSHIP","BANK", "RESTAURANT", "STORE","MARKET",
"SHOP", "OFFICE", "FUNERAL", "THEATER"], "green"),
("Public", ["PUBLIC", "SCHOOL"], "yellow")],
("Category", "Keys", "Color"))
categories
Out[22]:
Lots of other opportunities to talk about defining functions, scoping, higher order functions. This one is in the yechy of 70s style of global scope. But we don't have lambda yet.
In [23]:
def categorize (title) :
for category, keywords, _ in categories.rows:
for word in keywords :
if title.find(word) >= 0 : return category
return 'Other'
Try it out
In [24]:
categorize("SINGLE FAMILY RESIDENTIAL")
Out[24]:
In [25]:
parcels['Category'] = parcels.apply(categorize, 'UseCodeDes')
parcels
Out[25]:
Let's try to get a sense of how our catorigorization did.
In [26]:
for cat, des in parcels.select(["UseCodeDes","Category"]).group('Category', np.unique).rows:
print(cat,des)
In [27]:
colored = parcels.join('Category', categories)
colored.sort('BldgSqft', descending=True)
Out[27]:
In [28]:
colored.points('latitude', 'longitude', labels='UseCodeDes', colors='Color')
Out[28]:
Ok, so let's do some more detailed analysis on the residential part, now that we see where it is.
In [29]:
residential = parcels.where(parcels['Category'] == 'Residential')
In [30]:
residential.select(['BldgSqft']).hist(bins=40)
In [31]:
residential.select(['BldgSqft','LotSqft']).stats()
Out[31]:
In [32]:
def firstQtile(x) : return np.percentile(x,25)
def thirdQtile(x) : return np.percentile(x,25)
summary_ops = (min, firstQtile, np.median, np.mean, thirdQtile, max)
In [33]:
residential.select(['BldgSqft','LotSqft']).stats(summary_ops)
Out[33]:
In [34]:
residential['Size'] = ['small' if x < 3000 else 'large' for x in residential['BldgSqft'] ]
residential
Out[34]:
In [35]:
biz = Table.read_table("./data/BerkeleyData/Business_Licenses.csv")
In [36]:
biz
Out[36]:
In [37]:
biz_residential = residential.join('APN', biz)
In [38]:
biz_residential
Out[38]:
In [39]:
biz_residential.select(['latitude', 'longitude']).points('latitude','longitude')
Out[39]:
In [40]:
biz_residential.num_rows
Out[40]:
What fraction of residential properties have business licenses?
In [41]:
biz_residential.num_rows / residential.num_rows
Out[41]:
In [42]:
residential.select(['BldgSqft','LotSqft']).stats(summary_ops)
Out[42]:
In [43]:
biz_residential.select(['BldgSqft','LotSqft']).stats(summary_ops)
Out[43]:
In [44]:
bins = np.insert(np.arange(500,6000,100),0,0)
bins
Out[44]:
In [45]:
residential.select(['BldgSqft']).hist(bins=bins)
In [46]:
biz_residential.select(['BldgSqft']).hist(bins=bins)
In [47]:
res_dist = residential.select(['BldgSqft']).bin(bins=bins, normed=True)
res_dist.hist(counts='bin', bins='bin')
In [48]:
biz_dist = biz_residential.select(['BldgSqft']).bin(bins=bins, normed=True)
biz_dist.hist(counts='bin', bins='bin')
In [49]:
biz_dist
Out[49]:
In [50]:
res_dist['Biz Bldg Sqft'] = biz_dist['BldgSqft density']
res_dist.relabel('BldgSqft density','Res Bldg Sqft')
res_dist
Out[50]:
In [51]:
res_dist['Var Dist'] = res_dist['Res Bldg Sqft']-res_dist['Biz Bldg Sqft']
In [52]:
res_dist
Out[52]:
In [53]:
res_dist.bar('bin')
It sure looks like non-biz-license residences tend to be smaller and biz-license residences tend to be larger.
A statistics based entirely on the differences of these normalized histograms is the Total Variational Distance
In [54]:
# Total variational distance
TVD = np.sum(np.abs(res_dist['Var Dist']))
TVD
Out[54]:
In [55]:
res_size = residential.select(['APN','BldgSqft'])
res_size
Out[55]:
In [56]:
biz_residential.select(['APN','BldgSqft'])
Out[56]:
In [57]:
"016 141404600" in biz_residential['APN']
Out[57]:
In [58]:
def is_biz(apn):
return (apn in biz_residential['APN'])
In [59]:
is_biz("052 136400300")
Out[59]:
In [60]:
is_biz("052 136400301")
Out[60]:
In [61]:
res_size['Biz']=res_size.apply(is_biz,'APN')
In [62]:
res_size
Out[62]:
In [63]:
# compute a statistic on this
print(np.mean(res_size['BldgSqft']))
print(np.mean(res_size.where('Biz')['BldgSqft']))
print(np.mean(res_size.where('Biz',False)['BldgSqft']))
In [64]:
np.mean(res_size.where('Biz',True)['BldgSqft'])-np.mean(res_size.where('Biz',False)['BldgSqft'])
Out[64]:
In [65]:
# Permuted version of the biz column
permuted_biz = res_size.select(['Biz']).sample()
res_size['pBiz']=permuted_biz['Biz']
res_size
Out[65]:
In [66]:
num_permutations = 300
for i in range(num_permutations):
permuted_biz = res_size.select(['Biz']).sample()
res_size['pBiz-' + str(i)] = permuted_biz['Biz']
res_size
Out[66]:
In [67]:
# Under the null hypothesis (biz license is irrelevant) we compute the
# statistic on the many permuted labels under the null hypothesis
null_hypothesis = Table.empty(['diff mean'])
for i in range(num_permutations):
p = 'pBiz-' + str(i)
stat = np.mean(res_size.where(p,True)['BldgSqft']) - np.mean(res_size.where(p,False)['BldgSqft'])
null_hypothesis.append([stat])
null_hypothesis.hist(bins=20)
In [68]:
observation = np.mean(res_size.where('Biz',True)['BldgSqft'])-np.mean(res_size.where('Biz',False)['BldgSqft'])
In [69]:
observation
Out[69]:
In [70]:
def ptest(x):
return (x > observation)
In [71]:
null_hypothesis['ptest'] = null_hypothesis.apply(ptest,'diff mean')
null_hypothesis
Out[71]:
In [72]:
p_value = null_hypothesis.where('ptest').num_rows/num_permutations
p_value
Out[72]: