There was a nice recent blog post that showed the results of joining 2010 U.S. Census data and 2007-2011 American Community Survey data at the zip code level. I downloaded the data as a csv file from the Fusion Table site. There's geographical, income, demographic, education, and housing data. We know that these must all be related in non-trivial ways. For this exploration, we'll work on using the data to predict one of the columns.
(c) wise.io 2012-2013
In [2]:
%pylab inline
import pandas as pd
In [2]:
census = pd.read_csv("/Users/jbloom/Downloads/census_2010_acs.csv")
In [3]:
print len(census), len(census.columns)
print census.dtypes
I think it would be interesting to predict the median house price given all the other variables. So lets start by inspecting that distribution.
In [4]:
print census.median_house_value.median(), census.median_house_value.min(), census.median_house_value.max()
The median seems sensible but the max and min do not (they look suspicious, at least). It's not clear why there are over 1000 outlier zipcodes, but so be it.
In [5]:
outliers = census[(census.median_house_value < 100) | (census.median_house_value > 1e6)]
census = census[(census.median_house_value >= 100) & (census.median_house_value <= 1e6)]
In [6]:
print census.median_house_value.median(), census.median_house_value.min(), census.median_house_value.max()
Now let's save these two dataframes. We can build a regression model on house value using the non-outliers and predict what the outliers should be.
In [26]:
census.to_csv("census.clean.csv")
outliers.to_csv("census.outliers.csv")
We can also break up the census data into a testing and training set to get a feel for how well we should do blindly.
In [36]:
import copy
indices = copy.copy(census.index.values)
np.random.shuffle(indices)
indices = list(indices)
In [43]:
train = census.ix[indices[0:25000]]
test = census.ix[indices[25000:]]
print len(train), len(test)
In [44]:
train.to_csv("census.train.csv")
test.to_csv("census.test.csv")
We're going to use the wise.io platform to build some models just to get a good feeling of how the workflow is devised, graphically. However, we could do all of the below using the Python API (check out the docs for some worked examples that do not use the graphical interface).
If you don't have an account on the platform, send us an email (contact@wise.io) to request access.
First step is to upload the files:
Next step is to learn a model. I decided not to include postcode as one of the features since relative locations are encoded in lat, long columns.
We can take a look at feature importance after building the model:
Let's take a look at how we did by predicting the testing set using just the training set.
We can download prediction from the models. For what is below, I downloaded the predictions for the outliers.
In [45]:
outlier_predictions = pd.read_csv("/Users/jbloom/Downloads/predictions_a21464b675974d518c5085374e67f6f5.csv")
Let's vizualize the predictions in California
In [109]:
figure(figsize=(8,8))
from mpl_toolkits.basemap import Basemap, cm
map = Basemap(width=620000,height=580000,
rsphere=(6378137.00,6356752.3142),\
resolution='l',area_thresh=1000.,projection='lcc',\
lat_1=38.,lat_2=37,lat_0=38,lon_0=-122.)
map.drawcoastlines()
map.drawcountries()
map.drawstates()
x, y = map(outliers.longitude.values,outliers.latitude.values)
map.scatter(x,y,s=20*outlier_predictions.median_house_value.values/1e6,marker='o',color=mpl.cm.seismic((outlier_predictions.median_house_value.values - outlier_predictions.median_house_value.values.min())/nn))
draw()
Red is larger and blue are smaller median home values. Seems sensible at least in a broad stroke. Here's the non-outlier dataset
In [110]:
figure(figsize=(8,8))
map = Basemap(width=620000,height=580000,
rsphere=(6378137.00,6356752.3142),\
resolution='l',area_thresh=1000.,projection='lcc',\
lat_1=38.,lat_2=37,lat_0=38,lon_0=-122.)
map.drawcoastlines()
map.drawcountries()
map.drawstates()
x, y = map(census.longitude.values,census.latitude.values)
map.scatter(x,y,s=20*census.median_house_value.values/1e6,marker='o',color=mpl.cm.seismic((census.median_house_value.values - outlier_predictions.median_house_value.values.min())/nn))
draw()
In [7]:
preds = pd.read_csv("/Users/jbloom/Downloads/predictions_41b66ea7bc1844499c316a0eccaa0cfa.csv")
Let's how well we do predicting the data itself
In [51]:
scatter(census.median_house_value.values,preds.values,alpha=0.1)
plot([0,1e6],[0,1e6])
xlim(10000,1e6)
ylim(10000,1e6)
xlabel("median house value [$]")
ylabel("predicted house value [$]")
Out[51]:
Not too bad... Looks like a systematic underprediction at the high end. Some interesting outliers, suggesting the median prices were much lower than they should be. Perhaps there's something going on in these places.
In [13]:
tmp = (preds.median_house_value.values - census.median_house_value.values)/census.median_house_value.values
In [44]:
scatter(census.median_house_value.values, tmp)
ylim(0.0,2)
xlim(100000,200000)
Out[44]:
In [47]:
i = census[(census.median_house_value.values > 160000) & (census.median_house_value.values < 180000) &
(tmp > 1.5)]
In [49]:
i.postal_code
Out[49]:
Area code 95134 is in San Jose, right near the airport and at the bottom of the Bay. It's noisy and it smells like Bay-stank.