2010 Census Introspection

There was a nice recent blog post that showed the results of joining 2010 U.S. Census data and 2007-2011 American Community Survey data at the zip code level. I downloaded the data as a csv file from the Fusion Table site. There's geographical, income, demographic, education, and housing data. We know that these must all be related in non-trivial ways. For this exploration, we'll work on using the data to predict one of the columns.

(c) wise.io 2012-2013

Data Inspection and Preparation


In [2]:
%pylab inline
import pandas as pd


Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

In [2]:
census = pd.read_csv("/Users/jbloom/Downloads/census_2010_acs.csv")

In [3]:
print len(census), len(census.columns)
print census.dtypes


33120 66
postal_code              int64
state                   object
region                  object
division                object
statefips                int64
latitude               float64
longitude              float64
land_area_sq_meters      int64
total_population         int64
population_density     float64
housing_unit_count       int64
white_population         int64
black_population         int64
native_population        int64
asian_population         int64
...
education_level_6thgrade_pct                      float64
education_level_8thgrade_pct                      float64
education_level_9thgrade_pct                      float64
education_level_10thgrade_pct                     float64
education_level_11thgrade_pct                     float64
education_level_12thgrade_pct                     float64
education_level_highschool_pct                    float64
education_level_college_lessthan1year_pct         float64
education_level_college_morethan1year_pct         float64
education_level_college_associates_pct            float64
education_level_college_bachelors_pct             float64
education_level_college_masters_pct               float64
education_level_college_professionaldegree_pct    float64
education_level_college_doctorate_pct             float64
avg_education_years                               float64
Length: 66, dtype: object

I think it would be interesting to predict the median house price given all the other variables. So lets start by inspecting that distribution.


In [4]:
print census.median_house_value.median(), census.median_house_value.min(), census.median_house_value.max()


122300.0 0 1000001

The median seems sensible but the max and min do not (they look suspicious, at least). It's not clear why there are over 1000 outlier zipcodes, but so be it.


In [5]:
outliers = census[(census.median_house_value < 100) | (census.median_house_value > 1e6)]
census   = census[(census.median_house_value >= 100) & (census.median_house_value <= 1e6)]

In [6]:
print census.median_house_value.median(), census.median_house_value.min(), census.median_house_value.max()


126000.0 9999 1000000

Now let's save these two dataframes. We can build a regression model on house value using the non-outliers and predict what the outliers should be.


In [26]:
census.to_csv("census.clean.csv")
outliers.to_csv("census.outliers.csv")

We can also break up the census data into a testing and training set to get a feel for how well we should do blindly.


In [36]:
import copy
indices = copy.copy(census.index.values)
np.random.shuffle(indices)
indices = list(indices)

In [43]:
train = census.ix[indices[0:25000]]
test  = census.ix[indices[25000:]]
print len(train), len(test)


25000 6701

In [44]:
train.to_csv("census.train.csv")
test.to_csv("census.test.csv")

Learning some machine learning models using wise.io

We're going to use the wise.io platform to build some models just to get a good feeling of how the workflow is devised, graphically. However, we could do all of the below using the Python API (check out the docs for some worked examples that do not use the graphical interface).

If you don't have an account on the platform, send us an email (contact@wise.io) to request access.

First step is to upload the files:

Next step is to learn a model. I decided not to include postcode as one of the features since relative locations are encoded in lat, long columns.

We can take a look at feature importance after building the model:

Let's take a look at how we did by predicting the testing set using just the training set.

We can download prediction from the models. For what is below, I downloaded the predictions for the outliers.


In [45]:
outlier_predictions = pd.read_csv("/Users/jbloom/Downloads/predictions_a21464b675974d518c5085374e67f6f5.csv")

Let's vizualize the predictions in California


In [109]:
figure(figsize=(8,8))
from mpl_toolkits.basemap import Basemap, cm
map = Basemap(width=620000,height=580000,
            rsphere=(6378137.00,6356752.3142),\
            resolution='l',area_thresh=1000.,projection='lcc',\
            lat_1=38.,lat_2=37,lat_0=38,lon_0=-122.)

map.drawcoastlines()
map.drawcountries()
map.drawstates()
x, y = map(outliers.longitude.values,outliers.latitude.values)
map.scatter(x,y,s=20*outlier_predictions.median_house_value.values/1e6,marker='o',color=mpl.cm.seismic((outlier_predictions.median_house_value.values - outlier_predictions.median_house_value.values.min())/nn))
draw()


Red is larger and blue are smaller median home values. Seems sensible at least in a broad stroke. Here's the non-outlier dataset


In [110]:
figure(figsize=(8,8))
map = Basemap(width=620000,height=580000,
            rsphere=(6378137.00,6356752.3142),\
            resolution='l',area_thresh=1000.,projection='lcc',\
            lat_1=38.,lat_2=37,lat_0=38,lon_0=-122.)

map.drawcoastlines()
map.drawcountries()
map.drawstates()
x, y = map(census.longitude.values,census.latitude.values)
map.scatter(x,y,s=20*census.median_house_value.values/1e6,marker='o',color=mpl.cm.seismic((census.median_house_value.values - outlier_predictions.median_house_value.values.min())/nn))
draw()



In [7]:
preds = pd.read_csv("/Users/jbloom/Downloads/predictions_41b66ea7bc1844499c316a0eccaa0cfa.csv")

Let's how well we do predicting the data itself


In [51]:
scatter(census.median_house_value.values,preds.values,alpha=0.1)
plot([0,1e6],[0,1e6])
xlim(10000,1e6)
ylim(10000,1e6)
xlabel("median house value [$]")
ylabel("predicted house value [$]")


Out[51]:
<matplotlib.text.Text at 0x10ab3af10>

Not too bad... Looks like a systematic underprediction at the high end. Some interesting outliers, suggesting the median prices were much lower than they should be. Perhaps there's something going on in these places.


In [13]:
tmp = (preds.median_house_value.values - census.median_house_value.values)/census.median_house_value.values

In [44]:
scatter(census.median_house_value.values, tmp)
ylim(0.0,2)
xlim(100000,200000)


Out[44]:
(100000, 200000)

In [47]:
i = census[(census.median_house_value.values > 160000) & (census.median_house_value.values < 180000) &
        (tmp > 1.5)]

In [49]:
i.postal_code


Out[49]:
11747    95134
Name: postal_code, dtype: int64

Area code 95134 is in San Jose, right near the airport and at the bottom of the Bay. It's noisy and it smells like Bay-stank.