For a forecoming installment of The Odds Must Be Crazy, I was doing some research and wanted to know a bit about the movement between states in the US. I myself moved from Illinois to Calfornia and have anecdotally observed many people with similar stories. I wanted to know more about patterns in state-to-state moves, and I found the US census tracks this data.
I quickly found a spreadsheet available containing this data, but wasn't able to readily find a way to pull this via API. I realized I could probably parse and clean this spreadsheet faster than I could location the proper endpoint, so that's what I did. My script is below and transforms the excel file into a more useful tab separated file, which I'll be using in a data visualization project shortly.
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import urllib2
import math
import os
In [3]:
fname = 'State_to_State_Migrations_Table_2014.xls'
files = os.listdir('.')
try:
i = files.index(fname)
except ValueError:
url = 'http://www.census.gov/hhes/migration/files/acs/st-to-st/' + fname
r = urllib2.urlopen(url)
f = open(fname, 'wb')
f.write(r.read())
f.close()
In [4]:
df = pd.read_excel(fname, header=False, skiprows=4)
In [5]:
# Remove the repeated "Current residence" columns
filter = df.iloc[0]=='Current residence'
template = np.arange(len(filter))
redundant = template[np.array(filter)]
redundant = redundant[1:]
df = df.drop(df.columns[redundant], axis=1)
In [6]:
# Discard margin of error
filter = df.iloc[2]=='MOE'
template = np.arange(len(filter))
moe_cols = template[np.array(filter)]
df = df.drop(df.columns[moe_cols], axis=1)
In [7]:
# Remove all empty rows
def soft_isnan(x):
if type(x) == float:
return math.isnan(x)
else:
return False
headers = []
for i in range(df.shape[0]):
if sum(df.iloc[i].apply(soft_isnan)) == df.shape[1]:
headers.append(i)
headers = headers[1:]
df = df.drop(df.index[headers])
In [8]:
# Remove footer
df.index = np.arange(df.shape[0])
x = df.index[df[df.columns[0]]=="Footnotes:"][0]
df = df.iloc[0:x]
In [9]:
# Label Estimate column with state name
x = df.ix[0][0:5].tolist()
x.extend(df.ix[1][5:].tolist())
df.columns = map(lambda x: x.strip(), x)
In [10]:
# Remove repeated headers
df = df.ix[4:]
filter = df['Current residence'] != 'Current residence'
df = df[filter]
filter = df['Current residence'].apply(lambda x: type(x) != float)
df = df[filter]
In [11]:
# Set index
df = df.drop(4, axis=0)
df = df.set_index('Current residence')
In [12]:
df.head()
Out[12]:
In [13]:
df.to_csv('s2s_migration.tsv', index=True, sep="\t")