State to State Migration Data Munging

For a forecoming installment of The Odds Must Be Crazy, I was doing some research and wanted to know a bit about the movement between states in the US. I myself moved from Illinois to Calfornia and have anecdotally observed many people with similar stories. I wanted to know more about patterns in state-to-state moves, and I found the US census tracks this data.

I quickly found a spreadsheet available containing this data, but wasn't able to readily find a way to pull this via API. I realized I could probably parse and clean this spreadsheet faster than I could location the proper endpoint, so that's what I did. My script is below and transforms the excel file into a more useful tab separated file, which I'll be using in a data visualization project shortly.


In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import urllib2
import math
import os

In [3]:
fname = 'State_to_State_Migrations_Table_2014.xls'
files = os.listdir('.')
try:
    i = files.index(fname)
except ValueError:
    url = 'http://www.census.gov/hhes/migration/files/acs/st-to-st/' + fname
    r = urllib2.urlopen(url)
    f = open(fname, 'wb')
    f.write(r.read())
    f.close()

In [4]:
df = pd.read_excel(fname, header=False, skiprows=4)

In [5]:
# Remove the repeated "Current residence" columns
filter = df.iloc[0]=='Current residence'
template = np.arange(len(filter))
redundant = template[np.array(filter)]
redundant = redundant[1:]
df = df.drop(df.columns[redundant], axis=1)

In [6]:
# Discard margin of error
filter = df.iloc[2]=='MOE'
template = np.arange(len(filter))
moe_cols = template[np.array(filter)]
df = df.drop(df.columns[moe_cols], axis=1)

In [7]:
# Remove all empty rows

def soft_isnan(x):
    if type(x) == float:
        return math.isnan(x)
    else:
        return False

headers = []
for i in range(df.shape[0]):
    if sum(df.iloc[i].apply(soft_isnan)) == df.shape[1]:
        headers.append(i)

headers = headers[1:]
df = df.drop(df.index[headers])

In [8]:
# Remove footer
df.index = np.arange(df.shape[0])
x = df.index[df[df.columns[0]]=="Footnotes:"][0]
df = df.iloc[0:x]

In [9]:
# Label Estimate column with state name
x = df.ix[0][0:5].tolist()
x.extend(df.ix[1][5:].tolist())
df.columns = map(lambda x: x.strip(), x)

In [10]:
# Remove repeated headers
df = df.ix[4:]
filter = df['Current residence'] != 'Current residence'
df = df[filter]
filter = df['Current residence'].apply(lambda x: type(x) != float)
df = df[filter]

In [11]:
# Set index
df = df.drop(4, axis=0)
df = df.set_index('Current residence')

In [12]:
df.head()


Out[12]:
Population 1 year and over Same house 1 year ago Same state of residence 1 year ago Different state of residence 1 year ago Alabama Alaska Arizona Arkansas California Colorado ... Vermont Virginia Washington West Virginia Wisconsin Wyoming Total Puerto Rico U.S. Island Area Foreign Country
Current residence
Alabama 4791931 4081522 589211 107919 NaN 1136 837 1450 5402 2019 ... 179 1925 1143 1162 234 307 13279 320 46 12913
Alaska 725206 585916 97358 37842 603 NaN 2049 240 3419 2017 ... 115 684 4841 0 459 107 4090 118 0 3972
Arizona 6651964 5431241 922129 249730 785 3140 NaN 1180 57446 9663 ... 661 3202 13152 32 5948 2670 48864 679 332 47853
Arkansas 2931123 2464189 383087 74741 2145 66 782 NaN 5070 1149 ... 0 1250 777 303 822 34 9106 148 241 8717
California 38340324 32849251 4659084 513968 10439 7098 35200 1570 NaN 17500 ... 1041 17847 32837 897 6451 1401 318021 4317 1752 311952

5 rows × 59 columns


In [13]:
df.to_csv('s2s_migration.tsv', index=True, sep="\t")