State to State Migration Data Munging

For a forecoming installment of The Odds Must Be Crazy, I was doing some research and wanted to know a bit about the movement between states in the US. I myself moved from Illinois to Calfornia and have anecdotally observed many people with similar stories. I wanted to know more about patterns in state-to-state moves, and I found the US census tracks this data.

I quickly found a spreadsheet available containing this data, but wasn't able to readily find a way to pull this via API. I realized I could probably parse and clean this spreadsheet faster than I could location the proper endpoint, so that's what I did. My script is below and transforms the excel file into a more useful tab separated file, which I'll be using in a data visualization project shortly.



In [2]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import urllib2
import math
import os



In [3]:

    
fname = 'State_to_State_Migrations_Table_2014.xls'
files = os.listdir('.')
try:
    i = files.index(fname)
except ValueError:
    url = 'http://www.census.gov/hhes/migration/files/acs/st-to-st/' + fname
    r = urllib2.urlopen(url)
    f = open(fname, 'wb')
    f.write(r.read())
    f.close()



In [4]:

    
df = pd.read_excel(fname, header=False, skiprows=4)



In [5]:

    
# Remove the repeated "Current residence" columns
filter = df.iloc[0]=='Current residence'
template = np.arange(len(filter))
redundant = template[np.array(filter)]
redundant = redundant[1:]
df = df.drop(df.columns[redundant], axis=1)



In [6]:

    
# Discard margin of error
filter = df.iloc[2]=='MOE'
template = np.arange(len(filter))
moe_cols = template[np.array(filter)]
df = df.drop(df.columns[moe_cols], axis=1)



In [7]:

    
# Remove all empty rows

def soft_isnan(x):
    if type(x) == float:
        return math.isnan(x)
    else:
        return False

headers = []
for i in range(df.shape[0]):
    if sum(df.iloc[i].apply(soft_isnan)) == df.shape[1]:
        headers.append(i)

headers = headers[1:]
df = df.drop(df.index[headers])



In [8]:

    
# Remove footer
df.index = np.arange(df.shape[0])
x = df.index[df[df.columns[0]]=="Footnotes:"][0]
df = df.iloc[0:x]



In [9]:

    
# Label Estimate column with state name
x = df.ix[0][0:5].tolist()
x.extend(df.ix[1][5:].tolist())
df.columns = map(lambda x: x.strip(), x)



In [10]:

    
# Remove repeated headers
df = df.ix[4:]
filter = df['Current residence'] != 'Current residence'
df = df[filter]
filter = df['Current residence'].apply(lambda x: type(x) != float)
df = df[filter]



In [11]:

    
# Set index
df = df.drop(4, axis=0)
df = df.set_index('Current residence')



In [12]:

    
df.head()









    Out[12]:






  
    
      
      Population 1 year and over
      Same house 1 year ago
      Same state of residence 1 year ago
      Different state of residence 1 year ago
      Alabama
      Alaska
      Arizona
      Arkansas
      California
      Colorado
      ...
      Vermont
      Virginia
      Washington
      West Virginia
      Wisconsin
      Wyoming
      Total
      Puerto Rico
      U.S. Island Area
      Foreign Country
    
    
      Current residence
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Alabama
        4791931
        4081522
        589211
       107919
         NaN
       1136
         837
       1450
        5402
        2019
      ...
        179
        1925
        1143
       1162
        234
        307
        13279
        320
         46
        12913
    
    
      Alaska
         725206
         585916
         97358
        37842
         603
        NaN
        2049
        240
        3419
        2017
      ...
        115
         684
        4841
          0
        459
        107
         4090
        118
          0
         3972
    
    
      Arizona
        6651964
        5431241
        922129
       249730
         785
       3140
         NaN
       1180
       57446
        9663
      ...
        661
        3202
       13152
         32
       5948
       2670
        48864
        679
        332
        47853
    
    
      Arkansas
        2931123
        2464189
        383087
        74741
        2145
         66
         782
        NaN
        5070
        1149
      ...
          0
        1250
         777
        303
        822
         34
         9106
        148
        241
         8717
    
    
      California
       38340324
       32849251
       4659084
       513968
       10439
       7098
       35200
       1570
         NaN
       17500
      ...
       1041
       17847
       32837
        897
       6451
       1401
       318021
       4317
       1752
       311952
    
  

5 rows × 59 columns



In [13]:

    
df.to_csv('s2s_migration.tsv', index=True, sep="\t")

	Population 1 year and over	Same house 1 year ago	Same state of residence 1 year ago	Different state of residence 1 year ago	Alabama	Alaska	Arizona	Arkansas	California	Colorado	...	Vermont	Virginia	Washington	West Virginia	Wisconsin	Wyoming	Total	Puerto Rico	U.S. Island Area	Foreign Country
Current residence
Alabama	4791931	4081522	589211	107919	NaN	1136	837	1450	5402	2019	...	179	1925	1143	1162	234	307	13279	320	46	12913
Alaska	725206	585916	97358	37842	603	NaN	2049	240	3419	2017	...	115	684	4841	0	459	107	4090	118	0	3972
Arizona	6651964	5431241	922129	249730	785	3140	NaN	1180	57446	9663	...	661	3202	13152	32	5948	2670	48864	679	332	47853
Arkansas	2931123	2464189	383087	74741	2145	66	782	NaN	5070	1149	...	0	1250	777	303	822	34	9106	148	241	8717
California	38340324	32849251	4659084	513968	10439	7098	35200	1570	NaN	17500	...	1041	17847	32837	897	6451	1401	318021	4317	1752	311952