NOTEBOOK: Wrangling for CBSA-CITY-STATE

To create a table that has the CBSA code for City and State abbreviation that belongs to that MSA that. MSA = Metropolitan Statistical Area

OUTPUT

CBSA-CITY_STATE.csv --- for ing_wr_walkscore and ing_wr_FTA

DEPENDENCIES

MSA_principal.xls
MSA_STATE.xls



In [6]:

    
import os
import requests
import pandas as pd
import csv
import urllib2
import openpyxl
import csv
import xlrd
import numpy as np

WRANGLING

Part 1

Read xls file into csv and read csv into variable 'df'
Relabel columns
Drop unnecessary columns to create variable 'df_ABB'
'df_ABB' contains CBSA code of the MSA, title for that CBSA, principal city that MSA and the State it belongs in State FIPS format



In [8]:

    
xls_file = os.path.join(os.getcwd(), 'MSA_principal.xls')

data_xls = pd.read_excel(xls_file, 'List 2', index_col=None)
data_xls.to_csv('principal.csv', encoding='utf-8')



In [9]:

    
df = pd.read_csv('principal.csv')



In [10]:

    
df.columns = ["DELETE", "CBSA", "CBSA_TITLE", "CBSA TYPE", "PRINCIPAL", "FIPS_STATE", "FIPS_PLACE"]



In [14]:

    
df.head(10)









    Out[14]:






  
    
      
      DELETE
      CBSA
      CBSA_TITLE
      CBSA TYPE
      PRINCIPAL
      FIPS_STATE
      FIPS_PLACE
    
  
  
    
      0
      0
      List 2. PRINCIPAL CITIES OF METROPOLITAN AND M...
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      1
      CBSA Code
      CBSA Title
      Metropolitan/Micropolitan Statistical Area
      Principal City Name
      FIPS State Code
      FIPS Place Code
    
    
      2
      2
      10100
      Aberdeen, SD
      Micropolitan Statistical Area
      Aberdeen
      46
      00100
    
    
      3
      3
      10140
      Aberdeen, WA
      Micropolitan Statistical Area
      Aberdeen
      53
      00100
    
    
      4
      4
      10180
      Abilene, TX
      Metropolitan Statistical Area
      Abilene
      48
      01000
    
    
      5
      5
      10220
      Ada, OK
      Micropolitan Statistical Area
      Ada
      40
      00200
    
    
      6
      6
      10260
      Adjuntas, PR
      Micropolitan Statistical Area
      Adjuntas
      72
      00358
    
    
      7
      7
      10300
      Adrian, MI
      Micropolitan Statistical Area
      Adrian
      26
      00440
    
    
      8
      8
      10380
      Aguadilla-Isabela, PR
      Metropolitan Statistical Area
      Aguadilla
      72
      00745
    
    
      9
      9
      10380
      Aguadilla-Isabela, PR
      Metropolitan Statistical Area
      Isabela
      72
      36478



In [15]:

    
df = df.drop([0,1]) ## drop rows



In [16]:

    
df.head()









    Out[16]:






  
    
      
      DELETE
      CBSA
      CBSA_TITLE
      CBSA TYPE
      PRINCIPAL
      FIPS_STATE
      FIPS_PLACE
    
  
  
    
      2
      2
      10100
      Aberdeen, SD
      Micropolitan Statistical Area
      Aberdeen
      46
      00100
    
    
      3
      3
      10140
      Aberdeen, WA
      Micropolitan Statistical Area
      Aberdeen
      53
      00100
    
    
      4
      4
      10180
      Abilene, TX
      Metropolitan Statistical Area
      Abilene
      48
      01000
    
    
      5
      5
      10220
      Ada, OK
      Micropolitan Statistical Area
      Ada
      40
      00200
    
    
      6
      6
      10260
      Adjuntas, PR
      Micropolitan Statistical Area
      Adjuntas
      72
      00358



In [17]:

    
df.drop(df.columns[[0]], axis=1, inplace=True) ## remove column 0



In [18]:

    
df["FIPS"] = df["FIPS_STATE"].map(str) + df["FIPS_PLACE"].map(str) ## creates new wrong column called MSA



In [19]:

    
df.head()









    Out[19]:






  
    
      
      CBSA
      CBSA_TITLE
      CBSA TYPE
      PRINCIPAL
      FIPS_STATE
      FIPS_PLACE
      FIPS
    
  
  
    
      2
      10100
      Aberdeen, SD
      Micropolitan Statistical Area
      Aberdeen
      46
      00100
      4600100
    
    
      3
      10140
      Aberdeen, WA
      Micropolitan Statistical Area
      Aberdeen
      53
      00100
      5300100
    
    
      4
      10180
      Abilene, TX
      Metropolitan Statistical Area
      Abilene
      48
      01000
      4801000
    
    
      5
      10220
      Ada, OK
      Micropolitan Statistical Area
      Ada
      40
      00200
      4000200
    
    
      6
      10260
      Adjuntas, PR
      Micropolitan Statistical Area
      Adjuntas
      72
      00358
      7200358



In [24]:

    
df_ABB = df.drop(df.columns[[2,5,6]], axis=1) ## remove FIPS and FIPS_PLACE



In [25]:

    
df_ABB.drop(df_ABB.tail(4).index, inplace=True) ## remove the bottom



In [26]:

    
df_ABB.tail()









    Out[26]:






  
    
      
      CBSA
      CBSA_TITLE
      PRINCIPAL
      FIPS_STATE
    
  
  
    
      1262
      49660
      Youngstown-Warren-Boardman, OH-PA
      Youngstown
      39
    
    
      1263
      49700
      Yuba City, CA
      Yuba City
      06
    
    
      1264
      49740
      Yuma, AZ
      Yuma
      04
    
    
      1265
      49780
      Zanesville, OH
      Zanesville
      39
    
    
      1266
      49820
      Zapata, TX
      Zapata
      48



In [23]:

    
df_ABB.head()









    Out[23]:






  
    
      
      CBSA
      CBSA_TITLE
      CBSA TYPE
      PRINCIPAL
      FIPS_STATE
    
  
  
    
      2
      10100
      Aberdeen, SD
      Micropolitan Statistical Area
      Aberdeen
      46
    
    
      3
      10140
      Aberdeen, WA
      Micropolitan Statistical Area
      Aberdeen
      53
    
    
      4
      10180
      Abilene, TX
      Metropolitan Statistical Area
      Abilene
      48
    
    
      5
      10220
      Ada, OK
      Micropolitan Statistical Area
      Ada
      40
    
    
      6
      10260
      Adjuntas, PR
      Micropolitan Statistical Area
      Adjuntas
      72

Part 2

Read xls file into csv and read csv into variable dfs_x
Drop unnecessary columns
'dfs_x' ontains the State Abbreviation and the Fips state code



In [27]:

    
xls_state = os.path.join(os.getcwd(), 'MSA_STATE.xls')

data_state= pd.read_excel(xls_state, 'cqr_universe_fixedwidth_all', index_col=None)
data_state.to_csv('state.csv', encoding='utf-8')



In [28]:

    
dfs_x= pd.read_csv('state.csv')



In [29]:

    
dfs_x.drop(dfs_x.columns[[0,3,4,5,6,7]], axis=1, inplace=True) ## remove column 0



In [30]:

    
dfs_x.columns = ["STATE_ABB", "FIPS_STATE"]



In [31]:

    
dfs_x["FIPS_STATE"] = dfs_x.FIPS_STATE.apply(lambda x: str(int(x)).zfill(2))



In [32]:

    
dfs_x.head()

Part 3

Merge 'df_ABB' and 'dfs_x' from part 1 and part 2
This merges on 'FIPS STATE'
Drop unnecessary columns
new dataframe 'df_c_abb' that contains CBSA codes of MSAs, then the city that belongs to them with the state (abbreviation) it belongs to
from new dataframe 'df_c_abb' create new csv called 'CBSA-CITY-STATE.csv'



In [65]:

    
df_c_abb = pd.merge(df_ABB, dfs_x, how = "left", on = "FIPS_STATE")



In [66]:

    
df_c_abb = df_c_abb.drop_duplicates(take_last=True) ## REMOVE DUPLICATES









    



/Users/kbelita/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: the take_last=True keyword is deprecated, use keep='last' instead
  if __name__ == '__main__':



In [67]:

    
df_c_abb.head()









    Out[67]:






  
    
      
      CBSA
      CBSA_TITLE
      PRINCIPAL
      FIPS_STATE
      STATE_ABB
    
  
  
    
      1331
      10100
      Aberdeen, SD
      Aberdeen
      46
      SD
    
    
      1702
      10140
      Aberdeen, WA
      Aberdeen
      53
      WA
    
    
      3341
      10180
      Abilene, TX
      Abilene
      48
      TX
    
    
      4165
      10220
      Ada, OK
      Ada
      40
      OK
    
    
      4505
      10260
      Adjuntas, PR
      Adjuntas
      72
      PR



In [68]:

    
df_c_abb.reset_index(level=0, inplace=True) ## reset



In [69]:

    
df_c_abb.head()









    Out[69]:






  
    
      
      index
      CBSA
      CBSA_TITLE
      PRINCIPAL
      FIPS_STATE
      STATE_ABB
    
  
  
    
      0
      1331
      10100
      Aberdeen, SD
      Aberdeen
      46
      SD
    
    
      1
      1702
      10140
      Aberdeen, WA
      Aberdeen
      53
      WA
    
    
      2
      3341
      10180
      Abilene, TX
      Abilene
      48
      TX
    
    
      3
      4165
      10220
      Ada, OK
      Ada
      40
      OK
    
    
      4
      4505
      10260
      Adjuntas, PR
      Adjuntas
      72
      PR



In [70]:

    
df_c_abb.drop(df_c_abb.columns[[0,2,4]], axis=1, inplace=True) ## remove column 0



In [71]:

    
df_c_abb.head()









    Out[71]:






  
    
      
      CBSA
      PRINCIPAL
      STATE_ABB
    
  
  
    
      0
      10100
      Aberdeen
      SD
    
    
      1
      10140
      Aberdeen
      WA
    
    
      2
      10180
      Abilene
      TX
    
    
      3
      10220
      Ada
      OK
    
    
      4
      10260
      Adjuntas
      PR



In [72]:

    
df_c_abb.columns = ["CBSA", "CITY", "STATE"]



In [73]:

    
df_c_abb.head()



In [40]:

    
df_c_abb.to_csv("CBSA-CITY-ST.csv")



In [ ]:

	DELETE	CBSA	CBSA_TITLE	CBSA TYPE	PRINCIPAL	FIPS_STATE	FIPS_PLACE
0	0	List 2. PRINCIPAL CITIES OF METROPOLITAN AND M...	NaN	NaN	NaN	NaN	NaN
1	1	CBSA Code	CBSA Title	Metropolitan/Micropolitan Statistical Area	Principal City Name	FIPS State Code	FIPS Place Code
2	2	10100	Aberdeen, SD	Micropolitan Statistical Area	Aberdeen	46	00100
3	3	10140	Aberdeen, WA	Micropolitan Statistical Area	Aberdeen	53	00100
4	4	10180	Abilene, TX	Metropolitan Statistical Area	Abilene	48	01000
5	5	10220	Ada, OK	Micropolitan Statistical Area	Ada	40	00200
6	6	10260	Adjuntas, PR	Micropolitan Statistical Area	Adjuntas	72	00358
7	7	10300	Adrian, MI	Micropolitan Statistical Area	Adrian	26	00440
8	8	10380	Aguadilla-Isabela, PR	Metropolitan Statistical Area	Aguadilla	72	00745
9	9	10380	Aguadilla-Isabela, PR	Metropolitan Statistical Area	Isabela	72	36478

	CBSA	CBSA_TITLE	PRINCIPAL	FIPS_STATE
1262	49660	Youngstown-Warren-Boardman, OH-PA	Youngstown	39
1263	49700	Yuba City, CA	Yuba City	06
1264	49740	Yuma, AZ	Yuma	04
1265	49780	Zanesville, OH	Zanesville	39
1266	49820	Zapata, TX	Zapata	48

	CBSA	CBSA_TITLE	PRINCIPAL	FIPS_STATE	STATE_ABB
1331	10100	Aberdeen, SD	Aberdeen	46	SD
1702	10140	Aberdeen, WA	Aberdeen	53	WA
3341	10180	Abilene, TX	Abilene	48	TX
4165	10220	Ada, OK	Ada	40	OK
4505	10260	Adjuntas, PR	Adjuntas	72	PR