(c) Karen Belita
Team Neighborhood Change
Last Updated 9/7/2016

NOTEBOOK: Wrangling for CBSA-CITY-STATE

To create a table that has the CBSA code for City and State abbreviation that belongs to that MSA that. MSA = Metropolitan Statistical Area

OUTPUT

CBSA-CITY_STATE.csv --- for ing_wr_walkscore and ing_wr_FTA

DEPENDENCIES

  • MSA_principal.xls
  • MSA_STATE.xls

In [6]:
import os
import requests
import pandas as pd
import csv
import urllib2
import openpyxl
import csv
import xlrd
import numpy as np

WRANGLING

Part 1

  • Read xls file into csv and read csv into variable 'df'
  • Relabel columns
  • Drop unnecessary columns to create variable 'df_ABB'
  • 'df_ABB' contains CBSA code of the MSA, title for that CBSA, principal city that MSA and the State it belongs in State FIPS format

In [8]:
xls_file = os.path.join(os.getcwd(), 'MSA_principal.xls')

data_xls = pd.read_excel(xls_file, 'List 2', index_col=None)
data_xls.to_csv('principal.csv', encoding='utf-8')

In [9]:
df = pd.read_csv('principal.csv')

In [10]:
df.columns = ["DELETE", "CBSA", "CBSA_TITLE", "CBSA TYPE", "PRINCIPAL", "FIPS_STATE", "FIPS_PLACE"]

In [14]:
df.head(10)


Out[14]:
DELETE CBSA CBSA_TITLE CBSA TYPE PRINCIPAL FIPS_STATE FIPS_PLACE
0 0 List 2. PRINCIPAL CITIES OF METROPOLITAN AND M... NaN NaN NaN NaN NaN
1 1 CBSA Code CBSA Title Metropolitan/Micropolitan Statistical Area Principal City Name FIPS State Code FIPS Place Code
2 2 10100 Aberdeen, SD Micropolitan Statistical Area Aberdeen 46 00100
3 3 10140 Aberdeen, WA Micropolitan Statistical Area Aberdeen 53 00100
4 4 10180 Abilene, TX Metropolitan Statistical Area Abilene 48 01000
5 5 10220 Ada, OK Micropolitan Statistical Area Ada 40 00200
6 6 10260 Adjuntas, PR Micropolitan Statistical Area Adjuntas 72 00358
7 7 10300 Adrian, MI Micropolitan Statistical Area Adrian 26 00440
8 8 10380 Aguadilla-Isabela, PR Metropolitan Statistical Area Aguadilla 72 00745
9 9 10380 Aguadilla-Isabela, PR Metropolitan Statistical Area Isabela 72 36478

In [15]:
df = df.drop([0,1]) ## drop rows

In [16]:
df.head()


Out[16]:
DELETE CBSA CBSA_TITLE CBSA TYPE PRINCIPAL FIPS_STATE FIPS_PLACE
2 2 10100 Aberdeen, SD Micropolitan Statistical Area Aberdeen 46 00100
3 3 10140 Aberdeen, WA Micropolitan Statistical Area Aberdeen 53 00100
4 4 10180 Abilene, TX Metropolitan Statistical Area Abilene 48 01000
5 5 10220 Ada, OK Micropolitan Statistical Area Ada 40 00200
6 6 10260 Adjuntas, PR Micropolitan Statistical Area Adjuntas 72 00358

In [17]:
df.drop(df.columns[[0]], axis=1, inplace=True) ## remove column 0

In [18]:
df["FIPS"] = df["FIPS_STATE"].map(str) + df["FIPS_PLACE"].map(str) ## creates new wrong column called MSA

In [19]:
df.head()


Out[19]:
CBSA CBSA_TITLE CBSA TYPE PRINCIPAL FIPS_STATE FIPS_PLACE FIPS
2 10100 Aberdeen, SD Micropolitan Statistical Area Aberdeen 46 00100 4600100
3 10140 Aberdeen, WA Micropolitan Statistical Area Aberdeen 53 00100 5300100
4 10180 Abilene, TX Metropolitan Statistical Area Abilene 48 01000 4801000
5 10220 Ada, OK Micropolitan Statistical Area Ada 40 00200 4000200
6 10260 Adjuntas, PR Micropolitan Statistical Area Adjuntas 72 00358 7200358

In [24]:
df_ABB = df.drop(df.columns[[2,5,6]], axis=1) ## remove FIPS and FIPS_PLACE

In [25]:
df_ABB.drop(df_ABB.tail(4).index, inplace=True) ## remove the bottom

In [26]:
df_ABB.tail()


Out[26]:
CBSA CBSA_TITLE PRINCIPAL FIPS_STATE
1262 49660 Youngstown-Warren-Boardman, OH-PA Youngstown 39
1263 49700 Yuba City, CA Yuba City 06
1264 49740 Yuma, AZ Yuma 04
1265 49780 Zanesville, OH Zanesville 39
1266 49820 Zapata, TX Zapata 48

In [23]:
df_ABB.head()


Out[23]:
CBSA CBSA_TITLE CBSA TYPE PRINCIPAL FIPS_STATE
2 10100 Aberdeen, SD Micropolitan Statistical Area Aberdeen 46
3 10140 Aberdeen, WA Micropolitan Statistical Area Aberdeen 53
4 10180 Abilene, TX Metropolitan Statistical Area Abilene 48
5 10220 Ada, OK Micropolitan Statistical Area Ada 40
6 10260 Adjuntas, PR Micropolitan Statistical Area Adjuntas 72

Part 2

  • Read xls file into csv and read csv into variable dfs_x
  • Drop unnecessary columns
  • 'dfs_x' ontains the State Abbreviation and the Fips state code

In [27]:
xls_state = os.path.join(os.getcwd(), 'MSA_STATE.xls')

data_state= pd.read_excel(xls_state, 'cqr_universe_fixedwidth_all', index_col=None)
data_state.to_csv('state.csv', encoding='utf-8')

In [28]:
dfs_x= pd.read_csv('state.csv')

In [29]:
dfs_x.drop(dfs_x.columns[[0,3,4,5,6,7]], axis=1, inplace=True) ## remove column 0

In [30]:
dfs_x.columns = ["STATE_ABB", "FIPS_STATE"]

In [31]:
dfs_x["FIPS_STATE"] = dfs_x.FIPS_STATE.apply(lambda x: str(int(x)).zfill(2))

In [32]:
dfs_x.head()


Out[32]:
STATE_ABB FIPS_STATE
0 AL 01
1 AL 01
2 AL 01
3 AL 01
4 AL 01

Part 3

  • Merge 'df_ABB' and 'dfs_x' from part 1 and part 2
  • This merges on 'FIPS STATE'
  • Drop unnecessary columns
  • new dataframe 'df_c_abb' that contains CBSA codes of MSAs, then the city that belongs to them with the state (abbreviation) it belongs to
  • from new dataframe 'df_c_abb' create new csv called 'CBSA-CITY-STATE.csv'

In [65]:
df_c_abb = pd.merge(df_ABB, dfs_x, how = "left", on = "FIPS_STATE")

In [66]:
df_c_abb = df_c_abb.drop_duplicates(take_last=True) ## REMOVE DUPLICATES


/Users/kbelita/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: the take_last=True keyword is deprecated, use keep='last' instead
  if __name__ == '__main__':

In [67]:
df_c_abb.head()


Out[67]:
CBSA CBSA_TITLE PRINCIPAL FIPS_STATE STATE_ABB
1331 10100 Aberdeen, SD Aberdeen 46 SD
1702 10140 Aberdeen, WA Aberdeen 53 WA
3341 10180 Abilene, TX Abilene 48 TX
4165 10220 Ada, OK Ada 40 OK
4505 10260 Adjuntas, PR Adjuntas 72 PR

In [68]:
df_c_abb.reset_index(level=0, inplace=True) ## reset

In [69]:
df_c_abb.head()


Out[69]:
index CBSA CBSA_TITLE PRINCIPAL FIPS_STATE STATE_ABB
0 1331 10100 Aberdeen, SD Aberdeen 46 SD
1 1702 10140 Aberdeen, WA Aberdeen 53 WA
2 3341 10180 Abilene, TX Abilene 48 TX
3 4165 10220 Ada, OK Ada 40 OK
4 4505 10260 Adjuntas, PR Adjuntas 72 PR

In [70]:
df_c_abb.drop(df_c_abb.columns[[0,2,4]], axis=1, inplace=True) ## remove column 0

In [71]:
df_c_abb.head()


Out[71]:
CBSA PRINCIPAL STATE_ABB
0 10100 Aberdeen SD
1 10140 Aberdeen WA
2 10180 Abilene TX
3 10220 Ada OK
4 10260 Adjuntas PR

In [72]:
df_c_abb.columns = ["CBSA", "CITY", "STATE"]

In [73]:
df_c_abb.head()


Out[73]:
CBSA CITY STATE
0 10100 Aberdeen SD
1 10140 Aberdeen WA
2 10180 Abilene TX
3 10220 Ada OK
4 10260 Adjuntas PR

In [40]:
df_c_abb.to_csv("CBSA-CITY-ST.csv")

In [ ]: