This is the companion jupyter notebook to 'ageDataOver65.py'.

Purpose: open a CSV file from https://censusreporter.org/ and recode it with the corresponding JSON metadata file.

Written by CC on 20161228


In [1]:
import pandas
import numpy
import os
import ijson

In [2]:
path = os.chdir('/Users/superuser/Documents/projects/SDRegionalDataLib/age friendly community/acs2015_1yr_B01001/')

In [3]:
ageData = pandas.read_csv('acs2015_1yr_B01001.csv');
ageData.head()


Out[3]:
geoid name B01001001 B01001001, Error B01001002 B01001002, Error B01001003 B01001003, Error B01001004 B01001004, Error ... B01001045 B01001045, Error B01001046 B01001046, Error B01001047 B01001047, Error B01001048 B01001048, Error B01001049 B01001049, Error
0 16000US0666000 San Diego, CA 1394907 83 693826 6780 45036 3224 40131 3386 ... 19137 2032 19728 1815 14059 1730 12966 1649 13440 1875

1 rows × 100 columns

get a list of the column names


In [4]:
colNames = list(ageData.columns.values)

#show the first ten coded column names
colNames[0:9]


Out[4]:
['geoid',
 'name',
 'B01001001',
 'B01001001, Error',
 'B01001002',
 'B01001002, Error',
 'B01001003',
 'B01001003, Error',
 'B01001004']

In [5]:
#open the json file
jsonFile = 'metadata.json';

with open(jsonFile, 'r') as f:
    objects = ijson.items(f, 'tables.B01001.columns')
    columnAttr = list(objects)

#show the values of columnAttr
columnAttr[:]


Out[5]:
[{'B01001001': {'indent': 0, 'name': 'Total:'},
  'B01001002': {'indent': 1, 'name': 'Male:'},
  'B01001003': {'indent': 3, 'name': 'Under 5 years'},
  'B01001004': {'indent': 3, 'name': '5 to 9 years'},
  'B01001005': {'indent': 3, 'name': '10 to 14 years'},
  'B01001006': {'indent': 3, 'name': '15 to 17 years'},
  'B01001007': {'indent': 3, 'name': '18 and 19 years'},
  'B01001008': {'indent': 3, 'name': '20 years'},
  'B01001009': {'indent': 3, 'name': '21 years'},
  'B01001010': {'indent': 3, 'name': '22 to 24 years'},
  'B01001011': {'indent': 3, 'name': '25 to 29 years'},
  'B01001012': {'indent': 3, 'name': '30 to 34 years'},
  'B01001013': {'indent': 3, 'name': '35 to 39 years'},
  'B01001014': {'indent': 3, 'name': '40 to 44 years'},
  'B01001015': {'indent': 3, 'name': '45 to 49 years'},
  'B01001016': {'indent': 3, 'name': '50 to 54 years'},
  'B01001017': {'indent': 3, 'name': '55 to 59 years'},
  'B01001018': {'indent': 3, 'name': '60 and 61 years'},
  'B01001019': {'indent': 3, 'name': '62 to 64 years'},
  'B01001020': {'indent': 3, 'name': '65 and 66 years'},
  'B01001021': {'indent': 3, 'name': '67 to 69 years'},
  'B01001022': {'indent': 3, 'name': '70 to 74 years'},
  'B01001023': {'indent': 3, 'name': '75 to 79 years'},
  'B01001024': {'indent': 3, 'name': '80 to 84 years'},
  'B01001025': {'indent': 3, 'name': '85 years and over'},
  'B01001026': {'indent': 1, 'name': 'Female:'},
  'B01001027': {'indent': 3, 'name': 'Under 5 years'},
  'B01001028': {'indent': 3, 'name': '5 to 9 years'},
  'B01001029': {'indent': 3, 'name': '10 to 14 years'},
  'B01001030': {'indent': 3, 'name': '15 to 17 years'},
  'B01001031': {'indent': 3, 'name': '18 and 19 years'},
  'B01001032': {'indent': 3, 'name': '20 years'},
  'B01001033': {'indent': 3, 'name': '21 years'},
  'B01001034': {'indent': 3, 'name': '22 to 24 years'},
  'B01001035': {'indent': 3, 'name': '25 to 29 years'},
  'B01001036': {'indent': 3, 'name': '30 to 34 years'},
  'B01001037': {'indent': 3, 'name': '35 to 39 years'},
  'B01001038': {'indent': 3, 'name': '40 to 44 years'},
  'B01001039': {'indent': 3, 'name': '45 to 49 years'},
  'B01001040': {'indent': 3, 'name': '50 to 54 years'},
  'B01001041': {'indent': 3, 'name': '55 to 59 years'},
  'B01001042': {'indent': 3, 'name': '60 and 61 years'},
  'B01001043': {'indent': 3, 'name': '62 to 64 years'},
  'B01001044': {'indent': 3, 'name': '65 and 66 years'},
  'B01001045': {'indent': 3, 'name': '67 to 69 years'},
  'B01001046': {'indent': 3, 'name': '70 to 74 years'},
  'B01001047': {'indent': 3, 'name': '75 to 79 years'},
  'B01001048': {'indent': 3, 'name': '80 to 84 years'},
  'B01001049': {'indent': 3, 'name': '85 years and over'}}]

function: getRecodingKeys filters out column names that don't need to be recoded

codingDF is a lookup table that can be use to recode the column names of ageData


In [6]:
def getRecodingKeys(element):
    if ('Error' not in element) and ('name' != element) and ('geoid' != element):
        return element
    return False

#filter out the original column names that don't require recoding
codingDF = pandas.DataFrame({'origColNames': list(filter(getRecodingKeys, colNames))})

#add a new column to codingDF that contains the recoded ageDF column names
codingDF['recodeColName'] = ''

codingDF is intended to be used as a lookup table for recoding.
codingDF contains 2 columns:
origColNames: contain the original coded column name in the ageData dataframe

recodeColName: contains the actual column names that are a lot more descriptive than the coded column names in the ageData dataframe


In [7]:
#append the recodedColNames to codingDF.recodeColName
for idx, origColName in enumerate(codingDF.origColNames):
    codingDF.recodeColName[idx] = columnAttr[0][origColName]['name']

codingDF.head()


Out[7]:
origColNames recodeColName
0 B01001001 Total:
1 B01001002 Male:
2 B01001003 Under 5 years
3 B01001004 5 to 9 years
4 B01001005 10 to 14 years

The 2 cells below are my attempts to recode the ageData column names.

Some things to deal with:
1) Each coded column name in the ageData dataframe 'B01001xxx' for instance has a corresponding column cotaining the error of each measurement with a column name 'B01001xxx, Error'.

2) Finding the index values of python/pandas dataframe where the rows are equal to a certain value. In CRAN-R you could execute the following code: 'index = which(dataFrame$column=='B01001xxx')'. A python/pandas equivalent would be great if there is one.


In [ ]:
#recode ageData with the actual column names
for idx, col in enumerate(ageData.columns):
    if  codingDF.origColNames.str.contains(col): #codingDF.origColNames.str.contains(ageData.columns[idx]):
        colMatchIDX = codingDF.origColNames.get_loc(col)
        if ageData.columns[idx].str.contains('Error'):
            tempColName = codingDF.recodeColName[colMatchIDX] + '_Error'
            ageData.columns[idx] = tempColName
        else:
            tempColName = codingDF.recodeColName[colMatchIDX]

In [ ]:
#recode ageData with the actual column names
for idx, col in enumerate(ageData.columns):
    if  codingDF.origColNames.str.contains(col): #codingDF.origColNames.str.contains(ageData.columns[idx]):
        colMatchIDX = codingDF.origColNames.get_loc(col)
        if ageData.columns[idx].str.contains('Error'):
            tempColName = codingDF.recodeColName[colMatchIDX] + '_Error'
            ageData.rename(columns={col:tempColName}, inplace=True)
        else:
            tempColName = codingDF.recodeColName[colMatchIDX]
            print(tempColName)
            ageData.rename(columns={col:tempColName}, inplace=True)