This is the companion jupyter notebook to 'ageDataOver65.py'.

Purpose: open a CSV file from https://censusreporter.org/ and recode it with the corresponding JSON metadata file.

Written by CC on 20161228



In [1]:

    
import pandas
import numpy
import os
import ijson



In [2]:

    
path = os.chdir('/Users/superuser/Documents/projects/SDRegionalDataLib/age friendly community/acs2015_1yr_B01001/')



In [3]:

    
ageData = pandas.read_csv('acs2015_1yr_B01001.csv');
ageData.head()









    Out[3]:






  
    
      
      geoid
      name
      B01001001
      B01001001, Error
      B01001002
      B01001002, Error
      B01001003
      B01001003, Error
      B01001004
      B01001004, Error
      ...
      B01001045
      B01001045, Error
      B01001046
      B01001046, Error
      B01001047
      B01001047, Error
      B01001048
      B01001048, Error
      B01001049
      B01001049, Error
    
  
  
    
      0
      16000US0666000
      San Diego, CA
      1394907
      83
      693826
      6780
      45036
      3224
      40131
      3386
      ...
      19137
      2032
      19728
      1815
      14059
      1730
      12966
      1649
      13440
      1875
    
  

1 rows × 100 columns

get a list of the column names



In [4]:

    
colNames = list(ageData.columns.values)

#show the first ten coded column names
colNames[0:9]









    Out[4]:





['geoid',
 'name',
 'B01001001',
 'B01001001, Error',
 'B01001002',
 'B01001002, Error',
 'B01001003',
 'B01001003, Error',
 'B01001004']



In [5]:

    
#open the json file
jsonFile = 'metadata.json';

with open(jsonFile, 'r') as f:
    objects = ijson.items(f, 'tables.B01001.columns')
    columnAttr = list(objects)

#show the values of columnAttr
columnAttr[:]









    Out[5]:





[{'B01001001': {'indent': 0, 'name': 'Total:'},
  'B01001002': {'indent': 1, 'name': 'Male:'},
  'B01001003': {'indent': 3, 'name': 'Under 5 years'},
  'B01001004': {'indent': 3, 'name': '5 to 9 years'},
  'B01001005': {'indent': 3, 'name': '10 to 14 years'},
  'B01001006': {'indent': 3, 'name': '15 to 17 years'},
  'B01001007': {'indent': 3, 'name': '18 and 19 years'},
  'B01001008': {'indent': 3, 'name': '20 years'},
  'B01001009': {'indent': 3, 'name': '21 years'},
  'B01001010': {'indent': 3, 'name': '22 to 24 years'},
  'B01001011': {'indent': 3, 'name': '25 to 29 years'},
  'B01001012': {'indent': 3, 'name': '30 to 34 years'},
  'B01001013': {'indent': 3, 'name': '35 to 39 years'},
  'B01001014': {'indent': 3, 'name': '40 to 44 years'},
  'B01001015': {'indent': 3, 'name': '45 to 49 years'},
  'B01001016': {'indent': 3, 'name': '50 to 54 years'},
  'B01001017': {'indent': 3, 'name': '55 to 59 years'},
  'B01001018': {'indent': 3, 'name': '60 and 61 years'},
  'B01001019': {'indent': 3, 'name': '62 to 64 years'},
  'B01001020': {'indent': 3, 'name': '65 and 66 years'},
  'B01001021': {'indent': 3, 'name': '67 to 69 years'},
  'B01001022': {'indent': 3, 'name': '70 to 74 years'},
  'B01001023': {'indent': 3, 'name': '75 to 79 years'},
  'B01001024': {'indent': 3, 'name': '80 to 84 years'},
  'B01001025': {'indent': 3, 'name': '85 years and over'},
  'B01001026': {'indent': 1, 'name': 'Female:'},
  'B01001027': {'indent': 3, 'name': 'Under 5 years'},
  'B01001028': {'indent': 3, 'name': '5 to 9 years'},
  'B01001029': {'indent': 3, 'name': '10 to 14 years'},
  'B01001030': {'indent': 3, 'name': '15 to 17 years'},
  'B01001031': {'indent': 3, 'name': '18 and 19 years'},
  'B01001032': {'indent': 3, 'name': '20 years'},
  'B01001033': {'indent': 3, 'name': '21 years'},
  'B01001034': {'indent': 3, 'name': '22 to 24 years'},
  'B01001035': {'indent': 3, 'name': '25 to 29 years'},
  'B01001036': {'indent': 3, 'name': '30 to 34 years'},
  'B01001037': {'indent': 3, 'name': '35 to 39 years'},
  'B01001038': {'indent': 3, 'name': '40 to 44 years'},
  'B01001039': {'indent': 3, 'name': '45 to 49 years'},
  'B01001040': {'indent': 3, 'name': '50 to 54 years'},
  'B01001041': {'indent': 3, 'name': '55 to 59 years'},
  'B01001042': {'indent': 3, 'name': '60 and 61 years'},
  'B01001043': {'indent': 3, 'name': '62 to 64 years'},
  'B01001044': {'indent': 3, 'name': '65 and 66 years'},
  'B01001045': {'indent': 3, 'name': '67 to 69 years'},
  'B01001046': {'indent': 3, 'name': '70 to 74 years'},
  'B01001047': {'indent': 3, 'name': '75 to 79 years'},
  'B01001048': {'indent': 3, 'name': '80 to 84 years'},
  'B01001049': {'indent': 3, 'name': '85 years and over'}}]

function: getRecodingKeys filters out column names that don't need to be recoded

codingDF is a lookup table that can be use to recode the column names of ageData



In [6]:

    
def getRecodingKeys(element):
    if ('Error' not in element) and ('name' != element) and ('geoid' != element):
        return element
    return False

#filter out the original column names that don't require recoding
codingDF = pandas.DataFrame({'origColNames': list(filter(getRecodingKeys, colNames))})

#add a new column to codingDF that contains the recoded ageDF column names
codingDF['recodeColName'] = ''

codingDF is intended to be used as a lookup table for recoding.
codingDF contains 2 columns:
origColNames: contain the original coded column name in the ageData dataframe

recodeColName: contains the actual column names that are a lot more descriptive than the coded column names in the ageData dataframe



In [7]:

    
#append the recodedColNames to codingDF.recodeColName
for idx, origColName in enumerate(codingDF.origColNames):
    codingDF.recodeColName[idx] = columnAttr[0][origColName]['name']

codingDF.head()









    Out[7]:






  
    
      
      origColNames
      recodeColName
    
  
  
    
      0
      B01001001
      Total:
    
    
      1
      B01001002
      Male:
    
    
      2
      B01001003
      Under 5 years
    
    
      3
      B01001004
      5 to 9 years
    
    
      4
      B01001005
      10 to 14 years

The 2 cells below are my attempts to recode the ageData column names.

Some things to deal with:
1) Each coded column name in the ageData dataframe 'B01001xxx' for instance has a corresponding column cotaining the error of each measurement with a column name 'B01001xxx, Error'.

2) Finding the index values of python/pandas dataframe where the rows are equal to a certain value. In CRAN-R you could execute the following code: 'index = which(dataFrame$column=='B01001xxx')'. A python/pandas equivalent would be great if there is one.



In [ ]:

    
#recode ageData with the actual column names
for idx, col in enumerate(ageData.columns):
    if  codingDF.origColNames.str.contains(col): #codingDF.origColNames.str.contains(ageData.columns[idx]):
        colMatchIDX = codingDF.origColNames.get_loc(col)
        if ageData.columns[idx].str.contains('Error'):
            tempColName = codingDF.recodeColName[colMatchIDX] + '_Error'
            ageData.columns[idx] = tempColName
        else:
            tempColName = codingDF.recodeColName[colMatchIDX]



In [ ]:

    
#recode ageData with the actual column names
for idx, col in enumerate(ageData.columns):
    if  codingDF.origColNames.str.contains(col): #codingDF.origColNames.str.contains(ageData.columns[idx]):
        colMatchIDX = codingDF.origColNames.get_loc(col)
        if ageData.columns[idx].str.contains('Error'):
            tempColName = codingDF.recodeColName[colMatchIDX] + '_Error'
            ageData.rename(columns={col:tempColName}, inplace=True)
        else:
            tempColName = codingDF.recodeColName[colMatchIDX]
            print(tempColName)
            ageData.rename(columns={col:tempColName}, inplace=True)

	origColNames	recodeColName
0	B01001001	Total:
1	B01001002	Male:
2	B01001003	Under 5 years
3	B01001004	5 to 9 years
4	B01001005	10 to 14 years