Notebook to add co-ordinates for 1999 Polling Places

The AEC started putting co-ordinates on polling place files from 2007.

The code below matches to 2007 polling places, where the name of the venue is the same.

While not perfect, this should result the vast majority of the time in a co-ordinate closely representing the location of the polling location in 1999

Libraries



In [6]:

    
import pandas as pd
import numpy as np
from IPython.display import display, HTML
import json
import googlemaps

Helper functions



In [7]:

    
def left_of_bracket(s):
    if '(' in s:
        needle = s.find('(')
        r = s[:needle-1].strip()
        return r
    else:
        return s
    
def state_abbreviation(state):
    spaces = state.count(' ')
    
    if spaces == 2:
        bits = state.split(' ')
        r=''
        for b in bits:
            r = r + b[:1].upper() # for each word in state grab first letter
        return r
    elif 'Australia' in state:
        r = state[:1].upper() + 'A'
        return r
    elif state == 'Queensland':
        return 'QLD'
    elif state == 'Northern Territory':
        return 'NT'
    else:
        r = state[:3].upper()
        return r

# i keep forgetting the syntax for this so writing a wrapper
def dedup_df(df, keys, keep = False):
    
    # for a data frame, drop anything thats a duplicate
    # if you change keep to first, it'll keep the first row rather than none
    df_dedup = df.drop_duplicates(keys, keep)
    return df_dedup

Import polling places

import_polling_places(filepath)

Takes a path to a polling place file
Returns a tidy data frame
- Renames some columns
- Dedups on ['state','polling_place']



In [8]:

    
def import_polling_places(filepath):
    
    # read csv
    df_pp = pd.read_csv(
        filepath
    )
    
    # pick the columns I want to keep
    cols = [
        'State',
        'PollingPlaceNm',
        'PremisesNm',
        'PremisesAddress1',
        'PremisesAddress2',
        'PremisesAddress3',
        'PremisesSuburb',
        'PremisesStateAb',
        'PremisesPostCode',
        'Latitude',
        'Longitude'
    ]
    
    # filter for those
    df_pp = df_pp[cols]

    # create a polling place column missing the bracket
    lambda_polling_places = lambda x: left_of_bracket(x)
    df_pp['polling_place'] = df_pp['PollingPlaceNm'].apply(lambda_polling_places)

    # rename columns to make joining easier
    df_pp['premises'] = df_pp['PremisesNm']
    df_pp['postcode'] = df_pp['PremisesPostCode']

    # replace in the col headers list where I've modified/added the column
    cols = [c.replace('PollingPlaceNm', 'polling_place') for c in cols]
    cols = [c.replace('PremisesNm', 'premises') for c in cols]
    cols = [c.replace('PremisesPostCode', 'postcode') for c in cols]
    
    # reorder df
    df_pp = df_pp[cols]

    # dedup
    df_pp = df_pp.drop_duplicates()

    # make all headers lower case
    df_pp.columns = [x.lower() for x in df_pp.columns]
    
    return df_pp



In [9]:

    
filepath = 'federal_election_polling_places/pp_2007_election.csv'
test = import_polling_places(filepath)
display('Rows: ' + str(len(test.index)))









    





'Rows: 7861'

Import 1999 polling places



In [10]:

    
def import_1999_pp(filepath):

    df_pp_1999 = pd.read_csv(
        filepath
    )

    # add blank columns for match types and lat/lng
    df_pp_1999['match_source'] = np.nan
    df_pp_1999['match_type'] = np.nan
    df_pp_1999['latitude'] = np.nan
    df_pp_1999['longitude'] = np.nan

    # tell it to index on state and polling place
    df_pp_1999 = df_pp_1999.set_index(['state','polling_place'])
    
    return df_pp_1999



In [11]:

    
filepath = '1999_referenda_output/polling_places.csv'
df_pp_1999 = import_1999_pp(filepath)
display(df_pp_1999.head(3))









    







  
    
      
      
      premises
      address
      suburb
      postcode
      wheelchair_access
      match_source
      match_type
      latitude
      longitude
    
    
      state
      polling_place
      
      
      
      
      
      
      
      
      
    
  
  
    
      ACT
      Bonython
      Bonython Primary School
      Hurtle Ave
      BONYTHON
      2905.0
      F
      NaN
      NaN
      NaN
      NaN
    
    
      Calwell
      Calwell High School
      Casey Cres
      CALWELL
      2905.0
      F
      NaN
      NaN
      NaN
      NaN
    
    
      Canberra Hospital
      The Canberra Hospital
      Blding 2 Level 3 Yamba Dr
      GARRAN
      2605.0
      F
      NaN
      NaN
      NaN
      NaN

Matches

Pandas setting I need for below to behave

pandas generates warnings for working with a data frame that's a copy of another it thinks I might think I'm changing df_pp_1999 when im working with df_pp_1999_working i'm turning this warning off because i'm doing this on purpose, so I can keep df_pp_1999 as a 'yet to be matched' file, and update it with each subsequent working file



In [12]:

    
pd.set_option('chained_assignment',None)

Functions

`match_polling_places(df_pp_1999, df_pp, settings)`

For the 1999 data frame, and a given other polling place data frame, and a set of settings, run a merge, and return the rows that matched based on the join you specified
E.g:

match_polling_places(
  df_pp_1999,
  df_pp,
  dict(
      keys = ['state','premises','postcode'],
      match_source = '2007 Polling Places',
      match_type = 'Match 01 - state, premises, postcode'   
  )
)

runs a join on state, premises, and postcode between df1 and df2
keeps a defined set of columns from df1
adds the columns match_source and match_type, and sets their value
replaces the latitude and longitude columns of df1 with those from df2
returns this data frame, deleting all rows that didn't match from df1



In [13]:

    
def match_polling_places(df1, df2, settings):
    
    # split up our meta field
    keys = settings['keys']
    match_source = settings['match_source']
    match_type = settings['match_type']

    # filter for those columns
 
    df_working = df1.reset_index()[[
        'state',
        'polling_place',
        'premises',
        'address',
        'suburb',
        'postcode',
        'wheelchair_access'
    ]]

    # the keys I want to keep from the second df in the join are the group_by keys, and also lat/lng
    cols_df2 = keys + ['latitude','longitude']
    
    # add cols for match type
    df_working['match_source'] = match_source
    df_working['match_type'] = match_type
    
    # run the join
    df_working = pd.merge(
        df_working,
        df2[cols_df2],
        on=keys,
        how='left'
    )
    
    # delete those which we didn't match
    df_working = df_working[~df_working['latitude'].isnull()]
    
    # dedup on the keys we matched on
    df_working = dedup_df(df_working, keys)
    
    return df_working



In [14]:

    
# test match_polling_places
filepath = '1999_referenda_output/polling_places.csv'
df1 = import_1999_pp(filepath)

filepath2 = 'federal_election_polling_places/pp_2007_election.csv'
df2 = import_polling_places(filepath2)

test = match_polling_places(
  df1,
  df2,
  dict(
      keys = ['state','premises','postcode'],
      match_source = '2007 Polling Places',
      match_type = 'Match 01 - state, premises, postcode'   
  )
)

display(test.head(3))









    







  
    
      
      state
      polling_place
      premises
      address
      suburb
      postcode
      wheelchair_access
      match_source
      match_type
      latitude
      longitude
    
  
  
    
      0
      ACT
      Bonython
      Bonython Primary School
      Hurtle Ave
      BONYTHON
      2905.0
      F
      2007 Polling Places
      Match 01 - state, premises, postcode
      -35.4318
      149.083
    
    
      1
      ACT
      Calwell
      Calwell High School
      Casey Cres
      CALWELL
      2905.0
      F
      2007 Polling Places
      Match 01 - state, premises, postcode
      -35.4406
      149.116
    
    
      2
      ACT
      Canberra Hospital
      The Canberra Hospital
      Blding 2 Level 3 Yamba Dr
      GARRAN
      2605.0
      F
      2007 Polling Places
      Match 01 - state, premises, postcode
      -35.3453
      149.100

`match_unmatched_polling_places(df1, settings)`

This is a wrapper function for match_polling_places
It will only pass data that is NOT yet matched in df1 to the match function, so that we keep track of at what point in our order we matched the data frame (rather than overriding each time it matches)
This will matter as we do less high quality matches at the bottom of the pile



In [15]:

    
def match_unmatched_polling_places(df1, settings):
    
    # get polling place file from settings
    filepath = settings['pp_filepath']    
    df2 = import_polling_places(filepath)

    # work out which rows we haven't yet matched
    df1_unmatched = df1[df1.match_source.isnull()]
    
    # run match for those
    df1_matches = match_polling_places(df1_unmatched, df2, settings)
    
    # dedup this file for combinations of state/polling_place (my unique key)
    keys = ['state','polling_place']
    df1_matches = dedup_df(df1_matches, keys)
    
    # check that worked by making it a key now
    df1_matches = df1_matches.set_index(keys)
    
    # update with matches
    df1.update(df1_matches)
    
    # return
    return df1

`match_status(df1)`

a function to tell me for a given data frame what the match status is



In [16]:

    
def match_status(df1):
    
    # how many Nans are in match_type?
    not_matched = len(df1[df1['match_type'].isnull()].index)
    
    # make a df for none
    none = pd.DataFrame(dict(
                match_type = 'Not yet matched',
                count = not_matched
    ), index=[0])
    
    if not_matched == len(df1.index): # if all values are not-matched
        return none
    else:
        df = pd.DataFrame(
            df1.groupby('match_type')['match_type'].count().reset_index(name='count')
        )
        
        # add the non-matched row
        df = df.append(none)

        return df

Match attempts

Match 1 - 2007 on premises name, state, and postcode

Other than schools that have moved, these places should be the same
And for schools that have moved, the postcode test should ensure it's not too far



In [17]:

    
# first match attempt - set up file
filepath = '1999_referenda_output/polling_places.csv'
df_pp_1999 = import_1999_pp(filepath)

# double check none are somehow magically matched yet
print('before')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2007_election.csv',
    keys = ['state','premises','postcode'],
    match_source = '2007 Polling Places',
    match_type = 'Match 01 - state, premises, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))









    



before






    







  
    
      
      count
      match_type
    
  
  
    
      0
      7044
      Not yet matched
    
  








    



after






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      0
      2161
      Not yet matched

Match 2 through 4 - 2010 through 2016 on premises name, state, and postcode

Other than schools that have moved, these places should be the same
- And for schools that have moved, the postcode test should ensure it's not too far



In [18]:

    
## 2
print('before 2')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2010_election.csv',
    keys = ['state','premises','postcode'],
    match_source = '2010 Polling Places',
    match_type = 'Match 02 - state, premises, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))

## 3
print('before 3')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2013_election.csv',
    keys = ['state','premises','postcode'],
    match_source = '2013 Polling Places',
    match_type = 'Match 03 - state, premises, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))

## 4
print('before 4')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2016_election.csv',
    keys = ['state','premises','postcode'],
    match_source = '2016 Polling Places',
    match_type = 'Match 04 - state, premises, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))









    



before 2






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      0
      2161
      Not yet matched
    
  








    



after






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      1
      107
      Match 02 - state, premises, postcode
    
    
      0
      2054
      Not yet matched
    
  








    



before 3






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      1
      107
      Match 02 - state, premises, postcode
    
    
      0
      2054
      Not yet matched
    
  








    



after






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      1
      107
      Match 02 - state, premises, postcode
    
    
      2
      46
      Match 03 - state, premises, postcode
    
    
      0
      2008
      Not yet matched
    
  








    



before 4






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      1
      107
      Match 02 - state, premises, postcode
    
    
      2
      46
      Match 03 - state, premises, postcode
    
    
      0
      2008
      Not yet matched
    
  








    



after






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      1
      107
      Match 02 - state, premises, postcode
    
    
      2
      46
      Match 03 - state, premises, postcode
    
    
      3
      27
      Match 04 - state, premises, postcode
    
    
      0
      1981
      Not yet matched

Match 5 - 2007 polling places on polling place name, state, and postcode

This will match to a polling place name, in a different location, as long as it is in the same suburb
for the purposes of this analysis, this should be good enough



In [19]:

    
print('before 5')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2007_election.csv',
    keys = ['state','polling_place','postcode'],
    match_source = '2007 Polling Places',
    match_type = 'Match 05 - state, polling_place, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))









    



before 5






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      1
      107
      Match 02 - state, premises, postcode
    
    
      2
      46
      Match 03 - state, premises, postcode
    
    
      3
      27
      Match 04 - state, premises, postcode
    
    
      0
      1981
      Not yet matched
    
  








    



after






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      1
      107
      Match 02 - state, premises, postcode
    
    
      2
      46
      Match 03 - state, premises, postcode
    
    
      3
      27
      Match 04 - state, premises, postcode
    
    
      4
      1548
      Match 05 - state, polling_place, postcode
    
    
      0
      433
      Not yet matched

Match 6-8 - 2010-2016 polling places on polling place name, state, and postcode



In [20]:

    
print('before 6')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2010_election.csv',
    keys = ['state','polling_place','postcode'],
    match_source = '2010 Polling Places',
    match_type = 'Match 06 - state, polling_place, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))

## 7
print('before 7')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2013_election.csv',
    keys = ['state','polling_place','postcode'],
    match_source = '2013 Polling Places',
    match_type = 'Match 07 - state, polling_place, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))

## 8
print('before 8')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2016_election.csv',
    keys = ['state','polling_place','postcode'],
    match_source = '2016 Polling Places',
    match_type = 'Match 08 - state, polling_place, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))









    



before 6






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      1
      107
      Match 02 - state, premises, postcode
    
    
      2
      46
      Match 03 - state, premises, postcode
    
    
      3
      27
      Match 04 - state, premises, postcode
    
    
      4
      1548
      Match 05 - state, polling_place, postcode
    
    
      0
      433
      Not yet matched
    
  








    



after






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      1
      107
      Match 02 - state, premises, postcode
    
    
      2
      46
      Match 03 - state, premises, postcode
    
    
      3
      27
      Match 04 - state, premises, postcode
    
    
      4
      1548
      Match 05 - state, polling_place, postcode
    
    
      5
      10
      Match 06 - state, polling_place, postcode
    
    
      0
      423
      Not yet matched
    
  








    



before 7






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      1
      107
      Match 02 - state, premises, postcode
    
    
      2
      46
      Match 03 - state, premises, postcode
    
    
      3
      27
      Match 04 - state, premises, postcode
    
    
      4
      1548
      Match 05 - state, polling_place, postcode
    
    
      5
      10
      Match 06 - state, polling_place, postcode
    
    
      0
      423
      Not yet matched
    
  








    



after






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      1
      107
      Match 02 - state, premises, postcode
    
    
      2
      46
      Match 03 - state, premises, postcode
    
    
      3
      27
      Match 04 - state, premises, postcode
    
    
      4
      1548
      Match 05 - state, polling_place, postcode
    
    
      5
      10
      Match 06 - state, polling_place, postcode
    
    
      6
      2
      Match 07 - state, polling_place, postcode
    
    
      0
      421
      Not yet matched
    
  








    



before 8






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      1
      107
      Match 02 - state, premises, postcode
    
    
      2
      46
      Match 03 - state, premises, postcode
    
    
      3
      27
      Match 04 - state, premises, postcode
    
    
      4
      1548
      Match 05 - state, polling_place, postcode
    
    
      5
      10
      Match 06 - state, polling_place, postcode
    
    
      6
      2
      Match 07 - state, polling_place, postcode
    
    
      0
      421
      Not yet matched
    
  








    



after






    







  
    
      
      count
      match_type
    
  
  
    
      0
      4883
      Match 01 - state, premises, postcode
    
    
      1
      107
      Match 02 - state, premises, postcode
    
    
      2
      46
      Match 03 - state, premises, postcode
    
    
      3
      27
      Match 04 - state, premises, postcode
    
    
      4
      1548
      Match 05 - state, polling_place, postcode
    
    
      5
      10
      Match 06 - state, polling_place, postcode
    
    
      6
      2
      Match 07 - state, polling_place, postcode
    
    
      7
      4
      Match 08 - state, polling_place, postcode
    
    
      0
      417
      Not yet matched

Gooogle geocoder

keys.json contains a google maps api key, so it's not in this notebook.



In [21]:

    
def get_google_api_key():
    filepath = 'config/keys.json'
    with open(filepath) as data_file:    
        data = json.load(data_file)
        
    key = data['google_maps']
    
    return key

Google geocode example



In [22]:

    
def geocode_address(address):
    key = get_google_api_key()
    componentRestrictions = {'country':'AU'}

    gmaps = googlemaps.Client(key=key)
#     geocode_result = gmaps.geocode(address)

    geocode_result = gmaps.geocode(address, componentRestrictions)
    
    return geocode_result



In [23]:

    
# display(geocode_address('10/44 Lord St Richmond VIC 3121'))
display(geocode_address('Salt Creek Rd SALT CREEK 5275'))









    





[{'address_components': [{'long_name': 'Salt Creek Road',
    'short_name': 'Salt Creek Rd',
    'types': ['route']},
   {'long_name': 'Salt Creek',
    'short_name': 'Salt Creek',
    'types': ['locality', 'political']},
   {'long_name': 'Coorong District Council',
    'short_name': 'Coorong District Council',
    'types': ['administrative_area_level_2', 'political']},
   {'long_name': 'South Australia',
    'short_name': 'SA',
    'types': ['administrative_area_level_1', 'political']},
   {'long_name': 'Australia',
    'short_name': 'AU',
    'types': ['country', 'political']},
   {'long_name': '5264', 'short_name': '5264', 'types': ['postal_code']}],
  'formatted_address': 'Salt Creek Rd, Salt Creek SA 5264, Australia',
  'geometry': {'bounds': {'northeast': {'lat': -36.11577, 'lng': 139.7110613},
    'southwest': {'lat': -36.125792, 'lng': 139.647127}},
   'location': {'lat': -36.1201402, 'lng': 139.6792152},
   'location_type': 'GEOMETRIC_CENTER',
   'viewport': {'northeast': {'lat': -36.11577, 'lng': 139.7110613},
    'southwest': {'lat': -36.125792, 'lng': 139.647127}}},
  'place_id': 'ChIJt1A1Rdp0tmoRUfmf0QIpxuU',
  'types': ['route']}]



In [24]:

    
def geocode_polling_place(row, match_source = 'google geocode'):
    address = row['address'] + ' ' + row['suburb'] + ' ' + str(row['postcode'])[:4]

    # geocode address
    geocode = geocode_address(address)

    # update the row
    if geocode:
        row['match_source'] = match_source
        row['match_type'] = geocode[0]['geometry']['location_type']
        row['latitude'] = geocode[0]['geometry']['location']['lat']
        row['longitude'] = geocode[0]['geometry']['location']['lng']
    else:
        row['match_source'] = 'google geocode'
        row['match_type'] = 'failed'
    
    row = pd.DataFrame(row, index=[0])

    return row



In [25]:

    
# test above function on a few rows
test = df_pp_1999.head(3)

geocode_test = pd.DataFrame()

i = 0
for row in test.reset_index().to_dict('records')[:3]:

    row = geocode_polling_place(row, 'Match 09 - Google Geocode')

    if geocode_test.empty:
        geocode_test = row
    else:
        geocode_test = geocode_test.append(row)

geocode_test









    Out[25]:







  
    
      
      address
      latitude
      longitude
      match_source
      match_type
      polling_place
      postcode
      premises
      state
      suburb
      wheelchair_access
    
  
  
    
      0
      Hurtle Ave
      -35.433363
      149.082700
      Match 09 - Google Geocode
      GEOMETRIC_CENTER
      Bonython
      2905.0
      Bonython Primary School
      ACT
      BONYTHON
      F
    
    
      0
      Casey Cres
      -35.440247
      149.113499
      Match 09 - Google Geocode
      GEOMETRIC_CENTER
      Calwell
      2905.0
      Calwell High School
      ACT
      CALWELL
      F
    
    
      0
      Blding 2 Level 3 Yamba Dr
      -35.346829
      149.100011
      Match 09 - Google Geocode
      ROOFTOP
      Canberra Hospital
      2605.0
      The Canberra Hospital
      ACT
      GARRAN
      F

Match unmatched so far by google geocoder



In [26]:

    
# test above function on a few rows
unmatched_places = df_pp_1999[df_pp_1999.match_source.isnull()]

geocode_matches = pd.DataFrame()

for row in unmatched_places.reset_index().to_dict('records'):
    
    row = geocode_polling_place(row)

    if geocode_matches.empty:
        geocode_matches = row
    else:
        geocode_matches = geocode_matches.append(row)

geocode_matches.head(5)









    Out[26]:







  
    
      
      address
      latitude
      longitude
      match_source
      match_type
      polling_place
      postcode
      premises
      state
      suburb
      wheelchair_access
    
  
  
    
      0
      Maclaurin Cres
      -35.350298
      149.081043
      google geocode
      GEOMETRIC_CENTER
      Chifley
      2606.0
      Melrose Primary School
      ACT
      CHIFLEY
      A
    
    
      0
      Bangalay Cres
      -35.345552
      149.038070
      google geocode
      GEOMETRIC_CENTER
      Rivett
      2611.0
      Rivett Primary School
      ACT
      RIVETT
      F
    
    
      0
      Bingle St
      -35.205468
      149.045095
      google geocode
      GEOMETRIC_CENTER
      Flynn
      2615.0
      Flynn Primary School
      ACT
      FLYNN
      N
    
    
      0
      54 Coranderrk St
      -35.280107
      149.140103
      google geocode
      RANGE_INTERPOLATED
      Reid
      2612.0
      Uniting Church Hall
      ACT
      REID
      N
    
    
      0
      Crofts Cres
      -35.198717
      149.062818
      google geocode
      GEOMETRIC_CENTER
      Spence
      2615.0
      Former Spence Primary School
      ACT
      SPENCE
      N



In [27]:

    
# reorder geocode in same pattern as non-matches
# get all keys from that table
keys = df_pp_1999.reset_index().columns.values

#reorder by that
geocode_matches_ordered = geocode_matches[keys]

# add state indexes
keys = ['state','polling_place']
# df1_matches = dedup_df(df1_matches, keys)

# check that worked by making it a key now
geocode_matches_ordered = geocode_matches_ordered.set_index(keys)

# update with matches
df_pp_1999.update(geocode_matches_ordered)

# where are we at?
display(match_status(df_pp_1999))









    







  
    
      
      count
      match_type
    
  
  
    
      0
      21
      APPROXIMATE
    
    
      1
      310
      GEOMETRIC_CENTER
    
    
      2
      4883
      Match 01 - state, premises, postcode
    
    
      3
      107
      Match 02 - state, premises, postcode
    
    
      4
      46
      Match 03 - state, premises, postcode
    
    
      5
      27
      Match 04 - state, premises, postcode
    
    
      6
      1548
      Match 05 - state, polling_place, postcode
    
    
      7
      10
      Match 06 - state, polling_place, postcode
    
    
      8
      2
      Match 07 - state, polling_place, postcode
    
    
      9
      4
      Match 08 - state, polling_place, postcode
    
    
      10
      30
      RANGE_INTERPOLATED
    
    
      11
      56
      ROOFTOP
    
    
      0
      0
      Not yet matched

Set the rest to the centroid of their suburb



In [28]:

    
filepath = 'from_abs/ssc_2016_aust_centroid.csv'

subtown_centroids = results = pd.read_csv(
    filepath
)

# create a column for abbreviated states
lambda_states = lambda x: state_abbreviation(x)
subtown_centroids['state'] = subtown_centroids['STE_NAME16'].apply(lambda_states)

# strip the brackets out of suburb names, and make all caps, to match the 1999 file
lambda_suburbs = lambda x: left_of_bracket(x).upper()
subtown_centroids['suburb'] = subtown_centroids['SSC_NAME16'].apply(lambda_suburbs)

display(subtown_centroids.head(5))









    







  
    
      
      longitude
      latitude
      SSC_CODE16
      SSC_NAME16
      STE_CODE16
      STE_NAME16
      AREASQKM16
      xcoord
      ycoord
      state
      suburb
    
  
  
    
      0
      149.802885
      -32.879097
      10001
      Aarons Pass
      1
      New South Wales
      82.7639
      149.80289
      -32.87910
      NSW
      AARONS PASS
    
    
      1
      150.861203
      -33.870091
      10002
      Abbotsbury
      1
      New South Wales
      4.9788
      150.86120
      -33.87009
      NSW
      ABBOTSBURY
    
    
      2
      151.129240
      -33.850609
      10003
      Abbotsford (NSW)
      1
      New South Wales
      1.0180
      151.12924
      -33.85061
      NSW
      ABBOTSFORD
    
    
      3
      149.556431
      -33.389973
      10004
      Abercrombie
      1
      New South Wales
      2.9775
      149.55643
      -33.38997
      NSW
      ABERCROMBIE
    
    
      4
      149.347666
      -33.910344
      10005
      Abercrombie River
      1
      New South Wales
      127.1701
      149.34767
      -33.91034
      NSW
      ABERCROMBIE RIVER



In [29]:

    
print('before suburbs')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'from_abs/ssc_2016_aust_centroid.csv',
    keys = ['state','suburb'],
    match_source = 'ABS Suburb Centroids',
    match_type = 'Match 10 - suburb centroid'
)

# rows to try

## try not matched rows, OR rows that the geocode failed on
df_to_match = df_pp_1999[df_pp_1999.match_source.isnull()]
df_to_match = df_to_match.append(df_pp_1999[df_pp_1999['match_type'] == 'failed'])

# get matches
df1_matches = match_polling_places(
    df_to_match, # only run on empty rows
    subtown_centroids,
    settings
)

# dedup this file for combinations of state/polling_place (my unique key)
keys = ['state','polling_place']
# df1_matches = dedup_df(df1_matches, keys)

# check that worked by making it a key now
df1_matches = df1_matches.set_index(keys)

# update with matches
df_pp_1999.update(df1_matches)

print('after')
display(match_status(df_pp_1999))









    



before suburbs






    







  
    
      
      count
      match_type
    
  
  
    
      0
      21
      APPROXIMATE
    
    
      1
      310
      GEOMETRIC_CENTER
    
    
      2
      4883
      Match 01 - state, premises, postcode
    
    
      3
      107
      Match 02 - state, premises, postcode
    
    
      4
      46
      Match 03 - state, premises, postcode
    
    
      5
      27
      Match 04 - state, premises, postcode
    
    
      6
      1548
      Match 05 - state, polling_place, postcode
    
    
      7
      10
      Match 06 - state, polling_place, postcode
    
    
      8
      2
      Match 07 - state, polling_place, postcode
    
    
      9
      4
      Match 08 - state, polling_place, postcode
    
    
      10
      30
      RANGE_INTERPOLATED
    
    
      11
      56
      ROOFTOP
    
    
      0
      0
      Not yet matched
    
  








    



after






    







  
    
      
      count
      match_type
    
  
  
    
      0
      21
      APPROXIMATE
    
    
      1
      310
      GEOMETRIC_CENTER
    
    
      2
      4883
      Match 01 - state, premises, postcode
    
    
      3
      107
      Match 02 - state, premises, postcode
    
    
      4
      46
      Match 03 - state, premises, postcode
    
    
      5
      27
      Match 04 - state, premises, postcode
    
    
      6
      1548
      Match 05 - state, polling_place, postcode
    
    
      7
      10
      Match 06 - state, polling_place, postcode
    
    
      8
      2
      Match 07 - state, polling_place, postcode
    
    
      9
      4
      Match 08 - state, polling_place, postcode
    
    
      10
      30
      RANGE_INTERPOLATED
    
    
      11
      56
      ROOFTOP
    
    
      0
      0
      Not yet matched

How many are left now?



In [30]:

    
df_to_match = df_pp_1999[df_pp_1999.match_source.isnull()]
df_to_match = df_to_match.append(df_pp_1999[df_pp_1999['match_type'] == 'failed'])
display(df_to_match)









    







  
    
      
      
      premises
      address
      suburb
      postcode
      wheelchair_access
      match_source
      match_type
      latitude
      longitude
    
    
      state
      polling_place

Hooray! WE HAVE A RESULT FOR EVERYWHERE

Let's write a CSV:



In [33]:

    
df_pp_1999.to_csv(
    '1999_referenda_output/polling_places_geocoded.csv',
    sep = ','
)

		premises	address	suburb	postcode	wheelchair_access	match_source	match_type	latitude	longitude
state	polling_place
ACT	Bonython	Bonython Primary School	Hurtle Ave	BONYTHON	2905.0	F	NaN	NaN	NaN	NaN
	Calwell	Calwell High School	Casey Cres	CALWELL	2905.0	F	NaN	NaN	NaN	NaN
	Canberra Hospital	The Canberra Hospital	Blding 2 Level 3 Yamba Dr	GARRAN	2605.0	F	NaN	NaN	NaN	NaN

	count	match_type
0	4883	Match 01 - state, premises, postcode
0	2161	Not yet matched

	count	match_type
0	4883	Match 01 - state, premises, postcode
1	107	Match 02 - state, premises, postcode
0	2054	Not yet matched

address	latitude	longitude	match_source	match_type	polling_place	postcode	premises	state	suburb	wheelchair_access
Hurtle Ave	-35.433363	149.082700	Match 09 - Google Geocode	GEOMETRIC_CENTER	Bonython	2905.0	Bonython Primary School	ACT	BONYTHON	F
Casey Cres	-35.440247	149.113499	Match 09 - Google Geocode	GEOMETRIC_CENTER	Calwell	2905.0	Calwell High School	ACT	CALWELL	F
Blding 2 Level 3 Yamba Dr	-35.346829	149.100011	Match 09 - Google Geocode	ROOFTOP	Canberra Hospital	2605.0	The Canberra Hospital	ACT	GARRAN	F

address	latitude	longitude	match_source	match_type	polling_place	postcode	premises	state	suburb	wheelchair_access
Maclaurin Cres	-35.350298	149.081043	google geocode	GEOMETRIC_CENTER	Chifley	2606.0	Melrose Primary School	ACT	CHIFLEY	A
Bangalay Cres	-35.345552	149.038070	google geocode	GEOMETRIC_CENTER	Rivett	2611.0	Rivett Primary School	ACT	RIVETT	F
Bingle St	-35.205468	149.045095	google geocode	GEOMETRIC_CENTER	Flynn	2615.0	Flynn Primary School	ACT	FLYNN	N
54 Coranderrk St	-35.280107	149.140103	google geocode	RANGE_INTERPOLATED	Reid	2612.0	Uniting Church Hall	ACT	REID	N
Crofts Cres	-35.198717	149.062818	google geocode	GEOMETRIC_CENTER	Spence	2615.0	Former Spence Primary School	ACT	SPENCE	N

	count	match_type
0	21	APPROXIMATE
1	310	GEOMETRIC_CENTER
2	4883	Match 01 - state, premises, postcode
3	107	Match 02 - state, premises, postcode
4	46	Match 03 - state, premises, postcode
5	27	Match 04 - state, premises, postcode
6	1548	Match 05 - state, polling_place, postcode
7	10	Match 06 - state, polling_place, postcode
8	2	Match 07 - state, polling_place, postcode
9	4	Match 08 - state, polling_place, postcode
10	30	RANGE_INTERPOLATED
11	56	ROOFTOP
0	0	Not yet matched

	longitude	latitude	SSC_CODE16	SSC_NAME16	STE_CODE16	STE_NAME16	AREASQKM16	xcoord	ycoord	state	suburb
0	149.802885	-32.879097	10001	Aarons Pass	1	New South Wales	82.7639	149.80289	-32.87910	NSW	AARONS PASS
1	150.861203	-33.870091	10002	Abbotsbury	1	New South Wales	4.9788	150.86120	-33.87009	NSW	ABBOTSBURY
2	151.129240	-33.850609	10003	Abbotsford (NSW)	1	New South Wales	1.0180	151.12924	-33.85061	NSW	ABBOTSFORD
3	149.556431	-33.389973	10004	Abercrombie	1	New South Wales	2.9775	149.55643	-33.38997	NSW	ABERCROMBIE
4	149.347666	-33.910344	10005	Abercrombie River	1	New South Wales	127.1701	149.34767	-33.91034	NSW	ABERCROMBIE RIVER