Notebook to add co-ordinates for 1999 Polling Places

The AEC started putting co-ordinates on polling place files from 2007.

The code below matches to 2007 polling places, where the name of the venue is the same.

While not perfect, this should result the vast majority of the time in a co-ordinate closely representing the location of the polling location in 1999

Libraries


In [6]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML
import json
import googlemaps

Helper functions


In [7]:
def left_of_bracket(s):
    if '(' in s:
        needle = s.find('(')
        r = s[:needle-1].strip()
        return r
    else:
        return s
    
def state_abbreviation(state):
    spaces = state.count(' ')
    
    if spaces == 2:
        bits = state.split(' ')
        r=''
        for b in bits:
            r = r + b[:1].upper() # for each word in state grab first letter
        return r
    elif 'Australia' in state:
        r = state[:1].upper() + 'A'
        return r
    elif state == 'Queensland':
        return 'QLD'
    elif state == 'Northern Territory':
        return 'NT'
    else:
        r = state[:3].upper()
        return r

# i keep forgetting the syntax for this so writing a wrapper
def dedup_df(df, keys, keep = False):
    
    # for a data frame, drop anything thats a duplicate
    # if you change keep to first, it'll keep the first row rather than none
    df_dedup = df.drop_duplicates(keys, keep)
    return df_dedup

Import polling places

import_polling_places(filepath)

  • Takes a path to a polling place file
  • Returns a tidy data frame
    • Renames some columns
    • Dedups on ['state','polling_place']

In [8]:
def import_polling_places(filepath):
    
    # read csv
    df_pp = pd.read_csv(
        filepath
    )
    
    # pick the columns I want to keep
    cols = [
        'State',
        'PollingPlaceNm',
        'PremisesNm',
        'PremisesAddress1',
        'PremisesAddress2',
        'PremisesAddress3',
        'PremisesSuburb',
        'PremisesStateAb',
        'PremisesPostCode',
        'Latitude',
        'Longitude'
    ]
    
    # filter for those
    df_pp = df_pp[cols]

    # create a polling place column missing the bracket
    lambda_polling_places = lambda x: left_of_bracket(x)
    df_pp['polling_place'] = df_pp['PollingPlaceNm'].apply(lambda_polling_places)

    # rename columns to make joining easier
    df_pp['premises'] = df_pp['PremisesNm']
    df_pp['postcode'] = df_pp['PremisesPostCode']

    # replace in the col headers list where I've modified/added the column
    cols = [c.replace('PollingPlaceNm', 'polling_place') for c in cols]
    cols = [c.replace('PremisesNm', 'premises') for c in cols]
    cols = [c.replace('PremisesPostCode', 'postcode') for c in cols]
    
    # reorder df
    df_pp = df_pp[cols]

    # dedup
    df_pp = df_pp.drop_duplicates()

    # make all headers lower case
    df_pp.columns = [x.lower() for x in df_pp.columns]
    
    return df_pp

In [9]:
filepath = 'federal_election_polling_places/pp_2007_election.csv'
test = import_polling_places(filepath)
display('Rows: ' + str(len(test.index)))


'Rows: 7861'

Import 1999 polling places


In [10]:
def import_1999_pp(filepath):

    df_pp_1999 = pd.read_csv(
        filepath
    )

    # add blank columns for match types and lat/lng
    df_pp_1999['match_source'] = np.nan
    df_pp_1999['match_type'] = np.nan
    df_pp_1999['latitude'] = np.nan
    df_pp_1999['longitude'] = np.nan

    # tell it to index on state and polling place
    df_pp_1999 = df_pp_1999.set_index(['state','polling_place'])
    
    return df_pp_1999

In [11]:
filepath = '1999_referenda_output/polling_places.csv'
df_pp_1999 = import_1999_pp(filepath)
display(df_pp_1999.head(3))


premises address suburb postcode wheelchair_access match_source match_type latitude longitude
state polling_place
ACT Bonython Bonython Primary School Hurtle Ave BONYTHON 2905.0 F NaN NaN NaN NaN
Calwell Calwell High School Casey Cres CALWELL 2905.0 F NaN NaN NaN NaN
Canberra Hospital The Canberra Hospital Blding 2 Level 3 Yamba Dr GARRAN 2605.0 F NaN NaN NaN NaN

Matches

Pandas setting I need for below to behave

pandas generates warnings for working with a data frame that's a copy of another it thinks I might think I'm changing df_pp_1999 when im working with df_pp_1999_working i'm turning this warning off because i'm doing this on purpose, so I can keep df_pp_1999 as a 'yet to be matched' file, and update it with each subsequent working file


In [12]:
pd.set_option('chained_assignment',None)

Functions

match_polling_places(df_pp_1999, df_pp, settings)

  • For the 1999 data frame, and a given other polling place data frame, and a set of settings, run a merge, and return the rows that matched based on the join you specified
    E.g:
    match_polling_places(
      df_pp_1999,
      df_pp,
      dict(
          keys = ['state','premises','postcode'],
          match_source = '2007 Polling Places',
          match_type = 'Match 01 - state, premises, postcode'   
      )
    )
  • runs a join on state, premises, and postcode between df1 and df2
  • keeps a defined set of columns from df1
  • adds the columns match_source and match_type, and sets their value
  • replaces the latitude and longitude columns of df1 with those from df2
  • returns this data frame, deleting all rows that didn't match from df1

In [13]:
def match_polling_places(df1, df2, settings):
    
    # split up our meta field
    keys = settings['keys']
    match_source = settings['match_source']
    match_type = settings['match_type']

    # filter for those columns
 
    df_working = df1.reset_index()[[
        'state',
        'polling_place',
        'premises',
        'address',
        'suburb',
        'postcode',
        'wheelchair_access'
    ]]

    # the keys I want to keep from the second df in the join are the group_by keys, and also lat/lng
    cols_df2 = keys + ['latitude','longitude']
    
    # add cols for match type
    df_working['match_source'] = match_source
    df_working['match_type'] = match_type
    
    # run the join
    df_working = pd.merge(
        df_working,
        df2[cols_df2],
        on=keys,
        how='left'
    )
    
    # delete those which we didn't match
    df_working = df_working[~df_working['latitude'].isnull()]
    
    # dedup on the keys we matched on
    df_working = dedup_df(df_working, keys)
    
    return df_working

In [14]:
# test match_polling_places
filepath = '1999_referenda_output/polling_places.csv'
df1 = import_1999_pp(filepath)

filepath2 = 'federal_election_polling_places/pp_2007_election.csv'
df2 = import_polling_places(filepath2)

test = match_polling_places(
  df1,
  df2,
  dict(
      keys = ['state','premises','postcode'],
      match_source = '2007 Polling Places',
      match_type = 'Match 01 - state, premises, postcode'   
  )
)

display(test.head(3))


state polling_place premises address suburb postcode wheelchair_access match_source match_type latitude longitude
0 ACT Bonython Bonython Primary School Hurtle Ave BONYTHON 2905.0 F 2007 Polling Places Match 01 - state, premises, postcode -35.4318 149.083
1 ACT Calwell Calwell High School Casey Cres CALWELL 2905.0 F 2007 Polling Places Match 01 - state, premises, postcode -35.4406 149.116
2 ACT Canberra Hospital The Canberra Hospital Blding 2 Level 3 Yamba Dr GARRAN 2605.0 F 2007 Polling Places Match 01 - state, premises, postcode -35.3453 149.100

match_unmatched_polling_places(df1, settings)

  • This is a wrapper function for match_polling_places
  • It will only pass data that is NOT yet matched in df1 to the match function, so that we keep track of at what point in our order we matched the data frame (rather than overriding each time it matches)
  • This will matter as we do less high quality matches at the bottom of the pile

In [15]:
def match_unmatched_polling_places(df1, settings):
    
    # get polling place file from settings
    filepath = settings['pp_filepath']    
    df2 = import_polling_places(filepath)

    # work out which rows we haven't yet matched
    df1_unmatched = df1[df1.match_source.isnull()]
    
    # run match for those
    df1_matches = match_polling_places(df1_unmatched, df2, settings)
    
    # dedup this file for combinations of state/polling_place (my unique key)
    keys = ['state','polling_place']
    df1_matches = dedup_df(df1_matches, keys)
    
    # check that worked by making it a key now
    df1_matches = df1_matches.set_index(keys)
    
    # update with matches
    df1.update(df1_matches)
    
    # return
    return df1

match_status(df1)

  • a function to tell me for a given data frame what the match status is

In [16]:
def match_status(df1):
    
    # how many Nans are in match_type?
    not_matched = len(df1[df1['match_type'].isnull()].index)
    
    # make a df for none
    none = pd.DataFrame(dict(
                match_type = 'Not yet matched',
                count = not_matched
    ), index=[0])
    
    if not_matched == len(df1.index): # if all values are not-matched
        return none
    else:
        df = pd.DataFrame(
            df1.groupby('match_type')['match_type'].count().reset_index(name='count')
        )
        
        # add the non-matched row
        df = df.append(none)

        return df

Match attempts

Match 1 - 2007 on premises name, state, and postcode

  • Other than schools that have moved, these places should be the same
  • And for schools that have moved, the postcode test should ensure it's not too far

In [17]:
# first match attempt - set up file
filepath = '1999_referenda_output/polling_places.csv'
df_pp_1999 = import_1999_pp(filepath)

# double check none are somehow magically matched yet
print('before')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2007_election.csv',
    keys = ['state','premises','postcode'],
    match_source = '2007 Polling Places',
    match_type = 'Match 01 - state, premises, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))


before
count match_type
0 7044 Not yet matched
after
count match_type
0 4883 Match 01 - state, premises, postcode
0 2161 Not yet matched

Match 2 through 4 - 2010 through 2016 on premises name, state, and postcode

  • Other than schools that have moved, these places should be the same
    • And for schools that have moved, the postcode test should ensure it's not too far

In [18]:
## 2
print('before 2')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2010_election.csv',
    keys = ['state','premises','postcode'],
    match_source = '2010 Polling Places',
    match_type = 'Match 02 - state, premises, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))

## 3
print('before 3')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2013_election.csv',
    keys = ['state','premises','postcode'],
    match_source = '2013 Polling Places',
    match_type = 'Match 03 - state, premises, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))

## 4
print('before 4')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2016_election.csv',
    keys = ['state','premises','postcode'],
    match_source = '2016 Polling Places',
    match_type = 'Match 04 - state, premises, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))


before 2
count match_type
0 4883 Match 01 - state, premises, postcode
0 2161 Not yet matched
after
count match_type
0 4883 Match 01 - state, premises, postcode
1 107 Match 02 - state, premises, postcode
0 2054 Not yet matched
before 3
count match_type
0 4883 Match 01 - state, premises, postcode
1 107 Match 02 - state, premises, postcode
0 2054 Not yet matched
after
count match_type
0 4883 Match 01 - state, premises, postcode
1 107 Match 02 - state, premises, postcode
2 46 Match 03 - state, premises, postcode
0 2008 Not yet matched
before 4
count match_type
0 4883 Match 01 - state, premises, postcode
1 107 Match 02 - state, premises, postcode
2 46 Match 03 - state, premises, postcode
0 2008 Not yet matched
after
count match_type
0 4883 Match 01 - state, premises, postcode
1 107 Match 02 - state, premises, postcode
2 46 Match 03 - state, premises, postcode
3 27 Match 04 - state, premises, postcode
0 1981 Not yet matched

Match 5 - 2007 polling places on polling place name, state, and postcode

  • This will match to a polling place name, in a different location, as long as it is in the same suburb

  • for the purposes of this analysis, this should be good enough


In [19]:
print('before 5')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2007_election.csv',
    keys = ['state','polling_place','postcode'],
    match_source = '2007 Polling Places',
    match_type = 'Match 05 - state, polling_place, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))


before 5
count match_type
0 4883 Match 01 - state, premises, postcode
1 107 Match 02 - state, premises, postcode
2 46 Match 03 - state, premises, postcode
3 27 Match 04 - state, premises, postcode
0 1981 Not yet matched
after
count match_type
0 4883 Match 01 - state, premises, postcode
1 107 Match 02 - state, premises, postcode
2 46 Match 03 - state, premises, postcode
3 27 Match 04 - state, premises, postcode
4 1548 Match 05 - state, polling_place, postcode
0 433 Not yet matched

Match 6-8 - 2010-2016 polling places on polling place name, state, and postcode


In [20]:
print('before 6')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2010_election.csv',
    keys = ['state','polling_place','postcode'],
    match_source = '2010 Polling Places',
    match_type = 'Match 06 - state, polling_place, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))

## 7
print('before 7')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2013_election.csv',
    keys = ['state','polling_place','postcode'],
    match_source = '2013 Polling Places',
    match_type = 'Match 07 - state, polling_place, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))

## 8
print('before 8')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'federal_election_polling_places/pp_2016_election.csv',
    keys = ['state','polling_place','postcode'],
    match_source = '2016 Polling Places',
    match_type = 'Match 08 - state, polling_place, postcode'
)

# run match
df_pp_1999 = match_unmatched_polling_places(df_pp_1999, settings)

print('after')
display(match_status(df_pp_1999))


before 6
count match_type
0 4883 Match 01 - state, premises, postcode
1 107 Match 02 - state, premises, postcode
2 46 Match 03 - state, premises, postcode
3 27 Match 04 - state, premises, postcode
4 1548 Match 05 - state, polling_place, postcode
0 433 Not yet matched
after
count match_type
0 4883 Match 01 - state, premises, postcode
1 107 Match 02 - state, premises, postcode
2 46 Match 03 - state, premises, postcode
3 27 Match 04 - state, premises, postcode
4 1548 Match 05 - state, polling_place, postcode
5 10 Match 06 - state, polling_place, postcode
0 423 Not yet matched
before 7
count match_type
0 4883 Match 01 - state, premises, postcode
1 107 Match 02 - state, premises, postcode
2 46 Match 03 - state, premises, postcode
3 27 Match 04 - state, premises, postcode
4 1548 Match 05 - state, polling_place, postcode
5 10 Match 06 - state, polling_place, postcode
0 423 Not yet matched
after
count match_type
0 4883 Match 01 - state, premises, postcode
1 107 Match 02 - state, premises, postcode
2 46 Match 03 - state, premises, postcode
3 27 Match 04 - state, premises, postcode
4 1548 Match 05 - state, polling_place, postcode
5 10 Match 06 - state, polling_place, postcode
6 2 Match 07 - state, polling_place, postcode
0 421 Not yet matched
before 8
count match_type
0 4883 Match 01 - state, premises, postcode
1 107 Match 02 - state, premises, postcode
2 46 Match 03 - state, premises, postcode
3 27 Match 04 - state, premises, postcode
4 1548 Match 05 - state, polling_place, postcode
5 10 Match 06 - state, polling_place, postcode
6 2 Match 07 - state, polling_place, postcode
0 421 Not yet matched
after
count match_type
0 4883 Match 01 - state, premises, postcode
1 107 Match 02 - state, premises, postcode
2 46 Match 03 - state, premises, postcode
3 27 Match 04 - state, premises, postcode
4 1548 Match 05 - state, polling_place, postcode
5 10 Match 06 - state, polling_place, postcode
6 2 Match 07 - state, polling_place, postcode
7 4 Match 08 - state, polling_place, postcode
0 417 Not yet matched

Gooogle geocoder

keys.json contains a google maps api key, so it's not in this notebook.


In [21]:
def get_google_api_key():
    filepath = 'config/keys.json'
    with open(filepath) as data_file:    
        data = json.load(data_file)
        
    key = data['google_maps']
    
    return key

Google geocode example


In [22]:
def geocode_address(address):
    key = get_google_api_key()
    componentRestrictions = {'country':'AU'}

    gmaps = googlemaps.Client(key=key)
#     geocode_result = gmaps.geocode(address)

    geocode_result = gmaps.geocode(address, componentRestrictions)
    
    return geocode_result

In [23]:
# display(geocode_address('10/44 Lord St Richmond VIC 3121'))
display(geocode_address('Salt Creek Rd SALT CREEK 5275'))


[{'address_components': [{'long_name': 'Salt Creek Road',
    'short_name': 'Salt Creek Rd',
    'types': ['route']},
   {'long_name': 'Salt Creek',
    'short_name': 'Salt Creek',
    'types': ['locality', 'political']},
   {'long_name': 'Coorong District Council',
    'short_name': 'Coorong District Council',
    'types': ['administrative_area_level_2', 'political']},
   {'long_name': 'South Australia',
    'short_name': 'SA',
    'types': ['administrative_area_level_1', 'political']},
   {'long_name': 'Australia',
    'short_name': 'AU',
    'types': ['country', 'political']},
   {'long_name': '5264', 'short_name': '5264', 'types': ['postal_code']}],
  'formatted_address': 'Salt Creek Rd, Salt Creek SA 5264, Australia',
  'geometry': {'bounds': {'northeast': {'lat': -36.11577, 'lng': 139.7110613},
    'southwest': {'lat': -36.125792, 'lng': 139.647127}},
   'location': {'lat': -36.1201402, 'lng': 139.6792152},
   'location_type': 'GEOMETRIC_CENTER',
   'viewport': {'northeast': {'lat': -36.11577, 'lng': 139.7110613},
    'southwest': {'lat': -36.125792, 'lng': 139.647127}}},
  'place_id': 'ChIJt1A1Rdp0tmoRUfmf0QIpxuU',
  'types': ['route']}]

In [24]:
def geocode_polling_place(row, match_source = 'google geocode'):
    address = row['address'] + ' ' + row['suburb'] + ' ' + str(row['postcode'])[:4]

    # geocode address
    geocode = geocode_address(address)

    # update the row
    if geocode:
        row['match_source'] = match_source
        row['match_type'] = geocode[0]['geometry']['location_type']
        row['latitude'] = geocode[0]['geometry']['location']['lat']
        row['longitude'] = geocode[0]['geometry']['location']['lng']
    else:
        row['match_source'] = 'google geocode'
        row['match_type'] = 'failed'
    
    row = pd.DataFrame(row, index=[0])

    return row

In [25]:
# test above function on a few rows
test = df_pp_1999.head(3)

geocode_test = pd.DataFrame()

i = 0
for row in test.reset_index().to_dict('records')[:3]:

    row = geocode_polling_place(row, 'Match 09 - Google Geocode')

    if geocode_test.empty:
        geocode_test = row
    else:
        geocode_test = geocode_test.append(row)

geocode_test


Out[25]:
address latitude longitude match_source match_type polling_place postcode premises state suburb wheelchair_access
0 Hurtle Ave -35.433363 149.082700 Match 09 - Google Geocode GEOMETRIC_CENTER Bonython 2905.0 Bonython Primary School ACT BONYTHON F
0 Casey Cres -35.440247 149.113499 Match 09 - Google Geocode GEOMETRIC_CENTER Calwell 2905.0 Calwell High School ACT CALWELL F
0 Blding 2 Level 3 Yamba Dr -35.346829 149.100011 Match 09 - Google Geocode ROOFTOP Canberra Hospital 2605.0 The Canberra Hospital ACT GARRAN F

Match unmatched so far by google geocoder


In [26]:
# test above function on a few rows
unmatched_places = df_pp_1999[df_pp_1999.match_source.isnull()]

geocode_matches = pd.DataFrame()

for row in unmatched_places.reset_index().to_dict('records'):
    
    row = geocode_polling_place(row)

    if geocode_matches.empty:
        geocode_matches = row
    else:
        geocode_matches = geocode_matches.append(row)

geocode_matches.head(5)


Out[26]:
address latitude longitude match_source match_type polling_place postcode premises state suburb wheelchair_access
0 Maclaurin Cres -35.350298 149.081043 google geocode GEOMETRIC_CENTER Chifley 2606.0 Melrose Primary School ACT CHIFLEY A
0 Bangalay Cres -35.345552 149.038070 google geocode GEOMETRIC_CENTER Rivett 2611.0 Rivett Primary School ACT RIVETT F
0 Bingle St -35.205468 149.045095 google geocode GEOMETRIC_CENTER Flynn 2615.0 Flynn Primary School ACT FLYNN N
0 54 Coranderrk St -35.280107 149.140103 google geocode RANGE_INTERPOLATED Reid 2612.0 Uniting Church Hall ACT REID N
0 Crofts Cres -35.198717 149.062818 google geocode GEOMETRIC_CENTER Spence 2615.0 Former Spence Primary School ACT SPENCE N

In [27]:
# reorder geocode in same pattern as non-matches
# get all keys from that table
keys = df_pp_1999.reset_index().columns.values

#reorder by that
geocode_matches_ordered = geocode_matches[keys]

# add state indexes
keys = ['state','polling_place']
# df1_matches = dedup_df(df1_matches, keys)

# check that worked by making it a key now
geocode_matches_ordered = geocode_matches_ordered.set_index(keys)

# update with matches
df_pp_1999.update(geocode_matches_ordered)

# where are we at?
display(match_status(df_pp_1999))


count match_type
0 21 APPROXIMATE
1 310 GEOMETRIC_CENTER
2 4883 Match 01 - state, premises, postcode
3 107 Match 02 - state, premises, postcode
4 46 Match 03 - state, premises, postcode
5 27 Match 04 - state, premises, postcode
6 1548 Match 05 - state, polling_place, postcode
7 10 Match 06 - state, polling_place, postcode
8 2 Match 07 - state, polling_place, postcode
9 4 Match 08 - state, polling_place, postcode
10 30 RANGE_INTERPOLATED
11 56 ROOFTOP
0 0 Not yet matched

Set the rest to the centroid of their suburb


In [28]:
filepath = 'from_abs/ssc_2016_aust_centroid.csv'

subtown_centroids = results = pd.read_csv(
    filepath
)

# create a column for abbreviated states
lambda_states = lambda x: state_abbreviation(x)
subtown_centroids['state'] = subtown_centroids['STE_NAME16'].apply(lambda_states)

# strip the brackets out of suburb names, and make all caps, to match the 1999 file
lambda_suburbs = lambda x: left_of_bracket(x).upper()
subtown_centroids['suburb'] = subtown_centroids['SSC_NAME16'].apply(lambda_suburbs)

display(subtown_centroids.head(5))


longitude latitude SSC_CODE16 SSC_NAME16 STE_CODE16 STE_NAME16 AREASQKM16 xcoord ycoord state suburb
0 149.802885 -32.879097 10001 Aarons Pass 1 New South Wales 82.7639 149.80289 -32.87910 NSW AARONS PASS
1 150.861203 -33.870091 10002 Abbotsbury 1 New South Wales 4.9788 150.86120 -33.87009 NSW ABBOTSBURY
2 151.129240 -33.850609 10003 Abbotsford (NSW) 1 New South Wales 1.0180 151.12924 -33.85061 NSW ABBOTSFORD
3 149.556431 -33.389973 10004 Abercrombie 1 New South Wales 2.9775 149.55643 -33.38997 NSW ABERCROMBIE
4 149.347666 -33.910344 10005 Abercrombie River 1 New South Wales 127.1701 149.34767 -33.91034 NSW ABERCROMBIE RIVER

In [29]:
print('before suburbs')
display(match_status(df_pp_1999))

# configure match settings
settings = dict(
    pp_filepath = 'from_abs/ssc_2016_aust_centroid.csv',
    keys = ['state','suburb'],
    match_source = 'ABS Suburb Centroids',
    match_type = 'Match 10 - suburb centroid'
)

# rows to try

## try not matched rows, OR rows that the geocode failed on
df_to_match = df_pp_1999[df_pp_1999.match_source.isnull()]
df_to_match = df_to_match.append(df_pp_1999[df_pp_1999['match_type'] == 'failed'])

# get matches
df1_matches = match_polling_places(
    df_to_match, # only run on empty rows
    subtown_centroids,
    settings
)

# dedup this file for combinations of state/polling_place (my unique key)
keys = ['state','polling_place']
# df1_matches = dedup_df(df1_matches, keys)

# check that worked by making it a key now
df1_matches = df1_matches.set_index(keys)

# update with matches
df_pp_1999.update(df1_matches)

print('after')
display(match_status(df_pp_1999))


before suburbs
count match_type
0 21 APPROXIMATE
1 310 GEOMETRIC_CENTER
2 4883 Match 01 - state, premises, postcode
3 107 Match 02 - state, premises, postcode
4 46 Match 03 - state, premises, postcode
5 27 Match 04 - state, premises, postcode
6 1548 Match 05 - state, polling_place, postcode
7 10 Match 06 - state, polling_place, postcode
8 2 Match 07 - state, polling_place, postcode
9 4 Match 08 - state, polling_place, postcode
10 30 RANGE_INTERPOLATED
11 56 ROOFTOP
0 0 Not yet matched
after
count match_type
0 21 APPROXIMATE
1 310 GEOMETRIC_CENTER
2 4883 Match 01 - state, premises, postcode
3 107 Match 02 - state, premises, postcode
4 46 Match 03 - state, premises, postcode
5 27 Match 04 - state, premises, postcode
6 1548 Match 05 - state, polling_place, postcode
7 10 Match 06 - state, polling_place, postcode
8 2 Match 07 - state, polling_place, postcode
9 4 Match 08 - state, polling_place, postcode
10 30 RANGE_INTERPOLATED
11 56 ROOFTOP
0 0 Not yet matched

How many are left now?


In [30]:
df_to_match = df_pp_1999[df_pp_1999.match_source.isnull()]
df_to_match = df_to_match.append(df_pp_1999[df_pp_1999['match_type'] == 'failed'])
display(df_to_match)


premises address suburb postcode wheelchair_access match_source match_type latitude longitude
state polling_place

Hooray! WE HAVE A RESULT FOR EVERYWHERE

Let's write a CSV:


In [33]:
df_pp_1999.to_csv(
    '1999_referenda_output/polling_places_geocoded.csv',
    sep = ','
)