Benford's Law and Toronto's Mayoral Results

Benford's law describes the frequency distribution of leading digits in many real-life number-rich datasets. Some have used it to detect accounting fraud. Here I'm going to parse the City of Toronto's 2014 Mayoral Election results and see if the vote results follow Benford's Law.

You can get the raw data in xls format from here. I've written a script to convert the raw xls Mayoral results into a normalized csv file. The script and the csv are available here .


In [22]:
import pandas as pd
import math
import matplotlib.pyplot as plt
from collections import Counter

import seaborn as sns
%matplotlib inline

df_raw_data = pd.read_csv('mayor.csv',  encoding='utf-8')
datapoints = len(df_raw_data.index)
print str(datapoints) + " datapoints in the raw dataset!"


22602 datapoints in the raw dataset!

In [23]:
raw_benford = [ math.log10(1+1.0/i) for i in xrange(1,10) ]
df_benford = pd.DataFrame({ 'Digit': range(1,10), "Benford's Law": raw_benford })

ax = sns.barplot("Digit", "Benford's Law", data=df_benford, color="#4F628E", label="Probability")
ax = ax.set(ylabel='Probability', title="Benford's Law  "+ r'$log_{10}(1+\frac{1}{d})$' )



In [24]:
def first_digit(input_number):
    return int(str(input_number)[0])


macro_data = df_raw_data.groupby(by=['Candidate'])['Votes'].sum()
num_candidates = len(macro_data)


df_firstdigit = pd.DataFrame( { 'Candidates': macro_data.index, 'First Digit':  map( first_digit, macro_data) })
ser_highlevel = df_firstdigit['First Digit'].value_counts(normalize=True)
df_highlevel = pd.DataFrame( data={ 'Digit': ser_highlevel.index, 'Probability': ser_highlevel.values } ).sort('Digit')
df_highlevel = pd.merge( df_highlevel, df_benford )

def plot_figs( dataframe, legend_lbls ):
    ax = sns.barplot("Digit", "Probability", data=dataframe, color="#7887AB" )
    dataframe["Benford's Law"].plot(kind='bar', width = .5, color =  '#4F628E', grid=False)
    ax.set_xticklabels(labels=xrange(1,10), rotation=0 )
    ax.set(ylabel = "Probability", title="Benford's Law and Toronto's Election Results" )
    main_lbl= plt.Rectangle((0,0),1,1, fc="#7887AB", edgecolor = 'none')
    benford_lbl= plt.Rectangle((0,0),1,1, fc="#4F628E", edgecolor = 'none')
    ax.legend( [main_lbl, benford_lbl], legend_lbls,  loc=1, ncol = 2)

plot_figs(df_highlevel,['High-level Mayoral Results', "Benford's Law"] )


While the candidate totals seems to roughly fit Benford's Law, there are only 65 candidates. Why stick to 65 datapoints when the dataset has 22602 non-zero datapoints at the subdivison (below the ward) level for all the candidates.


In [27]:
# This data is the most granular possible
df_raw_data['First Digit'] = map( first_digit, df_raw_data['Votes'] )
digit_distrib = df_raw_data['First Digit'].value_counts(normalize=True)

df_alldata = pd.DataFrame( data={ 'Digit': digit_distrib.index, 'Probability': digit_distrib.values } )
df_alldata = pd.merge(df_alldata, df_benford )

plot_figs(df_alldata,['Granular Election Results', "Benford's Law"] )


I had a hunch that the granular data was skewed towards the digit one because there is an abundance of candidates that inhabit the long tail beyond the top three, 62 candidates to be precise. It's relatively cheap and easy to get yourself on the ballot for the mayoral races in Toronto. All these candidates generally get at most one vote per subdivision. If I was right plotting the data with all these candidates removed would correct the bias towards the digit one and plotting the data for remaining candidates would be even more biased towards the first digit.


In [29]:
ser_top_candidates = macro_data[ macro_data > 5000 ]
ser_bot_candidates = macro_data[ macro_data < 5000 ]

df_top_raw = df_raw_data[ df_raw_data['Candidate'].isin( ser_top_candidates.index ) ] 
df_bot_raw = df_raw_data[ df_raw_data['Candidate'].isin( ser_bot_candidates.index ) ] 

top3_datapoints = len(df_top_raw)

wipeout = ( float( datapoints - top3_datapoints) / float(datapoints) )* 100
print str(wipeout)+"%"


76.5728696576%

Basically removing thoses candidates basically wiped out 76% of the datapoints! Talk about a fat long tail (in terms of datapoints not votes).


In [31]:
ser_digit_distrib = df_top_raw['First Digit'].value_counts(normalize=True)
df_top3distrib = pd.DataFrame( data={ 'Digit': ser_digit_distrib.index, 'Probability': ser_digit_distrib.values } ).sort('Digit')
df_top3distrib = pd.merge(df_top3distrib, df_benford )

plot_figs(df_top3distrib,['Top 3 Candidates', "Benford's Law"] )



In [34]:
digit_distrib = df_bot_raw['First Digit'].value_counts(normalize=True)
df_bot62distrib = pd.DataFrame( data={ 'Digit': digit_distrib.index, 'Probability': digit_distrib.values } ).sort('Digit')
df_bot62distrib = pd.merge(df_bot62distrib, df_benford )

plot_figs(df_bot62distrib,['Remaining 62 Candidates', "Benford's Law"] )


My hunch was correct. Simply plotting the datapoints for the top 3 candidates generated a much closer fit to Benford's law because the long tail of other candidates were getting a lot of single votes at the subdivision level.

I was inspired by this notebook

Learn more at DataGenetics and of course at Wikipedia.