Benford's Law



In [465]:

    
import pandas as pd
import wbdata
from IPython.display import Latex, HTML, YouTubeVideo
%pylab inline









    



Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Created by Narasimha Prasad Prabhu and published under the MIT licence

Theory



In [2]:

    
HTML('<iframe src=http://en.wikipedia.org/wiki/Benford%27s_law width=700 height=350></iframe>')









    Out[2]:

Benford's law is a neat piece of mathematics which essentially states that given a large data set with a large distribution, the numbers with leading digit as 1 occurs about 30% of the time. Now this is obviously counter intuitive; given a set of numbers, the probability of a number starting with 0-9 should be equal for all the numbers.

A set of numbers is said to satisfy Benford's law if the leading digit $(d\\in{1,\\ldots,9})$ occurs with probability

$$P(d)=log_{10}(d+1)-log_{10}(d)=log_{10}(1+\\frac{1}{d})$$



In [469]:

    
benford=[log10(1+1.0/i) for i in xrange(1,10)]
x = range(1,10)
fig = figure()
ax = fig.add_subplot(111)
ax.bar(x,benford, align='center')
ax.set_title("Benford's Law")
ax.set_xlabel(r'$d$', fontsize=20)
ax.set_xticks(x)
ax.set_ylabel(r'$log_{10}(1+\frac{1}{d})$', fontsize=20)









    Out[469]:





<matplotlib.text.Text at 0x24212a10>

Time to test it out

I'm using the wbdata package to get 'Total Population' data from the World Bank. First we search for the string which gives us the search indicator to obtain. Followed by the actual gathering of the data. The data is contains multiple indexes as it contians population data for all countries from the year 1960. So, we unstack the data first before we start processing it



In [470]:

    
wbdata.search_indicators('population, total')









    



SP.POP.TOTL	Population, total



In [471]:

    
data = wbdata.get_dataframe({'SP.POP.TOTL':'population, total'})
datau = data.unstack(1)

It's not necessay, but it's always nice to have pretty pictures. In this case it also made me realise, that there is probably some data that we do not require. Ideally we would only have data from each countries, but this data set also has population of different income categories and the total world population



In [472]:

    
pop_tot = pd.DataFrame(datau, columns=[('population, total', '2012')])
fig1 = pop_tot.plot(kind='bar',legend=False, fontsize=10, figsize=(40,5))
fig1.set_ylabel('Population')









    Out[472]:





<matplotlib.text.Text at 0x246839d0>

First we develop an exclusion list and filter the data to only include the countries of the world. Next, we plot the data like in the previous cell and we have a nice graph showing China and India as the countries with the largest population.



In [473]:

    
exclude_list = ('East Asia & Pacific (all income levels)',
                    'East Asia & Pacific (developing only)',
                    'Euro area', 'Europe & Central Asia (all income levels)',
                    'Europe & Central Asia (developing only)',
                    'European Union', 'Heavily indebted poor countries (HIPC)',
                    'High income', 'High income: OECD', 'High income: nonOECD',
                    'Latin America & Caribbean (all income levels)',
                    'Latin America & Caribbean (developing only)',
                    'Least developed countries: UN classification',
                    'Low & middle income', 'Low income', 'Lower middle income',
                    'Middle East & North Africa (all income levels)',
                    'Middle East & North Africa (developing only)',
                    'Middle income', 'Not classified', 'OECD members',
                    'South Asia',
                    'Sub-Saharan Africa (all income levels)',
                    'Sub-Saharan Africa (developing only)',
                    'Upper middle income','World')
pop_tot_filt=pd.DataFrame(datau, columns=[('population, total', '2012')], index=pop_tot.index-exclude_list)
fig2=pop_tot_filt.plot(kind='bar',legend=False, fontsize=10, figsize=(40,5))
fig2.set_ylabel('Population')









    Out[473]:





<matplotlib.text.Text at 0x24c74d90>

Next, we define a small function to return the first digit of all the population data we just obtained.



In [474]:

    
def frst_dig(pop_data):
    return int(str(pop_data)[0])

Time to get all that data in an array(Almost there!)



In [475]:

    
benford_pop = [frst_dig(pop_tot_filt.iat[i,0]) for i in xrange(0,len(pop_tot_filt.index))]

Graphing time and voila, it seems Benford's law does hold up for the population data of 2012



In [476]:

    
fig3 = figure()
ax3 = fig3.add_subplot(111)
ax3.hist(benford_pop, normed=True, bins=range(1,11), label='Sample Data')
ax3.bar(x,benford, alpha=0.35, color='red', width=1, label='Benford')
ax3.set_xticks(x)
ax3.legend()
ax3.grid()
ax3.set_ylabel('Probability')









    Out[476]:





<matplotlib.text.Text at 0x27221ed0>

Youtube video from numberphile

Do check out, this video from numberphile detailing the Benford's Law.



In [468]:

    
YouTubeVideo('XXjlR2OK1kM')









    Out[468]: