Benford for Python

Current version: 0.1.0.3

Installation

As of Dec 2017, Benford for python is a Package in PyPi, so you can install with pip:

$ pip install benford_py

Or you can cd into the site-packages subfolder of your python distribution (or environment) and clone from there:

$ git clone http://github.com/milcent/Benford_py.git.

Demo

This demo assumes you have (at least) some familiarity with Benford's Law.

First let's import some libraries and the benford module.


In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
#import pandas_datareader.data as web # Not a dependency, but we'll need it now.

In [2]:
import benford as bf

Quick start

Getting some public data, the S&P500 EFT quotes, up until Dec 2016


In [5]:
sp = pd.read_csv('data/SPY.csv', index_col='Date', parse_dates=True)

Creating simple and log return columns


In [5]:
#adding '_' to facilitate handling the column
#sp.rename(columns={'Adj Close':'Adj_Close'}, inplace=True) 
sp['p_r'] = sp.Close/sp.Close.shift()-1        #simple returns
sp['l_r'] = np.log(sp.Close/sp.Close.shift())  #log returns
sp.tail()


Out[5]:
Open High Low Close Adj Close Volume p_r l_r
Date
2017-03-27 231.929993 233.919998 231.610001 233.619995 231.338181 87454500 -0.001026 -0.001027
2017-03-28 233.270004 235.809998 233.139999 235.320007 233.021576 93483900 0.007277 0.007250
2017-03-29 234.990005 235.809998 234.729996 235.539993 233.239410 61950400 0.000935 0.000934
2017-03-30 235.470001 236.520004 235.270004 236.289993 233.982101 56737900 0.003184 0.003179
2017-03-31 235.899994 236.509995 235.679993 235.740005 233.437485 73733100 -0.002328 -0.002330

First Digits Test

Let us see if the SPY log retunrs conform to Benford's Law


In [6]:
f1d = bf.first_digits(sp.l_r, digs=1, decimals=8) # digs=1 for the first digit (1-9)


Initialized sequence with 6029 registries.

Test performed on 6029 registries.
Discarded 0 records < 1 after preparation.

The first_digits function draws the plot (default) with bars fot the digits found frequencies and a line corresponding to the expected Benford proportions.

It also returns a DataFrame object with Counts, Found proportions and Expected values for each digit in the data studied.


In [7]:
f1d


Out[7]:
Counts Found Expected
First_1_Dig
1 1849 0.306684 0.301030
2 963 0.159728 0.176091
3 679 0.112622 0.124939
4 537 0.089069 0.096910
5 496 0.082269 0.079181
6 455 0.075469 0.066947
7 388 0.064356 0.057992
8 352 0.058384 0.051153
9 310 0.051418 0.045757

First Two Digists


In [8]:
f2d = bf.first_digits(sp.l_r, digs=2, decimals=8) # Note the parameter digs=2!


Initialized sequence with 6029 registries.

Test performed on 6029 registries.
Discarded 0 records < 10 after preparation.

In [9]:
f2d.head()


Out[9]:
Counts Found Expected
First_2_Dig
10 303 0.050257 0.041393
11 231 0.038315 0.037789
12 208 0.034500 0.034762
13 242 0.040139 0.032185
14 178 0.029524 0.029963

In [10]:
f2d.tail()


Out[10]:
Counts Found Expected
First_2_Dig
95 34 0.005639 0.004548
96 29 0.004810 0.004501
97 32 0.005308 0.004454
98 31 0.005142 0.004409
99 30 0.004976 0.004365

Assessing conformity

There are some tests to more precisely evaluate if the data studied is a good fit to Benford's Law.

The first we'll use is the Z statistic for the proportions.

In the digits functions, you can turn it on by settign the parameter confidence, which will tell the function which confidence level to consider after calculating the Z score for each proportion.


In [11]:
# For a significance of 5%, a confidence of 95
f2d = bf.first_digits(sp.l_r, digs=2, decimals=8, confidence=95)


Initialized sequence with 6029 registries.

Test performed on 6029 registries.
Discarded 0 records < 10 after preparation.

The entries with the significant positive deviations are:

             Expected     Found   Z_score
First_2_Dig                              
67           0.006434  0.010449  3.818945
13           0.032185  0.040139  3.463147
10           0.041393  0.050257  3.423007
66           0.006531  0.009786  3.057914
72           0.005990  0.008459  2.400701
82           0.005264  0.007298  2.093307
75           0.005752  0.007796  2.012781

Some things happened:

  • It printed a DataFrame wiith the significant positive deviations, in descending order of the Z score.
  • In the plot, to the Benford Expected line, it added upper and lower boundaries, based on the level of confidence by the parameter. Accordingly, it changed the colors of the bars whose proportions fell lower or higher than the drawn boundaries, for better vizualisation.
  • The confidence parameter takes the follwoing values other than None: 80, 85, 90, 95, 99 99.9, 99.99, 99.999, 99.9999 and 99.99999.

    Other tests

    We can do all this with the First Three Digits, Second Digit and the Last Two Digits tests too.

    
    
    In [12]:
    # First Three Digits Test, now with 99% confidence level
    # digs=3 for the first three digits
    f3d = bf.first_digits(sp.l_r, digs=3, decimals=8, confidence=99)
    
    
    
    
    Initialized sequence with 6029 registries.
    
    Test performed on 6029 registries.
    Discarded 0 records < 100 after preparation.
    
    The entries with the significant positive deviations are:
    
                 Expected     Found   Z_score
    First_3_Dig                              
    952          0.000456  0.001659  4.072761
    962          0.000451  0.001493  3.504961
    695          0.000624  0.001825  3.472363
    997          0.000435  0.001327  3.009661
    823          0.000527  0.001493  2.984556
    139          0.003113  0.005308  2.942979
    676          0.000642  0.001659  2.862410
    945          0.000459  0.001327  2.843428
    751          0.000578  0.001493  2.687925
    874          0.000497  0.001327  2.604667
    
    
    
    In [13]:
    # The First Three Digits plot is better seen and zoomed in and out without the inline plotting.
    # Try %matplotlib
    

    There are also the Second Digit test, and Last Two Digits test, as shown bellow.

    
    
    In [14]:
    # Second Digit Test
    sd = bf.second_digit(sp.l_r, decimals=8, confidence=95)
    
    
    
    
    Initialized sequence with 6029 registries.
    
    Test performed on 6029 registries.
    Discarded 0 records < 10 after preparation.
    
    The entries with the significant positive deviations are:
    
             Expected     Found   Z_score
    Sec_Dig                              
    0        0.119679  0.128545  2.101085
    
    
    
    In [16]:
    # Last Two Digits Test
    l2d = bf.last_two_digits(sp.l_r, decimals=8, confidence=90)
    
    
    
    
    Initialized sequence with 6029 registries.
    
    Test performed on 6029 registries.
    Discarded 0 records < 1000 after preparation
    
    The entries with the significant positive deviations are:
    
                Expected     Found   Z_score
    Last_2_Dig                              
    2           0.010101  0.013103  2.266925
    99          0.010101  0.012937  2.138130
    

    Other Important Parameters

  • *digs*: only used in the First Digits function, to tell it which test to run: 1- First Digits; 2- Fist Two Digits; and 3- First Three Digits.
  • *decimals*: informs the number of decimal places to consider. Defaluts to 2, for currencies, but I set it to 8 here, since we are dealing with log returns (long floats). If the sequence is of integers, set it to 0. You may also set it to *infer* if you don't know exactly or if the data has registries with different number of decimal places, and it will treat every registry separately.
  • *sign*: tells which portion of the data to consider. *pos*: only the positive entries; *neg*: only the negative ones; *all*: all entries but zeros. Defaults to *all*.
  • *inform*: gives information about the test during its run, like the number of registries analysed, the number of registries discarded according to each test (ie, < 10 for the First Digits), and shows the top Z scores of the resulting DataFrame if *confidence* is not None.
  • *high_Z*: chooses which Z scores to be used when displaying results, according to the confidence level chosen. Defaluts to *pos*, which will return only values higher than the expexted frequencies; *neg* will return only values lower than the expexted frequencies; *all* will return both extremes (positive and negative); and an integer will return the first n entries, positive and negative, regardless of whether Z is higher than the confidence or not.
  • *limit_N*: sets a limit to the sample size for the calculation of the Z scores. This may be found useful if the sample is too big, due to the Z test power problem. Defaults to None.
  • *show_plot*: draws the test plot. Defaults to True. Note that if *confidence* is not None, the plot will highlight the bars outside the lower and upper boundaries, regardless of the *high_Z* value.
  • *MAD* and *MSE*: calculate, respectively, the Mean Absolute Deviation and the Mean Squared Error of the sample, for each test. Defaults to False. Both can be used inside the tests' functions or separetely, in their own functions, mad and mse.
  • MAD

    The Mean Absolute Deviation, or MAD, is, as the name states, the average of all absolute deviations between the found proportions and the Benford's expected ones.

    Drake and Nigrini (2000) developed this model, later revised by Nigrini (2001), using empirical data to set limits of conformity for the First, First Two, First Three and Second Digits tests.

    The MAD averages the proportions, so it is not directly influenced by the sample size. The lower the MAD, the better the confotmity.

    
    
    In [17]:
    mad1 = bf.mad(sp.l_r, test=1, decimals=8) # test=1 : MAD for the First Digits
    mad1
    
    
    
    
    Out[17]:
    0.00811560097616021

    Note that you must choose the test parameter, since there is one MAD for each test.

  • First Digit: *1* or *'F1D'*;
  • First Two Digits: *2* or *'F2D'*;
  • First Three Digits: *3* or *'F3D'*;
  • Second Digit: *22* or *'SD'*;
  • Last Two Digits: *-2* or *'L2D'*; *# pithonic*
  • 
    
    In [18]:
    mad2 = bf.mad(sp.l_r, test=2, decimals=8) # test=2 : MAD for the First Two Digits
    mad2
    
    
    
    
    Out[18]:
    0.001414903612544444
    
    
    In [19]:
    mad3 = bf.mad(sp.l_r, test=3, decimals=8) # test=3 : MAD for the First Three Digits
    mad3
    
    
    
    
    Out[19]:
    0.00034020108565942513
    
    
    In [20]:
    mad_sd = bf.mad(sp.l_r, test=22, decimals=8) # test=22 : MAD for the Second Digits
    mad_sd
    
    
    
    
    Out[20]:
    0.00427937420604161
    
    
    In [21]:
    mad_l2d = bf.mad(sp.l_r, test=-2, decimals=8) # test=-2 : MAD for the Last Two Digits
    mad_l2d
    
    
    
    
    Out[21]:
    0.000980262066677724

    Or you can set the MAD parameter to True when running the tests functions, and it will also give the corresponding conformity limits (as long as inform is also True).

    
    
    In [22]:
    f2d = bf.first_digits(sp.l_r, digs=2, decimals=8, MAD=True, show_plot=False)
    
    
    
    
    Initialized sequence with 6029 registries.
    
    Test performed on 6029 registries.
    Discarded 0 records < 10 after preparation.
    
    The Mean Absolute Deviation is 0.001414903612544444
    For the First Two Digits:
                - 0.0000 to 0.0012: Close Conformity
                - 0.0012 to 0.0018: Acceptable Conformity
                - 0.0018 to 0.0022: Marginally Acceptable Conformity
                - Above 0.0022: Nonconformity
    
    
    
    In [23]:
    sd = bf.second_digit(sp.l_r, decimals=8, MAD=True, show_plot=False)
    
    
    
    
    Initialized sequence with 6029 registries.
    
    Test performed on 6029 registries.
    Discarded 0 records < 10 after preparation.
    
    The Mean Absolute Deviation is 0.00427937420604161
    For the Second Digits:
                - 0.0000 to 0.008: Close Conformity
                - 0.008 to 0.01: Acceptable Conformity
                - 0.01 to 0.012: Marginally Acceptable Conformity
                - Above 0.012: Nonconformity
    

    Mantissas

    The mantissa is the decimal part of a logarithm. In a Benford data set, the mantissas of the registries' logs are uniformly distributed, such that when ordered,they should form a straight line in the interval [0,1), with slope 1/N, N being the sample size..

    
    
    In [24]:
    mant = bf.mantissas(sp.l_r, inform=True, show_plot=True)
    
    
    
    
    The Mantissas MEAN is 0.4920278832245962. 		Ref: 0.5.
    The Mantissas VARIANCE is 0.08976797626032448. 	Ref: 0.083333.
    The Mantissas SKEWNESS is 0.05262788543118481. 	Ref: 0.
    The Mantissas KURTOSIS is -1.2797511488052609. 	Ref: -1.2.
    
    
    
    In [25]:
    mant.hist(bins=30, figsize=(12,5))
    
    
    
    
    Out[25]:
    <matplotlib.axes._subplots.AxesSubplot at 0x7f2762c15ac8>

    That's it for now.

    If you have a data set that you think would be nice to study with Benford tests, share it and we can post a notebook with all tests and comments.

    Thanks

    Milcent

    
    
    In [ ]: