Matthieu Antony, in AEDES, developed an algorithm to orient the program's supervision. This algorithm works in six steps :

  1. Computation of aggregate level of payment diminution for each indicator
  2. Determination of a subset of N indicators that cause most of payment diminutions
  3. Computation of a weighted correction rate of these indicators for each facility, using a priori weights displaying the qualitative interest of each activity
  4. Classification of facilities in three levels of risks, based on their weighted correction rates and correction amounts, and predefined threshold
  5. Monthly sampling of each facilities, with different sampling probabilities depending on facilities class

In [4]:
import sys
sys.path.insert(0, '../../src/monitoring_algorithms/')
from aedes_algorithm import *

In [5]:
%matplotlib inline

store = pd.HDFStore('../../data/processed/orbf_benin.h5')
tarifs = store['tarifs']
data = store['data']
store.close()

data = data[data.period.apply(str).str[0:4] == '2016']
data = data[data.date < '2016-11']
data_test = data[data.geozone_name.isin(bm_zones)]
data_test = data_test.set_index('indicator_label')

## Making payment claimed and verified

data_test['claimed_payment'] = list(data_test.indicator_claimed_value * data_test['indicator_tarif'])
data_test['verified_payment'] = list(data_test.indicator_verified_value * data_test['indicator_tarif'])


---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-1f828ec24148> in <module>()
      3 store = pd.HDFStore('../../data/processed/orbf_benin.h5')
      4 tarifs = store['tarifs']
----> 5 data = store['data']
      6 store.close()
      7 

C:\Users\grlurton\Anaconda3\lib\site-packages\pandas\io\pytables.py in __getitem__(self, key)
    457 
    458     def __getitem__(self, key):
--> 459         return self.get(key)
    460 
    461     def __setitem__(self, key, value):

C:\Users\grlurton\Anaconda3\lib\site-packages\pandas\io\pytables.py in get(self, key)
    674         if group is None:
    675             raise KeyError('No object named %s in the file' % key)
--> 676         return self._read_group(group)
    677 
    678     def select(self, key, where=None, start=None, stop=None, columns=None,

C:\Users\grlurton\Anaconda3\lib\site-packages\pandas\io\pytables.py in _read_group(self, group, **kwargs)
   1319         s = self._create_storer(group)
   1320         s.infer_axes()
-> 1321         return s.read(**kwargs)
   1322 
   1323 

C:\Users\grlurton\Anaconda3\lib\site-packages\pandas\io\pytables.py in read(self, start, stop, **kwargs)
   2864             blk_items = self.read_index('block%d_items' % i)
   2865             values = self.read_array('block%d_values' % i,
-> 2866                                      start=_start, stop=_stop)
   2867             blk = make_block(values,
   2868                              placement=items.get_indexer(blk_items))

C:\Users\grlurton\Anaconda3\lib\site-packages\pandas\io\pytables.py in read_array(self, key, start, stop)
   2413         import tables
   2414         node = getattr(self.group, key)
-> 2415         data = node[start:stop]
   2416         attrs = node._v_attrs
   2417 

C:\Users\grlurton\Anaconda3\lib\site-packages\tables\vlarray.py in __getitem__(self, key)
    675             start, stop, step = self._process_range(
    676                 key.start, key.stop, key.step)
--> 677             return self.read(start, stop, step)
    678         # Try with a boolean or point selection
    679         elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):

C:\Users\grlurton\Anaconda3\lib\site-packages\tables\vlarray.py in read(self, start, stop, step)
    815         atom = self.atom
    816         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 817             outlistarr = [atom.fromarray(arr) for arr in listarr]
    818         else:
    819             # Convert the list to the right flavor

C:\Users\grlurton\Anaconda3\lib\site-packages\tables\vlarray.py in <listcomp>(.0)
    815         atom = self.atom
    816         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 817             outlistarr = [atom.fromarray(arr) for arr in listarr]
    818         else:
    819             # Convert the list to the right flavor

C:\Users\grlurton\Anaconda3\lib\site-packages\tables\atom.py in fromarray(self, array)
   1179         if array.size == 0:
   1180             return None
-> 1181         return pickle.loads(array.tostring())

ModuleNotFoundError: No module named 'pandas._period'

Replication of the hand results

First I want to replicate the results Matthieu obtained by Manually processing the data in Excel. First I want to subset the data to the data reported by the WB Zones de Santé for the year 2016, and dropping all data collected after november 2016.

Indicators ranking and subsetting

The first step of the algorithm is to rank the indicators according to their importance in term of financial risk for the program. We want to evaluate the risk of each indicator in term of unduly claimed payments from the program.

Let's $P_{ift}^c$ be the claimed payment of facility $f$ at time $t$ for the indicator $i$ and $P_{ift}^v$ the verified payment of this facility for this period for this indicator.

The ranking algorithm runs in three steps :

  1. Computing the difference between claimed payment and verified payment : $$ \Delta_{if} = \sum_t P_{ift}^c - P_{ift}^v$$ And $\Delta_{if}$ will be positive if the payment was lowered, and negative if the payment was made higher after verification. Aggregated at program level, this difference is denoted $\Delta_{i} = \sum_{f,t} P_{ift}^c - P_{ift}$

  2. Computing the share of the financial risk that a given indicator holds for the program, computed as $$r_i = \frac{\Delta_i}{\sum_i \Delta_i}$$

  3. Finally, the indicators are ranked according in decreasing importance of their share in financial risk, and the progressive cumulative sum of this share is computed. The first indicators of these ranking are considered critical, to keep indicators representing around 80\% of the total risk


In [ ]:
## Getting total amount of money the program got back 
table_1 = make_first_table(data_test)
table_1

In [ ]:
indicateurs_critiques = list(table_1[table_1['% Cumulé'] < 0.8].index)
indicateurs_critiques

data_classif = data_test.loc[indicateurs_critiques]
data_classif = data_classif.reset_index()

Weighted Difference

In a second step, facilities are given a weighted metric of correction importance. This correction metric is linear combination of fixed importance weights given to each indicator, and the sum of monetary corrections by indicator. Noting $w_i$ the weight associated with indicator $i$, the weighted difference metric is :

$$\delta_f = \sum_{i} w_i \Delta_{if}$$

where :

$$ w_j = \frac{\Delta_j}{\max_i(\Delta_i)}$$

In [ ]:
ponderation = table_1['Volume Financier Récupéré'] / max(table_1['Volume Financier Récupéré'])

print(ponderation)

In [ ]:
ecart_moyen_pondere = data_classif.groupby(['geozone_name' , 'entity_name']).apply(get_ecart_pondere , ponderation = ponderation)

Facilities Classification

Finally, each facility in classified, according to a combination of the amount of payment they receive, and their specific weighted difference. This two dimensions classifiation is made according to a simple rule :

  • Facilities receiving payments over the 40th percentile of all payments, and with a weighted difference indice over 0.1 are classified as high risk.
  • Facilities receiving payments under the 40th percentile of all payments, and with a weighted difference indice under 0.1 are classified as low risk.
  • All other facilities are classified as a moderate level of risk

Figure 1 displays the classification of facilities in the World Bank zones. We can see that the dispersion of facilities in the plane varies greatly between departements. Banikoara does not have a lot of variation in terms of weighted difference, and most of the classification is made along the payments distributions. Figure 2 shows the distribution of facilities classes. We see that in most departments, we have an equal number of facilities in red and orange category, with green facilities being only a minor part of remaining facilites.


In [ ]:
fig=plt.figure(figsize=(18, 16), dpi= 80, facecolor='w', edgecolor='k')
for i in range(1,9):
    plt.subplot(4,2,i) 
    departement = list(ecart_moyen_pondere.index.levels[0])[i-1]
    make_cadran(ecart_moyen_pondere.loc[departement])
    plt.title(departement)

In [ ]:
def classify_facilities(ecart_moyen_pondere):
    q4_rev = ecart_moyen_pondere['Montant'].quantile(0.4)
    ecart_moyen_pondere['Class'] = 'red'
    ecart_moyen_pondere.loc[(ecart_moyen_pondere['Montant'] <= q4_rev) &
                            (ecart_moyen_pondere['Ecart Moyen Pondéré'] <= 0.1) , 'Class'] = 'green' 
    ecart_moyen_pondere.loc[(ecart_moyen_pondere['Montant'] <= q4_rev) & 
                            (ecart_moyen_pondere['Ecart Moyen Pondéré'] > 0.1) , 'Class'] = 'orange'
    ecart_moyen_pondere.loc[(ecart_moyen_pondere['Montant'] > q4_rev) & 
                            (ecart_moyen_pondere['Ecart Moyen Pondéré'] <= 0.1), 'Class'] = 'orange'
    ecart_moyen_pondere = ecart_moyen_pondere.sort('Class')
    return ecart_moyen_pondere

In [ ]:
classified_data = ecart_moyen_pondere.groupby(level = 0).apply(classify_facilities)

In [ ]:
def print_classification(df):
    color = df if df in ['red' , 'green' , 'orange'] else 'white'
    return  'background-color: %s' % color

In [ ]:
classified_data.loc['BANIKOARA'].style.applymap(print_classification)

In [ ]:
def bar_cols(col_data , order_cols = ['green' , 'orange' , 'red']):
    o = []
    for col in order_cols:
        try :
            n = col_data.loc[col]
            o.append(n)
        except KeyError :
            o.append(0)
    
    plt.bar([0,1,2], o , color = order_cols)
    plt.xticks([0,1,2] , order_cols)

In [ ]:
classes_counts = classified_data.Class.groupby(level = 0).value_counts()

fig=plt.figure(figsize=(18, 16), dpi= 80, facecolor='w', edgecolor='k')
for i in range(1,9):
    plt.subplot(4,2,i) 
    departement = list(classes_counts.index.levels[0])[i-1]
    bar_cols(classes_counts.loc[departement])
    plt.title(departement , fontsize=15)

Routine implementation

In a second step, I impement a simple unique function to classify facilities. In a second step, I run this routine separately in each departement. The full routine takes less than 3 seconds to run.


In [ ]:
%%time
classified_facilities = data.groupby('geozone_name').apply(make_facilities_classification , 
                                                           ponderation = ponderation , 
                                                           perc_risk =  0.8)

In [ ]:
classes_counts = classified_facilities.groupby(level = 0).Class.value_counts()

In [ ]:
fig=plt.figure(figsize=(18, 16), dpi= 80, facecolor='w', edgecolor='k')
for i in range(1,17):
    plt.subplot(4,4,i) 
    departement = list(classes_counts.index.levels[0])[i-1]
    bar_cols(classes_counts.loc[departement])
    departement =departement.replace('’' , "'")
    plt.title(departement , fontsize=15)

Supervision simulation

Once the facilities have been classified, facilities to be monitored each month are drawn as to maximize the supervision of most volatile facilities, and minimize the supervision of less volatile facilities. The supervision probability, for each facility, each month, is :

Category Monthly probability of supervision
Green 20 %
Orange 60 %
Red 100 %

In [ ]:
def draw_supervision_months(facilities):
    green_fac = facilities[facilities['Class'] == 'green']
    orange_fac = facilities[facilities['Class'] == 'orange']
    
    green_sample = green_fac.sample(frac = 0.2)
    orange_sample = orange_fac.sample(frac = 0.8)
    
    return {'green_sample':green_sample , 'orange_sample':orange_sample}

In [ ]:
out = draw_supervision_months(classified_data)

classified_facilities.Class.value_counts()
#print(len(out['orange_sample']))

In [ ]:
if date == 1:
    dat = data[data.date < date]
    classified_facilities = data.groupby('geozone_name').apply(make_facilities_classification , 
                                                           ponderation = ponderation , 
                                                           perc_risk =  0.8)