Matthieu Antony, in AEDES, developed an algorithm to orient the program's supervision. This algorithm works in six steps :
In [4]:
import sys
sys.path.insert(0, '../../src/monitoring_algorithms/')
from aedes_algorithm import *
In [5]:
%matplotlib inline
store = pd.HDFStore('../../data/processed/orbf_benin.h5')
tarifs = store['tarifs']
data = store['data']
store.close()
data = data[data.period.apply(str).str[0:4] == '2016']
data = data[data.date < '2016-11']
data_test = data[data.geozone_name.isin(bm_zones)]
data_test = data_test.set_index('indicator_label')
## Making payment claimed and verified
data_test['claimed_payment'] = list(data_test.indicator_claimed_value * data_test['indicator_tarif'])
data_test['verified_payment'] = list(data_test.indicator_verified_value * data_test['indicator_tarif'])
First I want to replicate the results Matthieu obtained by Manually processing the data in Excel. First I want to subset the data to the data reported by the WB Zones de Santé for the year 2016, and dropping all data collected after november 2016.
The first step of the algorithm is to rank the indicators according to their importance in term of financial risk for the program. We want to evaluate the risk of each indicator in term of unduly claimed payments from the program.
Let's $P_{ift}^c$ be the claimed payment of facility $f$ at time $t$ for the indicator $i$ and $P_{ift}^v$ the verified payment of this facility for this period for this indicator.
The ranking algorithm runs in three steps :
Computing the difference between claimed payment and verified payment : $$ \Delta_{if} = \sum_t P_{ift}^c - P_{ift}^v$$ And $\Delta_{if}$ will be positive if the payment was lowered, and negative if the payment was made higher after verification. Aggregated at program level, this difference is denoted $\Delta_{i} = \sum_{f,t} P_{ift}^c - P_{ift}$
Computing the share of the financial risk that a given indicator holds for the program, computed as $$r_i = \frac{\Delta_i}{\sum_i \Delta_i}$$
Finally, the indicators are ranked according in decreasing importance of their share in financial risk, and the progressive cumulative sum of this share is computed. The first indicators of these ranking are considered critical, to keep indicators representing around 80\% of the total risk
In [ ]:
## Getting total amount of money the program got back
table_1 = make_first_table(data_test)
table_1
In [ ]:
indicateurs_critiques = list(table_1[table_1['% Cumulé'] < 0.8].index)
indicateurs_critiques
data_classif = data_test.loc[indicateurs_critiques]
data_classif = data_classif.reset_index()
In a second step, facilities are given a weighted metric of correction importance. This correction metric is linear combination of fixed importance weights given to each indicator, and the sum of monetary corrections by indicator. Noting $w_i$ the weight associated with indicator $i$, the weighted difference metric is :
$$\delta_f = \sum_{i} w_i \Delta_{if}$$where :
$$ w_j = \frac{\Delta_j}{\max_i(\Delta_i)}$$
In [ ]:
ponderation = table_1['Volume Financier Récupéré'] / max(table_1['Volume Financier Récupéré'])
print(ponderation)
In [ ]:
ecart_moyen_pondere = data_classif.groupby(['geozone_name' , 'entity_name']).apply(get_ecart_pondere , ponderation = ponderation)
Finally, each facility in classified, according to a combination of the amount of payment they receive, and their specific weighted difference. This two dimensions classifiation is made according to a simple rule :
Figure 1 displays the classification of facilities in the World Bank zones. We can see that the dispersion of facilities in the plane varies greatly between departements. Banikoara does not have a lot of variation in terms of weighted difference, and most of the classification is made along the payments distributions. Figure 2 shows the distribution of facilities classes. We see that in most departments, we have an equal number of facilities in red and orange category, with green facilities being only a minor part of remaining facilites.
In [ ]:
fig=plt.figure(figsize=(18, 16), dpi= 80, facecolor='w', edgecolor='k')
for i in range(1,9):
plt.subplot(4,2,i)
departement = list(ecart_moyen_pondere.index.levels[0])[i-1]
make_cadran(ecart_moyen_pondere.loc[departement])
plt.title(departement)
In [ ]:
def classify_facilities(ecart_moyen_pondere):
q4_rev = ecart_moyen_pondere['Montant'].quantile(0.4)
ecart_moyen_pondere['Class'] = 'red'
ecart_moyen_pondere.loc[(ecart_moyen_pondere['Montant'] <= q4_rev) &
(ecart_moyen_pondere['Ecart Moyen Pondéré'] <= 0.1) , 'Class'] = 'green'
ecart_moyen_pondere.loc[(ecart_moyen_pondere['Montant'] <= q4_rev) &
(ecart_moyen_pondere['Ecart Moyen Pondéré'] > 0.1) , 'Class'] = 'orange'
ecart_moyen_pondere.loc[(ecart_moyen_pondere['Montant'] > q4_rev) &
(ecart_moyen_pondere['Ecart Moyen Pondéré'] <= 0.1), 'Class'] = 'orange'
ecart_moyen_pondere = ecart_moyen_pondere.sort('Class')
return ecart_moyen_pondere
In [ ]:
classified_data = ecart_moyen_pondere.groupby(level = 0).apply(classify_facilities)
In [ ]:
def print_classification(df):
color = df if df in ['red' , 'green' , 'orange'] else 'white'
return 'background-color: %s' % color
In [ ]:
classified_data.loc['BANIKOARA'].style.applymap(print_classification)
In [ ]:
def bar_cols(col_data , order_cols = ['green' , 'orange' , 'red']):
o = []
for col in order_cols:
try :
n = col_data.loc[col]
o.append(n)
except KeyError :
o.append(0)
plt.bar([0,1,2], o , color = order_cols)
plt.xticks([0,1,2] , order_cols)
In [ ]:
classes_counts = classified_data.Class.groupby(level = 0).value_counts()
fig=plt.figure(figsize=(18, 16), dpi= 80, facecolor='w', edgecolor='k')
for i in range(1,9):
plt.subplot(4,2,i)
departement = list(classes_counts.index.levels[0])[i-1]
bar_cols(classes_counts.loc[departement])
plt.title(departement , fontsize=15)
In [ ]:
%%time
classified_facilities = data.groupby('geozone_name').apply(make_facilities_classification ,
ponderation = ponderation ,
perc_risk = 0.8)
In [ ]:
classes_counts = classified_facilities.groupby(level = 0).Class.value_counts()
In [ ]:
fig=plt.figure(figsize=(18, 16), dpi= 80, facecolor='w', edgecolor='k')
for i in range(1,17):
plt.subplot(4,4,i)
departement = list(classes_counts.index.levels[0])[i-1]
bar_cols(classes_counts.loc[departement])
departement =departement.replace('’' , "'")
plt.title(departement , fontsize=15)
Once the facilities have been classified, facilities to be monitored each month are drawn as to maximize the supervision of most volatile facilities, and minimize the supervision of less volatile facilities. The supervision probability, for each facility, each month, is :
Category | Monthly probability of supervision |
---|---|
Green | 20 % |
Orange | 60 % |
Red | 100 % |
In [ ]:
def draw_supervision_months(facilities):
green_fac = facilities[facilities['Class'] == 'green']
orange_fac = facilities[facilities['Class'] == 'orange']
green_sample = green_fac.sample(frac = 0.2)
orange_sample = orange_fac.sample(frac = 0.8)
return {'green_sample':green_sample , 'orange_sample':orange_sample}
In [ ]:
out = draw_supervision_months(classified_data)
classified_facilities.Class.value_counts()
#print(len(out['orange_sample']))
In [ ]:
if date == 1:
dat = data[data.date < date]
classified_facilities = data.groupby('geozone_name').apply(make_facilities_classification ,
ponderation = ponderation ,
perc_risk = 0.8)