In [1]:
# Disabling autosave to allow for notebook pairing via https://github.com/mwouts/jupytext.
%autosave 0
This notebook describes how we perform measure calculations.
A measure is the compuation of a ratio between two values (a numerator and a denominator).
For instance:
Measures are computed monthly for each practice, and are also aggregated at CCG and national level, and we call the results of the computations "measure values".
For each month, we compute the deciles in the distribution of measure values for practices and CCGs.
Certain measures are identified as cost-saving. For these measures, we want to know how much a practice or CCG would have saved if its prescribing had been in line with the practice or CCG at the 10th percentile.
For the measures on the website, these calculations are done by performing a series of queries against data in BigQuery. The code is here, and some of the queries are quite gnarly.
(Note: In the OpenPrescribing source, we store measure values for practices and CCGs in MeasureValue objects, and national measure values in MeasureGlobal objects. Deciles for practices and CCGs are also stored in MeasureGlobal objects. Cost-saving calculations are stored in both kinds of objects.)
This notebook works through these calculations for the Cerazette/Desogestrel measure using one month of relevant prescribing data, using Pandas.
The OpenPrescribing implementation metadata for the measure is here. Note that the measure value is computed as the ratio of the quantity of prescriptions for presentations whose BNF code starts with 0703021Q0B (ie branded prescriptions for Desogestrel) to the quantity of prescriptions for presentations whose BNF code starts with 0703021Q0 (ie all prescriptions for Desogestrel). Also note that the is_cost_based and low_is_good fields are true.
We need pandas for the computation, and requests to validate the computation against the OpenPrescribing implementation, via the OpenPrescribing API.
In [2]:
import pandas as pd
import requests
We can get the prescribing data from BigQuery. We're only interested in prescribing from GPs.
In [3]:
sql = '''
SELECT
prescriptions.pct AS ccg,
prescriptions.practice,
prescriptions.bnf_code,
prescriptions.bnf_name,
prescriptions.quantity,
prescriptions.actual_cost AS cost
FROM
hscic.normalised_prescribing_standard AS prescriptions
INNER JOIN hscic.practices
ON prescriptions.practice = practices.code
WHERE
prescriptions.month = TIMESTAMP('2018-08-01')
AND prescriptions.bnf_code LIKE '0703021Q0%'
AND practices.setting = 4
'''
prescriptions = pd.read_gbq(sql, 'ebmdatalab', dialect='standard')
prescriptions.head()
Out[3]:
We filter prescribing to find the prescriptions that contribute to the measure's numerator (that is, prescriptions for branded Desogestrel), and the prescriptions that contribute to the measure's denominator (that is, all prescriptions for Desogestrel).
In [4]:
numerators = prescriptions[prescriptions['bnf_code'].str.startswith('0703021Q0B')]
denominators = prescriptions[prescriptions['bnf_code'].str.startswith('0703021Q0')]
We can look at the numerators and deonominators for one particular practice.
In [5]:
numerators[numerators['practice'] == 'A81001']
Out[5]:
In [6]:
denominators[denominators['practice'] == 'A81001']
Out[6]:
Note the branded prescriptions (Cerelle, Cerazette, Zelleta) are present in both the numerator and the denominator. Generic prescriptions are only present in the denominator.
We'll start with the calculations for practices.
Not every practice will have prescribed Desogestrel in January 2018, so we also want a list of all practices.
In [7]:
sql = 'SELECT code, name, ccg_id AS ccg FROM hscic.practices WHERE setting = 4 ORDER BY code'
practices = pd.read_gbq(sql, 'ebmdatalab', dialect='standard').set_index('code')
practices.head()
Out[7]:
practices is a DataFrame indexed by practice code. We will add columns to practices as we compute the measure values, percentiles, and cost savings.
For each practice, we want to know the total quantity and total cost for Desogestrel prescriptions, as well as the quantity and cost for branded and for generic prescriptions.
We can get the total quantity and cost from the denominators, the quantity and cost of branded prescriptions from the numerators, and the quantity and cost of generic prescriptions by subtracting one from the other.
In [8]:
practices['quantity_total'] = denominators.groupby('practice')['quantity'].sum()
practices['cost_total'] = denominators.groupby('practice')['cost'].sum()
practices['quantity_branded'] = numerators.groupby('practice')['quantity'].sum()
practices['cost_branded'] = numerators.groupby('practice')['cost'].sum()
# Practices with no prescribing will have NaN values for the new columns. This
# also means that the new columns will have dtypes of float64, but quantities
# should always be ints.
practices = practices.fillna(0)
practices['quantity_total'] = practices['quantity_total'].astype('int')
practices['quantity_branded'] = practices['quantity_branded'].astype('int')
practices['quantity_generic'] = practices['quantity_total'] - practices['quantity_branded']
practices['cost_generic'] = practices['cost_total'] - practices['cost_branded']
practices.head()
Out[8]:
The measure value for this measure is the ratio of the quantity of branded prescriptions to the quantity of total prescriptions.
In [9]:
practices['quantity_ratio'] = practices['quantity_branded'] / practices['quantity_total']
practices.head()
Out[9]:
Note the quantity_ratio for a practice with no prescribing is NaN. The OpenPrescribing implementation stores such values as database NULLs, or Python Nones.
We can compute each practice's percentile, according to their quantity_ratio.
In [10]:
# In order to match the percentiles calculated by the OpenPrescribing
# implementation (via BigQuery) we need to adjust the results of .rank() to
# ignore any NaNs.
#
# I can't see why this isn't the same as:
#
# practices['quantity_ratio'].dropna().rank(method='min', pct=True) * 100
#
# But it's not, quite.
#
# Thanks Dave for working out what to do here!
ranks = practices['quantity_ratio'].rank(method='min')
num_non_nans = practices['quantity_ratio'].count()
practices['quantity_ratio_percentile'] = (ranks - 1) / ((num_non_nans - 1) / 100)
practices.head()
Out[10]:
We can compute the value of the ratio of the practice at the 10th percentile:
In [11]:
practice_quantity_ratio_10 = practices['quantity_ratio'].quantile(0.1)
practice_quantity_ratio_10
Out[11]:
That is, for the practice at the 10th percentile, the ratio of the quantity of branded prescriptions to total prescriptions is 0.0172.
To find out the how much a practice would have saved if its prescribing had matched that of the practice at the 10th percentile, we first need to compute the unit cost of branded and generic prescriptions.
In [12]:
practices['unit_cost_branded'] = practices['cost_branded'] / practices['quantity_branded']
practices['unit_cost_generic'] = practices['cost_generic'] / practices['quantity_generic']
practices[['name', 'unit_cost_branded', 'unit_cost_generic']].head()
Out[12]:
When a practice hasn't prescribed either branded or generic or both, we use global values.
In [13]:
global_unit_cost_branded = practices['cost_branded'].sum() / practices['quantity_branded'].sum()
global_unit_cost_generic = practices['cost_generic'].sum() / practices['quantity_generic'].sum()
global_unit_cost_branded, global_unit_cost_generic
Out[13]:
In [14]:
practices['unit_cost_branded'] = practices['unit_cost_branded'].fillna(global_unit_cost_branded)
practices['unit_cost_generic'] = practices['unit_cost_generic'].fillna(global_unit_cost_generic)
practices[['name', 'unit_cost_branded', 'unit_cost_generic']].head()
Out[14]:
We also need the quantity of branded and generic prescriptions that would have been prescribed had its prescribing matched that of the practice at the 10th percentile, assuming the total prescribing remains the same.
In [15]:
practices['quantity_branded_10'] = practices['quantity_total'] * practice_quantity_ratio_10
practices['quantity_generic_10'] = practices['quantity_total'] - practices['quantity_branded_10']
practices[['name', 'quantity_branded', 'quantity_branded_10', 'quantity_generic', 'quantity_generic_10']].head()
Out[15]:
We can then work out a target cost for each practice, and the amount that would be saved if the practice could make that target.
In [16]:
practices['target_cost_10'] = practices['unit_cost_branded'] * practices['quantity_branded_10'] + practices['unit_cost_generic'] * practices['quantity_generic_10']
practices['cost_saving_10'] = practices['cost_total'] - practices['target_cost_10']
practices.head()
Out[16]:
Here are the practices that would save the least.
In [17]:
practices.sort_values('cost_saving_10').head()[['name', 'cost_saving_10']]
Out[17]:
And the practices that would save the most.
In [18]:
practices.sort_values('cost_saving_10').tail()[['name', 'cost_saving_10']]
Out[18]:
We can verify our calculations by checking against data returned by the API, for each of the following kinds of practice:
In [19]:
def get_practice_data_from_api(practice_id):
url = 'https://openprescribing.net/api/1.0/measure_by_practice/'
params = {
'format': 'json',
'measure': 'desogestrel',
'org': practice_id,
}
rsp = requests.get(url, params)
for record in rsp.json()['measures'][0]['data']:
if record['date'] == '2018-08-01':
return record
assert False
In [20]:
for partition in [
practices[practices['quantity_total'] == 0], # no prescribing
practices[(practices['quantity_total'] > 0) & (practices['quantity_generic'] == 0)], # all branded
practices[(practices['quantity_total'] > 0) & (practices['quantity_branded'] == 0)], # all generic
practices[(practices['quantity_branded'] > 0) & (practices['quantity_generic'] > 0)], # a mixture
]:
practice_id = partition.sample().index[0]
data = get_practice_data_from_api(practice_id)
practice = practices.loc[practice_id]
print('-' * 80)
print(practice)
print(data)
assert data['numerator'] == practice['quantity_branded']
assert data['denominator'] == practice['quantity_total']
if data['percentile'] is not None:
assert abs(data['percentile'] - practice['quantity_ratio_percentile']) < 0.001
assert abs(data['cost_savings']['10'] - practice['cost_saving_10']) < 0.001
We'll move onto the calculations for CCGs. The calculations are very similar to those for practices.
We're interested in CCGs with active GPs.
In [21]:
sql = '''
SELECT
ccgs.code, ccgs.name
FROM
hscic.ccgs
INNER JOIN (
SELECT DISTINCT(ccg_id)
FROM hscic.practices
WHERE practices.setting = 4 AND practices.status_code = 'A'
) AS practices
ON ccgs.code = practices.ccg_id
WHERE
ccgs.org_type = 'CCG'
ORDER BY ccgs.code
'''
ccgs = pd.read_gbq(sql, 'ebmdatalab', dialect='standard').set_index('code')
ccgs.head()
Out[21]:
In [22]:
ccgs['quantity_total'] = denominators.groupby('ccg')['quantity'].sum()
ccgs['cost_total'] = denominators.groupby('ccg')['cost'].sum()
ccgs['quantity_branded'] = numerators.groupby('ccg')['quantity'].sum()
ccgs['cost_branded'] = numerators.groupby('ccg')['cost'].sum()
ccgs = ccgs.fillna(0)
ccgs['quantity_total'] = ccgs['quantity_total'].astype('int')
ccgs['quantity_branded'] = ccgs['quantity_branded'].astype('int')
ccgs['quantity_generic'] = ccgs['quantity_total'] - ccgs['quantity_branded']
ccgs['cost_generic'] = ccgs['cost_total'] - ccgs['cost_branded']
ccgs['quantity_ratio'] = ccgs['quantity_branded'] / ccgs['quantity_total']
ranks = ccgs['quantity_ratio'].rank(method='min')
num_non_nans = ccgs['quantity_ratio'].count()
ccgs['quantity_ratio_percentile'] = (ranks - 1) / ((num_non_nans - 1) / 100)
ccgs.head()
Out[22]:
In [23]:
ccg_quantity_ratio_10 = ccgs['quantity_ratio'].quantile(0.1)
ccg_quantity_ratio_10
Out[23]:
In [24]:
ccgs['unit_cost_branded'] = ccgs['cost_branded'] / ccgs['quantity_branded']
ccgs['unit_cost_generic'] = ccgs['cost_generic'] / ccgs['quantity_generic']
ccgs[['name', 'unit_cost_branded', 'unit_cost_generic']].head()
Out[24]:
In [25]:
global_unit_cost_branded = ccgs['cost_branded'].sum() / ccgs['quantity_branded'].sum()
global_unit_cost_generic = ccgs['cost_generic'].sum() / ccgs['quantity_generic'].sum()
global_unit_cost_branded, global_unit_cost_generic
Out[25]:
In [26]:
ccgs['unit_cost_branded'] = ccgs['unit_cost_branded'].fillna(global_unit_cost_branded)
ccgs['unit_cost_generic'] = ccgs['unit_cost_generic'].fillna(global_unit_cost_generic)
ccgs[['name', 'unit_cost_branded', 'unit_cost_generic']].head()
Out[26]:
In [27]:
ccgs['quantity_branded_10'] = ccgs['quantity_total'] * ccg_quantity_ratio_10
ccgs['quantity_generic_10'] = ccgs['quantity_total'] - ccgs['quantity_branded_10']
ccgs[['name', 'quantity_branded', 'quantity_branded_10', 'quantity_generic', 'quantity_generic_10']].head()
Out[27]:
In [28]:
ccgs['target_cost_10'] = ccgs['unit_cost_branded'] * ccgs['quantity_branded_10'] + ccgs['unit_cost_generic'] * ccgs['quantity_generic_10']
ccgs['cost_saving_10'] = ccgs['cost_total'] - ccgs['target_cost_10']
ccgs.head()
Out[28]:
Here are the CCGs that would save the least.
In [29]:
ccgs.sort_values('cost_saving_10').head()[['name', 'cost_saving_10']]
Out[29]:
And the CCGs that would save the most.
In [30]:
ccgs.sort_values('cost_saving_10').tail()[['name', 'cost_saving_10']]
Out[30]:
Again, we can verify our calculations by checking against data returned by the API.
In [31]:
def get_ccg_data_from_api(ccg_id):
url = 'https://openprescribing.net/api/1.0/measure_by_ccg/'
params = {
'format': 'json',
'measure': 'desogestrel',
'org': ccg_id,
}
rsp = requests.get(url, params)
for record in rsp.json()['measures'][0]['data']:
if record['date'] == '2018-08-01':
return record
assert False
In [32]:
for ccg_id, ccg in ccgs.sample(4).iterrows():
data = get_ccg_data_from_api(ccg_id)
print('-' * 80)
print(ccg)
assert data['numerator'] == ccg['quantity_branded']
assert data['denominator'] == ccg['quantity_total']
assert abs(data['percentile'] - ccg['quantity_ratio_percentile']) < 0.001
assert abs(data['cost_savings']['10'] - ccg['cost_saving_10']) < 0.001