In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
sns.set_context('poster')
Confusion matrices for physician-certified verbal autopsy and random-allocation verbal autopsy. Panel A shows the confusion matrix for physician certified verbal autopsy (with a length-three cause list for clarity). The entry each cell counts the number of deaths truly due to the row cause that were predicted to be due to the column cause. For example, the value 83 in the “other” row, “stroke” column indicates that 83 deaths truly due to causes other than stroke or diabetes were (incorrectly) attributed to stroke by physicians. This table demonstrates that (for this dataset) physicians are right more often than they are wrong when they predict stroke as the cause of death, but wrong more than they are right when they predict diabetes. Panel B shows the confusion matrix for Random Allocation with the same dataset, where random chance predicts stroke and diabetes incorrectly for a vast majority of the cases. True and PCVA data from Lozano et al., 2011, where physicians were presented with VAI data where the underlying cause was known to meet stringent clinical diagnostic criteria, and their results compared to the truth.18
In [2]:
df = pd.read_csv('../3-data/pcva_results.csv')
In [3]:
df
Out[3]:
In [4]:
df['gs3'] = 'Other'
df.loc[df.gs_text34=='Stroke', 'gs3'] = 'Stroke'
df.loc[df.gs_text34=='Diabetes', 'gs3'] = 'Diabetes'
In [5]:
df.gs3.value_counts()
Out[5]:
In [6]:
df['pc3'] = 'Other'
df.loc[df.gs_assigned_MRR_1=='I64', 'pc3'] = 'Stroke'
df.loc[df.gs_assigned_MRR_1=='E10', 'pc3'] = 'Diabetes'
In [7]:
df.pc3.value_counts()
Out[7]:
In [8]:
cnts = df.groupby(['gs3', 'pc3']).id.count()
cnts = cnts.unstack()
cnts = pd.DataFrame(cnts, columns=['Stroke', 'Diabetes', 'Other'], index = ['Stroke', 'Diabetes', 'Other'])
cnts
Out[8]:
In [9]:
# spot-check a few things
assert np.all(cnts.sum(axis=0) - df.pc3.value_counts() == 0)
assert np.all(cnts.sum(axis=1) - df.gs3.value_counts() == 0)
assert cnts.loc['Stroke', 'Stroke'] == ((df.pc3=='Stroke') & (df.gs3=='Stroke')).sum()
assert cnts.loc['Stroke', 'Other'] == ((df.pc3=='Other') & (df.gs3=='Stroke')).sum()
In [10]:
# set seed for reproducibility
np.random.seed(12345)
df['ra3'] = np.random.choice(['Stroke', 'Diabetes', 'Other'], size=len(df.index))
In [11]:
cnts = df.groupby(['gs3', 'ra3']).id.count()
cnts = cnts.unstack()
cnts = pd.DataFrame(cnts, columns=['Stroke', 'Diabetes', 'Other'], index = ['Stroke', 'Diabetes', 'Other'])
cnts
Out[11]:
Confusion matrix for Random-From-Train verbal autopsy. The confusion matrix for Random-From-Train (with a length-three cause-list for clarity). As in Table 1, the entry each cell counts the number of deaths truly due to the row cause that were predicted to be due to the column cause. This table demonstrates that while Random-From-Train is inaccurate at the individual level, at the population level, the prediction distribution can closely match the truth.
In [12]:
df['rft'] = np.random.choice(df.gs3, size=len(df.index))
In [13]:
cnts = df.groupby(['gs3', 'rft']).id.count()
cnts = cnts.unstack()
cnts = pd.DataFrame(cnts, columns=['Stroke', 'Diabetes', 'Other'], index = ['Stroke', 'Diabetes', 'Other'])
cnts
Out[13]:
This table demonstrates that while Random-From-Train is inaccurate at the individual level, at the population level, the prediction distribution can closely match the truth.
In [14]:
pd.DataFrame({'row sums':cnts.sum(0), 'col sums':cnts.sum(1)})
Out[14]:
To do this, we performed a Monte Carlo calculation of the CSMF accuracy of Random Allocation, by simulating a dataset with known CSMF distribution, assigning “predicted” causes of death uniformly at random, and measuring the CSMF accuracy of the predictions.
The distribution of the simulated dataset is an important and subtle detail of this calculation. We sampled the true CSMF distribution from an uninformative Dirichlet distribution (which gives equal probability to all possible CSMF distributions). We generated XXX replicated of the Monte Carlo simulation, and calculated the mean and standard deviation of the CSMF accuracy.
In [15]:
def csmf_acc_for_random(J=34, r=500, n=10000, seed=12345):
""" Use Monte Carlo to approximate the CSMF accuracy
of randomly assigning deaths to causes
J : int, number of causes
r : int, number of replicates in Monte Carlo approx
n : int, size of test db
"""
# set random seed for reproducibility
np.random.seed(seed)
#######################################################
#######################################################
# generate n CSMFs from uninformative dirichlet prior
csmf_true = np.random.dirichlet(np.ones(J), size=r)
########################################################
########################################################
# generate n CSMFs from random allocation of causes
# to n deaths drawn from this distribution
csmf_rand = np.random.multinomial(n, np.ones(J) / float(J), size=r) / float(n)
assert np.allclose(csmf_rand.sum(axis=1), 1) # rows sum to one (modulo machine precision)
########################################################
########################################################
# calculate CSMF accuracy for all replicates
csmf_acc = 1 - np.sum(np.absolute(csmf_true - csmf_rand), axis=1) \
/ (2 * (1 - np.min(csmf_true, axis=1)))
#plt.title('Mean CSMF Accuracy = %0.2f\nfor I=%d, n=%d, r=%d replicates' % (np.mean(csmf_acc), I, n, r))
#sns.distplot(csmf_acc, rug=True, rug_kws={'alpha':.25})
return csmf_acc
acc = csmf_acc_for_random(J=34)
acc.mean(), acc.std()
Out[15]:
We also used this simulation framework to perform a Monte Carlo calculation of the concordance for random allocation, which provides a cross-check for the analytical derivation of CCC derived in [ref]. We repeated the simulations for cause lists ranging from 3 to 50 causes.
In [16]:
def concordance_for_random(J=34, r=500, n=10000, seed=12345):
""" Use Monte Carlo to approximate the concordance
of randomly assigning deaths to causes
J : int, number of causes
r : int, number of replicates in Monte Carlo approx
n : int, size of test db
"""
# set random seed for reproducibility
np.random.seed(seed)
#######################################################
#######################################################
# generate r replicates of n underlying causes of death
# (true and predicted)
cause_true = np.floor(np.random.uniform(0, J, size=(r,n)))
cause_pred = np.floor(np.random.uniform(0, J, size=(r,n)))
########################################################
########################################################
# calculate concordance for r replicates
c = np.empty((J,r))
for j in range(J):
n_j = (cause_true == j).sum(axis=1)
n_j = np.array(n_j, dtype=float) # ensure that we get floating point division
c[j] = ((cause_true == j)&(cause_pred == j)).sum(axis=1) / n_j
concordance = np.mean(c, axis=0)
assert concordance.shape == (r,)
return concordance
c = concordance_for_random(J=3)
c.mean(), c.std()
Out[16]:
This simulation setting also provided us an opportunity to demonstrate the importance of randomly resampling the cause-fraction of the test set from an uninformative Dirichlet distribution (a technical point that perhaps has not been sufficiently appreciated since its introduction in [ref]). To do so, we compared the CCCSMF accuracy of Random Allocation with that of Random-From-Train, where training data was either uniformly distributed among causes or distributed according same distribution as the test data.
In [17]:
def csmf_acc_for_rft(J=34, r=500, n=10000, seed=12345, uniform_train=False):
""" Use Monte Carlo to approximate the CSMF accuracy
of randomly assigning deaths according to train set distribution
J : int, number of causes
r : int, number of replicates in Monte Carlo approx
n : int, size of test db
uniform_train : bool, should train set have different distribution than test set
"""
# set random seed for reproducibility
np.random.seed(seed)
#######################################################
#######################################################
# generate n CSMFs from uninformative dirichlet prior
csmf_true = np.random.dirichlet(np.ones(J), size=r)
assert np.allclose(csmf_true.sum(axis=1), 1) # after completing, rows sum to one (modulo machine precision)
#######################################################
#######################################################
# generate train set of size n
X_train = np.empty((r,n))
for i in range(r):
X_train[i,:] = np.random.choice(range(len(csmf_true[i])), p=csmf_true[i], size=n)
########################################################
########################################################
# re-calculate csmf for train set
# (since it is a little different than desired)
csmf_true = np.empty((r,J))
for i in range(r):
for j in range(J):
csmf_true[i,j] = (X_train[i] == j).sum() / float(n)
assert np.allclose(csmf_true.sum(axis=1), 1) # rows sum to one (modulo machine precision)
#######################################################
#######################################################
# resample train set to have equal class sizes,
# _if requested_
if uniform_train:
X_train = np.empty((r,n))
for i in range(r):
X_train[i,:] = np.random.choice(range(J), p=np.ones(J)/float(J), size=n)
########################################################
########################################################
# generate test set using random-from-train
X_test = np.empty((r,n))
for i in range(r):
X_test[i] = np.random.choice(X_train[i], size=n, replace=True)
########################################################
########################################################
# calculate csmf for test set
csmf_rft = np.empty((r,J))
for i in range(r):
for j in range(J):
csmf_rft[i,j] = (X_test[i] == j).sum() / float(n)
assert np.allclose(csmf_rft.sum(axis=1), 1) # rows sum to one (modulo machine precision)
########################################################
########################################################
# calculate CSMF accuracy for all replicates
csmf_acc = 1 - np.sum(np.absolute(csmf_true - csmf_rft), axis=1) \
/ (2 * (1 - np.min(csmf_true, axis=1)))
#plt.title('Mean CSMF Accuracy = %0.2f\nfor I=%d, n=%d, r=%d replicates' % (np.mean(csmf_acc), I, n, r))
#sns.distplot(csmf_acc, rug=True, rug_kws={'alpha':.25})
return csmf_acc
acc = csmf_acc_for_rft(J=34)
print(acc.mean(), acc.std())
acc = csmf_acc_for_rft(J=34, uniform_train=True)
print(acc.mean(), acc.std())
In [18]:
import sys
In [19]:
%%time
df = pd.DataFrame(columns=['J'])
for J in range(3,51):
sys.stdout.write(str(J)+' ')
results_J = {'J':J}
acc = csmf_acc_for_random(J, r=10000)
results_J['acc_rand'] = acc.mean()
results_J['acc_rand_lb'] = np.percentile(acc, 2.5)
results_J['acc_rand_ub'] = np.percentile(acc, 97.5)
c = concordance_for_random(J, r=500)
results_J['conc_rand'] = c.mean()
results_J['conc_rand_lb'] = np.percentile(c, 2.5)
results_J['conc_rand_ub'] = np.percentile(c, 97.5)
acc = csmf_acc_for_rft(J, r=500)
results_J['acc_rft'] = acc.mean()
results_J['acc_rft_lb'] = np.percentile(acc, 2.5)
results_J['acc_rft_ub'] = np.percentile(acc, 97.5)
acc = csmf_acc_for_rft(J, r=500, uniform_train=True)
results_J['acc_rft_unif'] = acc.mean()
results_J['acc_rft_unif_lb'] = np.percentile(acc, 2.5)
results_J['acc_rft_unif_ub'] = np.percentile(acc, 97.5)
df = df.append(results_J, ignore_index=True)
print()
df_sim = df
In [20]:
df = df_sim
df.head()
Out[20]:
CCCSMF accuracy of Random Allocation and Random-From-Train with and without resampling the test CSMF distribution. This table demonstrates the importance of resampling the CSMF distribution in the test set; if the test and train sets have the same CSMF distribution, then simple approaches like Random-From-Train, as well as state-of-the-art approaches like King-Lu,23 can appear to have better performance than is justified, due to “overfitting”.
In [21]:
df.index = df.J
np.round(1*((df.filter(['acc_rft', 'acc_rand', 'acc_rft_unif']) - .632) / (1-.632)).loc[[5,15,25,35,50,]], 3)
Out[21]:
Figure 1. CSMF Accuracy of random allocation as a function of CoD list length. The mean CSMF accuracy of random allocation was calculated with 10,000 Monte Carlo replicates for cause-list length ranging from 3 to 50. The CSMF accuracy decreases monotonically as a function of J and appears to stay above 1-1/e≈0.632, which we selected for our chance-correction parameter.
In [22]:
df.acc_rand.plot(color='k',marker='o')
plt.ylabel('CSMF Accuracy\nof Random Allocation', rotation=0, ha='right')
plt.xlabel('Number of Causes ($J$)')
plt.axis(xmin=0, xmax=53)
plt.subplots_adjust(left=.3)
Figure 2. Comparison of concordance from Monte Carlo calculation and analytic calculation. The analogous chance-correction value for concordance was calculated analytically in Murray et al.13, and we confirmed its accuracy in our simulation environment. The absolute relative difference was always less than 1%.
In [23]:
plt.plot(1/df.J, df.conc_rand, 'ko')
plt.ylabel('Monte Carlo Estimate\nof Concordance\nof Random Allocation', rotation=0, ha='right')
plt.xlabel('Analytically Calculated Concordance\nof Random Allocation')
df.index = df.J
for j in [3,5,10,25,50]:
plt.text(1/float(j), df.conc_rand[j], ' $J=%d$'%j, ha='left', va='center', fontsize=24)
plt.subplots_adjust(left=.3)
Figure 3. Comparison of individual-level and population-level prediction quality for three commonly used methods: InterVA, Tariff, physician-certified verbal autopsy (PCVA). Questions that rely on the deceased having health care experience (HCE) are necessary for population-level PCVA quality to surpass random guessing. Data from Murray et al.12
In [24]:
df = pd.read_excel('../3-data/va_methods_comparison.xlsx')
# chance-correct accuracy
df.accuracy = (df.accuracy - 0.632) / (1 - 0.632)
df.head()
Out[24]:
In [25]:
fig, ax_list = plt.subplots(2, 3, sharex=True, sharey=True, figsize=(15,10))
for i, hce in enumerate(['With HCE', 'Without HCE']):
for j, module in enumerate(['Neonate', 'Child', 'Adult']):
t = df[(df.hce==hce)&(df.module==module)&df.model.isin(['InterVA', 'PCVA', 'Tariff'])]
t = t.dropna(subset=['accuracy', 'CCC'])
ax = ax_list[i,j]
ax.axvline(0.64, color='grey', linestyle='--', linewidth=2)
ax.plot(t.accuracy, t.CCC, 'o', ms=15, color='grey', mec='k', mew=2, alpha=1)
for k in t.index:
ha = 'left'
dx = .06
dy = 0
if t.model[k] == 'Tariff' and j==0:
ha = 'right'
dx = -.06
elif t.model[k] == 'InterVA':
dy = -.0
ax.text(t.accuracy[k]+dx, t.CCC[k]+dy, t.model[k], ha=ha, va='center', fontsize=18)
if i == 0:
ax.set_title(module)
ax = ax_list[i,2]
ax.set_ylabel(hce, labelpad=-320)
ax = ax_list[0,0]
ax.set_ylabel('Individual-Level Quality (CCC)', ha='center', position=(0,0))
ax = ax_list[1,1]
ax.set_xlabel('Population-Level Quality (CCCSMF Accuracy)')
fig.subplots_adjust(wspace=0.1, hspace=0.1)
ax.set_xticks([-.3, 0, .3])
ax.set_yticks([.2, .4, .6, .8])
ax.axis(xmin=-.4, xmax=.6, ymin=.15, ymax=.55)
Out[25]: