Replication Archive for "Measuring causes of death in populations: a new metric that corrects cause-specific mortality fractions for chance"


In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline

sns.set_style('whitegrid')
sns.set_context('poster')

Table 1

Confusion matrices for physician-certified verbal autopsy and random-allocation verbal autopsy. Panel A shows the confusion matrix for physician certified verbal autopsy (with a length-three cause list for clarity). The entry each cell counts the number of deaths truly due to the row cause that were predicted to be due to the column cause. For example, the value 83 in the “other” row, “stroke” column indicates that 83 deaths truly due to causes other than stroke or diabetes were (incorrectly) attributed to stroke by physicians. This table demonstrates that (for this dataset) physicians are right more often than they are wrong when they predict stroke as the cause of death, but wrong more than they are right when they predict diabetes. Panel B shows the confusion matrix for Random Allocation with the same dataset, where random chance predicts stroke and diabetes incorrectly for a vast majority of the cases. True and PCVA data from Lozano et al., 2011, where physicians were presented with VAI data where the underlying cause was known to meet stringent clinical diagnostic criteria, and their results compared to the truth.18


In [2]:
df = pd.read_csv('../3-data/pcva_results.csv')

In [3]:
df


Out[3]:
id gs_text34 gs_assigned_MRR_1
0 0 Stroke ZZ23
1 1 IHD - Acute Myocardial Infarction I21
2 2 Stroke I64
3 3 Pneumonia E10
4 4 Stroke I64
5 5 Other Cardiovascular Diseases ZZ25
6 6 Diabetes E10
7 7 Renal Failure D91
8 8 Stroke ZZ23
9 9 IHD - Acute Myocardial Infarction ZZ23
10 10 Cirrhosis A09
11 11 Stroke ZZ23
12 12 Stroke ZZ23
13 13 Stroke W00
14 14 Other Cardiovascular Diseases I21
15 15 Diabetes E10
16 16 Poisonings ZZ25
17 17 Diarrhea/Dysentery I64
18 18 Stroke ZZ23
19 19 IHD - Acute Myocardial Infarction ZZ23
20 20 Epilepsy J45
21 21 Falls W00
22 22 Stroke ZZ25
23 23 IHD - Acute Myocardial Infarction E10
24 24 IHD - Acute Myocardial Infarction E10
25 25 Pneumonia J33
26 26 Stroke ZZ23
27 27 IHD - Acute Myocardial Infarction I21
28 28 Stroke ZZ23
29 29 Falls W00
... ... ... ...
2672 2672 Other Infectious Diseases J33
2673 2673 Cirrhosis ZZ21
2674 2674 Poisonings ZZ27
2675 2675 Other Infectious Diseases ZZ21
2676 2676 Other Infectious Diseases J45
2677 2677 Pneumonia E10
2678 2678 Diabetes K71
2679 2679 Maternal O67
2680 2680 Other Infectious Diseases ZZ27
2681 2681 Maternal O67
2682 2682 Drowning ZZ27
2683 2683 Falls ZZ27
2684 2684 Renal Failure O67
2685 2685 Suicide ZZ27
2686 2686 IHD - Acute Myocardial Infarction I21
2687 2687 Fires ZZ27
2688 2688 Homicide ZZ27
2689 2689 Homicide Y00
2690 2690 Maternal O67
2691 2691 Fires ZZ27
2692 2692 Drowning ZZ27
2693 2693 Fires ZZ27
2694 2694 Drowning W65
2695 2695 Drowning ZZ25
2696 2696 Fires ZZ27
2697 2697 Drowning W65
2698 2698 Homicide Y00
2699 2699 Pneumonia ZZ23
2700 2700 IHD - Acute Myocardial Infarction I21
2701 2701 Cirrhosis K71

2702 rows × 3 columns


In [4]:
df['gs3'] = 'Other'
df.loc[df.gs_text34=='Stroke', 'gs3'] = 'Stroke'
df.loc[df.gs_text34=='Diabetes', 'gs3'] = 'Diabetes'

In [5]:
df.gs3.value_counts()


Out[5]:
Other       2290
Stroke       266
Diabetes     146
Name: gs3, dtype: int64

In [6]:
df['pc3'] = 'Other'
df.loc[df.gs_assigned_MRR_1=='I64', 'pc3'] = 'Stroke'
df.loc[df.gs_assigned_MRR_1=='E10', 'pc3'] = 'Diabetes'

In [7]:
df.pc3.value_counts()


Out[7]:
Other       2292
Stroke       211
Diabetes     199
Name: pc3, dtype: int64

Panel A


In [8]:
cnts = df.groupby(['gs3', 'pc3']).id.count()
cnts = cnts.unstack()
cnts = pd.DataFrame(cnts, columns=['Stroke', 'Diabetes', 'Other'], index = ['Stroke', 'Diabetes', 'Other'])
cnts


Out[8]:
Stroke Diabetes Other
Stroke 123 18 125
Diabetes 5 86 55
Other 83 95 2112

In [9]:
# spot-check a few things

assert np.all(cnts.sum(axis=0) - df.pc3.value_counts() == 0)
assert np.all(cnts.sum(axis=1) - df.gs3.value_counts() == 0)

assert cnts.loc['Stroke', 'Stroke'] == ((df.pc3=='Stroke') & (df.gs3=='Stroke')).sum()
assert cnts.loc['Stroke', 'Other'] == ((df.pc3=='Other') & (df.gs3=='Stroke')).sum()

Panel B


In [10]:
# set seed for reproducibility
np.random.seed(12345)

df['ra3'] = np.random.choice(['Stroke', 'Diabetes', 'Other'], size=len(df.index))

In [11]:
cnts = df.groupby(['gs3', 'ra3']).id.count()
cnts = cnts.unstack()
cnts = pd.DataFrame(cnts, columns=['Stroke', 'Diabetes', 'Other'], index = ['Stroke', 'Diabetes', 'Other'])
cnts


Out[11]:
Stroke Diabetes Other
Stroke 87 84 95
Diabetes 32 61 53
Other 746 780 764

Table 2

Confusion matrix for Random-From-Train verbal autopsy. The confusion matrix for Random-From-Train (with a length-three cause-list for clarity). As in Table 1, the entry each cell counts the number of deaths truly due to the row cause that were predicted to be due to the column cause. This table demonstrates that while Random-From-Train is inaccurate at the individual level, at the population level, the prediction distribution can closely match the truth.


In [12]:
df['rft'] = np.random.choice(df.gs3, size=len(df.index))

In [13]:
cnts = df.groupby(['gs3', 'rft']).id.count()
cnts = cnts.unstack()
cnts = pd.DataFrame(cnts, columns=['Stroke', 'Diabetes', 'Other'], index = ['Stroke', 'Diabetes', 'Other'])
cnts


Out[13]:
Stroke Diabetes Other
Stroke 28 14 224
Diabetes 11 13 122
Other 223 106 1961

This table demonstrates that while Random-From-Train is inaccurate at the individual level, at the population level, the prediction distribution can closely match the truth.


In [14]:
pd.DataFrame({'row sums':cnts.sum(0), 'col sums':cnts.sum(1)})


Out[14]:
col sums row sums
Stroke 266 262
Diabetes 146 133
Other 2290 2307

To do this, we performed a Monte Carlo calculation of the CSMF accuracy of Random Allocation, by simulating a dataset with known CSMF distribution, assigning “predicted” causes of death uniformly at random, and measuring the CSMF accuracy of the predictions.

The distribution of the simulated dataset is an important and subtle detail of this calculation. We sampled the true CSMF distribution from an uninformative Dirichlet distribution (which gives equal probability to all possible CSMF distributions). We generated XXX replicated of the Monte Carlo simulation, and calculated the mean and standard deviation of the CSMF accuracy.


In [15]:
def csmf_acc_for_random(J=34, r=500, n=10000, seed=12345):
    """ Use Monte Carlo to approximate the CSMF accuracy
    of randomly assigning deaths to causes

    J : int, number of causes
    r : int, number of replicates in Monte Carlo approx
    n : int, size of test db
    """
    # set random seed for reproducibility
    np.random.seed(seed)

    #######################################################
    #######################################################
    # generate n CSMFs from uninformative dirichlet prior

    csmf_true = np.random.dirichlet(np.ones(J), size=r)

    
    ########################################################
    ########################################################
    # generate n CSMFs from random allocation of causes
    # to n deaths drawn from this distribution

    csmf_rand = np.random.multinomial(n, np.ones(J) / float(J), size=r) / float(n)
    assert np.allclose(csmf_rand.sum(axis=1), 1)  #  rows sum to one (modulo machine precision)


    ########################################################
    ########################################################
    # calculate CSMF accuracy for all replicates

    csmf_acc = 1 - np.sum(np.absolute(csmf_true - csmf_rand), axis=1) \
                        / (2 * (1 - np.min(csmf_true, axis=1)))


    #plt.title('Mean CSMF Accuracy = %0.2f\nfor I=%d, n=%d, r=%d replicates' % (np.mean(csmf_acc), I, n, r))
    #sns.distplot(csmf_acc, rug=True, rug_kws={'alpha':.25})
    
    return csmf_acc

acc = csmf_acc_for_random(J=34)
acc.mean(), acc.std()


Out[15]:
(0.63650814200295769, 0.039592336123364276)

We also used this simulation framework to perform a Monte Carlo calculation of the concordance for random allocation, which provides a cross-check for the analytical derivation of CCC derived in [ref]. We repeated the simulations for cause lists ranging from 3 to 50 causes.


In [16]:
def concordance_for_random(J=34, r=500, n=10000, seed=12345):
    """ Use Monte Carlo to approximate the concordance
    of randomly assigning deaths to causes

    J : int, number of causes
    r : int, number of replicates in Monte Carlo approx
    n : int, size of test db
    """
    # set random seed for reproducibility
    np.random.seed(seed)

    #######################################################
    #######################################################
    # generate r replicates of n underlying causes of death
    # (true and predicted)
    
    cause_true = np.floor(np.random.uniform(0, J, size=(r,n)))
    cause_pred = np.floor(np.random.uniform(0, J, size=(r,n)))

    ########################################################
    ########################################################
    # calculate concordance for r replicates
    c = np.empty((J,r))
    for j in range(J):
        n_j = (cause_true == j).sum(axis=1)
        n_j = np.array(n_j, dtype=float)  # ensure that we get floating point division
        
        c[j] = ((cause_true == j)&(cause_pred == j)).sum(axis=1) / n_j
        
    concordance = np.mean(c, axis=0)
    assert concordance.shape == (r,)

    return concordance

c = concordance_for_random(J=3)
c.mean(), c.std()


Out[16]:
(0.33348456656282904, 0.004793418009544693)

This simulation setting also provided us an opportunity to demonstrate the importance of randomly resampling the cause-fraction of the test set from an uninformative Dirichlet distribution (a technical point that perhaps has not been sufficiently appreciated since its introduction in [ref]). To do so, we compared the CCCSMF accuracy of Random Allocation with that of Random-From-Train, where training data was either uniformly distributed among causes or distributed according same distribution as the test data.


In [17]:
def csmf_acc_for_rft(J=34, r=500, n=10000, seed=12345, uniform_train=False):
    """ Use Monte Carlo to approximate the CSMF accuracy
    of randomly assigning deaths according to train set distribution

    J : int, number of causes
    r : int, number of replicates in Monte Carlo approx
    n : int, size of test db
    uniform_train : bool, should train set have different distribution than test set
    """
    # set random seed for reproducibility
    np.random.seed(seed)

    #######################################################
    #######################################################
    # generate n CSMFs from uninformative dirichlet prior

    csmf_true = np.random.dirichlet(np.ones(J), size=r)
    assert np.allclose(csmf_true.sum(axis=1), 1)  # after completing, rows sum to one (modulo machine precision)

    #######################################################
    #######################################################
    # generate train set of size n
    
    X_train = np.empty((r,n))
    for i in range(r):
        X_train[i,:] = np.random.choice(range(len(csmf_true[i])), p=csmf_true[i], size=n)
    
    ########################################################
    ########################################################
    # re-calculate csmf for train set
    # (since it is a little different than desired)
    csmf_true = np.empty((r,J))
    for i in range(r):
        for j in range(J):
            csmf_true[i,j] = (X_train[i] == j).sum() / float(n)
    assert np.allclose(csmf_true.sum(axis=1), 1)  #  rows sum to one (modulo machine precision)
    
    #######################################################
    #######################################################
    # resample train set to have equal class sizes,
    # _if requested_
    if uniform_train:
        X_train = np.empty((r,n))
        for i in range(r):
            X_train[i,:] = np.random.choice(range(J), p=np.ones(J)/float(J), size=n)
    
    
    
    ########################################################
    ########################################################
    # generate test set using random-from-train
    X_test = np.empty((r,n))
    for i in range(r):
        X_test[i] = np.random.choice(X_train[i], size=n, replace=True)
        
    ########################################################
    ########################################################
    # calculate csmf for test set
    csmf_rft = np.empty((r,J))
    for i in range(r):
        for j in range(J):
            csmf_rft[i,j] = (X_test[i] == j).sum() / float(n)
    assert np.allclose(csmf_rft.sum(axis=1), 1)  #  rows sum to one (modulo machine precision)


    ########################################################
    ########################################################
    # calculate CSMF accuracy for all replicates

    csmf_acc = 1 - np.sum(np.absolute(csmf_true - csmf_rft), axis=1) \
                        / (2 * (1 - np.min(csmf_true, axis=1)))


    #plt.title('Mean CSMF Accuracy = %0.2f\nfor I=%d, n=%d, r=%d replicates' % (np.mean(csmf_acc), I, n, r))
    #sns.distplot(csmf_acc, rug=True, rug_kws={'alpha':.25})
    
    return csmf_acc

acc = csmf_acc_for_rft(J=34)
print(acc.mean(), acc.std())

acc = csmf_acc_for_rft(J=34, uniform_train=True)
print(acc.mean(), acc.std())


(0.9801022955774783, 0.0028779120303083984)
(0.63496926759115124, 0.040351311224907462)

In [18]:
import sys

In [19]:
%%time

df = pd.DataFrame(columns=['J'])
for J in range(3,51):
    sys.stdout.write(str(J)+' ')
    results_J = {'J':J}
    
    acc = csmf_acc_for_random(J, r=10000)
    results_J['acc_rand'] = acc.mean()
    results_J['acc_rand_lb'] = np.percentile(acc, 2.5)
    results_J['acc_rand_ub'] = np.percentile(acc, 97.5)
    
    
    c = concordance_for_random(J, r=500)
    results_J['conc_rand'] = c.mean()
    results_J['conc_rand_lb'] = np.percentile(c, 2.5)
    results_J['conc_rand_ub'] = np.percentile(c, 97.5)

    acc = csmf_acc_for_rft(J, r=500)
    results_J['acc_rft'] = acc.mean()
    results_J['acc_rft_lb'] = np.percentile(acc, 2.5)
    results_J['acc_rft_ub'] = np.percentile(acc, 97.5)

    acc = csmf_acc_for_rft(J, r=500, uniform_train=True)
    results_J['acc_rft_unif'] = acc.mean()
    results_J['acc_rft_unif_lb'] = np.percentile(acc, 2.5)
    results_J['acc_rft_unif_ub'] = np.percentile(acc, 97.5)
    
    df = df.append(results_J, ignore_index=True)
print()
df_sim = df


3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 ()
CPU times: user 4min 20s, sys: 9.92 s, total: 4min 30s
Wall time: 4min 30s

In [20]:
df = df_sim
df.head()


Out[20]:
J acc_rand acc_rand_lb acc_rand_ub acc_rft acc_rft_lb acc_rft_ub acc_rft_unif acc_rft_unif_lb acc_rft_unif_ub conc_rand conc_rand_lb conc_rand_ub
0 3.0 0.672649 0.409661 0.906207 0.994532 0.986763 0.999225 0.676596 0.412911 0.915653 0.333485 0.324596 0.342875
1 4.0 0.664950 0.419911 0.870437 0.993817 0.986271 0.998765 0.670330 0.433460 0.870414 0.249967 0.241173 0.258959
2 5.0 0.659377 0.445512 0.845829 0.992700 0.985842 0.997518 0.666023 0.466219 0.844205 0.199946 0.192522 0.208066
3 6.0 0.655971 0.456732 0.827259 0.991902 0.985600 0.996989 0.659627 0.467981 0.806137 0.167013 0.160000 0.174345
4 7.0 0.652765 0.465657 0.815029 0.991545 0.985601 0.996526 0.659969 0.494500 0.822932 0.143023 0.136691 0.149720

Table 3

CCCSMF accuracy of Random Allocation and Random-From-Train with and without resampling the test CSMF distribution. This table demonstrates the importance of resampling the CSMF distribution in the test set; if the test and train sets have the same CSMF distribution, then simple approaches like Random-From-Train, as well as state-of-the-art approaches like King-Lu,23 can appear to have better performance than is justified, due to “overfitting”.


In [21]:
df.index = df.J
np.round(1*((df.filter(['acc_rft', 'acc_rand', 'acc_rft_unif']) - .632) / (1-.632)).loc[[5,15,25,35,50,]], 3)


Out[21]:
acc_rft acc_rand acc_rft_unif
J
5 0.980 0.074 0.092
15 0.964 0.028 0.027
25 0.953 0.016 0.016
35 0.945 0.010 0.007
50 0.933 0.006 -0.005

Figure 1

Figure 1. CSMF Accuracy of random allocation as a function of CoD list length. The mean CSMF accuracy of random allocation was calculated with 10,000 Monte Carlo replicates for cause-list length ranging from 3 to 50. The CSMF accuracy decreases monotonically as a function of J and appears to stay above 1-1/e≈0.632, which we selected for our chance-correction parameter.


In [22]:
df.acc_rand.plot(color='k',marker='o')

plt.ylabel('CSMF Accuracy\nof Random Allocation', rotation=0, ha='right')
plt.xlabel('Number of Causes ($J$)')

plt.axis(xmin=0, xmax=53)
plt.subplots_adjust(left=.3)


Figure 2

Figure 2. Comparison of concordance from Monte Carlo calculation and analytic calculation. The analogous chance-correction value for concordance was calculated analytically in Murray et al.13, and we confirmed its accuracy in our simulation environment. The absolute relative difference was always less than 1%.


In [23]:
plt.plot(1/df.J, df.conc_rand, 'ko')
plt.ylabel('Monte Carlo Estimate\nof Concordance\nof Random Allocation', rotation=0, ha='right')
plt.xlabel('Analytically Calculated Concordance\nof Random Allocation')

df.index = df.J
for j in [3,5,10,25,50]:
    plt.text(1/float(j), df.conc_rand[j], '   $J=%d$'%j, ha='left', va='center', fontsize=24)
    
plt.subplots_adjust(left=.3)


Figure 3

Figure 3. Comparison of individual-level and population-level prediction quality for three commonly used methods: InterVA, Tariff, physician-certified verbal autopsy (PCVA). Questions that rely on the deceased having health care experience (HCE) are necessary for population-level PCVA quality to surpass random guessing. Data from Murray et al.12


In [24]:
df = pd.read_excel('../3-data/va_methods_comparison.xlsx')

# chance-correct accuracy
df.accuracy = (df.accuracy - 0.632) / (1 - 0.632)
df.head()


Out[24]:
model module hce accuracy CCC
0 InterVA Adult With HCE -0.017486 0.238500
1 InterVA Adult Without HCE -0.054394 0.234244
2 KL Adult With HCE 0.153660 0.000000
3 KL Adult Without HCE 0.108869 0.000000
4 MCCD Adult With HCE 0.562500 0.759000

In [25]:
fig, ax_list = plt.subplots(2, 3, sharex=True, sharey=True, figsize=(15,10))

for i, hce in enumerate(['With HCE', 'Without HCE']):
    for j, module in enumerate(['Neonate', 'Child', 'Adult']):
        t = df[(df.hce==hce)&(df.module==module)&df.model.isin(['InterVA', 'PCVA', 'Tariff'])]
        t = t.dropna(subset=['accuracy', 'CCC'])
        ax = ax_list[i,j]
        ax.axvline(0.64, color='grey', linestyle='--', linewidth=2)
        ax.plot(t.accuracy, t.CCC, 'o', ms=15, color='grey', mec='k', mew=2, alpha=1)
        
        for k in t.index:
            ha = 'left'
            dx = .06
            dy = 0
            if t.model[k] == 'Tariff' and j==0:
                ha = 'right'
                dx = -.06
            elif t.model[k] == 'InterVA':
                dy = -.0
            ax.text(t.accuracy[k]+dx, t.CCC[k]+dy, t.model[k], ha=ha, va='center', fontsize=18)
        
        if i == 0:
            ax.set_title(module)
    
    ax = ax_list[i,2]
    ax.set_ylabel(hce, labelpad=-320)

ax = ax_list[0,0]
ax.set_ylabel('Individual-Level Quality (CCC)', ha='center', position=(0,0))
ax = ax_list[1,1]
ax.set_xlabel('Population-Level Quality (CCCSMF Accuracy)')
fig.subplots_adjust(wspace=0.1, hspace=0.1)

ax.set_xticks([-.3, 0, .3])
ax.set_yticks([.2, .4, .6, .8])
ax.axis(xmin=-.4, xmax=.6, ymin=.15, ymax=.55)


Out[25]:
(-0.4, 0.6, 0.15, 0.55)