Setup and load data.



In [1]:

    
import numpy as np
import pandas as pd



In [2]:

    
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'



In [3]:

    
# Set some Pandas options
pd.set_option('html', False)
pd.set_option('max_columns', 30)
pd.set_option('max_rows', 10)

Using 2007 Q1 for now



In [4]:

    
data = pd.read_hdf('/var/datasets/dshs/CD2007Q1/reduced_PUDF_base1q2007.h5','data')

It consists of ~740K claims.



In [5]:

    
nclaims = data.Discharge.count()



In [6]:

    
nclaims









    Out[6]:





740288

It can be used to create about 75M diagnosis triplets.



In [12]:

    
diagnosisView = data[['Princ_Diag_Code'] + ['Oth_Diag_Code_'+str(i) for i in range(1,25)]]
diagnosis = diagnosisView.copy()
diagnosis.count()
cnt = diagnosis.count().values
diff = cnt[:-1]-cnt[1:]
diff = np.append(diff,cnt[-1])



In [14]:

    
trigrams=[diff[i-1]*i*(i-1)*(i-2)/6 for i in range(3, 26)]
trigrams
sum(trigrams)









    Out[14]:





75175406



In [15]:

    
def calcNumberOfTrigrams(nvars):
    trigrams=[diff[i-1]*i*(i-1)*(i-2)/6 for i in range(3, nvars + 1)]
    return sum(trigrams)



In [16]:

    
def calcNumberOfTrigramsJoin(nvars, claimsperpatient=10):
    trigrams=[diff[i-1]*(i*claimsperpatient)*((i*claimsperpatient)-1)*((i*claimsperpatient)-2)/6 for i in range(3, nvars + 1)]
    return sum(trigrams)



In [17]:

    
nvars=range(3,12)



In [18]:

    
ntri = [calcNumberOfTrigrams(x) for x in nvars]



In [19]:

    
ntrij = [calcNumberOfTrigramsJoin(x) for x in nvars]



In [21]:

    
plt.bar(nvars, ntri)
plt.title('PUF 2007 Q1')
plt.xlabel('Number of diagnosis features used')
plt.ylabel('Number of extracted triplets')









    Out[21]:





<matplotlib.text.Text at 0x7fad20e18a10>

The above is sligtly incorrect as there are repetitions in the list of codes for every claim. Most importantly:

Other stats to consider:

Major hospitals
Popular diagnoses
Basic demographics
Rural vs urban
Got data for 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006 and 2007: cross reference with major events, seasonal occurences.



In [22]:

    
dif = diff.tolist()



In [23]:

    
dif.insert(0,0)



In [24]:

    
diff = np.array(dif)



In [25]:

    
diff









    Out[25]:





array([     0,  56214, 104082,  80893,  67767,  58724,  51644,  46195,
        41906, 103513,  23062,  16971,  14595,  19677,   9448,   9077,
         6599,   5124,   4325,   3817,   5977,   5197,    929,    843,
          923,   2454])



In [26]:

    
sum(diff)









    Out[26]:





739956



In [27]:

    
nclaims









    Out[27]:





740288



In [28]:

    
cdf = np.cumsum(diff)/float(sum(diff))



In [29]:

    
pdf = diff/float(sum(diff))



In [32]:

    
plt.bar(range(0, len(pdf)), pdf)
plt.title('Number of *new* triplets contributed by including features')
plt.xlabel('Diagnosis feature #')
plt.ylabel('Fraction of triplets contributed')









    Out[32]:





<matplotlib.text.Text at 0x7fad20d4cb90>

Including DIAG9 is (for some reason) especially beneficial

What if claims were linked?

This section tries to predict how many triplets could be extracted from the data, if we had a fixed number of linked claims for each patient.

The PUF from DSHS does not contain linkable claims. This is just an estimate of what linkable claims would be able to give us.



In [33]:

    
pdf2 = np.convolve(pdf, pdf)



In [34]:

    
plt.bar(range(0, len(pdf2)), pdf2)









    Out[34]:





<Container object of 51 artists>



In [35]:

    
pdf4 = np.convolve(pdf2, pdf2)



In [36]:

    
plt.bar(range(0, len(pdf4)), pdf4)









    Out[36]:





<Container object of 101 artists>



In [37]:

    
pdf8 = np.convolve(pdf4, pdf4)



In [38]:

    
pdf16 = np.convolve(pdf8, pdf8)



In [39]:

    
plt.bar(range(0, len(pdf16)), pdf16)









    Out[39]:





<Container object of 401 artists>



In [40]:

    
def calcNumberOfTrigramsPMF(nvars, pmf, claimsperpatient, nclaims=nclaims):
    pmfcut = pmf.copy()
    pmfcut[nvars+1:]=0
    pmfcut[nvars] = 1 - sum(pmfcut[:nvars])
    pconv = pmfcut.copy()
    for k in range(1,claimsperpatient):
        pconv = np.convolve(pconv.copy(),pmfcut)
    trigrams=[(nclaims/float(claimsperpatient))*pconv[i]*i*(i-1)*(i-2)/6 for i in range(3, len(pconv))]
    return sum(trigrams)



In [41]:

    
ntri = calcNumberOfTrigramsPMF(5, pdf, 1)
ntri



In [44]:

    
nvars = range(3,15)

Set the average number of linked claims per patient



In [55]:

    
numberOfClaimsPerPatient = 1



In [56]:

    
ntrip = [12*calcNumberOfTrigramsPMF(x, pdf, numberOfClaimsPerPatient) for x in nvars]
plt.bar(nvars, ntrip)
plt.title('2007-2010 PUF, assuming ' + str(numberOfClaimsPerPatient) + ' claims per patient.')
plt.xlabel('Number of diagnosis features')
plt.ylabel('Number of extracted triplets')









    Out[56]:





<matplotlib.text.Text at 0x7fad1de16f90>



In [46]:

    
sum((nclaims*pdf-diff)/nclaims)









    Out[46]:





0.0004484741073743976



In [47]:

    
diff









    Out[47]:





array([     0,  56214, 104082,  80893,  67767,  58724,  51644,  46195,
        41906, 103513,  23062,  16971,  14595,  19677,   9448,   9077,
         6599,   5124,   4325,   3817,   5977,   5197,    929,    843,
          923,   2454])



In [ ]:



In [ ]: