In [1]:
import reader
In [2]:
data=reader.Data()
In [3]:
data['all_encounter_data'].columns.values
Out[3]:
In [4]:
df = data['all_encounter_data'][['Enc_Date','Person_Nbr','mNPDR', 'MNPDR', 'SNPDR', 'PDR']].copy()
df.head()
Out[4]:
In [5]:
df.shape
Out[5]:
In [6]:
df[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=0)
Out[6]:
There are encounters have more than one diagnosis of DR.
In [7]:
from collections import Counter
Counter(df[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=1))
Out[7]:
With the multi-diagnosis encounter records and group all the encounters into person file, there are 555 people have 2 diagnosis, 98 people have 3 and 10 people have 4. But we can't tell if the diagnosis comes with a series of encounters or just in one encounter.
In [8]:
tmp={k:v[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=0) for k,v in df.groupby('Person_Nbr')}
In [9]:
tmp1 = {k:len(v[~v.isin([0])].index.tolist()) for k,v in tmp.items()}
In [10]:
Counter(tmp1.values())
Out[10]:
So I select the encounters that have only 1 diagnosis and ignore the multi-diagnosis situations. There are 449 people have more than 1 diagnosis, and these diagnosis must come with a encounter sequence. But the encounters I have ignored may influece the people amount here.
In [11]:
df1=df[df[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=1).isin([1])]
In [12]:
temp={k:v[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=0) for k,v in df1.groupby('Person_Nbr')}
In [13]:
len(temp)
Out[13]:
In [14]:
temp[863]
Out[14]:
In [15]:
temp1 = {k:len(v[~v.isin([0])].index.tolist()) for k,v in temp.items()}
In [16]:
Counter(temp1.values())
Out[16]:
The multi_diagnosis encounters may mislead our decision with 340 people, out of 663 people who have multiple diagnosis in total.
In [17]:
df2=df[df[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=1).isin([2,3])]
In [18]:
temp2={k:v[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=0) for k,v in df2.groupby('Person_Nbr')}
In [19]:
len(temp2)
Out[19]:
In [20]:
temp2[863]
Out[20]:
In [21]:
temp3 = {k:len(v[~v.isin([0])].index.tolist()) for k,v in temp2.items()}
In [22]:
Counter(temp3.values())
Out[22]:
There are overlap between the sets I extracted above.
Among 3180 people who is extracted in the previous step to own only one diagnosis throughout their entire record, 127 also have multi-diagnosis encounters at the same time. Shall we delete the multi_diagnosis ones from their records? I don't think it is a good idea...
In [23]:
len(set([k for k,v in temp1.items() if v==1])&set(temp3.keys()))
Out[23]:
87 people only have multi-diagnosis encounters, they have no single-diagnosis encounters.
In [24]:
len([k for k in temp3.keys() if k not in temp1.keys()])
Out[24]:
Among 449 people who is extracted in the pervious step to own a diagnosis sequence, 126 have multi-diagnosis encounters.
In [25]:
len(set([k for k,v in temp1.items() if v>1]))
Out[25]:
In [27]:
len([k1 for k1 in set([k for k,v in temp1.items() if v>1]) if k1 not in temp3.keys()])
Out[27]:
In [26]:
len(set([k for k,v in temp1.items() if v>1])&set(temp3.keys()))
Out[26]:
In [43]:
len([k1 for k1 in temp3.keys() if k1 not in set([k for k,v in temp1.items() if v>1])])
Out[43]:
(This is the amount of people we can definitely use to study the diagnosis changing.)
(With some deletion/cleaning, this amount of people could be used by our purpose.)
(These people can't be used in my view, they have no difference with the single-diagnosis patients.)
In [28]:
# Example of pure diagnosis sequence
df[df.Person_Nbr.isin([639475])]
Out[28]:
In [29]:
# Example of mixture of diagnosis sequence and multi-diagnosis encounters
df[df.Person_Nbr.isin([186925])]
Out[29]:
In [53]:
# Example of people who have both multi-diagnosis encounters and single-diagnosis encounters
df[df.Person_Nbr.isin([138753])]
Out[53]:
In [52]:
# Example of people who have only multi-diagnosis encounters
df[df.Person_Nbr.isin([476849])]
Out[52]:
For people who have pure diagnosis sequence, how is their situation changing over time?
In [30]:
index = [k1 for k1 in set([k for k,v in temp1.items() if v>1]) if k1 not in temp3.keys()]
In [31]:
df3 = df[df[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=1).isin([0,1])]
df3 = df3[df3.Person_Nbr.isin(index)]
df3.head()
Out[31]:
In [39]:
import pprint
d={}
for k,v in df3.groupby('Person_Nbr'):
l=[]
t = v.sort_values(by=['Enc_Date'], ascending=True)
for k in t.index:
t1=t.loc[k,]
if len(t1[t1==True])>0:
l.append({t.loc[k,'Enc_Date']: (t1[t1==True].index[0])})
else:
l.append({t.loc[k,'Enc_Date']: 'unknown'})
d[k]=l
d.items()[0]
Out[39]:
In [40]:
dictionary={'mNPDR':1, 'MNPDR':2, 'SNPDR':3, 'PDR':4}
def f(dic):
l=[]
t=[]
for idx, v in enumerate(dic):
t.append(v.keys()[0])
if v.values()[0]=='unknown':
if idx!=0:
l.append(l[idx-1])
else:
l.append(0)
else:
l.append(dictionary[v.values()[0]])
return (t,l)
In [41]:
import matplotlib.pyplot as plt
import matplotlib.dates
In [42]:
i=0
for k,v in d.items():
t,l = f(v)
plt.plot(t,l)
i+=1
if i>9:
break
plt.yticks(range(0,5), ['unknown', 'mNPDR', 'MNPDR', 'SNPDR', 'PDR'])
plt.ylabel('Severity of DR')
plt.xlabel('Year')
#plt.savefig('hhh.jpg')
plt.show()
In [ ]:
In [ ]: