In [1]:
import reader

In [2]:
data=reader.Data()


Local data read/write folder path:
	Default path: /Users/Dan/2017 spring/MATH 497/code and data/data/

Data: systemic_disease_list 
File: systemic_disease_list.pickle
File already exists.

Data: SNOMED_problem_list 
File: SNOMED_problem_list.pickle
File already exists.

Data: macula_findings_for_Enc 
File: macula_findings_for_Enc.pickle
File already exists.

Data: SL_Lens_for_Enc 
File: SL_Lens_for_Enc.pickle
File already exists.

Data: family_hist_list 
File: family_hist_list.pickle
File already exists.

Data: systemic_disease_for_Enc 
File: systemic_disease_for_Enc.pickle
File already exists.

Data: all_person_data 
File: all_person_data_Richard_20170307.pickle
File does not exist. Searching from drive...
	Got the file id. Print metadata:
		Title: all_person_data_Richard_20170307.pickle
		MIME type: application/octet-stream
	Download Progress: 37%
	Download Progress: 75%
	Download Progress: 100%
	Download Complete

Data: family_hist_for_Enc 
File: family_hist_for_Enc.pickle
File already exists.

Data: person_profile 
File: person_profile_df.pickle
File already exists.

Data: all_encounter_data 
File: all_encounter_data_Richard_20170307.pickle
File already exists.

Data: encounters 
File: encounters.pickle
File already exists.

Data: demographics 
File: demographics_processed_Dan_20170304.pickle
File already exists.

Data: ICD_for_Enc 
File: ICD_for_Enc_processed_Dan_20170304.pickle
File already exists.

In [3]:
data['all_encounter_data'].columns.values


Out[3]:
array(['Enc_Date', 'Person_Nbr', 'Primary_Payer', 'Smoking_Status',
       'MR_OD_SPH', 'MR_OD_CYL', 'MR_OD_AXIS', 'MR_OD_DVA', 'MR_OD_NVA',
       'MR_OS_SPH', 'MR_OS_CYL', 'MR_OS_AXIS', 'MR_OS_DVA', 'MR_OS_NVA',
       'BB_OD_SPH', 'BB_OD_CYL', 'BB_OD_AXIS', 'BB_OD_DVA', 'BB_OD_NVA',
       'BB_OS_SPH', 'BB_OS_CYL', 'BB_OS_AXIS', 'BB_OS_DVA', 'BB_OS_NVA',
       'CYCLO_OD_SPH', 'CYCLO_OD_CYL', 'CYCLO_OD_AXIS', 'CYCLO_OD_DVA',
       'CYCLO_OD_NVA', 'CYCLO_OS_SPH', 'CYCLO_OS_CYL', 'CYCLO_OS_AXIS',
       'CYCLO_OS_DVA', 'CYCLO_OS_NVA', 'Glucose', 'BMI', 'BP_Systolic',
       'A1C', 'BP_Diastolic', 'ME', 'MNPDR', 'DM', 'SNPDR',
       'Glaucoma_Suspect', 'mNPDR', 'Open_angle_Glaucoma', 'PDR',
       'Cataract'], dtype=object)

In [4]:
df = data['all_encounter_data'][['Enc_Date','Person_Nbr','mNPDR', 'MNPDR', 'SNPDR', 'PDR']].copy()
df.head()


Out[4]:
Enc_Date Person_Nbr mNPDR MNPDR SNPDR PDR
Enc_Nbr
1043 2016-03-08 06:15:00 544674 False False False False
1802 2016-05-13 03:45:00 605657 False False False False
2698 2014-06-08 10:15:00 514762 False False False False
2966 2016-06-24 03:15:00 552364 True False False False
4091 2015-10-29 19:45:00 931187 False False False False

In [5]:
df.shape


Out[5]:
(61862, 6)

In [6]:
df[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=0)


Out[6]:
mNPDR    3795
MNPDR    2012
SNPDR     907
PDR      3676
dtype: int64

There are encounters have more than one diagnosis of DR.


In [7]:
from collections import Counter
Counter(df[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=1))


Out[7]:
Counter({0: 52011, 1: 9322, 2: 519, 3: 10})

With the multi-diagnosis encounter records and group all the encounters into person file, there are 555 people have 2 diagnosis, 98 people have 3 and 10 people have 4. But we can't tell if the diagnosis comes with a series of encounters or just in one encounter.


In [8]:
tmp={k:v[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=0) for k,v in df.groupby('Person_Nbr')}

In [9]:
tmp1 = {k:len(v[~v.isin([0])].index.tolist()) for k,v in tmp.items()}

In [10]:
Counter(tmp1.values())


Out[10]:
Counter({0: 12323, 1: 3053, 2: 555, 3: 98, 4: 10})

So I select the encounters that have only 1 diagnosis and ignore the multi-diagnosis situations. There are 449 people have more than 1 diagnosis, and these diagnosis must come with a encounter sequence. But the encounters I have ignored may influece the people amount here.


In [11]:
df1=df[df[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=1).isin([1])]

In [12]:
temp={k:v[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=0) for k,v in df1.groupby('Person_Nbr')}

In [13]:
len(temp)


Out[13]:
3629

In [14]:
temp[863]


Out[14]:
mNPDR    0
MNPDR    2
SNPDR    0
PDR      0
dtype: int64

In [15]:
temp1 = {k:len(v[~v.isin([0])].index.tolist()) for k,v in temp.items()}

In [16]:
Counter(temp1.values())


Out[16]:
Counter({1: 3180, 2: 388, 3: 57, 4: 4})

The multi_diagnosis encounters may mislead our decision with 340 people, out of 663 people who have multiple diagnosis in total.


In [17]:
df2=df[df[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=1).isin([2,3])]

In [18]:
temp2={k:v[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=0) for k,v in df2.groupby('Person_Nbr')}

In [19]:
len(temp2)


Out[19]:
340

In [20]:
temp2[863]


Out[20]:
mNPDR    0
MNPDR    1
SNPDR    0
PDR      1
dtype: int64

In [21]:
temp3 = {k:len(v[~v.isin([0])].index.tolist()) for k,v in temp2.items()}

In [22]:
Counter(temp3.values())


Out[22]:
Counter({2: 305, 3: 31, 4: 4})

There are overlap between the sets I extracted above.

Among 3180 people who is extracted in the previous step to own only one diagnosis throughout their entire record, 127 also have multi-diagnosis encounters at the same time. Shall we delete the multi_diagnosis ones from their records? I don't think it is a good idea...


In [23]:
len(set([k for k,v in temp1.items() if v==1])&set(temp3.keys()))


Out[23]:
127

87 people only have multi-diagnosis encounters, they have no single-diagnosis encounters.


In [24]:
len([k for k in temp3.keys() if k not in temp1.keys()])


Out[24]:
87

Among 449 people who is extracted in the pervious step to own a diagnosis sequence, 126 have multi-diagnosis encounters.


In [25]:
len(set([k for k,v in temp1.items() if v>1]))


Out[25]:
449

In [27]:
len([k1 for k1 in set([k for k,v in temp1.items() if v>1]) if k1 not in temp3.keys()])


Out[27]:
323

In [26]:
len(set([k for k,v in temp1.items() if v>1])&set(temp3.keys()))


Out[26]:
126

In [43]:
len([k1 for k1 in temp3.keys() if k1 not in set([k for k,v in temp1.items() if v>1])])


Out[43]:
214

My conclusion:

3716 people have DR diagnosis. 663 of 3716 have multiple DR diagnosis throughout their entire record

-323 of 663 have a pure diagnosis sequence.

(This is the amount of people we can definitely use to study the diagnosis changing.)

-126 of 663 have both diagnosis sequence and multi-diagnosis encounters

(With some deletion/cleaning, this amount of people could be used by our purpose.)

-214 of 663 have multi-diagnosis encounters

--127 of 214 have both multi-diagnosis encounters and single-diagnosis encounters
--87 people have only multi-diagnosis encounters

(These people can't be used in my view, they have no difference with the single-diagnosis patients.)

Of course all above may or may not have encounters without DR diagnosis


In [28]:
# Example of pure diagnosis sequence
df[df.Person_Nbr.isin([639475])]


Out[28]:
Enc_Date Person_Nbr mNPDR MNPDR SNPDR PDR
Enc_Nbr
286486 2015-11-16 11:45:00 639475 False True False False
483707 2011-11-23 13:15:00 639475 False False False False
3723748 2011-12-04 19:15:00 639475 False False False False
4539952 2014-10-08 08:15:00 639475 False False False False
5154739 2016-08-02 23:45:00 639475 False False True False
7605153 2014-05-21 03:15:00 639475 True False False False
10912274 2016-08-06 03:30:00 639475 False False False True
10960755 2016-07-22 13:00:00 639475 False False False False
11416041 2011-11-17 17:45:00 639475 False False False False
13482828 2016-10-04 10:30:00 639475 False False False True

In [29]:
# Example of mixture of diagnosis sequence and multi-diagnosis encounters
df[df.Person_Nbr.isin([186925])]


Out[29]:
Enc_Date Person_Nbr mNPDR MNPDR SNPDR PDR
Enc_Nbr
2181817 2011-12-12 23:45:00 186925 True True False True
5676339 2013-05-06 10:00:00 186925 False False True False
7103183 2016-07-05 21:00:00 186925 False False False True
8432456 2013-04-03 03:30:00 186925 False False True True
9577267 2015-12-14 01:45:00 186925 False False True False
9833154 2013-07-19 23:15:00 186925 False False True True
9942232 2016-03-02 02:45:00 186925 False False False True
11265339 2016-02-06 08:00:00 186925 False False False True
11664322 2016-09-06 08:45:00 186925 False False False True
14092477 2012-07-20 01:15:00 186925 True False False False
14409371 2014-01-16 16:15:00 186925 False False True True
14484579 2013-12-09 02:30:00 186925 False False False True
14496790 2016-06-18 04:30:00 186925 False False False True
14702115 2013-12-07 22:45:00 186925 False False True True
15024156 2014-03-15 06:00:00 186925 False False True True
16450694 2016-01-18 06:45:00 186925 False False True True
16529961 2013-02-11 02:30:00 186925 True False False False

In [53]:
# Example of people who have both multi-diagnosis encounters and single-diagnosis encounters
df[df.Person_Nbr.isin([138753])]


Out[53]:
Enc_Date Person_Nbr mNPDR MNPDR SNPDR PDR
Enc_Nbr
2513688 2013-09-13 03:30:00 138753 False False True False
15601634 2013-08-14 03:00:00 138753 False True True False

In [52]:
# Example of people who have only multi-diagnosis encounters
df[df.Person_Nbr.isin([476849])]


Out[52]:
Enc_Date Person_Nbr mNPDR MNPDR SNPDR PDR
Enc_Nbr
5528625 2012-10-15 21:15:00 476849 True True False False

For people who have pure diagnosis sequence, how is their situation changing over time?


In [30]:
index = [k1 for k1 in set([k for k,v in temp1.items() if v>1]) if k1 not in temp3.keys()]

In [31]:
df3 = df[df[['mNPDR', 'MNPDR', 'SNPDR', 'PDR']].sum(axis=1).isin([0,1])]
df3 = df3[df3.Person_Nbr.isin(index)]
df3.head()


Out[31]:
Enc_Date Person_Nbr mNPDR MNPDR SNPDR PDR
Enc_Nbr
8341 2013-11-03 04:30:00 1003061 False False False False
9831 2014-06-09 16:45:00 800883 False True False False
16624 2015-02-06 06:00:00 556527 False False False False
20261 2012-04-21 04:45:00 764764 False False False True
44860 2016-10-17 16:15:00 771595 False False False True

In [39]:
import pprint
d={}
for k,v in df3.groupby('Person_Nbr'):
    l=[]
    t = v.sort_values(by=['Enc_Date'], ascending=True)
    for k in t.index:
        t1=t.loc[k,]
        if len(t1[t1==True])>0:
            l.append({t.loc[k,'Enc_Date']: (t1[t1==True].index[0])})
        else:
            l.append({t.loc[k,'Enc_Date']: 'unknown'})
    d[k]=l
    
d.items()[0]


Out[39]:
(12789760,
 [{Timestamp('2012-01-28 08:30:00'): 'mNPDR'},
  {Timestamp('2012-03-19 07:45:00'): 'MNPDR'},
  {Timestamp('2012-09-03 19:45:00'): 'MNPDR'},
  {Timestamp('2013-09-03 02:45:00'): 'MNPDR'},
  {Timestamp('2014-07-04 11:45:00'): 'MNPDR'},
  {Timestamp('2014-08-02 12:30:00'): 'unknown'},
  {Timestamp('2014-10-12 22:15:00'): 'mNPDR'},
  {Timestamp('2015-07-20 06:15:00'): 'MNPDR'},
  {Timestamp('2016-01-07 21:15:00'): 'PDR'}])

In [40]:
dictionary={'mNPDR':1, 'MNPDR':2, 'SNPDR':3, 'PDR':4}
def f(dic):
    l=[]
    t=[]
    for idx, v in enumerate(dic):
        t.append(v.keys()[0])
        if v.values()[0]=='unknown':
            if idx!=0:
                l.append(l[idx-1])
            else:
                l.append(0)
        else:
            l.append(dictionary[v.values()[0]])
    return (t,l)

In [41]:
import matplotlib.pyplot as plt
import matplotlib.dates

In [42]:
i=0
for k,v in d.items():
    t,l = f(v)
    plt.plot(t,l)
    i+=1
    if i>9:
        break
plt.yticks(range(0,5), ['unknown', 'mNPDR', 'MNPDR', 'SNPDR', 'PDR'])
plt.ylabel('Severity of DR')
plt.xlabel('Year')
#plt.savefig('hhh.jpg')
plt.show()



In [ ]:


In [ ]: