The diagnosis table is used to track active problems for a given patient. It contains a list of diagnoses that were documented for each patient in the Active Diagnosis/Treatment sections of the eCareManager medical record. For some records, the corresponding International Classification of Diseases (ICD) codes are available. This can be useful for determining if certain diseases were documented during the ICU stay and at what point in the patient’s ICU stay these diagnoses were documented.
Important columns in this table include:
diagnosisstring
- the problem documentedicd9code
- a mapping from the problem to ICD-9 and ICD-10 code (the name of the column is antiquated)
In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2
import getpass
import pdvega
# for configuring connection
from configobj import ConfigObj
import os
%matplotlib inline
In [2]:
# Create a database connection using settings from config file
config='../db/config.ini'
# connection info
conn_info = dict()
if os.path.isfile(config):
config = ConfigObj(config)
conn_info["sqluser"] = config['username']
conn_info["sqlpass"] = config['password']
conn_info["sqlhost"] = config['host']
conn_info["sqlport"] = config['port']
conn_info["dbname"] = config['dbname']
conn_info["schema_name"] = config['schema_name']
else:
conn_info["sqluser"] = 'postgres'
conn_info["sqlpass"] = ''
conn_info["sqlhost"] = 'localhost'
conn_info["sqlport"] = 5432
conn_info["dbname"] = 'eicu'
conn_info["schema_name"] = 'public,eicu_crd'
# Connect to the eICU database
print('Database: {}'.format(conn_info['dbname']))
print('Username: {}'.format(conn_info["sqluser"]))
if conn_info["sqlpass"] == '':
# try connecting without password, i.e. peer or OS authentication
try:
if (conn_info["sqlhost"] == 'localhost') & (conn_info["sqlport"]=='5432'):
con = psycopg2.connect(dbname=conn_info["dbname"],
user=conn_info["sqluser"])
else:
con = psycopg2.connect(dbname=conn_info["dbname"],
host=conn_info["sqlhost"],
port=conn_info["sqlport"],
user=conn_info["sqluser"])
except:
conn_info["sqlpass"] = getpass.getpass('Password: ')
con = psycopg2.connect(dbname=conn_info["dbname"],
host=conn_info["sqlhost"],
port=conn_info["sqlport"],
user=conn_info["sqluser"],
password=conn_info["sqlpass"])
query_schema = 'set search_path to ' + conn_info['schema_name'] + ';'
In [3]:
patientunitstayid = 2704494
In [4]:
query = query_schema + """
select *
from diagnosis
where patientunitstayid = {}
order by diagnosisoffset
""".format(patientunitstayid)
df = pd.read_sql_query(query, con)
df
Out[4]:
Here we can see that roughly on admission (diagnosisoffset = 48
minutes after ICU admission), the patient had 4 problems documented:
Later in the stay (diagnosisoffset = 2153
), the same conditions are redocumented, and then finally a few minutes later (diagnosisoffset = 2159
) these conditions are documented a final time with activeupondischarge = True
, indicating that the patient was discharged with these conditions.
It's also worth noting that ICD-9 and ICD-10 codes are available with 3/4 of these diagnosis strings - only leukemia does not have matching codes. This often occurs because the diagnosisstring
is at too high a level in the hierarchy, and so no appropriate ICD-9/ICD-10 code could be found. Let's look at this in more detail.
In [5]:
dx = 'leukemia'
query = query_schema + """
select diagnosisstring, icd9code
, count(distinct patientunitstayid) as n
from diagnosis
where lower(diagnosisstring) like '%{}%'
group by diagnosisstring, icd9code
order by diagnosisstring, n desc
""".format(dx)
df = pd.read_sql_query(query, con)
df.head(n=10)
Out[5]:
Above, we can see that the high level 'hematology|oncology and leukemia|leukemia' does not have an associated ICD-9/ICD-10 code, but the more granular values of 'hematology|oncology and leukemia|leukemia|acute lymphocytic', 'hematology|oncology and leukemia|leukemia|in remission', etc do have mappings to ICD-9/ICD-10 codes. Thus, if using ICD codes to define diagnoses, it is highly recommended to first find associated problems, then identify levels higher up in the hierarchy in order to capture all patients.
In [6]:
icd9 = '250.'
query = query_schema + """
select diagnosisstring, icd9code
, count(distinct patientunitstayid) as n
from diagnosis
where icd9code like '%{}%'
group by diagnosisstring, icd9code
order by diagnosisstring, n desc
""".format(icd9)
df = pd.read_sql_query(query, con)
df
Out[6]:
We have identified 8 rows, beginning with 'endocrine|glucose metabolism'. Let's modify our query to search for this phrase.
In [7]:
dx = 'endocrine|glucose metabolism'
query = query_schema + """
select diagnosisstring, icd9code
, count(distinct patientunitstayid) as n
from diagnosis
where diagnosisstring like '%{}%'
group by diagnosisstring, icd9code
order by diagnosisstring, n desc
""".format(dx)
df = pd.read_sql_query(query, con)
df
Out[7]:
We now have 21 rows, and though not all are equivalent to diabetes, we do find that the top hit, 'endocrine|glucose metabolism|diabetes mellitus' has 14620 observations and is by far the most frequently documented problem for diagbetes, even though it does not have an ICD-9/ICD-10 code!
In [8]:
query = query_schema + """
select
pt.hospitalid
, count(distinct pt.patientunitstayid) as number_of_patients
, count(distinct dx.patientunitstayid) as number_of_patients_with_tbl
from patient pt
left join diagnosis dx
on pt.patientunitstayid = dx.patientunitstayid
group by pt.hospitalid
""".format(patientunitstayid)
df = pd.read_sql_query(query, con)
df['data completion'] = df['number_of_patients_with_tbl'] / df['number_of_patients'] * 100.0
df.sort_values('number_of_patients_with_tbl', ascending=False, inplace=True)
df.head(n=10)
Out[8]:
In [9]:
df[['data completion']].vgplot.hist(bins=10,
var_name='Number of hospitals',
value_name='Percent of patients with data')
The diagnosis table is a core component of eICU Care Manager, and the majority of hospitals use this interface routinely. However, the particular use of it may vary. For example, we can ask: how many hospitals document problems on ICU discharge?
In [10]:
query = query_schema + """
select
pt.hospitalid
, count(distinct pt.patientunitstayid) as number_of_patients
, count(distinct dx.patientunitstayid) as number_of_patients_with_tbl
, count(distinct case when dx.activeupondischarge = 'True' then dx.patientunitstayid else null end)
as number_of_patients_with_active
from patient pt
left join diagnosis dx
on pt.patientunitstayid = dx.patientunitstayid
group by pt.hospitalid
""".format(patientunitstayid)
df = pd.read_sql_query(query, con)
df['data completion'] = df['number_of_patients_with_tbl'] / df['number_of_patients'] * 100.0
df['active dx completion'] = df['number_of_patients_with_active'] / df['number_of_patients'] * 100.0
df.sort_values('number_of_patients_with_tbl', ascending=False, inplace=True)
df.head(n=10)
df.head()
Out[10]:
In [19]:
df[['data completion', 'active dx completion']].vgplot.hist(bins=10, alpha=0.6, stacked=False,
var_name='Number of hospitals',
value_name='Percent of patients with data')
The number is reduced, indicating there are a few ICUs which do not routinely document diagnosis on discharge. Conversely, for a few hospitals, every patient may have a diagnosis documented on discharge.