Laboratory tests that have have been mapped to a standard set of measurements. Unmapped measurements are recorded in the customLab table. The lab table is fairly well populated by hospitals. It is possible some rarely obtained lab measurements are not interfaced into the system and therefore will not be available in the database. Absence of a rare lab measurement, such as serum lidocaine concentrations, would not indicate the lab was not drawn. However, absence of a platelet count would likely indicate the value was not obtained.
In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2
import getpass
import pdvega
# for configuring connection
from configobj import ConfigObj
import os
%matplotlib inline
In [2]:
# Create a database connection using settings from config file
config='../db/config.ini'
# connection info
conn_info = dict()
if os.path.isfile(config):
config = ConfigObj(config)
conn_info["sqluser"] = config['username']
conn_info["sqlpass"] = config['password']
conn_info["sqlhost"] = config['host']
conn_info["sqlport"] = config['port']
conn_info["dbname"] = config['dbname']
conn_info["schema_name"] = config['schema_name']
else:
conn_info["sqluser"] = 'postgres'
conn_info["sqlpass"] = ''
conn_info["sqlhost"] = 'localhost'
conn_info["sqlport"] = 5432
conn_info["dbname"] = 'eicu'
conn_info["schema_name"] = 'public,eicu_crd'
# Connect to the eICU database
print('Database: {}'.format(conn_info['dbname']))
print('Username: {}'.format(conn_info["sqluser"]))
if conn_info["sqlpass"] == '':
# try connecting without password, i.e. peer or OS authentication
try:
if (conn_info["sqlhost"] == 'localhost') & (conn_info["sqlport"]=='5432'):
con = psycopg2.connect(dbname=conn_info["dbname"],
user=conn_info["sqluser"])
else:
con = psycopg2.connect(dbname=conn_info["dbname"],
host=conn_info["sqlhost"],
port=conn_info["sqlport"],
user=conn_info["sqluser"])
except:
conn_info["sqlpass"] = getpass.getpass('Password: ')
con = psycopg2.connect(dbname=conn_info["dbname"],
host=conn_info["sqlhost"],
port=conn_info["sqlport"],
user=conn_info["sqluser"],
password=conn_info["sqlpass"])
query_schema = 'set search_path to ' + conn_info['schema_name'] + ';'
In [3]:
patientunitstayid = 2704494
In [4]:
query = query_schema + """
select *
from lab
where patientunitstayid = {}
order by labresultoffset
""".format(patientunitstayid)
df = pd.read_sql_query(query, con)
df.head()
Out[4]:
In [5]:
query = query_schema + """
select *
from patient
where patientunitstayid = {}
""".format(patientunitstayid)
pt = pd.read_sql_query(query, con)
pt[['patientunitstayid', 'apacheadmissiondx', 'hospitaladmitoffset']]
Out[5]:
Immediately we can note the very large negative labresultoffset
. This likely means we have some lab values pre-ICU. In some cases this will be a lab measured in another hospital location such as the emergency department or hospital floor. In this case, the large value (-99620 minutes, or ~70 days) is surprising, but we can see from the patient table that the patient was admitted to the hospital -99779 minutes before their ICU stay (hospitaladmitoffset
). This patient was admitted to the ICU with thrombocytopenia (apacheadmissiondx
), and inspection of the diagnosis table indicates they have a form of cancer, so likely this is a long hospital stay where labs were taken on hospital admission.
In [6]:
query = query_schema + """
select labname, count(*) as n
from lab
group by labname
order by n desc
""".format(patientunitstayid)
lab = pd.read_sql_query(query, con)
print('{} total vlues for {} distinct labs.'.format(lab['n'].sum(), lab.shape[0]))
print('\nTop 5 labs by frequency:')
lab.head()
Out[6]:
The lab table is a large table with over 39 million observations. The most frequent observation is bedside glucose which accounts for almost 10% of the lab table, followed by potassium and sodium.
In [7]:
query = query_schema + """
with t as
(
select distinct patientunitstayid
from lab
)
select
pt.hospitalid
, count(distinct pt.patientunitstayid) as number_of_patients
, count(distinct t.patientunitstayid) as number_of_patients_with_tbl
from patient pt
left join t
on pt.patientunitstayid = t.patientunitstayid
group by pt.hospitalid
""".format(patientunitstayid)
df = pd.read_sql_query(query, con)
df['data completion'] = df['number_of_patients_with_tbl'] / df['number_of_patients'] * 100.0
df.sort_values('number_of_patients_with_tbl', ascending=False, inplace=True)
df.head(n=10)
Out[7]:
In [8]:
df.tail(n=10)
Out[8]:
In [9]:
df[['data completion']].vgplot.hist(bins=10,
var_name='Number of hospitals',
value_name='Percent of patients with data')
Above we can see that very few hospitals are missing lab data. Most of the data at < 90% data completion is driven by a few hospitals with very few patients. The majority of hospitals have 90-100% of patients with data in the lab table (right side of histogram, 0-90% bin).