This notebook reproduces the arterial line study in MIMIC-III. The following is an outline of the notebook:
The R code then evaluates whether an arterial line is associated with mortality after propensity matching.
Note that the original arterial line study used a genetic algorithm to select the covariates in the propensity score. We omit the genetic algorithm step, and instead use the final set of covariates described by the authors. For more detail, see:
Hsu DJ, Feng M, Kothari R, Zhou H, Chen KP, Celi LA. The association between indwelling arterial catheters and mortality in hemodynamically stable patients with respiratory failure: a propensity score analysis. CHEST Journal. 2015 Dec 1;148(6):1470-6.
In [ ]:
# Install OS dependencies. This only needs to be run once for each new notebook instance.
!pip install PyAthena
In [1]:
from pyathena import connect
from pyathena.util import as_pandas
from __future__ import print_function
# Import libraries
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import boto3
from botocore.client import ClientError
# below is used to print out pretty pandas dataframes
from IPython.display import display, HTML
%matplotlib inline
s3 = boto3.resource('s3')
client = boto3.client("sts")
account_id = client.get_caller_identity()["Account"]
my_session = boto3.session.Session()
region = my_session.region_name
athena_query_results_bucket = 'aws-athena-query-results-'+account_id+'-'+region
try:
s3.meta.client.head_bucket(Bucket=athena_query_results_bucket)
except ClientError:
bucket = s3.create_bucket(Bucket=athena_query_results_bucket)
print('Creating bucket '+athena_query_results_bucket)
cursor = connect(s3_staging_dir='s3://'+athena_query_results_bucket+'/athena/temp').cursor()
# The Glue database name of your MIMIC-III parquet data
gluedatabase="mimiciii"
# location of the queries to generate aline specific materialized views
aline_path = './'
# location of the queries to generate materialized views from the MIMIC code repository
concepts_path = './concepts/'
Before generating the aline cohort, we require the following materialized views to be already generated:
angus.sqlHeightWeightQuery.sqlaline_vaso_flag.sqlYou can generate the above by executing the below codeblock. If you haven't changed the directory structure, the below should work, otherwise you may need to modify the concepts_path variable above.
In [2]:
# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.angus_sepsis;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(concepts_path,'sepsis/angus-awsathena.sql')
with open(f) as fp:
query = ''.join(fp.readlines())
# Execute the query
print('Generating table \'angus_sepsis\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')
# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.heightweight;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(concepts_path,'demographics/HeightWeightQuery-awsathena.sql')
with open(f) as fp:
query = ''.join(fp.readlines())
# Execute the query
print('Generating table \'heightweight\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')
# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.aline_vaso_flag;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(aline_path,'aline_vaso_flag-awsathena.sql')
with open(f) as fp:
query = ''.join(fp.readlines())
# Execute the query
print('Generating table \'aline_vaso_flag\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')
# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.ventsettings;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(concepts_path,'durations/ventilation-settings-awsathena.sql')
with open(f) as fp:
query = ''.join(fp.readlines())
# Execute the query
print('Generating table \'vent_settings\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')
# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.ventdurations;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(concepts_path,'durations/ventilation-durations-awsathena.sql')
with open(f) as fp:
query = ''.join(fp.readlines())
# Execute the query
print('Generating table \'vent_durations\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')
Now we generate the aline_cohort table using the aline_cohort.sql file.
Afterwards, we can generate the remaining 6 materialized views in any order, as they all depend on only aline_cohort and raw MIMIC-III data.
In [3]:
# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.aline_cohort_all;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(aline_path,'aline_cohort-awsathena.sql')
with open(f) as fp:
query = ''.join(fp.readlines())
# Execute the query
print('Generating table \'aline_cohort_all\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')
# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.aline_cohort;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(aline_path,'aline_final_cohort-awsathena.sql')
with open(f) as fp:
query = ''.join(fp.readlines())
# Execute the query
print('Generating table \'aline_cohort\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')
In [4]:
query = """
select
icustay_id
, exclusion_readmission
, exclusion_shortstay
, exclusion_vasopressors
, exclusion_septic
, exclusion_aline_before_admission
, exclusion_not_ventilated_first24hr
, exclusion_service_surgical
from DATABASE.aline_cohort_all
"""
cursor.execute(query.replace("DATABASE", gluedatabase))
# Load the result of the query into a dataframe
df = as_pandas(cursor)
# print out exclusions
idxRem = df['icustay_id'].isnull()
for c in df.columns:
if 'exclusion_' in c:
print('{:5d} - {}'.format(df[c].sum(), c))
idxRem[df[c]==1] = True
# final exclusion (excl sepsis/something else)
print('Will remove {} of {} patients.'.format(np.sum(idxRem), df.shape[0]))
print('')
print('')
print('Reproducing the flow of the flowchart from Chest paper.')
# first stay
idxRem = (df['exclusion_readmission']==1) | (df['exclusion_shortstay']==1)
print('{:5d} - removing {:5d} ({:2.2f}%) patients - short stay // readmission.'.format(
df.shape[0], np.sum(idxRem), 100.0*np.mean(idxRem)))
df = df.loc[~idxRem,:]
idxRem = df['exclusion_not_ventilated_first24hr']==1
print('{:5d} - removing {:5d} ({:2.2f}%) patients - not ventilated in first 24 hours.'.format(
df.shape[0], np.sum(idxRem), 100.0*np.mean(idxRem)))
df = df.loc[df['exclusion_not_ventilated_first24hr']==0,:]
print('{:5d}'.format(df.shape[0]))
idxRem = df['icustay_id'].isnull()
for c in ['exclusion_septic', 'exclusion_vasopressors',
'exclusion_aline_before_admission', 'exclusion_service_surgical']:
print('{:5s} - removing {:5d} ({:2.2f}%) patients - additional {:5d} {:2.2f}% - {}'.format(
'', df[c].sum(), 100.0*df[c].mean(),
np.sum((idxRem==0)&(df[c]==1)), 100.0*np.mean((idxRem==0)&(df[c]==1)),
c))
idxRem = idxRem | (df[c]==1)
df = df.loc[~idxRem,:]
print('{} - final cohort.'.format(df.shape[0]))
The following codeblock loads in the SQL from each file in the aline subfolder and executes the query to generate the materialized view. We specifically exclude the aline_cohort.sql file as we have already executed it above. Again, the order of query execution does not matter for these queries. Note also that the filenames are the same as the created materialized view names for convenience.
In [5]:
# get a list of all files in the subfolder
aline_queries = [f for f in os.listdir(aline_path)
# only keep the filename if it is actually a file (and not a directory)
if os.path.isfile(os.path.join(aline_path,f))
# and only keep the filename if it is an SQL file
& f.endswith('.sql')
# and we do *not* want aline_cohort - it's generated above
& (f != 'aline_cohort-awsathena.sql') & (f != 'aline_final_cohort-awsathena.sql') & (f != 'aline_vaso_flag-awsathena.sql')]
for f in aline_queries:
# Load in the query from file
table=f.split('-')
query='DROP TABLE IF EXISTS DATABASE.{};'.format(table[0])
cursor.execute(query.replace("DATABASE", gluedatabase))
print('Executing {} ...'.format(f), end=' ')
with open(os.path.join(aline_path,f)) as fp:
query = ''.join(fp.readlines())
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')
Summarize the cohort exclusions before we pull all the data together.
In [6]:
# Load in the query from file
query = """
--FINAL QUERY
select
co.subject_id, co.hadm_id, co.icustay_id
-- static variables from patient tracking tables
, co.age
, co.gender
-- , co.gender_num -- gender, 0=F, 1=M
, co.intime as icustay_intime
, co.day_icu_intime -- day of week, text
--, co.day_icu_intime_num -- day of week, numeric (0=Sun, 6=Sat)
, co.hour_icu_intime -- hour of ICU admission (24 hour clock)
, case
when co.hour_icu_intime >= 7
and co.hour_icu_intime < 19
then 1
else 0
end as icu_hour_flag
, co.outtime as icustay_outtime
-- outcome variables
, co.icu_los_day
, co.hospital_los_day
, co.hosp_exp_flag -- 1/0 patient died within current hospital stay
, co.icu_exp_flag -- 1/0 patient died within current ICU stay
, co.mort_day -- days from ICU admission to mortality, if they died
, co.day_28_flag -- 1/0 whether the patient died 28 days after *ICU* admission
, co.mort_day_censored -- days until patient died *or* 150 days (150 days is our censor time)
, co.censor_flag -- 1/0 did this patient have 150 imputed in mort_day_censored
-- aline flags
-- , co.initial_aline_flag -- always 0, we remove patients admitted w/ aline
, co.aline_flag -- 1/0 did the patient receive an aline
, co.aline_time_day -- if the patient received aline, fractional days until aline put in
-- demographics extracted using regex + echos
, bmi.weight as weight_first
, bmi.height as height_first
, bmi.bmi
-- service patient was admitted to the ICU under
, co.service_unit
-- severity of illness just before ventilation
, so.sofa as sofa_first
-- vital sign value just preceeding ventilation
, vi.map as map_first
, vi.heartrate as hr_first
, vi.temperature as temp_first
, vi.spo2 as spo2_first
-- labs!
, labs.bun_first
, labs.creatinine_first
, labs.chloride_first
, labs.hgb_first
, labs.platelet_first
, labs.potassium_first
, labs.sodium_first
, labs.tco2_first
, labs.wbc_first
-- comorbidities extracted using ICD-9 codes
, icd.chf as chf_flag
, icd.afib as afib_flag
, icd.renal as renal_flag
, icd.liver as liver_flag
, icd.copd as copd_flag
, icd.cad as cad_flag
, icd.stroke as stroke_flag
, icd.malignancy as malignancy_flag
, icd.respfail as respfail_flag
, icd.endocarditis as endocarditis_flag
, icd.ards as ards_flag
, icd.pneumonia as pneumonia_flag
-- sedative use
, sed.sedative_flag
, sed.midazolam_flag
, sed.fentanyl_flag
, sed.propofol_flag
from DATABASE.aline_cohort co
-- The following tables are generated by code within this repository
left join DATABASE.aline_sofa so
on co.icustay_id = so.icustay_id
left join DATABASE.aline_bmi bmi
on co.icustay_id = bmi.icustay_id
left join DATABASE.aline_icd icd
on co.hadm_id = icd.hadm_id
left join DATABASE.aline_vitals vi
on co.icustay_id = vi.icustay_id
left join DATABASE.aline_labs labs
on co.icustay_id = labs.icustay_id
left join DATABASE.aline_sedatives sed
on co.icustay_id = sed.icustay_id
order by co.icustay_id
"""
cursor.execute(query.replace("DATABASE", gluedatabase))
# Load the result of the query into a dataframe
df = as_pandas(cursor)
df.describe().T
Out[6]:
Now we need to remove obvious outliers, including correcting ages > 200 to 91.4 (i.e. replace anonymized ages with 91.4, the median age of patients older than 89).
In [7]:
# plot the rest of the distributions
for col in df.columns:
if df.dtypes[col] in ('int64','float64'):
plt.figure(figsize=[12,6])
plt.hist(df[col].dropna(), bins=50, normed=True)
plt.xlabel(col,fontsize=24)
plt.show()
In [8]:
# apply corrections
df.loc[df['age']>89, 'age'] = 91.4
In [9]:
df.to_csv('aline_data.csv',index=False)
We will create the propensity score using R in the Jupyter Notebook file aline_propensity_score.ipynb.