Arterial line study

This notebook reproduces the arterial line study in MIMIC-III. The following is an outline of the notebook:

  1. Generate necessary materialized views in SQL
  2. Combine materialized views and acquire a single dataframe
  3. Write this data to file for use in R

The R code then evaluates whether an arterial line is associated with mortality after propensity matching.

Note that the original arterial line study used a genetic algorithm to select the covariates in the propensity score. We omit the genetic algorithm step, and instead use the final set of covariates described by the authors. For more detail, see:

Hsu DJ, Feng M, Kothari R, Zhou H, Chen KP, Celi LA. The association between indwelling arterial catheters and mortality in hemodynamically stable patients with respiratory failure: a propensity score analysis. CHEST Journal. 2015 Dec 1;148(6):1470-6.


In [ ]:
# Install OS dependencies.  This only needs to be run once for each new notebook instance.
!pip install PyAthena

In [1]:
from pyathena import connect
from pyathena.util import as_pandas
from __future__ import print_function

# Import libraries
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import boto3
from botocore.client import ClientError
# below is used to print out pretty pandas dataframes
from IPython.display import display, HTML
%matplotlib inline


s3 = boto3.resource('s3')
client = boto3.client("sts")
account_id = client.get_caller_identity()["Account"]
my_session = boto3.session.Session()
region = my_session.region_name
athena_query_results_bucket = 'aws-athena-query-results-'+account_id+'-'+region

try:
    s3.meta.client.head_bucket(Bucket=athena_query_results_bucket)
except ClientError:
    bucket = s3.create_bucket(Bucket=athena_query_results_bucket)
    print('Creating bucket '+athena_query_results_bucket)
cursor = connect(s3_staging_dir='s3://'+athena_query_results_bucket+'/athena/temp').cursor()


# The Glue database name of your MIMIC-III parquet data
gluedatabase="mimiciii"

# location of the queries to generate aline specific materialized views
aline_path = './'

# location of the queries to generate materialized views from the MIMIC code repository
concepts_path = './concepts/'

1 - Generate materialized views

Before generating the aline cohort, we require the following materialized views to be already generated:

  • angus - from angus.sql
  • heightweight - from HeightWeightQuery.sql
  • aline_vaso_flag - from aline_vaso_flag.sql

You can generate the above by executing the below codeblock. If you haven't changed the directory structure, the below should work, otherwise you may need to modify the concepts_path variable above.


In [2]:
# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.angus_sepsis;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(concepts_path,'sepsis/angus-awsathena.sql')
with open(f) as fp:
    query = ''.join(fp.readlines())
    
# Execute the query
print('Generating table \'angus_sepsis\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')

# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.heightweight;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(concepts_path,'demographics/HeightWeightQuery-awsathena.sql')
with open(f) as fp:
    query = ''.join(fp.readlines())
    
# Execute the query
print('Generating table \'heightweight\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')


# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.aline_vaso_flag;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(aline_path,'aline_vaso_flag-awsathena.sql')
with open(f) as fp:
    query = ''.join(fp.readlines())
    
# Execute the query
print('Generating table \'aline_vaso_flag\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')


# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.ventsettings;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(concepts_path,'durations/ventilation-settings-awsathena.sql')
with open(f) as fp:
    query = ''.join(fp.readlines())
    
# Execute the query
print('Generating table \'vent_settings\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')


# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.ventdurations;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(concepts_path,'durations/ventilation-durations-awsathena.sql')
with open(f) as fp:
    query = ''.join(fp.readlines())
    
# Execute the query
print('Generating table \'vent_durations\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')


Generating table 'angus_sepsis' using ./concepts/sepsis/angus-awsathena.sql ... done.
Generating table 'heightweight' using ./concepts/demographics/HeightWeightQuery-awsathena.sql ... done.
Generating table 'aline_vaso_flag' using ./aline_vaso_flag-awsathena.sql ... done.
Generating table 'vent_settings' using ./concepts/durations/ventilation-settings-awsathena.sql ... done.
Generating table 'vent_durations' using ./concepts/durations/ventilation-durations-awsathena.sql ... done.

Now we generate the aline_cohort table using the aline_cohort.sql file.

Afterwards, we can generate the remaining 6 materialized views in any order, as they all depend on only aline_cohort and raw MIMIC-III data.


In [3]:
# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.aline_cohort_all;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(aline_path,'aline_cohort-awsathena.sql')
with open(f) as fp:
    query = ''.join(fp.readlines())
    
# Execute the query
print('Generating table \'aline_cohort_all\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')


# Load in the query from file
query='DROP TABLE IF EXISTS DATABASE.aline_cohort;'
cursor.execute(query.replace("DATABASE", gluedatabase))
f = os.path.join(aline_path,'aline_final_cohort-awsathena.sql')
with open(f) as fp:
    query = ''.join(fp.readlines())
    
# Execute the query
print('Generating table \'aline_cohort\' using {} ...'.format(f),end=' ')
cursor.execute(query.replace("DATABASE", gluedatabase))
print('done.')


Generating table 'aline_cohort_all' using ./aline_cohort-awsathena.sql ... done.
Generating table 'aline_cohort' using ./aline_final_cohort-awsathena.sql ... done.

In [4]:
query = """
select
icustay_id
, exclusion_readmission
, exclusion_shortstay
, exclusion_vasopressors
, exclusion_septic
, exclusion_aline_before_admission
, exclusion_not_ventilated_first24hr
, exclusion_service_surgical
from DATABASE.aline_cohort_all
"""

cursor.execute(query.replace("DATABASE", gluedatabase))
# Load the result of the query into a dataframe
df = as_pandas(cursor)

# print out exclusions
idxRem = df['icustay_id'].isnull()
for c in df.columns:
    if 'exclusion_' in c:
        print('{:5d} - {}'.format(df[c].sum(), c))
        idxRem[df[c]==1] = True   
        
# final exclusion (excl sepsis/something else)
print('Will remove {} of {} patients.'.format(np.sum(idxRem), df.shape[0]))


print('')
print('')
print('Reproducing the flow of the flowchart from Chest paper.')

# first stay
idxRem = (df['exclusion_readmission']==1) | (df['exclusion_shortstay']==1)
print('{:5d} - removing {:5d} ({:2.2f}%) patients - short stay // readmission.'.format(
        df.shape[0], np.sum(idxRem), 100.0*np.mean(idxRem)))
df = df.loc[~idxRem,:]

idxRem = df['exclusion_not_ventilated_first24hr']==1
print('{:5d} - removing {:5d} ({:2.2f}%) patients - not ventilated in first 24 hours.'.format(
        df.shape[0], np.sum(idxRem), 100.0*np.mean(idxRem)))

df = df.loc[df['exclusion_not_ventilated_first24hr']==0,:]

print('{:5d}'.format(df.shape[0]))
idxRem = df['icustay_id'].isnull()
for c in ['exclusion_septic', 'exclusion_vasopressors',
            'exclusion_aline_before_admission', 'exclusion_service_surgical']:
    print('{:5s} - removing {:5d} ({:2.2f}%) patients - additional {:5d} {:2.2f}% - {}'.format(
            '', df[c].sum(), 100.0*df[c].mean(),
            np.sum((idxRem==0)&(df[c]==1)), 100.0*np.mean((idxRem==0)&(df[c]==1)),
            c))
    idxRem = idxRem | (df[c]==1)

df = df.loc[~idxRem,:]
print('{} - final cohort.'.format(df.shape[0]))


14024 - exclusion_readmission
 8948 - exclusion_shortstay
17399 - exclusion_vasopressors
17007 - exclusion_septic
12961 - exclusion_aline_before_admission
29449 - exclusion_not_ventilated_first24hr
15499 - exclusion_service_surgical
Will remove 49907 of 52430 patients.


Reproducing the flow of the flowchart from Chest paper.
52430 - removing 20816 (39.70%) patients - short stay // readmission.
31614 - removing 15738 (49.78%) patients - not ventilated in first 24 hours.
15876
      - removing  5717 (36.01%) patients - additional  5717 36.01% - exclusion_septic
      - removing  9110 (57.38%) patients - additional  5751 36.22% - exclusion_vasopressors
      - removing  7601 (47.88%) patients - additional  1598 10.07% - exclusion_aline_before_admission
      - removing  6747 (42.50%) patients - additional   287 1.81% - exclusion_service_surgical
2523 - final cohort.

The following codeblock loads in the SQL from each file in the aline subfolder and executes the query to generate the materialized view. We specifically exclude the aline_cohort.sql file as we have already executed it above. Again, the order of query execution does not matter for these queries. Note also that the filenames are the same as the created materialized view names for convenience.


In [5]:
# get a list of all files in the subfolder
aline_queries = [f for f in os.listdir(aline_path) 
                 # only keep the filename if it is actually a file (and not a directory)
                if os.path.isfile(os.path.join(aline_path,f))
                 # and only keep the filename if it is an SQL file
                & f.endswith('.sql')
                # and we do *not* want aline_cohort - it's generated above
                & (f != 'aline_cohort-awsathena.sql') & (f != 'aline_final_cohort-awsathena.sql') & (f != 'aline_vaso_flag-awsathena.sql')]



for f in aline_queries:
    # Load in the query from file
    table=f.split('-')
    query='DROP TABLE IF EXISTS DATABASE.{};'.format(table[0])
    cursor.execute(query.replace("DATABASE", gluedatabase))
    print('Executing {} ...'.format(f), end=' ')
    
    with open(os.path.join(aline_path,f)) as fp:
        query = ''.join(fp.readlines())
    cursor.execute(query.replace("DATABASE", gluedatabase))
    print('done.')


Executing aline_bmi-awsathena.sql ... done.
Executing aline_sofa-awsathena.sql ... done.
Executing aline_labs-awsathena.sql ... done.
Executing aline_sedatives-awsathena.sql ... done.
Executing aline_icd-awsathena.sql ... done.
Executing aline_vitals-awsathena.sql ... done.

Summarize the cohort exclusions before we pull all the data together.

2 - Extract all covariates and outcome measures

We now aggregate all the data from the various views into a single dataframe.


In [6]:
# Load in the query from file
query = """
--FINAL QUERY
select
  co.subject_id, co.hadm_id, co.icustay_id

  -- static variables from patient tracking tables
  , co.age
  , co.gender
  -- , co.gender_num -- gender, 0=F, 1=M
  , co.intime as icustay_intime
  , co.day_icu_intime -- day of week, text
  --, co.day_icu_intime_num -- day of week, numeric (0=Sun, 6=Sat)
  , co.hour_icu_intime -- hour of ICU admission (24 hour clock)
  , case 
      when co.hour_icu_intime >= 7
       and co.hour_icu_intime < 19
         then 1
      else 0
    end as icu_hour_flag
  , co.outtime as icustay_outtime

  -- outcome variables
  , co.icu_los_day
  , co.hospital_los_day
  , co.hosp_exp_flag -- 1/0 patient died within current hospital stay
  , co.icu_exp_flag -- 1/0 patient died within current ICU stay
  , co.mort_day -- days from ICU admission to mortality, if they died
  , co.day_28_flag -- 1/0 whether the patient died 28 days after *ICU* admission
  , co.mort_day_censored -- days until patient died *or* 150 days (150 days is our censor time)
  , co.censor_flag -- 1/0 did this patient have 150 imputed in mort_day_censored

  -- aline flags
  -- , co.initial_aline_flag -- always 0, we remove patients admitted w/ aline
  , co.aline_flag -- 1/0 did the patient receive an aline
  , co.aline_time_day -- if the patient received aline, fractional days until aline put in

  -- demographics extracted using regex + echos
  , bmi.weight as weight_first
  , bmi.height as height_first
  , bmi.bmi

  -- service patient was admitted to the ICU under
  , co.service_unit

  -- severity of illness just before ventilation
  , so.sofa as sofa_first

  -- vital sign value just preceeding ventilation
  , vi.map as map_first
  , vi.heartrate as hr_first
  , vi.temperature as temp_first
  , vi.spo2 as spo2_first

  -- labs!
  , labs.bun_first
  , labs.creatinine_first
  , labs.chloride_first
  , labs.hgb_first
  , labs.platelet_first
  , labs.potassium_first
  , labs.sodium_first
  , labs.tco2_first
  , labs.wbc_first

  -- comorbidities extracted using ICD-9 codes
  , icd.chf as chf_flag
  , icd.afib as afib_flag
  , icd.renal as renal_flag
  , icd.liver as liver_flag
  , icd.copd as copd_flag
  , icd.cad as cad_flag
  , icd.stroke as stroke_flag
  , icd.malignancy as malignancy_flag
  , icd.respfail as respfail_flag
  , icd.endocarditis as endocarditis_flag
  , icd.ards as ards_flag
  , icd.pneumonia as pneumonia_flag

  -- sedative use
  , sed.sedative_flag
  , sed.midazolam_flag
  , sed.fentanyl_flag
  , sed.propofol_flag
  
from DATABASE.aline_cohort co
-- The following tables are generated by code within this repository
left join DATABASE.aline_sofa so
on co.icustay_id = so.icustay_id
left join DATABASE.aline_bmi bmi
  on co.icustay_id = bmi.icustay_id
left join DATABASE.aline_icd icd
  on co.hadm_id = icd.hadm_id
left join DATABASE.aline_vitals vi
  on co.icustay_id = vi.icustay_id
left join DATABASE.aline_labs labs
  on co.icustay_id = labs.icustay_id
left join DATABASE.aline_sedatives sed
  on co.icustay_id = sed.icustay_id
order by co.icustay_id
"""

cursor.execute(query.replace("DATABASE", gluedatabase))
# Load the result of the query into a dataframe
df = as_pandas(cursor)
df.describe().T


Out[6]:
count mean std min 25% 50% 75% max
subject_id 2523.0 41200.547364 29712.636902 22.000000 15641.000000 31269.000000 66535.000000 99881.000000
hadm_id 2523.0 149881.509314 29258.187301 100016.000000 124222.000000 150164.000000 175300.000000 199962.000000
icustay_id 2523.0 250829.827190 28926.610595 200019.000000 226843.000000 251681.000000 275897.000000 299995.000000
age 2523.0 64.544657 50.086406 16.203184 42.349852 57.126036 73.741161 300.052602
hour_icu_intime 2523.0 12.770908 7.533683 0.000000 5.000000 14.000000 19.000000 23.000000
icu_hour_flag 2523.0 0.404677 0.490927 0.000000 0.000000 0.000000 1.000000 1.000000
icu_los_day 2523.0 3.657932 3.403448 1.000579 1.709288 2.525197 4.355093 37.304780
hospital_los_day 2523.0 8.445863 7.819610 0.038194 3.807292 6.450000 10.527083 123.687500
hosp_exp_flag 2523.0 0.131193 0.337678 0.000000 0.000000 0.000000 0.000000 1.000000
icu_exp_flag 2523.0 0.090765 0.287332 0.000000 0.000000 0.000000 0.000000 1.000000
mort_day 910.0 429.589134 690.777814 0.000694 5.094097 62.694097 558.330556 3731.972222
day_28_flag 2523.0 0.155767 0.362706 0.000000 0.000000 0.000000 0.000000 1.000000
mort_day_censored 2523.0 250.842692 435.912143 0.000694 150.000000 150.000000 150.000000 3731.972222
censor_flag 2523.0 0.639318 0.480294 0.000000 0.000000 1.000000 1.000000 1.000000
aline_flag 2523.0 0.514863 0.499878 0.000000 0.000000 1.000000 1.000000 1.000000
aline_time_day 1299.0 0.286686 0.710809 0.000694 0.018056 0.066667 0.246528 11.843750
weight_first 2446.0 80.868029 26.924396 1.000000 65.000000 77.300000 91.100000 710.000000
height_first 898.0 169.866036 17.067155 15.240000 162.560000 170.180000 177.800000 444.500000
bmi 898.0 0.003432 0.014497 0.000815 0.002325 0.002683 0.003150 0.434431
sofa_first 2523.0 4.365834 1.996958 0.000000 3.000000 4.000000 5.000000 15.000000
map_first 2523.0 84.808427 16.846600 -6.000000 73.000000 83.666702 95.000000 174.000000
hr_first 2523.0 87.056282 19.476617 0.000000 74.000000 85.000000 99.000000 174.000000
temp_first 2513.0 36.807150 0.908248 32.222222 36.222221 36.777778 37.388889 40.444446
spo2_first 2520.0 98.641270 3.020225 47.000000 98.000000 100.000000 100.000000 100.000000
bun_first 2474.0 20.187146 15.136237 1.000000 12.000000 16.000000 23.000000 139.000000
creatinine_first 2474.0 1.140097 1.168660 0.100000 0.700000 0.900000 1.100000 18.800000
chloride_first 2481.0 104.074970 5.933029 56.000000 101.000000 104.000000 107.000000 129.000000
hgb_first 2479.0 12.220976 2.235188 4.100000 10.700000 12.300000 13.800000 19.800000
platelet_first 2471.0 240.770943 101.345871 6.000000 177.000000 231.000000 294.000000 1313.000000
potassium_first 2485.0 4.065634 0.681857 1.800000 3.600000 4.000000 4.400000 9.200000
sodium_first 2481.0 139.494156 4.874810 74.000000 137.000000 140.000000 142.000000 165.000000
tco2_first 2497.0 25.123348 5.209551 3.000000 22.000000 25.000000 28.000000 53.000000
wbc_first 2470.0 12.232874 6.163138 0.200000 8.000000 11.300000 15.200000 94.000000
chf_flag 2523.0 0.070947 0.256788 0.000000 0.000000 0.000000 0.000000 1.000000
afib_flag 2523.0 0.138327 0.345312 0.000000 0.000000 0.000000 0.000000 1.000000
renal_flag 2523.0 0.065002 0.246578 0.000000 0.000000 0.000000 0.000000 1.000000
liver_flag 2523.0 0.057868 0.233539 0.000000 0.000000 0.000000 0.000000 1.000000
copd_flag 2523.0 0.015458 0.123389 0.000000 0.000000 0.000000 0.000000 1.000000
cad_flag 2523.0 0.093539 0.291245 0.000000 0.000000 0.000000 0.000000 1.000000
stroke_flag 2523.0 0.158938 0.365691 0.000000 0.000000 0.000000 0.000000 1.000000
malignancy_flag 2523.0 0.156163 0.363082 0.000000 0.000000 0.000000 0.000000 1.000000
respfail_flag 2523.0 0.365042 0.481537 0.000000 0.000000 0.000000 1.000000 1.000000
endocarditis_flag 2523.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
ards_flag 2523.0 0.017043 0.129458 0.000000 0.000000 0.000000 0.000000 1.000000
pneumonia_flag 2523.0 0.177963 0.382557 0.000000 0.000000 0.000000 0.000000 1.000000
sedative_flag 2523.0 0.211256 0.408281 0.000000 0.000000 0.000000 0.000000 1.000000
midazolam_flag 2523.0 0.024970 0.156065 0.000000 0.000000 0.000000 0.000000 1.000000
fentanyl_flag 2523.0 0.040824 0.197922 0.000000 0.000000 0.000000 0.000000 1.000000
propofol_flag 2523.0 0.187872 0.390687 0.000000 0.000000 0.000000 0.000000 1.000000

Now we need to remove obvious outliers, including correcting ages > 200 to 91.4 (i.e. replace anonymized ages with 91.4, the median age of patients older than 89).


In [7]:
# plot the rest of the distributions
for col in df.columns:
    if df.dtypes[col] in ('int64','float64'):
        plt.figure(figsize=[12,6])
        plt.hist(df[col].dropna(), bins=50, normed=True)
        plt.xlabel(col,fontsize=24)
        plt.show()


/home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

In [8]:
# apply corrections
df.loc[df['age']>89, 'age'] = 91.4

3 - Write to file


In [9]:
df.to_csv('aline_data.csv',index=False)

4 - Create a propensity score using this data

We will create the propensity score using R in the Jupyter Notebook file aline_propensity_score.ipynb.