Cohort selection

The aim of this tutorial is to describe how patients are tracked in the MIMIC-III database. By the end of this notebook you should:

Understand what subject_id, hadm_id, and icustay_id represent
Know how to set up a cohort table for subselecting a patient population
Understand the difference between service and physical location

Requirements:

MIMIC-III in a PostgreSQL database
Python packages installable with:
- pip install numpy pandas matplotlib psycopg2 jupyter

First, as always, we open a connection to a local copy of the database. If you don't have a local copy of the database in PostgreSQL, follow the tutorial online (http://mimic.physionet.org) to install one.

Let's begin!



In [1]:

    
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2

# below imports are used to print out pretty pandas dataframes
from IPython.display import display, HTML

%matplotlib inline
plt.style.use('ggplot')

# information used to create a database connection
sqluser = 'postgres'
dbname = 'mimic'
schema_name = 'mimiciii'

# Connect to postgres with a copy of the MIMIC-III database
con = psycopg2.connect(dbname=dbname, user=sqluser)

# the below statement is prepended to queries to ensure they select from the right schema
query_schema = 'set search_path to ' + schema_name + ';'

Cohort selection will begin with three tables: patients, admissions, and icustays:

patients: information about a patient that does not change - e.g. date of birth, genotypical sex
admissions: information recorded on hospital admission - admission type (elective, emergency), time of admission
icustays: information recorded on intensive care unit admission - primarily admission and discharge time

As MIMIC-III is primarily an intensive care unit (ICU) database, the focus will be on patients admitted to and discharged from the ICU. That is, rather than selecting our cohort based off the individual patient (identified by subject_id in the database), we will usually want to select our cohort based off the ICU stay (identified by icustay_id). Thus, it is sensible to begin with the icustays table:



In [2]:

    
query = query_schema + """
SELECT subject_id, hadm_id, icustay_id
FROM icustays
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df.head()

Note: in the above we use LIMIT 10: this limits our results to only 10 rows. It's nice to include this statement when prototyping as it speeds up queries immensely. Later on when we are doing full data extraction, we would remove this statement.

If we are interested in the length of stay for the ICU patients, we can query the intime and outtime columns, adding in some SQL specific syntax for calculating the difference between two dates.



In [3]:

    
query = query_schema + """
SELECT subject_id, hadm_id, icustay_id
, outtime - intime as icu_length_of_stay_interval
, EXTRACT(EPOCH FROM outtime - intime) as icu_length_of_stay
FROM icustays
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df.head()









    Out[3]:






  
    
      
      subject_id
      hadm_id
      icustay_id
      icu_length_of_stay_interval
      icu_length_of_stay
    
  
  
    
      0
      268
      110404
      280836
      3 days 05:58:33
      280713.0
    
    
      1
      269
      106296
      206613
      3 days 06:41:28
      283288.0
    
    
      2
      270
      188028
      220345
      2 days 21:27:09
      250029.0
    
    
      3
      271
      173727
      249196
      2 days 01:26:22
      177982.0
    
    
      4
      272
      164716
      210407
      1 days 14:53:09
      139989.0

Note that the EXTRACT(EPOCH FROM ... ) code extracts the number of fractional seconds represented by an interval data type. So the logic is roughly intime - outtime -> icu_length_of_stay_interval, followed by EXTRACT(EPOCH FROM icu_length_of_stay_interval) -> fractional seconds (numeric). ICU Length of stay is most readily interpretted when represented in fractional days, so let's do that conversion.



In [4]:

    
query = query_schema + """
SELECT subject_id, hadm_id, icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
FROM icustays
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df.head()









    Out[4]:






  
    
      
      subject_id
      hadm_id
      icustay_id
      icu_length_of_stay
    
  
  
    
      0
      268
      110404
      280836
      3.248993
    
    
      1
      269
      106296
      206613
      3.278796
    
    
      2
      270
      188028
      220345
      2.893854
    
    
      3
      271
      173727
      249196
      2.059977
    
    
      4
      272
      164716
      210407
      1.620243

In the above, seconds are converted to days easily by dividing by: 60 (seconds in a minute), 60 (minutes in an hour), and 24 (hours in a day). We also omit the icu_length_of_stay_interval as it's now redundant for our purposes.

If we are only interested in ICU stays lasting a certain length (say 24 hours), we need to do the following two steps:

use an in-line view to "hold" the data
use the WHERE clause to filter this data using our generated column

Here's an example of us filtering to only stays lasting at least 2 days:



In [5]:

    
query = query_schema + """
WITH co AS
(
SELECT subject_id, hadm_id, icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
FROM icustays
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
FROM co
WHERE icu_length_of_stay >= 2
"""
df = pd.read_sql_query(query, con)
df









    Out[5]:






  
    
      
      subject_id
      hadm_id
      icustay_id
      icu_length_of_stay
    
  
  
    
      0
      268
      110404
      280836
      3.248993
    
    
      1
      269
      106296
      206613
      3.278796
    
    
      2
      270
      188028
      220345
      2.893854
    
    
      3
      271
      173727
      249196
      2.059977
    
    
      4
      274
      130546
      254851
      8.814259
    
    
      5
      275
      129886
      219649
      7.131412

Looks good - none of the above stays are shorter than 2 days.

Many studies using the MIMIC-III database are focused on specific subgroups of patients. For example, MIMIC-III contains both adults and neonates, but it is rare that a study would like to evaluate some phenomenom in both groups simulatenously. As a result, the first step of many studies is selecting a subpopulation from the icustays table. Concretely, we will want to select a set of icustay_id which represent our patient population. You've just seen an example of doing this: in the above code, we limited our population to only those who were in the ICU for at least 2 days.

When subselecting the patient population, it is generally good practice to build a "cohort" table - that is a table with all icustay_id available in the database, each associated with binary flags indicating whether or not they are excluded from your population. Let's take a look at how this would work with the above query which limited the dataset to patients who stayed longer than 2 days.



In [6]:

    
query = query_schema + """
WITH co AS
(
SELECT subject_id, hadm_id, icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
FROM icustays
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
        as exclusion_los
FROM co
"""
df = pd.read_sql_query(query, con)
df









    Out[6]:






  
    
      
      subject_id
      hadm_id
      icustay_id
      icu_length_of_stay
      exclusion_los
    
  
  
    
      0
      268
      110404
      280836
      3.248993
      0
    
    
      1
      269
      106296
      206613
      3.278796
      0
    
    
      2
      270
      188028
      220345
      2.893854
      0
    
    
      3
      271
      173727
      249196
      2.059977
      0
    
    
      4
      272
      164716
      210407
      1.620243
      1
    
    
      5
      273
      158689
      241507
      1.486181
      1
    
    
      6
      274
      130546
      254851
      8.814259
      0
    
    
      7
      275
      129886
      219649
      7.131412
      0
    
    
      8
      276
      135156
      206327
      1.337836
      1
    
    
      9
      277
      171601
      272866
      0.731273
      1

In the earlier query we had a total of 6 rows returned because we filtered 4 of them out. In the above query, we keep all 10 rows, but we have indicated that 4 of them should be excluded in the last column. This is a good practice to have as it will make it very easy to summarize your exclusions at the end of your study and modify them if your later work deems it necessary.

Let's go back to the common exclusion criteria mentioned earlier: flagging non-adults for removal. First, we'll need to calculate the patient's age on ICU admission, which will require the patient's date of birth and the ICU admission time. We already have the ICU admission time (intime in the icustays table), so all we need to do is get the date of birth from the patients table.



In [7]:

    
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, icu.intime - pat.dob AS age
FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
        as exclusion_los
FROM co
"""
df = pd.read_sql_query(query, con)
df









    Out[7]:






  
    
      
      subject_id
      hadm_id
      icustay_id
      icu_length_of_stay
      age
      exclusion_los
    
  
  
    
      0
      2
      163353
      243653
      0.091829
      0 days 21:20:07
      1
    
    
      1
      3
      145834
      211552
      6.064560
      27950 days 19:10:11
      0
    
    
      2
      4
      185777
      294638
      1.678472
      17475 days 00:29:31
      1
    
    
      3
      5
      178980
      214757
      0.084444
      0 days 06:04:24
      1
    
    
      4
      6
      107064
      228232
      3.672917
      24084 days 21:30:54
      0
    
    
      5
      7
      118037
      278444
      0.267720
      0 days 15:35:29
      1
    
    
      6
      7
      118037
      236754
      0.739097
      2 days 03:26:01
      1
    
    
      7
      8
      159514
      262299
      1.075521
      0 days 12:36:10
      1
    
    
      8
      9
      150750
      220597
      5.323056
      15263 days 13:07:02
      0
    
    
      9
      10
      184167
      288409
      8.092106
      0 days 11:39:05
      0

Notes from the above query: we have to specify the table in icu.subject_id because there is a subject_id column in both the icustays table and the patients table, and the program doesn't know which we want unless we specify it.

Now, looking at the results, it appears age is returned as the number of days between the dob and the intime - perhaps not what we desire! As mentioned before, this is an interval data type - it's useful when doing date operations but for our purposes it is not practical. We have three options:

we can use the function EXTRACT() to extract the seconds and convert that into an age by dividing by the number of seconds in a year (as we did before)
we can use the PostgreSQL function AGE() to return a symbolic representation of the age in years followed by the function DATE_PART() to extract the years
the same as the above, but calling DATE_PART() to get the months and days as well for more precision

Let's look at all three.



In [8]:

    
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, icu.intime - pat.dob AS age
FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , EXTRACT('year' FROM co.age) as age_extract_year 
  , EXTRACT('year' FROM co.age) 
    + EXTRACT('months' FROM co.age) / 12.0
    + EXTRACT('days' FROM co.age) / 365.242
    + EXTRACT('hours' FROM co.age) / 24.0 / 364.242
    as age_extract_precise
  , EXTRACT('epoch' from co.age) / 60.0 / 60.0 / 24.0 / 365.242 as age_extract_epoch
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
        as exclusion_los
FROM co
"""
df = pd.read_sql_query(query, con)
df









    Out[8]:






  
    
      
      subject_id
      hadm_id
      icustay_id
      icu_length_of_stay
      age
      age_extract_year
      age_extract_precise
      age_extract_epoch
      exclusion_los
    
  
  
    
      0
      2
      163353
      243653
      0.091829
      0 days 21:20:07
      0.0
      0.002402
      0.002434
      1
    
    
      1
      3
      145834
      211552
      6.064560
      27950 days 19:10:11
      0.0
      76.526779
      76.526792
      0
    
    
      2
      4
      185777
      294638
      1.678472
      17475 days 00:29:31
      0.0
      47.844990
      47.845047
      1
    
    
      3
      5
      178980
      214757
      0.084444
      0 days 06:04:24
      0.0
      0.000686
      0.000693
      1
    
    
      4
      6
      107064
      228232
      3.672917
      24084 days 21:30:54
      0.0
      65.942245
      65.942297
      0
    
    
      5
      7
      118037
      278444
      0.267720
      0 days 15:35:29
      0.0
      0.001716
      0.001779
      1
    
    
      6
      7
      118037
      236754
      0.739097
      2 days 03:26:01
      0.0
      0.005819
      0.005868
      1
    
    
      7
      8
      159514
      262299
      1.075521
      0 days 12:36:10
      0.0
      0.001373
      0.001438
      1
    
    
      8
      9
      150750
      220597
      5.323056
      15263 days 13:07:02
      0.0
      41.790219
      41.790228
      0
    
    
      9
      10
      184167
      288409
      8.092106
      0 days 11:39:05
      0.0
      0.001258
      0.001329
      0

As we can see, there is very little difference between the second and last approach - so it is up to preference (and desire for true precision). We will use the EXTRACT('epoch' ... ) approach as it's the simplest.

Now, we will filter out neonates by requiring age to be greater than 16 (note while this also removes children, there are no children in the MIMIC-III database).



In [9]:

    
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
        as exclusion_los
  , CASE
        WHEN co.age < 16 then 1
    ELSE 0 END
        as exclusion_age
FROM co
"""
df = pd.read_sql_query(query, con)
df









    Out[9]:






  
    
      
      subject_id
      hadm_id
      icustay_id
      icu_length_of_stay
      age
      exclusion_los
      exclusion_age
    
  
  
    
      0
      2
      163353
      243653
      0.091829
      0.002434
      1
      1
    
    
      1
      3
      145834
      211552
      6.064560
      76.526792
      0
      0
    
    
      2
      4
      185777
      294638
      1.678472
      47.845047
      1
      0
    
    
      3
      5
      178980
      214757
      0.084444
      0.000693
      1
      1
    
    
      4
      6
      107064
      228232
      3.672917
      65.942297
      0
      0
    
    
      5
      7
      118037
      278444
      0.267720
      0.001779
      1
      1
    
    
      6
      7
      118037
      236754
      0.739097
      0.005868
      1
      1
    
    
      7
      8
      159514
      262299
      1.075521
      0.001438
      1
      1
    
    
      8
      9
      150750
      220597
      5.323056
      41.790228
      0
      0
    
    
      9
      10
      184167
      288409
      8.092106
      0.001329
      0
      1

Above we can see we have "flagged" 6 icustay_id for exclusion due to their age, and that many of these exclusions overlap with the requirement that their ICU length of stay be longer than 2 days.

Let's try another common exclusion criteria: secondary admissions to the ICU - either in-hospital or out of hospital. The primary reason for this is it simplifies many statistical analyses which assume independent observations. If we kept multiple ICU stays for the same patient, then we would have to account for the fact that these ICU stays are highly correlated (e.g. the same patient may repeatedly be admitted for the same condition), and this can add an undesirable layer of complexity. To identify readmissions, we first rank ICU stays from earliest to latest using the RANK() function.



In [10]:

    
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age

, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order

FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , co.icustay_id_order
  
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
        as exclusion_los
  , CASE
        WHEN co.age < 16 then 1
    ELSE 0 END
        as exclusion_age
FROM co
"""
df = pd.read_sql_query(query, con)
df









    Out[10]:






  
    
      
      subject_id
      hadm_id
      icustay_id
      icu_length_of_stay
      age
      icustay_id_order
      exclusion_los
      exclusion_age
    
  
  
    
      0
      2
      163353
      243653
      0.091829
      0.002434
      1
      1
      1
    
    
      1
      3
      145834
      211552
      6.064560
      76.526792
      1
      0
      0
    
    
      2
      4
      185777
      294638
      1.678472
      47.845047
      1
      1
      0
    
    
      3
      5
      178980
      214757
      0.084444
      0.000693
      1
      1
      1
    
    
      4
      6
      107064
      228232
      3.672917
      65.942297
      1
      0
      0
    
    
      5
      7
      118037
      278444
      0.267720
      0.001779
      1
      1
      1
    
    
      6
      7
      118037
      236754
      0.739097
      0.005868
      2
      1
      1
    
    
      7
      8
      159514
      262299
      1.075521
      0.001438
      1
      1
      1
    
    
      8
      9
      150750
      220597
      5.323056
      41.790228
      1
      0
      0
    
    
      9
      10
      184167
      288409
      8.092106
      0.001329
      1
      0
      1

We can see that subject_id = 7 has been admitted twice - we would like to exclude this second admission, so we code in a CASE statement to do exactly this (note: while the subject would be excluded anyway, due to the other exclusion criteria, it's still a good example case!).



In [11]:

    
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age

, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order

FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , co.icustay_id_order
  
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
    AS exclusion_los
  , CASE
        WHEN co.age < 16 then 1
    ELSE 0 END
    AS exclusion_age
  , CASE 
        WHEN co.icustay_id_order != 1 THEN 1
    ELSE 0 END 
    AS exclusion_first_stay
FROM co
"""
df = pd.read_sql_query(query, con)
df









    Out[11]:






  
    
      
      subject_id
      hadm_id
      icustay_id
      icu_length_of_stay
      age
      icustay_id_order
      exclusion_los
      exclusion_age
      exclusion_first_stay
    
  
  
    
      0
      2
      163353
      243653
      0.091829
      0.002434
      1
      1
      1
      0
    
    
      1
      3
      145834
      211552
      6.064560
      76.526792
      1
      0
      0
      0
    
    
      2
      4
      185777
      294638
      1.678472
      47.845047
      1
      1
      0
      0
    
    
      3
      5
      178980
      214757
      0.084444
      0.000693
      1
      1
      1
      0
    
    
      4
      6
      107064
      228232
      3.672917
      65.942297
      1
      0
      0
      0
    
    
      5
      7
      118037
      278444
      0.267720
      0.001779
      1
      1
      1
      0
    
    
      6
      7
      118037
      236754
      0.739097
      0.005868
      2
      1
      1
      1
    
    
      7
      8
      159514
      262299
      1.075521
      0.001438
      1
      1
      1
      0
    
    
      8
      9
      150750
      220597
      5.323056
      41.790228
      1
      0
      0
      0
    
    
      9
      10
      184167
      288409
      8.092106
      0.001329
      1
      0
      1
      0

As we can see, we are now excluding this later admission.

Finally, we may want to exclude patients who were admitted for certain services. This is commonly done as the patient demographics can vary widely based upon service type, and we may want a more homoegenous group of patients. The services table provides the hospital service that a patient was admitted under, and is the best place to identify the type of care the patient is receiving. Let's take a look at the services table now:



In [12]:

    
query = query_schema + """
SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
FROM services
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df









    Out[12]:






  
    
      
      subject_id
      hadm_id
      transfertime
      prev_service
      curr_service
    
  
  
    
      0
      471
      135879
      2122-07-22 14:07:27
      TSURG
      MED
    
    
      1
      471
      135879
      2122-07-26 18:31:49
      MED
      TSURG
    
    
      2
      472
      173064
      2172-09-28 19:22:15
      None
      CMED
    
    
      3
      473
      129194
      2201-01-09 20:16:45
      None
      NB
    
    
      4
      474
      194246
      2181-03-23 08:24:41
      None
      NB
    
    
      5
      474
      146746
      2181-04-04 17:38:46
      None
      NBB
    
    
      6
      475
      139351
      2131-09-16 18:44:04
      None
      NB
    
    
      7
      476
      161042
      2100-07-05 10:26:45
      None
      NB
    
    
      8
      477
      191025
      2156-07-20 11:53:03
      None
      MED
    
    
      9
      478
      137370
      2194-07-15 13:55:21
      None
      NB

Above we can see that the curr_service column gives an abbreviation for the current service. The prev_service column is null, unless the patient had a transfer of service, in which case it identifies the previous service. For example, we can see subject_id = 471 has had at least two service changes: once from TSURG to MED and once from MED back to TSURG (note: there may be more as we have limited this query using LIMIT 10, and you could examine this patient in detail using SELECT * FROM services WHERE subject_id = 471 if you like).

A list of the unique services and their descriptions can be found at: http://mimic.physionet.org/mimictables/services/

In particular, if we want to exclude surgery, we should exclude patients who were admitted under:

CSURG
NSURG
ORTHO
PSURG
SURG
TSURG
VSURG

We can simplify this to patients who were under service '%SURG' or 'ORTHO' - where '%' is a wildcard matching any letter(s).



In [13]:

    
query = query_schema + """
SELECT hadm_id, curr_service
, CASE
    WHEN curr_service like '%SURG' then 1
    WHEN curr_service = 'ORTHO' then 1
    ELSE 0 END
  AS surgical
FROM services se
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df









    Out[13]:






  
    
      
      hadm_id
      curr_service
      surgical
    
  
  
    
      0
      135879
      MED
      0
    
    
      1
      135879
      TSURG
      1
    
    
      2
      173064
      CMED
      0
    
    
      3
      129194
      NB
      0
    
    
      4
      194246
      NB
      0
    
    
      5
      146746
      NBB
      0
    
    
      6
      139351
      NB
      0
    
    
      7
      161042
      NB
      0
    
    
      8
      191025
      MED
      0
    
    
      9
      137370
      NB
      0

This seems to be working nicely - except we only have hadm_id, and we are basing our cohort off of icustay_id. No problem, we can join from the icustays table to get the icustay_id for each hadm_id.



In [14]:

    
query = query_schema + """
SELECT icu.hadm_id, icu.icustay_id, curr_service
, CASE
    WHEN curr_service like '%SURG' then 1
    WHEN curr_service = 'ORTHO' then 1
    ELSE 0 END
  AS surgical
FROM icustays icu
LEFT JOIN services se
  ON icu.hadm_id = se.hadm_id
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df









    Out[14]:






  
    
      
      hadm_id
      icustay_id
      curr_service
      surgical
    
  
  
    
      0
      100001
      275225
      MED
      0
    
    
      1
      100003
      209281
      MED
      0
    
    
      2
      100006
      291788
      MED
      0
    
    
      3
      100006
      291788
      OMED
      0
    
    
      4
      100007
      217937
      SURG
      1
    
    
      5
      100009
      253656
      CSURG
      1
    
    
      6
      100010
      271147
      GU
      0
    
    
      7
      100011
      214619
      TRAUM
      0
    
    
      8
      100012
      239289
      SURG
      1
    
    
      9
      100016
      217590
      MED
      0

Note however that now we have a new issue: which service do we pick for each icustay_id? This is a cohort selection question, not a syntax question. We choose to exclude patients whose last service before ICU admission was surgical. We can update our join clause to reflect this choice:



In [15]:

    
query = query_schema + """
SELECT icu.hadm_id, icu.icustay_id, se.curr_service
, CASE
    WHEN curr_service like '%SURG' then 1
    WHEN curr_service = 'ORTHO' then 1
    ELSE 0 END
  AS surgical
FROM icustays icu
LEFT JOIN services se
 ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df









    Out[15]:






  
    
      
      hadm_id
      icustay_id
      curr_service
      surgical
    
  
  
    
      0
      100001
      275225
      MED
      0
    
    
      1
      100003
      209281
      MED
      0
    
    
      2
      100006
      291788
      MED
      0
    
    
      3
      100007
      217937
      SURG
      1
    
    
      4
      100009
      253656
      CSURG
      1
    
    
      5
      100010
      271147
      GU
      0
    
    
      6
      100011
      214619
      TRAUM
      0
    
    
      7
      100012
      239289
      SURG
      1
    
    
      8
      100016
      217590
      MED
      0
    
    
      9
      100017
      258320
      MED
      0

Note how icustay_id = 291788 no longer has an entry for OMED above: this is because this service was given after their ICU admission, so we do not want to consider it. Also note that our join clause has + interval '12' hour - this adds a bit of "fuzziness" to our criteria. As these times are entered asynchronously by different people in varying locations in the hospital, there can be some minor inconsistencies in the order. For example, a patient may be transferred to the surgical service for ICU admission, but the transfertime in services occurs after the intime in icustays by an hour or so. This is administrative "noise" - and a fuzzy interval can be useful in these cases. Again, this is a cohort selection decision - you may not want to use an interval as large as 12 hours - perhaps only 2 or 4 - though in this case there is likely to be very minor differences as 80% of patients never have a change in hospital service.

Finally, we want to collapse this down so we only have one service for a given ICU admission. As done earlier, we will use RANK() to do this.



In [16]:

    
query = query_schema + """
WITH serv AS
(
SELECT icu.hadm_id, icu.icustay_id, se.curr_service
, CASE
    WHEN curr_service like '%SURG' then 1
    WHEN curr_service = 'ORTHO' then 1
    ELSE 0 END
  AS surgical
, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
FROM icustays icu
LEFT JOIN services se
 ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
LIMIT 10
)
SELECT hadm_id, icustay_id, curr_service, surgical
FROM serv
WHERE rank = 1
"""
df = pd.read_sql_query(query, con)
df









    Out[16]:






  
    
      
      hadm_id
      icustay_id
      curr_service
      surgical
    
  
  
    
      0
      100001
      275225
      MED
      0
    
    
      1
      100003
      209281
      MED
      0
    
    
      2
      100006
      291788
      MED
      0
    
    
      3
      100007
      217937
      SURG
      1
    
    
      4
      100009
      253656
      CSURG
      1
    
    
      5
      100010
      271147
      GU
      0
    
    
      6
      100011
      214619
      TRAUM
      0
    
    
      7
      100012
      239289
      SURG
      1
    
    
      8
      100016
      217590
      MED
      0
    
    
      9
      100017
      258320
      MED
      0

We can join this table to our original cohort from above.



In [17]:

    
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
, serv AS
(
SELECT icu.hadm_id, icu.icustay_id, se.curr_service
, CASE
    WHEN curr_service like '%SURG' then 1
    WHEN curr_service = 'ORTHO' then 1
    ELSE 0 END
  as surgical
, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
FROM icustays icu
LEFT JOIN services se
 ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , co.icustay_id_order
  
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
    AS exclusion_los
  , CASE
        WHEN co.age < 16 then 1
    ELSE 0 END
    AS exclusion_age
  , CASE 
        WHEN co.icustay_id_order != 1 THEN 1
    ELSE 0 END 
    AS exclusion_first_stay
  , CASE
        WHEN serv.surgical = 1 THEN 1
    ELSE 0 END
    as exclusion_surgical
FROM co
LEFT JOIN serv
  ON  co.icustay_id = serv.icustay_id
  AND serv.rank = 1
"""
df = pd.read_sql_query(query, con)
df









    Out[17]:






  
    
      
      subject_id
      hadm_id
      icustay_id
      icu_length_of_stay
      age
      icustay_id_order
      exclusion_los
      exclusion_age
      exclusion_first_stay
      exclusion_surgical
    
  
  
    
      0
      6
      107064
      228232
      3.672917
      65.942297
      1
      0
      0
      0
      1
    
    
      1
      7
      118037
      278444
      0.267720
      0.001779
      1
      1
      1
      0
      0
    
    
      2
      7
      118037
      236754
      0.739097
      0.005868
      2
      1
      1
      1
      0
    
    
      3
      3
      145834
      211552
      6.064560
      76.526792
      1
      0
      0
      0
      1
    
    
      4
      9
      150750
      220597
      5.323056
      41.790228
      1
      0
      0
      0
      0
    
    
      5
      8
      159514
      262299
      1.075521
      0.001438
      1
      1
      1
      0
      0
    
    
      6
      2
      163353
      243653
      0.091829
      0.002434
      1
      1
      1
      0
      0
    
    
      7
      5
      178980
      214757
      0.084444
      0.000693
      1
      1
      1
      0
      0
    
    
      8
      10
      184167
      288409
      8.092106
      0.001329
      1
      0
      1
      0
      0
    
    
      9
      4
      185777
      294638
      1.678472
      47.845047
      1
      1
      0
      0
      0

Great! You now have a cohort for which you can start extracting data.

A common question asked is: why did we use the services table for identifying surgical patients, rather than the first_careunit column from the icustays? This is a very important concept in the MIMIC-III database: while patients may be cared for by the surgical service, they are not necessarily in the surgical ICU. These patients are called "boarders", and the reason why they are not in the usual ICU for their service is multifactorial. Let's take a look at some care units:



In [18]:

    
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id, first_careunit
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
, serv AS
(
SELECT icu.hadm_id, icu.icustay_id, se.curr_service
, CASE
    WHEN curr_service like '%SURG' then 1
    WHEN curr_service = 'ORTHO' then 1
    ELSE 0 END
  as surgical
, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
FROM icustays icu
LEFT JOIN services se
 ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , co.icustay_id_order
  , serv.curr_service
  , co.first_careunit
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
    AS exclusion_los
  , CASE
        WHEN co.age < 16 then 1
    ELSE 0 END
    AS exclusion_age
  , CASE 
        WHEN co.icustay_id_order != 1 THEN 1
    ELSE 0 END 
    AS exclusion_first_stay
  , CASE
        WHEN serv.surgical = 1 THEN 1
    ELSE 0 END
    as exclusion_surgical
FROM co
LEFT JOIN serv
  ON  co.icustay_id = serv.icustay_id
  AND serv.rank = 1
"""
df = pd.read_sql_query(query, con)
df









    Out[18]:






  
    
      
      subject_id
      hadm_id
      icustay_id
      icu_length_of_stay
      age
      icustay_id_order
      curr_service
      first_careunit
      exclusion_los
      exclusion_age
      exclusion_first_stay
      exclusion_surgical
    
  
  
    
      0
      6
      107064
      228232
      3.672917
      65.942297
      1
      SURG
      SICU
      0
      0
      0
      1
    
    
      1
      7
      118037
      278444
      0.267720
      0.001779
      1
      NB
      NICU
      1
      1
      0
      0
    
    
      2
      7
      118037
      236754
      0.739097
      0.005868
      2
      NB
      NICU
      1
      1
      1
      0
    
    
      3
      3
      145834
      211552
      6.064560
      76.526792
      1
      VSURG
      MICU
      0
      0
      0
      1
    
    
      4
      9
      150750
      220597
      5.323056
      41.790228
      1
      NMED
      MICU
      0
      0
      0
      0
    
    
      5
      8
      159514
      262299
      1.075521
      0.001438
      1
      NB
      NICU
      1
      1
      0
      0
    
    
      6
      2
      163353
      243653
      0.091829
      0.002434
      1
      NB
      NICU
      1
      1
      0
      0
    
    
      7
      5
      178980
      214757
      0.084444
      0.000693
      1
      NB
      NICU
      1
      1
      0
      0
    
    
      8
      10
      184167
      288409
      8.092106
      0.001329
      1
      NB
      NICU
      0
      1
      0
      0
    
    
      9
      4
      185777
      294638
      1.678472
      47.845047
      1
      MED
      MICU
      1
      0
      0
      0

Without specifically looking for it, we have found an example in icustay_id 211552: they were admitted under the VSURG service, but admitted to a medical ICU (MICU). If we used the first_careunit, then we would undesirably include this "boarder" in our study.

Let's summarize our exclusions by looking at some simple summary measures of the dataframe df.



In [19]:

    
print('{:20s} {:5d}'.format('Observations', df.shape[0]))
idxExcl = np.zeros(df.shape[0],dtype=bool)
for col in df.columns:
    if "exclusion_" in col:
        print('{:20s} {:5d} ({:2.2f}%)'.format(col, df[col].sum(), df[col].sum()*100.0/df.shape[0]))
        idxExcl = (idxExcl) | (df[col]==1)

# print a summary of how many were excluded in total
print('')
print('{:20s} {:5d} ({:2.2f}%)'.format('Total excluded', np.sum(idxExcl), np.sum(idxExcl)*100.0/df.shape[0]))









    



Observations            10
exclusion_los            6 (60.00%)
exclusion_age            6 (60.00%)
exclusion_first_stay     1 (10.00%)
exclusion_surgical       2 (20.00%)

Total excluded           9 (90.00%)

As we can see, summarizing the exclusions is very simple because we have created this cohort table. With that, we conclude this tutorial on cohort selection. To recap, you have learned that:

best practice is to create a "cohort" table with a single row for every unique icustay_id, which is usually the identifier of interest
exclusions flags can be created based off rules, allowing easy prototyping, modification, and summarization later
when identifying the type of care provided, use the services table
read the docs, and don't make assumptions!

Also, remember that when prototyping the LIMIT clause is very useful for speed gains, but don't forget to remove it once you want to test your code on all 60,000+ admissions :)

Good luck in your analysis!



In [20]:

    
# close out the database connection
con.close()

	subject_id	hadm_id	icustay_id
0	268	110404	280836
1	269	106296	206613
2	270	188028	220345
3	271	173727	249196
4	272	164716	210407

	subject_id	hadm_id	icustay_id	icu_length_of_stay_interval	icu_length_of_stay
0	268	110404	280836	3 days 05:58:33	280713.0
1	269	106296	206613	3 days 06:41:28	283288.0
2	270	188028	220345	2 days 21:27:09	250029.0
3	271	173727	249196	2 days 01:26:22	177982.0
4	272	164716	210407	1 days 14:53:09	139989.0

	subject_id	hadm_id	icustay_id	icu_length_of_stay
0	268	110404	280836	3.248993
1	269	106296	206613	3.278796
2	270	188028	220345	2.893854
3	271	173727	249196	2.059977
4	272	164716	210407	1.620243

	subject_id	hadm_id	icustay_id	icu_length_of_stay	age	exclusion_los
0	2	163353	243653	0.091829	0 days 21:20:07	1
1	3	145834	211552	6.064560	27950 days 19:10:11	0
2	4	185777	294638	1.678472	17475 days 00:29:31	1
3	5	178980	214757	0.084444	0 days 06:04:24	1
4	6	107064	228232	3.672917	24084 days 21:30:54	0
5	7	118037	278444	0.267720	0 days 15:35:29	1
6	7	118037	236754	0.739097	2 days 03:26:01	1
7	8	159514	262299	1.075521	0 days 12:36:10	1
8	9	150750	220597	5.323056	15263 days 13:07:02	0
9	10	184167	288409	8.092106	0 days 11:39:05	0

	subject_id	hadm_id	transfertime	prev_service	curr_service
0	471	135879	2122-07-22 14:07:27	TSURG	MED
1	471	135879	2122-07-26 18:31:49	MED	TSURG
2	472	173064	2172-09-28 19:22:15	None	CMED
3	473	129194	2201-01-09 20:16:45	None	NB
4	474	194246	2181-03-23 08:24:41	None	NB
5	474	146746	2181-04-04 17:38:46	None	NBB
6	475	139351	2131-09-16 18:44:04	None	NB
7	476	161042	2100-07-05 10:26:45	None	NB
8	477	191025	2156-07-20 11:53:03	None	MED
9	478	137370	2194-07-15 13:55:21	None	NB

	hadm_id	icustay_id	curr_service	surgical
0	100001	275225	MED	0
1	100003	209281	MED	0
2	100006	291788	MED	0
3	100006	291788	OMED	0
4	100007	217937	SURG	1
5	100009	253656	CSURG	1
6	100010	271147	GU	0
7	100011	214619	TRAUM	0
8	100012	239289	SURG	1
9	100016	217590	MED	0