Cohort selection

The aim of this tutorial is to describe how patients are tracked in the MIMIC-III database. By the end of this notebook you should:

  • Understand what subject_id, hadm_id, and icustay_id represent
  • Know how to set up a cohort table for subselecting a patient population
  • Understand the difference between service and physical location

Requirements:

  • MIMIC-III in a PostgreSQL database
  • Python packages installable with:
    • pip install numpy pandas matplotlib psycopg2 jupyter

First, as always, we open a connection to a local copy of the database. If you don't have a local copy of the database in PostgreSQL, follow the tutorial online (http://mimic.physionet.org) to install one.

Let's begin!


In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2

# below imports are used to print out pretty pandas dataframes
from IPython.display import display, HTML

%matplotlib inline
plt.style.use('ggplot')

# information used to create a database connection
sqluser = 'postgres'
dbname = 'mimic'
schema_name = 'mimiciii'

# Connect to postgres with a copy of the MIMIC-III database
con = psycopg2.connect(dbname=dbname, user=sqluser)

# the below statement is prepended to queries to ensure they select from the right schema
query_schema = 'set search_path to ' + schema_name + ';'

Cohort selection will begin with three tables: patients, admissions, and icustays:

  • patients: information about a patient that does not change - e.g. date of birth, genotypical sex
  • admissions: information recorded on hospital admission - admission type (elective, emergency), time of admission
  • icustays: information recorded on intensive care unit admission - primarily admission and discharge time

As MIMIC-III is primarily an intensive care unit (ICU) database, the focus will be on patients admitted to and discharged from the ICU. That is, rather than selecting our cohort based off the individual patient (identified by subject_id in the database), we will usually want to select our cohort based off the ICU stay (identified by icustay_id). Thus, it is sensible to begin with the icustays table:


In [2]:
query = query_schema + """
SELECT subject_id, hadm_id, icustay_id
FROM icustays
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df.head()


Out[2]:
subject_id hadm_id icustay_id
0 268 110404 280836
1 269 106296 206613
2 270 188028 220345
3 271 173727 249196
4 272 164716 210407

Note: in the above we use LIMIT 10: this limits our results to only 10 rows. It's nice to include this statement when prototyping as it speeds up queries immensely. Later on when we are doing full data extraction, we would remove this statement.

If we are interested in the length of stay for the ICU patients, we can query the intime and outtime columns, adding in some SQL specific syntax for calculating the difference between two dates.


In [3]:
query = query_schema + """
SELECT subject_id, hadm_id, icustay_id
, outtime - intime as icu_length_of_stay_interval
, EXTRACT(EPOCH FROM outtime - intime) as icu_length_of_stay
FROM icustays
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df.head()


Out[3]:
subject_id hadm_id icustay_id icu_length_of_stay_interval icu_length_of_stay
0 268 110404 280836 3 days 05:58:33 280713.0
1 269 106296 206613 3 days 06:41:28 283288.0
2 270 188028 220345 2 days 21:27:09 250029.0
3 271 173727 249196 2 days 01:26:22 177982.0
4 272 164716 210407 1 days 14:53:09 139989.0

Note that the EXTRACT(EPOCH FROM ... ) code extracts the number of fractional seconds represented by an interval data type. So the logic is roughly intime - outtime -> icu_length_of_stay_interval, followed by EXTRACT(EPOCH FROM icu_length_of_stay_interval) -> fractional seconds (numeric). ICU Length of stay is most readily interpretted when represented in fractional days, so let's do that conversion.


In [4]:
query = query_schema + """
SELECT subject_id, hadm_id, icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
FROM icustays
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df.head()


Out[4]:
subject_id hadm_id icustay_id icu_length_of_stay
0 268 110404 280836 3.248993
1 269 106296 206613 3.278796
2 270 188028 220345 2.893854
3 271 173727 249196 2.059977
4 272 164716 210407 1.620243

In the above, seconds are converted to days easily by dividing by: 60 (seconds in a minute), 60 (minutes in an hour), and 24 (hours in a day). We also omit the icu_length_of_stay_interval as it's now redundant for our purposes.

If we are only interested in ICU stays lasting a certain length (say 24 hours), we need to do the following two steps:

  • use an in-line view to "hold" the data
  • use the WHERE clause to filter this data using our generated column

Here's an example of us filtering to only stays lasting at least 2 days:


In [5]:
query = query_schema + """
WITH co AS
(
SELECT subject_id, hadm_id, icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
FROM icustays
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
FROM co
WHERE icu_length_of_stay >= 2
"""
df = pd.read_sql_query(query, con)
df


Out[5]:
subject_id hadm_id icustay_id icu_length_of_stay
0 268 110404 280836 3.248993
1 269 106296 206613 3.278796
2 270 188028 220345 2.893854
3 271 173727 249196 2.059977
4 274 130546 254851 8.814259
5 275 129886 219649 7.131412

Looks good - none of the above stays are shorter than 2 days.

Many studies using the MIMIC-III database are focused on specific subgroups of patients. For example, MIMIC-III contains both adults and neonates, but it is rare that a study would like to evaluate some phenomenom in both groups simulatenously. As a result, the first step of many studies is selecting a subpopulation from the icustays table. Concretely, we will want to select a set of icustay_id which represent our patient population. You've just seen an example of doing this: in the above code, we limited our population to only those who were in the ICU for at least 2 days.

When subselecting the patient population, it is generally good practice to build a "cohort" table - that is a table with all icustay_id available in the database, each associated with binary flags indicating whether or not they are excluded from your population. Let's take a look at how this would work with the above query which limited the dataset to patients who stayed longer than 2 days.


In [6]:
query = query_schema + """
WITH co AS
(
SELECT subject_id, hadm_id, icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
FROM icustays
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
        as exclusion_los
FROM co
"""
df = pd.read_sql_query(query, con)
df


Out[6]:
subject_id hadm_id icustay_id icu_length_of_stay exclusion_los
0 268 110404 280836 3.248993 0
1 269 106296 206613 3.278796 0
2 270 188028 220345 2.893854 0
3 271 173727 249196 2.059977 0
4 272 164716 210407 1.620243 1
5 273 158689 241507 1.486181 1
6 274 130546 254851 8.814259 0
7 275 129886 219649 7.131412 0
8 276 135156 206327 1.337836 1
9 277 171601 272866 0.731273 1

In the earlier query we had a total of 6 rows returned because we filtered 4 of them out. In the above query, we keep all 10 rows, but we have indicated that 4 of them should be excluded in the last column. This is a good practice to have as it will make it very easy to summarize your exclusions at the end of your study and modify them if your later work deems it necessary.

Let's go back to the common exclusion criteria mentioned earlier: flagging non-adults for removal. First, we'll need to calculate the patient's age on ICU admission, which will require the patient's date of birth and the ICU admission time. We already have the ICU admission time (intime in the icustays table), so all we need to do is get the date of birth from the patients table.


In [7]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, icu.intime - pat.dob AS age
FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
        as exclusion_los
FROM co
"""
df = pd.read_sql_query(query, con)
df


Out[7]:
subject_id hadm_id icustay_id icu_length_of_stay age exclusion_los
0 2 163353 243653 0.091829 0 days 21:20:07 1
1 3 145834 211552 6.064560 27950 days 19:10:11 0
2 4 185777 294638 1.678472 17475 days 00:29:31 1
3 5 178980 214757 0.084444 0 days 06:04:24 1
4 6 107064 228232 3.672917 24084 days 21:30:54 0
5 7 118037 278444 0.267720 0 days 15:35:29 1
6 7 118037 236754 0.739097 2 days 03:26:01 1
7 8 159514 262299 1.075521 0 days 12:36:10 1
8 9 150750 220597 5.323056 15263 days 13:07:02 0
9 10 184167 288409 8.092106 0 days 11:39:05 0

Notes from the above query: we have to specify the table in icu.subject_id because there is a subject_id column in both the icustays table and the patients table, and the program doesn't know which we want unless we specify it.

Now, looking at the results, it appears age is returned as the number of days between the dob and the intime - perhaps not what we desire! As mentioned before, this is an interval data type - it's useful when doing date operations but for our purposes it is not practical. We have three options:

  • we can use the function EXTRACT() to extract the seconds and convert that into an age by dividing by the number of seconds in a year (as we did before)
  • we can use the PostgreSQL function AGE() to return a symbolic representation of the age in years followed by the function DATE_PART() to extract the years
  • the same as the above, but calling DATE_PART() to get the months and days as well for more precision

Let's look at all three.


In [8]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, icu.intime - pat.dob AS age
FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , EXTRACT('year' FROM co.age) as age_extract_year 
  , EXTRACT('year' FROM co.age) 
    + EXTRACT('months' FROM co.age) / 12.0
    + EXTRACT('days' FROM co.age) / 365.242
    + EXTRACT('hours' FROM co.age) / 24.0 / 364.242
    as age_extract_precise
  , EXTRACT('epoch' from co.age) / 60.0 / 60.0 / 24.0 / 365.242 as age_extract_epoch
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
        as exclusion_los
FROM co
"""
df = pd.read_sql_query(query, con)
df


Out[8]:
subject_id hadm_id icustay_id icu_length_of_stay age age_extract_year age_extract_precise age_extract_epoch exclusion_los
0 2 163353 243653 0.091829 0 days 21:20:07 0.0 0.002402 0.002434 1
1 3 145834 211552 6.064560 27950 days 19:10:11 0.0 76.526779 76.526792 0
2 4 185777 294638 1.678472 17475 days 00:29:31 0.0 47.844990 47.845047 1
3 5 178980 214757 0.084444 0 days 06:04:24 0.0 0.000686 0.000693 1
4 6 107064 228232 3.672917 24084 days 21:30:54 0.0 65.942245 65.942297 0
5 7 118037 278444 0.267720 0 days 15:35:29 0.0 0.001716 0.001779 1
6 7 118037 236754 0.739097 2 days 03:26:01 0.0 0.005819 0.005868 1
7 8 159514 262299 1.075521 0 days 12:36:10 0.0 0.001373 0.001438 1
8 9 150750 220597 5.323056 15263 days 13:07:02 0.0 41.790219 41.790228 0
9 10 184167 288409 8.092106 0 days 11:39:05 0.0 0.001258 0.001329 0

As we can see, there is very little difference between the second and last approach - so it is up to preference (and desire for true precision). We will use the EXTRACT('epoch' ... ) approach as it's the simplest.

Now, we will filter out neonates by requiring age to be greater than 16 (note while this also removes children, there are no children in the MIMIC-III database).


In [9]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
        as exclusion_los
  , CASE
        WHEN co.age < 16 then 1
    ELSE 0 END
        as exclusion_age
FROM co
"""
df = pd.read_sql_query(query, con)
df


Out[9]:
subject_id hadm_id icustay_id icu_length_of_stay age exclusion_los exclusion_age
0 2 163353 243653 0.091829 0.002434 1 1
1 3 145834 211552 6.064560 76.526792 0 0
2 4 185777 294638 1.678472 47.845047 1 0
3 5 178980 214757 0.084444 0.000693 1 1
4 6 107064 228232 3.672917 65.942297 0 0
5 7 118037 278444 0.267720 0.001779 1 1
6 7 118037 236754 0.739097 0.005868 1 1
7 8 159514 262299 1.075521 0.001438 1 1
8 9 150750 220597 5.323056 41.790228 0 0
9 10 184167 288409 8.092106 0.001329 0 1

Above we can see we have "flagged" 6 icustay_id for exclusion due to their age, and that many of these exclusions overlap with the requirement that their ICU length of stay be longer than 2 days.

Let's try another common exclusion criteria: secondary admissions to the ICU - either in-hospital or out of hospital. The primary reason for this is it simplifies many statistical analyses which assume independent observations. If we kept multiple ICU stays for the same patient, then we would have to account for the fact that these ICU stays are highly correlated (e.g. the same patient may repeatedly be admitted for the same condition), and this can add an undesirable layer of complexity. To identify readmissions, we first rank ICU stays from earliest to latest using the RANK() function.


In [10]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age

, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order

FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , co.icustay_id_order
  
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
        as exclusion_los
  , CASE
        WHEN co.age < 16 then 1
    ELSE 0 END
        as exclusion_age
FROM co
"""
df = pd.read_sql_query(query, con)
df


Out[10]:
subject_id hadm_id icustay_id icu_length_of_stay age icustay_id_order exclusion_los exclusion_age
0 2 163353 243653 0.091829 0.002434 1 1 1
1 3 145834 211552 6.064560 76.526792 1 0 0
2 4 185777 294638 1.678472 47.845047 1 1 0
3 5 178980 214757 0.084444 0.000693 1 1 1
4 6 107064 228232 3.672917 65.942297 1 0 0
5 7 118037 278444 0.267720 0.001779 1 1 1
6 7 118037 236754 0.739097 0.005868 2 1 1
7 8 159514 262299 1.075521 0.001438 1 1 1
8 9 150750 220597 5.323056 41.790228 1 0 0
9 10 184167 288409 8.092106 0.001329 1 0 1

We can see that subject_id = 7 has been admitted twice - we would like to exclude this second admission, so we code in a CASE statement to do exactly this (note: while the subject would be excluded anyway, due to the other exclusion criteria, it's still a good example case!).


In [11]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age

, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order

FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , co.icustay_id_order
  
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
    AS exclusion_los
  , CASE
        WHEN co.age < 16 then 1
    ELSE 0 END
    AS exclusion_age
  , CASE 
        WHEN co.icustay_id_order != 1 THEN 1
    ELSE 0 END 
    AS exclusion_first_stay
FROM co
"""
df = pd.read_sql_query(query, con)
df


Out[11]:
subject_id hadm_id icustay_id icu_length_of_stay age icustay_id_order exclusion_los exclusion_age exclusion_first_stay
0 2 163353 243653 0.091829 0.002434 1 1 1 0
1 3 145834 211552 6.064560 76.526792 1 0 0 0
2 4 185777 294638 1.678472 47.845047 1 1 0 0
3 5 178980 214757 0.084444 0.000693 1 1 1 0
4 6 107064 228232 3.672917 65.942297 1 0 0 0
5 7 118037 278444 0.267720 0.001779 1 1 1 0
6 7 118037 236754 0.739097 0.005868 2 1 1 1
7 8 159514 262299 1.075521 0.001438 1 1 1 0
8 9 150750 220597 5.323056 41.790228 1 0 0 0
9 10 184167 288409 8.092106 0.001329 1 0 1 0

As we can see, we are now excluding this later admission.

Finally, we may want to exclude patients who were admitted for certain services. This is commonly done as the patient demographics can vary widely based upon service type, and we may want a more homoegenous group of patients. The services table provides the hospital service that a patient was admitted under, and is the best place to identify the type of care the patient is receiving. Let's take a look at the services table now:


In [12]:
query = query_schema + """
SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
FROM services
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df


Out[12]:
subject_id hadm_id transfertime prev_service curr_service
0 471 135879 2122-07-22 14:07:27 TSURG MED
1 471 135879 2122-07-26 18:31:49 MED TSURG
2 472 173064 2172-09-28 19:22:15 None CMED
3 473 129194 2201-01-09 20:16:45 None NB
4 474 194246 2181-03-23 08:24:41 None NB
5 474 146746 2181-04-04 17:38:46 None NBB
6 475 139351 2131-09-16 18:44:04 None NB
7 476 161042 2100-07-05 10:26:45 None NB
8 477 191025 2156-07-20 11:53:03 None MED
9 478 137370 2194-07-15 13:55:21 None NB

Above we can see that the curr_service column gives an abbreviation for the current service. The prev_service column is null, unless the patient had a transfer of service, in which case it identifies the previous service. For example, we can see subject_id = 471 has had at least two service changes: once from TSURG to MED and once from MED back to TSURG (note: there may be more as we have limited this query using LIMIT 10, and you could examine this patient in detail using SELECT * FROM services WHERE subject_id = 471 if you like).

A list of the unique services and their descriptions can be found at: http://mimic.physionet.org/mimictables/services/

In particular, if we want to exclude surgery, we should exclude patients who were admitted under:

  • CSURG
  • NSURG
  • ORTHO
  • PSURG
  • SURG
  • TSURG
  • VSURG

We can simplify this to patients who were under service '%SURG' or 'ORTHO' - where '%' is a wildcard matching any letter(s).


In [13]:
query = query_schema + """
SELECT hadm_id, curr_service
, CASE
    WHEN curr_service like '%SURG' then 1
    WHEN curr_service = 'ORTHO' then 1
    ELSE 0 END
  AS surgical
FROM services se
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df


Out[13]:
hadm_id curr_service surgical
0 135879 MED 0
1 135879 TSURG 1
2 173064 CMED 0
3 129194 NB 0
4 194246 NB 0
5 146746 NBB 0
6 139351 NB 0
7 161042 NB 0
8 191025 MED 0
9 137370 NB 0

This seems to be working nicely - except we only have hadm_id, and we are basing our cohort off of icustay_id. No problem, we can join from the icustays table to get the icustay_id for each hadm_id.


In [14]:
query = query_schema + """
SELECT icu.hadm_id, icu.icustay_id, curr_service
, CASE
    WHEN curr_service like '%SURG' then 1
    WHEN curr_service = 'ORTHO' then 1
    ELSE 0 END
  AS surgical
FROM icustays icu
LEFT JOIN services se
  ON icu.hadm_id = se.hadm_id
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df


Out[14]:
hadm_id icustay_id curr_service surgical
0 100001 275225 MED 0
1 100003 209281 MED 0
2 100006 291788 MED 0
3 100006 291788 OMED 0
4 100007 217937 SURG 1
5 100009 253656 CSURG 1
6 100010 271147 GU 0
7 100011 214619 TRAUM 0
8 100012 239289 SURG 1
9 100016 217590 MED 0

Note however that now we have a new issue: which service do we pick for each icustay_id? This is a cohort selection question, not a syntax question. We choose to exclude patients whose last service before ICU admission was surgical. We can update our join clause to reflect this choice:


In [15]:
query = query_schema + """
SELECT icu.hadm_id, icu.icustay_id, se.curr_service
, CASE
    WHEN curr_service like '%SURG' then 1
    WHEN curr_service = 'ORTHO' then 1
    ELSE 0 END
  AS surgical
FROM icustays icu
LEFT JOIN services se
 ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df


Out[15]:
hadm_id icustay_id curr_service surgical
0 100001 275225 MED 0
1 100003 209281 MED 0
2 100006 291788 MED 0
3 100007 217937 SURG 1
4 100009 253656 CSURG 1
5 100010 271147 GU 0
6 100011 214619 TRAUM 0
7 100012 239289 SURG 1
8 100016 217590 MED 0
9 100017 258320 MED 0

Note how icustay_id = 291788 no longer has an entry for OMED above: this is because this service was given after their ICU admission, so we do not want to consider it. Also note that our join clause has + interval '12' hour - this adds a bit of "fuzziness" to our criteria. As these times are entered asynchronously by different people in varying locations in the hospital, there can be some minor inconsistencies in the order. For example, a patient may be transferred to the surgical service for ICU admission, but the transfertime in services occurs after the intime in icustays by an hour or so. This is administrative "noise" - and a fuzzy interval can be useful in these cases. Again, this is a cohort selection decision - you may not want to use an interval as large as 12 hours - perhaps only 2 or 4 - though in this case there is likely to be very minor differences as 80% of patients never have a change in hospital service.

Finally, we want to collapse this down so we only have one service for a given ICU admission. As done earlier, we will use RANK() to do this.


In [16]:
query = query_schema + """
WITH serv AS
(
SELECT icu.hadm_id, icu.icustay_id, se.curr_service
, CASE
    WHEN curr_service like '%SURG' then 1
    WHEN curr_service = 'ORTHO' then 1
    ELSE 0 END
  AS surgical
, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
FROM icustays icu
LEFT JOIN services se
 ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
LIMIT 10
)
SELECT hadm_id, icustay_id, curr_service, surgical
FROM serv
WHERE rank = 1
"""
df = pd.read_sql_query(query, con)
df


Out[16]:
hadm_id icustay_id curr_service surgical
0 100001 275225 MED 0
1 100003 209281 MED 0
2 100006 291788 MED 0
3 100007 217937 SURG 1
4 100009 253656 CSURG 1
5 100010 271147 GU 0
6 100011 214619 TRAUM 0
7 100012 239289 SURG 1
8 100016 217590 MED 0
9 100017 258320 MED 0

We can join this table to our original cohort from above.


In [17]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
, serv AS
(
SELECT icu.hadm_id, icu.icustay_id, se.curr_service
, CASE
    WHEN curr_service like '%SURG' then 1
    WHEN curr_service = 'ORTHO' then 1
    ELSE 0 END
  as surgical
, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
FROM icustays icu
LEFT JOIN services se
 ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , co.icustay_id_order
  
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
    AS exclusion_los
  , CASE
        WHEN co.age < 16 then 1
    ELSE 0 END
    AS exclusion_age
  , CASE 
        WHEN co.icustay_id_order != 1 THEN 1
    ELSE 0 END 
    AS exclusion_first_stay
  , CASE
        WHEN serv.surgical = 1 THEN 1
    ELSE 0 END
    as exclusion_surgical
FROM co
LEFT JOIN serv
  ON  co.icustay_id = serv.icustay_id
  AND serv.rank = 1
"""
df = pd.read_sql_query(query, con)
df


Out[17]:
subject_id hadm_id icustay_id icu_length_of_stay age icustay_id_order exclusion_los exclusion_age exclusion_first_stay exclusion_surgical
0 6 107064 228232 3.672917 65.942297 1 0 0 0 1
1 7 118037 278444 0.267720 0.001779 1 1 1 0 0
2 7 118037 236754 0.739097 0.005868 2 1 1 1 0
3 3 145834 211552 6.064560 76.526792 1 0 0 0 1
4 9 150750 220597 5.323056 41.790228 1 0 0 0 0
5 8 159514 262299 1.075521 0.001438 1 1 1 0 0
6 2 163353 243653 0.091829 0.002434 1 1 1 0 0
7 5 178980 214757 0.084444 0.000693 1 1 1 0 0
8 10 184167 288409 8.092106 0.001329 1 0 1 0 0
9 4 185777 294638 1.678472 47.845047 1 1 0 0 0

Great! You now have a cohort for which you can start extracting data.

A common question asked is: why did we use the services table for identifying surgical patients, rather than the first_careunit column from the icustays? This is a very important concept in the MIMIC-III database: while patients may be cared for by the surgical service, they are not necessarily in the surgical ICU. These patients are called "boarders", and the reason why they are not in the usual ICU for their service is multifactorial. Let's take a look at some care units:


In [18]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id, first_careunit
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
, serv AS
(
SELECT icu.hadm_id, icu.icustay_id, se.curr_service
, CASE
    WHEN curr_service like '%SURG' then 1
    WHEN curr_service = 'ORTHO' then 1
    ELSE 0 END
  as surgical
, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
FROM icustays icu
LEFT JOIN services se
 ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , co.icustay_id_order
  , serv.curr_service
  , co.first_careunit
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
    AS exclusion_los
  , CASE
        WHEN co.age < 16 then 1
    ELSE 0 END
    AS exclusion_age
  , CASE 
        WHEN co.icustay_id_order != 1 THEN 1
    ELSE 0 END 
    AS exclusion_first_stay
  , CASE
        WHEN serv.surgical = 1 THEN 1
    ELSE 0 END
    as exclusion_surgical
FROM co
LEFT JOIN serv
  ON  co.icustay_id = serv.icustay_id
  AND serv.rank = 1
"""
df = pd.read_sql_query(query, con)
df


Out[18]:
subject_id hadm_id icustay_id icu_length_of_stay age icustay_id_order curr_service first_careunit exclusion_los exclusion_age exclusion_first_stay exclusion_surgical
0 6 107064 228232 3.672917 65.942297 1 SURG SICU 0 0 0 1
1 7 118037 278444 0.267720 0.001779 1 NB NICU 1 1 0 0
2 7 118037 236754 0.739097 0.005868 2 NB NICU 1 1 1 0
3 3 145834 211552 6.064560 76.526792 1 VSURG MICU 0 0 0 1
4 9 150750 220597 5.323056 41.790228 1 NMED MICU 0 0 0 0
5 8 159514 262299 1.075521 0.001438 1 NB NICU 1 1 0 0
6 2 163353 243653 0.091829 0.002434 1 NB NICU 1 1 0 0
7 5 178980 214757 0.084444 0.000693 1 NB NICU 1 1 0 0
8 10 184167 288409 8.092106 0.001329 1 NB NICU 0 1 0 0
9 4 185777 294638 1.678472 47.845047 1 MED MICU 1 0 0 0

Without specifically looking for it, we have found an example in icustay_id 211552: they were admitted under the VSURG service, but admitted to a medical ICU (MICU). If we used the first_careunit, then we would undesirably include this "boarder" in our study.

Let's summarize our exclusions by looking at some simple summary measures of the dataframe df.


In [19]:
print('{:20s} {:5d}'.format('Observations', df.shape[0]))
idxExcl = np.zeros(df.shape[0],dtype=bool)
for col in df.columns:
    if "exclusion_" in col:
        print('{:20s} {:5d} ({:2.2f}%)'.format(col, df[col].sum(), df[col].sum()*100.0/df.shape[0]))
        idxExcl = (idxExcl) | (df[col]==1)

# print a summary of how many were excluded in total
print('')
print('{:20s} {:5d} ({:2.2f}%)'.format('Total excluded', np.sum(idxExcl), np.sum(idxExcl)*100.0/df.shape[0]))


Observations            10
exclusion_los            6 (60.00%)
exclusion_age            6 (60.00%)
exclusion_first_stay     1 (10.00%)
exclusion_surgical       2 (20.00%)

Total excluded           9 (90.00%)

As we can see, summarizing the exclusions is very simple because we have created this cohort table. With that, we conclude this tutorial on cohort selection. To recap, you have learned that:

  • best practice is to create a "cohort" table with a single row for every unique icustay_id, which is usually the identifier of interest
  • exclusions flags can be created based off rules, allowing easy prototyping, modification, and summarization later
  • when identifying the type of care provided, use the services table
  • read the docs, and don't make assumptions!

Also, remember that when prototyping the LIMIT clause is very useful for speed gains, but don't forget to remove it once you want to test your code on all 60,000+ admissions :)

Good luck in your analysis!


In [20]:
# close out the database connection
con.close()