The aim of this tutorial is to describe how patients are tracked in the MIMIC-III database. By the end of this notebook you should:
subject_id
, hadm_id
, and icustay_id
representRequirements:
pip install numpy pandas matplotlib psycopg2 jupyter
First, as always, we open a connection to a local copy of the database. If you don't have a local copy of the database in PostgreSQL, follow the tutorial online (http://mimic.physionet.org) to install one.
Let's begin!
In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2
# below imports are used to print out pretty pandas dataframes
from IPython.display import display, HTML
%matplotlib inline
plt.style.use('ggplot')
# information used to create a database connection
sqluser = 'postgres'
dbname = 'mimic'
schema_name = 'mimiciii'
# Connect to postgres with a copy of the MIMIC-III database
con = psycopg2.connect(dbname=dbname, user=sqluser)
# the below statement is prepended to queries to ensure they select from the right schema
query_schema = 'set search_path to ' + schema_name + ';'
Cohort selection will begin with three tables: patients, admissions, and icustays:
As MIMIC-III is primarily an intensive care unit (ICU) database, the focus will be on patients admitted to and discharged from the ICU. That is, rather than selecting our cohort based off the individual patient (identified by subject_id
in the database), we will usually want to select our cohort based off the ICU stay (identified by icustay_id
). Thus, it is sensible to begin with the icustays table:
In [2]:
query = query_schema + """
SELECT subject_id, hadm_id, icustay_id
FROM icustays
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df.head()
Out[2]:
Note: in the above we use LIMIT 10
: this limits our results to only 10 rows. It's nice to include this statement when prototyping as it speeds up queries immensely. Later on when we are doing full data extraction, we would remove this statement.
If we are interested in the length of stay for the ICU patients, we can query the intime
and outtime
columns, adding in some SQL specific syntax for calculating the difference between two dates.
In [3]:
query = query_schema + """
SELECT subject_id, hadm_id, icustay_id
, outtime - intime as icu_length_of_stay_interval
, EXTRACT(EPOCH FROM outtime - intime) as icu_length_of_stay
FROM icustays
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df.head()
Out[3]:
Note that the EXTRACT(EPOCH FROM ... )
code extracts the number of fractional seconds represented by an interval data type. So the logic is roughly intime - outtime
-> icu_length_of_stay_interval
, followed by EXTRACT(EPOCH FROM icu_length_of_stay_interval)
-> fractional seconds (numeric). ICU Length of stay is most readily interpretted when represented in fractional days, so let's do that conversion.
In [4]:
query = query_schema + """
SELECT subject_id, hadm_id, icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
FROM icustays
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df.head()
Out[4]:
In the above, seconds are converted to days easily by dividing by: 60 (seconds in a minute), 60 (minutes in an hour), and 24 (hours in a day). We also omit the icu_length_of_stay_interval
as it's now redundant for our purposes.
If we are only interested in ICU stays lasting a certain length (say 24 hours), we need to do the following two steps:
WHERE
clause to filter this data using our generated columnHere's an example of us filtering to only stays lasting at least 2 days:
In [5]:
query = query_schema + """
WITH co AS
(
SELECT subject_id, hadm_id, icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
FROM icustays
LIMIT 10
)
SELECT
co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
FROM co
WHERE icu_length_of_stay >= 2
"""
df = pd.read_sql_query(query, con)
df
Out[5]:
Looks good - none of the above stays are shorter than 2 days.
Many studies using the MIMIC-III database are focused on specific subgroups of patients. For example, MIMIC-III contains both adults and neonates, but it is rare that a study would like to evaluate some phenomenom in both groups simulatenously. As a result, the first step of many studies is selecting a subpopulation from the icustays table. Concretely, we will want to select a set of icustay_id
which represent our patient population. You've just seen an example of doing this: in the above code, we limited our population to only those who were in the ICU for at least 2 days.
When subselecting the patient population, it is generally good practice to build a "cohort" table - that is a table with all icustay_id
available in the database, each associated with binary flags indicating whether or not they are excluded from your population. Let's take a look at how this would work with the above query which limited the dataset to patients who stayed longer than 2 days.
In [6]:
query = query_schema + """
WITH co AS
(
SELECT subject_id, hadm_id, icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
FROM icustays
LIMIT 10
)
SELECT
co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
, CASE
WHEN co.icu_length_of_stay < 2 then 1
ELSE 0 END
as exclusion_los
FROM co
"""
df = pd.read_sql_query(query, con)
df
Out[6]:
In the earlier query we had a total of 6 rows returned because we filtered 4 of them out. In the above query, we keep all 10 rows, but we have indicated that 4 of them should be excluded in the last column. This is a good practice to have as it will make it very easy to summarize your exclusions at the end of your study and modify them if your later work deems it necessary.
Let's go back to the common exclusion criteria mentioned earlier: flagging non-adults for removal. First, we'll need to calculate the patient's age on ICU admission, which will require the patient's date of birth and the ICU admission time. We already have the ICU admission time (intime
in the icustays table), so all we need to do is get the date of birth from the patients table.
In [7]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, icu.intime - pat.dob AS age
FROM icustays icu
INNER JOIN patients pat
ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
, co.age
, CASE
WHEN co.icu_length_of_stay < 2 then 1
ELSE 0 END
as exclusion_los
FROM co
"""
df = pd.read_sql_query(query, con)
df
Out[7]:
Notes from the above query: we have to specify the table in icu.subject_id
because there is a subject_id
column in both the icustays table and the patients table, and the program doesn't know which we want unless we specify it.
Now, looking at the results, it appears age
is returned as the number of days between the dob
and the intime
- perhaps not what we desire! As mentioned before, this is an interval data type - it's useful when doing date operations but for our purposes it is not practical. We have three options:
EXTRACT()
to extract the seconds and convert that into an age by dividing by the number of seconds in a year (as we did before)AGE()
to return a symbolic representation of the age in years followed by the function DATE_PART()
to extract the yearsDATE_PART()
to get the months and days as well for more precisionLet's look at all three.
In [8]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, icu.intime - pat.dob AS age
FROM icustays icu
INNER JOIN patients pat
ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
, co.age
, EXTRACT('year' FROM co.age) as age_extract_year
, EXTRACT('year' FROM co.age)
+ EXTRACT('months' FROM co.age) / 12.0
+ EXTRACT('days' FROM co.age) / 365.242
+ EXTRACT('hours' FROM co.age) / 24.0 / 364.242
as age_extract_precise
, EXTRACT('epoch' from co.age) / 60.0 / 60.0 / 24.0 / 365.242 as age_extract_epoch
, CASE
WHEN co.icu_length_of_stay < 2 then 1
ELSE 0 END
as exclusion_los
FROM co
"""
df = pd.read_sql_query(query, con)
df
Out[8]:
As we can see, there is very little difference between the second and last approach - so it is up to preference (and desire for true precision). We will use the EXTRACT('epoch' ... )
approach as it's the simplest.
Now, we will filter out neonates by requiring age to be greater than 16 (note while this also removes children, there are no children in the MIMIC-III database).
In [9]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
FROM icustays icu
INNER JOIN patients pat
ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
, co.age
, CASE
WHEN co.icu_length_of_stay < 2 then 1
ELSE 0 END
as exclusion_los
, CASE
WHEN co.age < 16 then 1
ELSE 0 END
as exclusion_age
FROM co
"""
df = pd.read_sql_query(query, con)
df
Out[9]:
Above we can see we have "flagged" 6 icustay_id
for exclusion due to their age, and that many of these exclusions overlap with the requirement that their ICU length of stay be longer than 2 days.
Let's try another common exclusion criteria: secondary admissions to the ICU - either in-hospital or out of hospital. The primary reason for this is it simplifies many statistical analyses which assume independent observations. If we kept multiple ICU stays for the same patient, then we would have to account for the fact that these ICU stays are highly correlated (e.g. the same patient may repeatedly be admitted for the same condition), and this can add an undesirable layer of complexity. To identify readmissions, we first rank ICU stays from earliest to latest using the RANK()
function.
In [10]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
FROM icustays icu
INNER JOIN patients pat
ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
, co.age
, co.icustay_id_order
, CASE
WHEN co.icu_length_of_stay < 2 then 1
ELSE 0 END
as exclusion_los
, CASE
WHEN co.age < 16 then 1
ELSE 0 END
as exclusion_age
FROM co
"""
df = pd.read_sql_query(query, con)
df
Out[10]:
We can see that subject_id
= 7 has been admitted twice - we would like to exclude this second admission, so we code in a CASE
statement to do exactly this (note: while the subject would be excluded anyway, due to the other exclusion criteria, it's still a good example case!).
In [11]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
FROM icustays icu
INNER JOIN patients pat
ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
, co.age
, co.icustay_id_order
, CASE
WHEN co.icu_length_of_stay < 2 then 1
ELSE 0 END
AS exclusion_los
, CASE
WHEN co.age < 16 then 1
ELSE 0 END
AS exclusion_age
, CASE
WHEN co.icustay_id_order != 1 THEN 1
ELSE 0 END
AS exclusion_first_stay
FROM co
"""
df = pd.read_sql_query(query, con)
df
Out[11]:
As we can see, we are now excluding this later admission.
Finally, we may want to exclude patients who were admitted for certain services. This is commonly done as the patient demographics can vary widely based upon service type, and we may want a more homoegenous group of patients. The services table provides the hospital service that a patient was admitted under, and is the best place to identify the type of care the patient is receiving. Let's take a look at the services table now:
In [12]:
query = query_schema + """
SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
FROM services
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df
Out[12]:
Above we can see that the curr_service
column gives an abbreviation for the current service. The prev_service
column is null, unless the patient had a transfer of service, in which case it identifies the previous service. For example, we can see subject_id = 471
has had at least two service changes: once from TSURG to MED and once from MED back to TSURG (note: there may be more as we have limited this query using LIMIT 10
, and you could examine this patient in detail using SELECT * FROM services WHERE subject_id = 471
if you like).
A list of the unique services and their descriptions can be found at: http://mimic.physionet.org/mimictables/services/
In particular, if we want to exclude surgery, we should exclude patients who were admitted under:
We can simplify this to patients who were under service '%SURG'
or 'ORTHO'
- where '%'
is a wildcard matching any letter(s).
In [13]:
query = query_schema + """
SELECT hadm_id, curr_service
, CASE
WHEN curr_service like '%SURG' then 1
WHEN curr_service = 'ORTHO' then 1
ELSE 0 END
AS surgical
FROM services se
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df
Out[13]:
This seems to be working nicely - except we only have hadm_id
, and we are basing our cohort off of icustay_id
. No problem, we can join from the icustays table to get the icustay_id
for each hadm_id
.
In [14]:
query = query_schema + """
SELECT icu.hadm_id, icu.icustay_id, curr_service
, CASE
WHEN curr_service like '%SURG' then 1
WHEN curr_service = 'ORTHO' then 1
ELSE 0 END
AS surgical
FROM icustays icu
LEFT JOIN services se
ON icu.hadm_id = se.hadm_id
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df
Out[14]:
Note however that now we have a new issue: which service do we pick for each icustay_id
? This is a cohort selection question, not a syntax question. We choose to exclude patients whose last service before ICU admission was surgical. We can update our join clause to reflect this choice:
In [15]:
query = query_schema + """
SELECT icu.hadm_id, icu.icustay_id, se.curr_service
, CASE
WHEN curr_service like '%SURG' then 1
WHEN curr_service = 'ORTHO' then 1
ELSE 0 END
AS surgical
FROM icustays icu
LEFT JOIN services se
ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
LIMIT 10
"""
df = pd.read_sql_query(query, con)
df
Out[15]:
Note how icustay_id
= 291788 no longer has an entry for OMED above: this is because this service was given after their ICU admission, so we do not want to consider it. Also note that our join clause has + interval '12' hour
- this adds a bit of "fuzziness" to our criteria. As these times are entered asynchronously by different people in varying locations in the hospital, there can be some minor inconsistencies in the order. For example, a patient may be transferred to the surgical service for ICU admission, but the transfertime
in services occurs after the intime
in icustays by an hour or so. This is administrative "noise" - and a fuzzy interval can be useful in these cases. Again, this is a cohort selection decision - you may not want to use an interval as large as 12 hours - perhaps only 2 or 4 - though in this case there is likely to be very minor differences as 80% of patients never have a change in hospital service.
Finally, we want to collapse this down so we only have one service for a given ICU admission. As done earlier, we will use RANK() to do this.
In [16]:
query = query_schema + """
WITH serv AS
(
SELECT icu.hadm_id, icu.icustay_id, se.curr_service
, CASE
WHEN curr_service like '%SURG' then 1
WHEN curr_service = 'ORTHO' then 1
ELSE 0 END
AS surgical
, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
FROM icustays icu
LEFT JOIN services se
ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
LIMIT 10
)
SELECT hadm_id, icustay_id, curr_service, surgical
FROM serv
WHERE rank = 1
"""
df = pd.read_sql_query(query, con)
df
Out[16]:
We can join this table to our original cohort from above.
In [17]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
FROM icustays icu
INNER JOIN patients pat
ON icu.subject_id = pat.subject_id
LIMIT 10
)
, serv AS
(
SELECT icu.hadm_id, icu.icustay_id, se.curr_service
, CASE
WHEN curr_service like '%SURG' then 1
WHEN curr_service = 'ORTHO' then 1
ELSE 0 END
as surgical
, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
FROM icustays icu
LEFT JOIN services se
ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
)
SELECT
co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
, co.age
, co.icustay_id_order
, CASE
WHEN co.icu_length_of_stay < 2 then 1
ELSE 0 END
AS exclusion_los
, CASE
WHEN co.age < 16 then 1
ELSE 0 END
AS exclusion_age
, CASE
WHEN co.icustay_id_order != 1 THEN 1
ELSE 0 END
AS exclusion_first_stay
, CASE
WHEN serv.surgical = 1 THEN 1
ELSE 0 END
as exclusion_surgical
FROM co
LEFT JOIN serv
ON co.icustay_id = serv.icustay_id
AND serv.rank = 1
"""
df = pd.read_sql_query(query, con)
df
Out[17]:
Great! You now have a cohort for which you can start extracting data.
A common question asked is: why did we use the services table for identifying surgical patients, rather than the first_careunit
column from the icustays? This is a very important concept in the MIMIC-III database: while patients may be cared for by the surgical service, they are not necessarily in the surgical ICU. These patients are called "boarders", and the reason why they are not in the usual ICU for their service is multifactorial. Let's take a look at some care units:
In [18]:
query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id, first_careunit
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
FROM icustays icu
INNER JOIN patients pat
ON icu.subject_id = pat.subject_id
LIMIT 10
)
, serv AS
(
SELECT icu.hadm_id, icu.icustay_id, se.curr_service
, CASE
WHEN curr_service like '%SURG' then 1
WHEN curr_service = 'ORTHO' then 1
ELSE 0 END
as surgical
, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
FROM icustays icu
LEFT JOIN services se
ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
)
SELECT
co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
, co.age
, co.icustay_id_order
, serv.curr_service
, co.first_careunit
, CASE
WHEN co.icu_length_of_stay < 2 then 1
ELSE 0 END
AS exclusion_los
, CASE
WHEN co.age < 16 then 1
ELSE 0 END
AS exclusion_age
, CASE
WHEN co.icustay_id_order != 1 THEN 1
ELSE 0 END
AS exclusion_first_stay
, CASE
WHEN serv.surgical = 1 THEN 1
ELSE 0 END
as exclusion_surgical
FROM co
LEFT JOIN serv
ON co.icustay_id = serv.icustay_id
AND serv.rank = 1
"""
df = pd.read_sql_query(query, con)
df
Out[18]:
Without specifically looking for it, we have found an example in icustay_id
211552: they were admitted under the VSURG service, but admitted to a medical ICU (MICU). If we used the first_careunit
, then we would undesirably include this "boarder" in our study.
Let's summarize our exclusions by looking at some simple summary measures of the dataframe df.
In [19]:
print('{:20s} {:5d}'.format('Observations', df.shape[0]))
idxExcl = np.zeros(df.shape[0],dtype=bool)
for col in df.columns:
if "exclusion_" in col:
print('{:20s} {:5d} ({:2.2f}%)'.format(col, df[col].sum(), df[col].sum()*100.0/df.shape[0]))
idxExcl = (idxExcl) | (df[col]==1)
# print a summary of how many were excluded in total
print('')
print('{:20s} {:5d} ({:2.2f}%)'.format('Total excluded', np.sum(idxExcl), np.sum(idxExcl)*100.0/df.shape[0]))
As we can see, summarizing the exclusions is very simple because we have created this cohort table. With that, we conclude this tutorial on cohort selection. To recap, you have learned that:
icustay_id
, which is usually the identifier of interestAlso, remember that when prototyping the LIMIT
clause is very useful for speed gains, but don't forget to remove it once you want to test your code on all 60,000+ admissions :)
Good luck in your analysis!
In [20]:
# close out the database connection
con.close()