Copyright 2019 Google Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
The aim of this tutorial is to get you familiarized with BigQuery to query/filter/aggregate/export data with Python.
You should already have had a valid Gmail account registered with the datathon organizers. If you do not have a Gmail account, you can create one at http://www.gmail.com. You need to notify datathon organizers to register your new account for data access. If you have not yet signed the data use agreement (DUA) sent by the organizers, please do so immediately to get access to the MIMIC-III dataset.
To be able to run the queries in this tutorial, you need to create a copy of this Colab notebook by clicking "File" > "Save a copy in Drive..." menu. You can share your copy with your teammates by clicking on the "SHARE" button on the top-right corner of your Colab notebook copy. Everyone with "Edit" permission is able to modify the notebook at the same time, so it is a great way for team collaboration. Before running any cell in this colab, please make sure there is a green check mark before "CONNECTED" on top right corner, if not, please click "CONNECTED" button to connect to a random backend.
Now that you have done the initial setup, let us start playing with the data. First, you need to run some initialization code. You can run the following cell by clicking on the triangle button when you hover over the [ ] space on the top-left corner of the code cell below.
In [0]:
# Import libraries
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.path as path
import tensorflow as tf
# Below imports are used to print out pretty pandas dataframes
from IPython.display import display, HTML
# Imports for accessing Datathon data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery
Before running any queries using BigQuery, you need to first authenticate yourself by running the following cell. If you are running it for the first time, it will ask you to follow a link to log in using your Gmail account, and accept the data access requests to your profile. Once this is done, it will generate a string of verification code, which you should paste back to the cell below and press enter.
In [0]:
auth.authenticate_user()
The data-hosting project physionet-data
has read-only access, as a result, you
need to set a default project that you have BigQuery access to. A shared project
should be created by the event organizers, and we will be using it throughout
this tutorial.
Note that during the datathon, all participants will be divided into teams and a Google Cloud project will be created for each team specifically. That project would be the preferred project to use. For now we'll stick with the shared project for the purpose of the tutorial.
After datathon is finished, the shared project may either lock down access or be deleted, it's still possible to run queries from a project you own personally as long as you have access to the dataset hosting project.
Change the variable project_id below to list the project you are using.
In [0]:
# Note that this should be the project for the datathon work,
# not the physionet-data project which is for data hosting.
project_id = 'REPLACE_WITH_PROJECT_ID'
os.environ['GOOGLE_CLOUD_PROJECT'] = project_id
Let's define a few methods to wrap BigQuery operations, so that we don't have to write the configurations again and again.
In [0]:
# Read data from BigQuery into pandas dataframes.
def run_query(query):
return pd.io.gbq.read_gbq(
query,
project_id=project_id,
verbose=False,
configuration={'query': {
'useLegacySql': False
}})
OK, that's it for setup, now let's get our hands on the MIMIC demo data!
Let's now run some queries adapted from the MIMIC cohort selection tutorial.
First let's run the following query to produce data to generate a histrogram graph to show the distribution of patient ages in ten-year buckets (i.e. [0, 10), [10, 20), ..., [90, ∞).
In [0]:
df = run_query("""
WITH ps AS (
SELECT
icu.subject_id,
icu.hadm_id,
icu.icustay_id,
pat.dob,
DATETIME_DIFF(icu.outtime, icu.intime, DAY) AS icu_length_of_stay,
DATE_DIFF(DATE(icu.intime), DATE(pat.dob), YEAR) AS age
FROM `physionet-data.mimiciii_demo.icustays` AS icu
INNER JOIN `physionet-data.mimiciii_demo.patients` AS pat
ON icu.subject_id = pat.subject_id),
bu AS (
SELECT
CAST(FLOOR(age / 10) AS INT64) AS bucket
FROM ps)
SELECT
COUNT(bucket) AS num_icu_stays,
IF(bucket >= 9, ">= 90", FORMAT("%d - %d", bucket * 10, (bucket + 1) * 10)) AS age_bucket
FROM bu
GROUP BY bucket
ORDER BY bucket ASC
""")
df.set_index('age_bucket').plot(title='stay - age', kind='bar', legend=False)
The query consists of 3 parts:
icustays
and patients
tables to produce length of ICU
stays in days for each patient, which is saved in a temporary table ps
;bu
table;age_bucket
and num_icu_stays
, to plot the chart.Note: If you are having a hard time following the queries in this colab, or you want to know more about the table structures of MIMIC-III dataset, please consult our colab for a previous Datathon held in Sydney.
Now let's see if there is correlation between age and average length of stay in hours. Since we are using the age of patients when they get admitted, so we don't need to worry about multiple admissions of patients. Note that we treat the redacted ages (> 90) as noises and filter them out.
In [0]:
df = run_query("""
WITH re AS (
SELECT
DATETIME_DIFF(icu.outtime, icu.intime, HOUR) AS icu_length_of_stay,
DATE_DIFF(DATE(icu.intime), DATE(pat.dob), YEAR) AS age
FROM `physionet-data.mimiciii_demo.icustays` AS icu
INNER JOIN `physionet-data.mimiciii_demo.patients` AS pat
ON icu.subject_id = pat.subject_id)
SELECT
icu_length_of_stay AS stay,
age
FROM re
WHERE age < 100
""")
df.plot(kind='scatter', x='age', y='stay')
Let's take a look at another query which uses a filter that we often use, which is the current service that ICU patients are undergoing.
In [0]:
df = run_query("""
WITH co AS (
SELECT
icu.subject_id,
icu.hadm_id,
icu.icustay_id,
pat.dob,
DATETIME_DIFF(icu.outtime, icu.intime, DAY) AS icu_length_of_stay,
DATE_DIFF(DATE(icu.intime), DATE(pat.dob), YEAR) AS age,
RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
FROM `physionet-data.mimiciii_demo.icustays` AS icu
INNER JOIN `physionet-data.mimiciii_demo.patients` AS pat
ON icu.subject_id = pat.subject_id
ORDER BY hadm_id DESC),
serv AS (
SELECT
icu.hadm_id,
icu.icustay_id,
se.curr_service,
IF(curr_service like '%SURG' OR curr_service = 'ORTHO', 1, 0) AS surgical,
RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
FROM `physionet-data.mimiciii_demo.icustays` AS icu
LEFT JOIN `physionet-data.mimiciii_demo.services` AS se
ON icu.hadm_id = se.hadm_id
AND se.transfertime < DATETIME_ADD(icu.intime, INTERVAL 12 HOUR)
ORDER BY icustay_id)
SELECT
co.subject_id,
co.hadm_id,
co.icustay_id,
co.icu_length_of_stay,
co.age,
IF(co.icu_length_of_stay < 2, 1, 0) AS short_stay,
IF(co.icustay_id_order = 1, 0, 1) AS first_stay,
IF(serv.surgical = 1, 1, 0) AS surgical
FROM co
LEFT JOIN serv USING (icustay_id, hadm_id)
WHERE
serv.rank = 1 AND age < 100
ORDER BY subject_id, icustay_id_order
""")
print 'Number of rows in dataframe: %d' % len(df)
df.head()
This is a long query, but is pretty simple if we take a closer look. It consists of 3 steps as well:
icustays
and patients
. Note that since each patient may be admitted
multiple times, we usually filter out follow-up ICU stays, and only keep the
first one to minimize unwanted data correlation. This is achieved by
partitioning over subject_id
, and ordering by admission time, then choose
only the first one with RANK
function, the result is saved to a temporary
table co
;serv
;co
and serv
. For the
convenience of later analysis, we rename some columns, and filter out
patients more than 100 years old.The times in the tables are stored as DATETIME objects. This means you cannot
use operators like <
, =
, or >
for comparing them.
DATETIME_SUB()
function. In the example below, we are looking for stays of
less than 1 hour (where the admit time is less than 1 hour away from the
discharge time).[...] WHERE ADMITTIME BETWEEN DATETIME_SUB(DISCHTIME, INTERVAL 1 HOUR) AND DISCHTIME
Next we will show an example of using Tensorflow (getting started doc) to build a simple predictor, where we use the patient's age and whether it is the first ICU stay to predict whether the ICU stay will be a short one. With only 127 data points in total, we don't expect to actually build an accurate or useful predictor, but it should serve the purpose of showing how a model can be trained and used using Tensorflow within Colab.
First, let us split the 127 data points into a training set with 100 records and a testing set with 27, and examine the distribution of the split sets to make sure that the distribution is similar.
In [0]:
data = df[['age', 'first_stay', 'short_stay']]
data.reindex(np.random.permutation(data.index))
training_df = data.head(100)
validation_df = data.tail(27)
print 'Training data summary:'
display(training_df.describe())
print 'Validation data summary:'
display(validation_df.describe())
And let's quickly check the label distribution for the features.
In [0]:
display(training_df.groupby(['short_stay', 'first_stay']).count())
fig, ax = plt.subplots()
shorts = training_df[training_df.short_stay == 1].age
longs = training_df[training_df.short_stay == 0].age
colors = ['b', 'g']
ax.hist([shorts, longs],
bins=10,
color=colors,
label=['short_stay=1', 'short_stay=0'])
ax.set_xlabel('Age')
ax.set_ylabel('Number of Patients')
plt.legend(loc='upper left')
plt.show()
Let's first build a linear regression model to predict the numeric value of "short_stay" based on age and first_stay features. You can tune the parameters on the right-hand side and observe differences in the evaluation result.
In [0]:
#@title Linear Regression Parameters {display-mode:"both"}
BATCH_SIZE = 5 # @param
NUM_EPOCHS = 100 # @param
first_stay = tf.feature_column.numeric_column('first_stay')
age = tf.feature_column.numeric_column('age')
# Build linear regressor
linear_regressor = tf.estimator.LinearRegressor(
feature_columns=[first_stay, age])
# Train the Model.
model = linear_regressor.train(
input_fn=tf.compat.v1.estimator.inputs.pandas_input_fn(
x=training_df,
y=training_df['short_stay'],
num_epochs=100,
batch_size=BATCH_SIZE,
shuffle=True),
steps=100)
# Evaluate the model.
eval_result = linear_regressor.evaluate(
input_fn=tf.compat.v1.estimator.inputs.pandas_input_fn(
x=validation_df,
y=validation_df['short_stay'],
batch_size=BATCH_SIZE,
shuffle=False))
display(eval_result)
Remember that the label short_stay
is actually categorical, with the value 1
for an ICU stay of 1 day or less and value 0 for stays of length 2 days or more.
So a classification model better fits this task. Here we try a deep neural
networks model using the DNNClassifier
estimator. Notice the little changes
from the regression code above.
In [0]:
#@title ML Training example {display-mode:"both"}
BATCH_SIZE = 5 # @param
NUM_EPOCHS = 100 # @param
HIDDEN_UNITS = [10, 10] # @param
# Build linear regressor
classifier = tf.estimator.DNNClassifier(
feature_columns=[first_stay, age], hidden_units=HIDDEN_UNITS)
# Train the Model.
model = classifier.train(
input_fn=tf.compat.v1.estimator.inputs.pandas_input_fn(
x=training_df,
y=training_df['short_stay'],
num_epochs=100,
batch_size=BATCH_SIZE,
shuffle=True),
steps=100)
# Evaluate the model.
eval_result = classifier.evaluate(
input_fn=tf.compat.v1.estimator.inputs.pandas_input_fn(
x=validation_df,
y=validation_df['short_stay'],
batch_size=BATCH_SIZE,
shuffle=False))
display(eval_result)
Congratulations! Now you have finished this datathon tutorial, and ready to
explore the real data by querying Google BigQuery. To do so, simply use
mimiciii_clinical
as the dataset name. For example, the table
mimiciii_demo.icustays
becomes mimiciii_clinical.icustays
when you need the
actual MIMIC data. One thing to note though, is that it is highly recommended to
aggregate data aggressively wherever possible, because large dataframes may
cause the performance of colab to drop drastically or even out of memory errors.
Now, let's do the substitution, and start the real datathon exploration.
ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
If you try to run a query and see this error message, scroll to the bottom of the error text. The very last row of the error will show the specific error message, which is usually related to having the wrong project_id or not having access to the project/dataset.
Reset the runtime, to reinitialize. Note that this will clear any local variables or uploaded files. Do this by clicking the
Runtime
menu at the top, thenReset all runtimes