Introduction

This tutorial shows how to use Google Cloud Platform technologies to work with structured healthcare data to build a predictive model.

Synthea

Synthea is a data generator that simulates the lives of patients based on several medical modules. Each module models a different medical condition based on some real world statistics. Each patient in the Synthea dataset dies either due to medical reasons or non-medical random events not modeled by the generator.

Problem definition

Given the patient records generated by Synthea, predict the probability of the patient dying due to medical reasons.

Overview

Setup: Authentication, importing libraries, and project and dataset naming. NOTE: You must execute these commands every time you reconnect to Colab.

Data generation: Download Synthea and run it, and then export the data to BigQuery.

Feature extraction: Pivot a vertical table, join with horizontal, simplify.

Model: Identify columns with sufficient data, train the model using BigQuery ML.

Explore: Check model weights and most important features.

Model with Tables: Train a neural net using AutoML Tables and compare the results with the BigQuery ML model.

Flow/reconnecting

Whenever you restart or reconnect to this notebook, run the steps in Setup again. The remaining steps need to be performed in order, but they do not need to be repeated after you complete them once.

Prerequisites

This notebook is accompanied by another notebook that generates the Synthea dataset and imports it into BigQuery. Run that notebook before running this one.

Requirements

To run this tutorial, you will need a GCP project with a billing account.

Costs

There is a small cost associated with importing the dataset and storing it in BigQuery. However, if you run the AutoML Tables step at the end of the tutorial, the costs can reach up to $20 per hour of model training.

Setup

First, sign into your Google account to access Google Cloud Platform (GCP).

You will also import some standard Python data analysis packages that you'll use later to extract features.

Authentication: Run the following commands, click on the link that displays, and follow the instructions to authenticate. Scroll to the results box to the left to see where to paste the key you will copy from the browser.

NOTE: You will need to repeat this step each time you reconnect to the notebook server.


In [0]:
from google.colab import auth
auth.authenticate_user()
credentials = auth._check_adc()
print(credentials)

Library imports:

NOTE: You will need to repeat this step each time you reconnect to the notebook server.


In [0]:
import re
import pandas as pd
from google.cloud import bigquery

Setup:

Enter the name of your GCP project. The dataset name, output table, and model names are supplied for you. Use the same GCP project and dataset that you used when importing Synthea data in the previous notebook.

NOTE: You will need to repeat this step each time you reconnect to the notebook server.


In [0]:
project = "" #@param {type:"string"}
if not project:
  raise Exception("Project is empty.")
!gcloud config set project $project


dataset = "SYNMASS_2k" #@param {type:"string"}

output_table = "ml_ready_table_1" #@param {type:"string"}
ml_ready_table_name = "{}.{}.{}".format(project, dataset, output_table)

model_name = "mortality_model_1" #@param {type:"string"}
full_model_name = "{}.{}".format(dataset, model_name)

Feature extraction

Like many healthcare datasets, the Synthea dataset contains both "vertical/longitudinal" and "horizontal" tables.

Horizontal tables

Horizontal tables contain one row per patient. Each column provides a piece of information about that patient. A good example of this is the patients table where each column provides a demographic feature of the patient such as gender, race, and so forth. A more technical term for horizontal tables is first normal form (1NF).

Vertical or longitudinal tables

Vertical tables are usually used to store logitudinal measurements and observations. Unlike horizontal tables, each patient can have multiple rows representing different measurements/observations at different times. An example is the observations table where each row can contain the result of a specific lab test. The description column provides the name of the lab test and the value column determines the outcome of the test. A more technical term for vertical tables is entity–attribute–value model.

ML-ready table

You cannot directly train the regression model on the dataset as-is in BigQuery. Instead, you will have a simple horizontal table where each training example corresponds to a single row containing all of the data for one patient and each column corresponds to a feature. The name for this type of table is an ML-ready table.

Transformation

The following code extracts data from the source vertical and horizontal tables, joins them based on patient ID, and creates a single ML-ready table. To extract features from the vertical tables, a query is built that groups the rows by their description and aggregates the corresponding values into an array sorted by time. For each description that occurs at least 100 times, a new column is added to the ML-ready table, which will be a feature for the model. The name of the column comes from the description and the value of the column is the last value in the aggregated array. Therefore, for each type of measurement, the last measurement is used as the value of the corresponding feature.


In [0]:
_MIN_DESCRIPTION_OCCURENCES = 100

def GetFullTableName(name):
  """Returns the full BQ table name for the given short table name."""

  return "{}.{}.{}".format(project, dataset, name)


def UpdateFieldNameToDesc(field_name_to_desc, prefix, descriptions_to_exclude,
                          description):
  """Updates a given field name to description dictionary with the given values.

  The description is converted to a valid BQ field name, and then the
  dictionary is updated to map the field name to the description.

  Args:
    field_name_to_desc: The map that should be updated.
    prefix: The prefix used for the normalized field name. This is required to
      differentiate between same descriptions in different tables.
    descriptions_to_exclude: A list of descriptions that should be excluded from
      the map.
    description: the description that should be added to the map.
  """
  if description in descriptions_to_exclude:
    return
  pattern = re.compile(r"[\W_]+")
  field_name = pattern.sub(" ", description)
  field_name = field_name.replace(" ", "_")
  field_name = "{}_{}".format(prefix, field_name)
  field_name_to_desc[field_name] = description


def BuildFieldNameToDesc(prefix, table, typ, descriptions_to_exclude):
  """Reads a vertical table and returns a map of field name to description.

  The description value of the rows determines the column name of the
  ML-ready table. We extract all the unique descriptions and transform them
  to a valid name that can be used as BQ column names.

  Args:
    prefix: The prefix used for the normalized field names.
    table: The name of the table for which the map is built.
    typ: If this is set, the output is limited to the fields having this type.
    descriptions_to_exclude: A list of descriptions that should be excluded from
      the map.

  Returns:
    A map from normalized BQ field names to their corresponding description.
  """

  type_constraint = ""
  if typ is not None:
    type_constraint = " WHERE TYPE='{}' ".format(typ)
  sql = """
    SELECT
      DESCRIPTION as description,
      count(*) as occurences
    FROM `{}`{}
    GROUP BY 1 ORDER BY 2 DESC""".format(table, type_constraint)

  data = pd.read_gbq(query=sql, project_id=project, dialect="standard")

  # Filter the data to contain descriptions that have at least
  # _MIN_DESCRIPTION_OCCURENCES occurences.
  data = data[data["occurences"] > _MIN_DESCRIPTION_OCCURENCES]

  field_name_to_desc = {}

  def UpdateFn(description):
    UpdateFieldNameToDesc(field_name_to_desc, prefix, descriptions_to_exclude,
                          description)

  data["description"].apply(UpdateFn)
  return field_name_to_desc


def BuildQueryToHorizontalize(prefix, table, typ, descriptions_to_exclude):
  """Builds a query that horizontalizes the given table.

  The description column determines the feature name and the value column
  determines the value. In case of multiple values for a description, the last
  value is used.

  Args:
    prefix: The prefix used for the normalized field names.
    table: The name of the table for which the query is built.
    typ: Type of the values, it can be either "numeric" or "text".
    descriptions_to_exclude: descriptions that shouldn't be featurized.

  Returns:
    A sql query to horizontalize the table.
  """

  field_name_to_desc = BuildFieldNameToDesc(prefix, table, typ,
                                            descriptions_to_exclude)
  columns_str = ""
  for field_name, desc in field_name_to_desc.items():
    if typ == "numeric":
      columns_str += (
          ", any_value(if(DESCRIPTION = \"{}\", CAST(values[OFFSET(0)] as "
          "float64), NULL)) AS {}\n").format(desc, field_name)
    else:
      columns_str += (", any_value(if(DESCRIPTION = \"{}\", values[OFFSET(0)], "
                      "NULL)) AS {}\n").format(desc, field_name)

  sql = """
  SELECT
    PATIENT
  {}
  FROM (
    SELECT
      PATIENT,
      DESCRIPTION,
      ARRAY_AGG(VALUE order by DATE desc) as values
    FROM `{}` group by 1,2) group by 1""".format(columns_str, table)
  return sql


def BuildQueryToHorizontalizeBinaryFeatures(prefix, table):
  """Builds a query to horizontalize the table.

  If a patient has no row with a given description the corresponding feature
  will have value 0, otherwise 1.

  Args:
    prefix: The prefix used for the normalized field names.
    table: The name of the table for which the query is built.

  Returns:
    A sql query to horizontalize the table.
  """
  descriptions_to_exclude = set()
  field_name_to_desc = BuildFieldNameToDesc(
      prefix, table, typ=None, descriptions_to_exclude=descriptions_to_exclude)
  columns_str = ""
  for field_name, desc in field_name_to_desc.items():
    columns_str += ", sum(if(DESCRIPTION = \"{}\", 1, 0)) AS {}\n".format(
        desc, field_name)

  sql = """
  SELECT
    PATIENT
  {}
  FROM (
    SELECT
      PATIENT,
      DESCRIPTION
    FROM `{}` group by 1,2) group by 1""".format(columns_str, table)
  return sql


def BuildQueryToExtractHorizontalFeatures(table, columns):
  """Builds a query to extract a given set of features from a horizontal table."""

  features_str = ""
  for col in columns:
    features_str += ", {}".format(col)

  sql = """
  SELECT
    Id as PATIENT
  {}
  FROM `{}`""".format(features_str, table)
  return sql


def BuildLabelQuery(table):
  """Returns the query to build a table with patient id and label columns."""

  sql = """
  SELECT
    PATIENT,
    sum(if(DESCRIPTION="Death Certification", 1, 0)) as LABEL
  FROM `{}` group by 1""".format(table)
  return sql


# Build queries to extract features from patients, observations, and conditions
# tables.

demographics = BuildQueryToExtractHorizontalFeatures(
    table=GetFullTableName("patients"),
    columns=["ethnicity", "gender", "city", "race"])

# Exclude the rows that contains the cause of death, otherwise it would be
# cheating ;)."".
numeric_observations = BuildQueryToHorizontalize(
    prefix="obs",
    table=GetFullTableName("observations"),
    typ="numeric",
    descriptions_to_exclude=set(
        ["Cause of Death [US Standard Certificate of Death]"]))

text_observations = BuildQueryToHorizontalize(
    prefix="obs",
    table=GetFullTableName("observations"),
    typ="text",
    descriptions_to_exclude=set(
        ["Cause of Death [US Standard Certificate of Death]"]))

# Conditions are modeled as binary features, the corresponding column is
# true if and only if there is a row in conditions table with matching
# description.
conditions = BuildQueryToHorizontalizeBinaryFeatures(
    prefix="cond", table=GetFullTableName("conditions"))

# Build the query for the label table.
label = BuildLabelQuery(table=GetFullTableName("encounters"))

# Build the main query that uses subqueries to extract the label, and
# verticalize observations and conditions tables. The result of subqueries
# are joined based on patient ID.
sql_query = """
  SELECT * FROM ({})
  left join ({}) using (PATIENT)
  left join ({}) using (PATIENT)
  left join ({}) using (PATIENT)
  left join ({}) using (PATIENT)""".format(numeric_observations,
                                           text_observations, conditions,
                                           demographics, label)

job_config = bigquery.QueryJobConfig()

# Set the destination table
table_name = ml_ready_table_name.split(".")[-1]

bq_client = bigquery.Client(project=project)

table_ref = bq_client.dataset(dataset).table(table_name)
job_config.destination = table_ref
job_config.write_disposition = "WRITE_TRUNCATE"

# Start the query, passing in the extra configuration.
query_job = bq_client.query(
    sql_query,
    # Location must match that of the dataset(s) referenced in the query
    # and of the destination table.
    location="US",
    job_config=job_config)  # API request - starts the query

query_job.result()  # Waits for the query to finish
print("Query results loaded to table {}".format(table_ref.path))

Model training

Now that the data is transformed into the ML-ready table, you are ready to train a model. At this point, you have several options to train a model including BigQuery ML, AutoML tables, and Cloud Machine Learning Engine. This tutorial focuses on the simplest and quickest tool, BigQuery ML (BQML), to train a linear logistic regression model to predict the probablity of death due to medical reasons.

BQML automatically applies the required transformations depending on each variable's data type. For example, STRINGs are transformed into one-hot vectors, and TIMESTAMPs are standardized.

BQML also let you apply regularization to help with the generalization error. In the example here, because you are using features that all resulted from verticalization of the observations and conditions table, you will use relatively large l1 regularization coefficients to avoid overfitting.

This step takes ~5 minutes.

Set up the context for bigquery.magics which you will use in the following sections:


In [0]:
# Set the default project for running queries
bigquery.magics.context.project = project 

# Set up the substitution preprocessing injection
# This is used to be able to configure bigquery magic with ml_ready_table_name
# parameter
sub_dict = dict()
sub_dict["model_name"] = "{}.{}".format(dataset, model_name)
sub_dict["ml_ready_table_name"] = ml_ready_table_name
if globals().get('custom_run_query') is None:
  original_run_query = bigquery.magics._run_query

  def custom_run_query(client, query, job_config=None):
    query = query.format(**sub_dict)
    return original_run_query(client, query, job_config)

  bigquery.magics._run_query = custom_run_query
print('done')

Next, run the following commands to perform the actual model training and evaluation using BigQuery ML.

NOTE: If the following command fails with a permission error, check your Cloud IAM settings and make sure that the default Compute Engine service account (PROJECT_NUMBER-compute@developer.gserviceaccount.com) has a role with BigQuery model creation permissions, such as roles/bigquery.dataEditor or BigQuery Data Editor. This happens only if you have intentionally changed the role for the Compute Engine default service account. The default role for this account has BigQuery Data Editor, which has all the required permissions.


In [0]:
%%bigquery
# BigQuery ML create model statement:
CREATE OR REPLACE MODEL `{model_name}`
OPTIONS(
  # Use logistic_reg for discrete predictions (classification) and linear_reg
  # for continuous predictions (forecasting).
  model_type = 'logistic_reg',
  early_stop = False,
  max_iterations = 25,
  l1_reg = 2,
  # Identify the column to use as the label.
  input_label_cols = ["LABEL"]
)
AS
SELECT
  *
FROM `{ml_ready_table_name}`

Exploring the results

To see the training metrics, go to the BigQuery dashboard in Cloud Console and select the project that you are running this tutorial in. You can then find your model under the dataset you used in the model_name.

BigQuery shows you useful plots on training and evaluation tabs. On the training tab, the training and validation loss are plotted as a function of iterations. You can also see the learning rate used at each iteration.

The evaluation tab also provides useful accuracy metrics like F1 score, Log loss and ROC AUC. You should see an AUC of around 0.8.

Inspecting the weights of the different features

Because you converted all of the data in the conditions and observations tables to features, you might ask whether all of these features are required to train a model. Because Bigquery ML trains linear models, you can answer this question by inspecting the weights learned for the features.

Run the following query to view the top 10 categorical features having the largest weight variance:


In [0]:
%%bigquery
SELECT
  processed_input,
  STDDEV(cws.weight) as stddev_w,
  max(cws.weight) as max_w,
  min(cws.weight) as min_w
from (
  SELECT processed_input, cws
  FROM
      ML.WEIGHTS(MODEL `{model_name}`)
    cross join unnest(category_weights) as cws
)
group by 1
order by 2 desc limit 10

Run the following query to view the top 10 categorical features by maximum absolute weight value:


In [0]:
%%bigquery
SELECT processed_input, max(abs(weight))
FROM ML.WEIGHTS(MODEL `{model_name}`)
group by 1
order by 2 desc limit 10

AutoML Tables

If you succesfully trained a model using BigQuery ML, it means that your "ml_ready_table" is in good shape. You can now use AutoML Tables to train a neural net. This model should perform better than the logistic regression model that you trained using BigQuery ML.

Complete the following steps to enable the AutoML Tables API and view the UI:

  1. Go to the Google Cloud Console.

  2. Select the project you are using for this demo from the project dropdown at the top of the window.

  3. Go to the main menu by selecting the "hamburger" at the upper left of the window, and scroll down near the bottom of the menu to the Artificial Intelligence section and select Tables.

  4. If this is your first time using Tables on this project, you will be asked to enable the API. This will take several minutes.

  5. After the API is enabled, you will be taken to the dataset screen. Select New Dataset. Give the dataset a name and select Create Dataset.

  6. You will then be taken to the AutoML Tables GUI Import tab.

Import "ml_ready_table" to AutoML Tables

To import the "ml_ready_table" table to Tables, complete the following steps:

  1. Make sure that Import data from BigQuery is selected.

  2. Enter your project ID, dataset ID, and table ID for this tutorial. Pay close attention to the use of underscores (_) and dashes (-) when entering these values.

NOTE: These are the "project", "dataset", and "output table" that you entered at the beginning of this tutorial.

  1. Select IMPORT.

If you get an error, carefully check your project, dataset and table names. Importing will take several minutes. You will receive an email when importing is complete.

TRAIN

After you finish with the import, you will end up on the TRAIN page.

Here, you need to use the dropdown to pick the Target Column, which for your table is called LABEL.

On this tab, you can review statistics about your inputs including the distribution of different values, the percentage of missing and invalid values.

When you are done checking out the data statistics, click on the TRAIN MODEL button. Here you need to assign a training budget. For this tutorial, two hours of training is adequate, but three hours might give you better results. (Training costs approximately $20/hour).

Then hit the TRAIN MODEL button.

Training will take a little more than the budgeted training time. You will receive an email when training is complete.

EVALUATE

After training is done (again, you will receive an email), follow the link on the email to get to the EVALUATE tab. Successful evaluation results will look something like the below screenshot.

CLEAN UP

Because the BigQuery tables and models developed by this tutorial are in your GCP project, you will be billed for storage on an ongoing basis. After you have finished exploring these assets, you will want to delete them to avoid recurring charges. You can do this by deleting the resources individually or by deleting the entire project which will delete all of the underlying resources.