Introduction

This tutorial shows how to generate data using the Synthea generator and then upload the data into BigQuery. This is a prerequisite for other tutorials that work with the Synthea dataset.

Synthea

Synthea is a data generator that simulates the lives of patients based on several medical modules. Each module models a different medical condition based on some real world statistics. Each patient in the Synthea dataset dies either due to medical reasons or non-medical random events not modeled by the generator.

Requirements

To run this tutorial you will need a GCP project with a billing account.

Costs

There is a small cost associated with importing the dataset and storing it in BigQuery.

Setup

NOTE: At present, this demo works only on Colab. To run the demo, go to and upload the notebook into your environment. (The first step, generating data using Synthea, requires the Java SDK which is available by default in Colab and not in Cloud Datalab or ML notebooks).

First, you need to sign into your Google account to access Google Cloud Platform (GCP).

Authentication Run the following commands, click on the link that displays, and follow the instructions to authenticate. Scroll to the results box to the left to see where to paste the key you will copy from the browser.

NOTE: You will need to repeat this step each time you reconnect to the notebook server.


In [0]:
from google.colab import auth
auth.authenticate_user()
credentials = auth._check_adc()
print(credentials)

Library Imports:

NOTE: You will need to repeat this step each time you reconnect to the notebook server.


In [0]:
from google.cloud import bigquery
from google.cloud import storage

Setup:

Enter the name of your GCP project and the name of a staging bucket in Cloud Storage. The staging bucket will be created if it does not exist. The dataset name, output table, and model names are supplied for you.

NOTE: You will need to repeat this step each time you reconnect to the notebook server.


In [0]:
project = "" #@param {type:"string"}
if not project:
  raise Exception("Project is empty.")

!gcloud config set project $project


dataset = "SYNMASS_2k" #@param {type:"string"}

staging_bucket_name = "" #@param {type:"string"}


if not staging_bucket_name:
  raise Exception("Staging bucket name is empty.")

if staging_bucket_name.startswith("gs://"):
  staging_bucket_path = staging_bucket_name
  staging_bucket_name = staging_bucket_path[5:]
else:
  staging_bucket_path = "gs://" + staging_bucket_name

# Create the staging bucket if it doesn't exist.
storage_client = storage.Client(project)
if storage_client.lookup_bucket(staging_bucket_name) is None:
  bucket = storage_client.create_bucket(staging_bucket_name)

Generate the Synthea data

This section explains how to generate synthetic data and import it into BigQuery. You only need to complete this step once. You do not need to complete it again if you restart or reconnect to this notebook.

First, clone the Synthea generator from GitHub, and then build it using Gradle.

This step takes two to three minutes.

You'll know that the build has finished successfully when the output contains BUILD SUCCESSFUL. If you encounter any errors about missing JavaDoc comments, you can safely ignore them.


In [0]:
# Clone the Synthea code
!git clone https://github.com/synthetichealth/synthea.git
# Compile the code. This will take ~2 minutes.
%cd ./synthea
!git checkout 56032e01bd2afb154dd94f62ae836459ee7821c9
!./gradlew build -x test

Generate the data

In this step, you generate clinical data for 2,000 patients. Synthea supports multiple output formats including FHIR and CSV. This tutorial uses the CSV format.

NOTE: This step takes ~8 minutes to run.


In [0]:
%%bash
time ./run_synthea Massachusetts -p 2000 -s 123 --exporter.csv.export=true > data_generation.log 2> error.log
echo "done"

Export the data to BigQuery

Run the following commands to create a BigQuery dataset and then import the CSV files into the dataset. A Dataflow job runs that detects the underlying table schema and imports the data into BigQuery.

NOTE: The output might contain a list of all of your projects. However, the project set by the $project variable will automatically be selected at the end of the operation, so you don't need to enter anything.


In [0]:
%%bash -s "$project" "$dataset"

# This step is only needed if the dataset does not exist.
bq mk --dataset $1:$2

Run the following commands to:

  1. Make sure Java 8 is used, because later versions are not yet supported by Cloud Dataflow.
  2. Clone the data importer code into the notebook environment.
  3. Compress the generated CSV files and copy them to Google Cloud Storage.

In [0]:
%%bash -s "$staging_bucket_path"
update-java-alternatives -s java-1.8.0-openjdk-amd64 
git clone https://github.com/GoogleCloudPlatform/bigquery-data-importer.git
tar --create --gzip --file synmass.tar.gz output/csv
gsutil cp synmass.tar.gz "$1"

Run the data importer pipeline. This step takes ~ 11 minutes, you can monitor the progress of job via Cloud dataflow dashboard (https://console.cloud.google.com/dataflow). Before running this command, please use the Cloud Platform Console, https://console.developers.google.com/apis/api/dataflow.googleapis.com/overview, to enable Dataflow API.


In [0]:
%cd bigquery-data-importer

In [0]:
%%bash -s "$project" "$dataset" "$staging_bucket_path" "$staging_bucket_name"

./gradlew run --stacktrace -PappArgs="[\
'--gcp_project_id', '${1}',\
'--gcs_uri', '${3}/synmass.tar.gz',\
'--bq_dataset', '${2}',\
'--temp_bucket', '${4}',\
'--verbose', 'true'
]"

Examine the Synthea data in BigQuery

By this point, you have: installed Synthea, used it to synthesize data on 2,000 patients, and imported the resulting CSV files into BigQuery. To explore the data, complete the following steps:

  1. Go to the Cloud Console.
  2. Select the project under which you are running this tutorial.
  3. Using the "hamburger" menu on the upper left, scroll down to the "Big Data" section and select BigQuery.
  4. A list of projects under which you have BigQuery datasets displays. Select the tutorial project (again).
  5. A dataset displays under the tutorial project. The default name of the dataset is SYNMASS_2K, but if you used a different value then that value will appear.
  6. Click on the dataset. A list of tables appears:

    allergies,

    careplans,

    conditions,

    encounters,

    imaging_studies,

    immunizations,

    medications,

    observations,

    organizations,

    patients,

    providers

  7. Select any of the tables and then explore it using the schema and preview tabs.