This tutorial shows how to generate data using the Synthea generator and then upload the data into BigQuery. This is a prerequisite for other tutorials that work with the Synthea dataset.
Synthea
Synthea is a data generator that simulates the lives of patients based on several medical modules. Each module models a different medical condition based on some real world statistics. Each patient in the Synthea dataset dies either due to medical reasons or non-medical random events not modeled by the generator.
To run this tutorial you will need a GCP project with a billing account.
There is a small cost associated with importing the dataset and storing it in BigQuery.
NOTE: At present, this demo works only on Colab. To run the demo, go to and upload the notebook into your environment. (The first step, generating data using Synthea, requires the Java SDK which is available by default in Colab and not in Cloud Datalab or ML notebooks).
First, you need to sign into your Google account to access Google Cloud Platform (GCP).
Authentication Run the following commands, click on the link that displays, and follow the instructions to authenticate. Scroll to the results box to the left to see where to paste the key you will copy from the browser.
NOTE: You will need to repeat this step each time you reconnect to the notebook server.
In [0]:
from google.colab import auth
auth.authenticate_user()
credentials = auth._check_adc()
print(credentials)
Library Imports:
NOTE: You will need to repeat this step each time you reconnect to the notebook server.
In [0]:
from google.cloud import bigquery
from google.cloud import storage
Setup:
Enter the name of your GCP project and the name of a staging bucket in Cloud Storage. The staging bucket will be created if it does not exist. The dataset name, output table, and model names are supplied for you.
NOTE: You will need to repeat this step each time you reconnect to the notebook server.
In [0]:
project = "" #@param {type:"string"}
if not project:
raise Exception("Project is empty.")
!gcloud config set project $project
dataset = "SYNMASS_2k" #@param {type:"string"}
staging_bucket_name = "" #@param {type:"string"}
if not staging_bucket_name:
raise Exception("Staging bucket name is empty.")
if staging_bucket_name.startswith("gs://"):
staging_bucket_path = staging_bucket_name
staging_bucket_name = staging_bucket_path[5:]
else:
staging_bucket_path = "gs://" + staging_bucket_name
# Create the staging bucket if it doesn't exist.
storage_client = storage.Client(project)
if storage_client.lookup_bucket(staging_bucket_name) is None:
bucket = storage_client.create_bucket(staging_bucket_name)
This section explains how to generate synthetic data and import it into BigQuery. You only need to complete this step once. You do not need to complete it again if you restart or reconnect to this notebook.
First, clone the Synthea generator from GitHub, and then build it using Gradle.
This step takes two to three minutes.
You'll know that the build has finished successfully when the output contains BUILD SUCCESSFUL
. If you encounter any errors about missing JavaDoc comments, you can safely ignore them.
In [0]:
# Clone the Synthea code
!git clone https://github.com/synthetichealth/synthea.git
# Compile the code. This will take ~2 minutes.
%cd ./synthea
!git checkout 56032e01bd2afb154dd94f62ae836459ee7821c9
!./gradlew build -x test
Generate the data
In this step, you generate clinical data for 2,000 patients. Synthea supports multiple output formats including FHIR and CSV. This tutorial uses the CSV format.
NOTE: This step takes ~8 minutes to run.
In [0]:
%%bash
time ./run_synthea Massachusetts -p 2000 -s 123 --exporter.csv.export=true > data_generation.log 2> error.log
echo "done"
Export the data to BigQuery
Run the following commands to create a BigQuery dataset and then import the CSV files into the dataset. A Dataflow job runs that detects the underlying table schema and imports the data into BigQuery.
NOTE: The output might contain a list of all of your projects. However, the project set by the $project
variable will automatically be selected at the end of the operation, so you don't need to enter anything.
In [0]:
%%bash -s "$project" "$dataset"
# This step is only needed if the dataset does not exist.
bq mk --dataset $1:$2
Run the following commands to:
In [0]:
%%bash -s "$staging_bucket_path"
update-java-alternatives -s java-1.8.0-openjdk-amd64
git clone https://github.com/GoogleCloudPlatform/bigquery-data-importer.git
tar --create --gzip --file synmass.tar.gz output/csv
gsutil cp synmass.tar.gz "$1"
Run the data importer pipeline. This step takes ~ 11 minutes, you can monitor the progress of job via Cloud dataflow dashboard (https://console.cloud.google.com/dataflow). Before running this command, please use the Cloud Platform Console, https://console.developers.google.com/apis/api/dataflow.googleapis.com/overview, to enable Dataflow API.
In [0]:
%cd bigquery-data-importer
In [0]:
%%bash -s "$project" "$dataset" "$staging_bucket_path" "$staging_bucket_name"
./gradlew run --stacktrace -PappArgs="[\
'--gcp_project_id', '${1}',\
'--gcs_uri', '${3}/synmass.tar.gz',\
'--bq_dataset', '${2}',\
'--temp_bucket', '${4}',\
'--verbose', 'true'
]"
By this point, you have: installed Synthea, used it to synthesize data on 2,000 patients, and imported the resulting CSV files into BigQuery. To explore the data, complete the following steps:
Click on the dataset. A list of tables appears:
allergies,
careplans,
conditions,
encounters,
imaging_studies,
immunizations,
medications,
observations,
organizations,
patients,
providers
Select any of the tables and then explore it using the schema and preview tabs.