Copyright 2018 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

This tutorial is for educational purposes purposes only and is not intended for use in clinical diagnosis or clinical decision-making or for any other clinical use.

Training/Inference on Breast Density Classification Model on AutoML Vision

The goal of this tutorial is to train, deploy and run inference on a breast density classification model. Breast density is thought to be a factor for an increase in the risk for breast cancer. This will emphasize using the Cloud Healthcare API in order to store, retreive and transcode medical images (in DICOM format) in a managed and scalable way. This tutorial will focus on using Cloud AutoML Vision to scalably train and serve the model.

Note: This is the AutoML version of the Cloud ML Engine Codelab found here.

Requirements

Notebook dependencies

We will need to install the hcls_imaging_ml_toolkit package found here. This toolkit helps make working with DICOM objects and the Cloud Healthcare API easier. In addition, we will install dicomweb-client to help us interact with the DIOCOMWeb API and pydicom which is used to help up construct DICOM objects.


In [ ]:
%%bash

pip3 install git+https://github.com/GoogleCloudPlatform/healthcare.git#subdirectory=imaging/ml/toolkit
pip3 install dicomweb-client
pip3 install pydicom

Input Dataset

The dataset that will be used for training is the TCIA CBIS-DDSM dataset. This dataset contains ~2500 mammography images in DICOM format. Each image is given a BI-RADS breast density score from 1 to 4. In this tutorial, we will build a binary classifier that distinguishes between breast density "2" (scattered density) and "3" (heterogeneously dense). These are the two most common and variably assigned scores. In the literature, this is said to be particularly difficult for radiologists to consistently distinguish.


In [0]:
project_id = "MY_PROJECT" # @param
location = "us-central1"
dataset_id = "MY_DATASET" # @param
dicom_store_id = "MY_DICOM_STORE" # @param

# Input data used by AutoML must be in a bucket with the following format.
automl_bucket_name = "gs://" + project_id + "-vcm"

In [0]:
%%bash -s {project_id} {location} {automl_bucket_name}
# Create bucket.
gsutil -q mb -c regional -l $2 $3

# Allow Cloud Healthcare API to write to bucket.
PROJECT_NUMBER=`gcloud projects describe $1 | grep projectNumber | sed 's/[^0-9]//g'`
SERVICE_ACCOUNT="service-${PROJECT_NUMBER}@gcp-sa-healthcare.iam.gserviceaccount.com"
COMPUTE_ENGINE_SERVICE_ACCOUNT="${PROJECT_NUMBER}-compute@developer.gserviceaccount.com"

gsutil -q iam ch serviceAccount:${SERVICE_ACCOUNT}:objectAdmin $3
gsutil -q iam ch serviceAccount:${COMPUTE_ENGINE_SERVICE_ACCOUNT}:objectAdmin $3
gcloud projects add-iam-policy-binding $1 --member=serviceAccount:${SERVICE_ACCOUNT} --role=roles/pubsub.publisher
gcloud projects add-iam-policy-binding $1 --member=serviceAccount:${COMPUTE_ENGINE_SERVICE_ACCOUNT} --role roles/pubsub.admin
# Allow compute service account to create datasets and dicomStores.
gcloud projects add-iam-policy-binding $1 --member=serviceAccount:${COMPUTE_ENGINE_SERVICE_ACCOUNT} --role roles/healthcare.dicomStoreAdmin
gcloud projects add-iam-policy-binding $1 --member=serviceAccount:${COMPUTE_ENGINE_SERVICE_ACCOUNT} --role roles/healthcare.datasetAdmin

In [0]:
import json
import os
import google.auth
from google.auth.transport.requests import AuthorizedSession
from hcls_imaging_ml_toolkit import dicom_path

credentials, project = google.auth.default()
authed_session = AuthorizedSession(credentials)
# Path to Cloud Healthcare API.
HEALTHCARE_API_URL = 'https://healthcare.googleapis.com/v1'

# Create Cloud Healthcare API dataset.
path = os.path.join(HEALTHCARE_API_URL, 'projects', project_id, 'locations', location, 'datasets?dataset_id=' + dataset_id)
headers = {'Content-Type': 'application/json'}
resp = authed_session.post(path, headers=headers)

assert resp.status_code == 200, 'error creating Dataset, code: {0}, response: {1}'.format(resp.status_code, resp.text)
print('Full response:\n{0}'.format(resp.text))

# Create Cloud Healthcare API DICOM store.
path = os.path.join(HEALTHCARE_API_URL, 'projects', project_id, 'locations', location, 'datasets', dataset_id, 'dicomStores?dicom_store_id=' + dicom_store_id)
resp = authed_session.post(path, headers=headers)
assert resp.status_code == 200, 'error creating DICOM store, code: {0}, response: {1}'.format(resp.status_code, resp.text)
print('Full response:\n{0}'.format(resp.text))
dicom_store_path = dicom_path.Path(project_id, location, dataset_id, dicom_store_id)

Next, we are going to transfer the DICOM instances to the Cloud Healthcare API.

Note: We are transfering >100GB of data so this will take some time to complete


In [0]:
# Store DICOM instances in Cloud Healthcare API.
path = 'https://healthcare.googleapis.com/v1/{}:import'.format(dicom_store_path)
headers = {'Content-Type': 'application/json'}
body = { 
      'gcsSource': {
        'uri': 'gs://gcs-public-data--healthcare-tcia-cbis-ddsm/dicom/**'
      }
}
resp = authed_session.post(path, headers=headers, json=body)
assert resp.status_code == 200, 'error creating Dataset, code: {0}, response: {1}'.format(resp.status_code, resp.text)
print('Full response:\n{0}'.format(resp.text))
response = json.loads(resp.text)
operation_name = response['name']

In [0]:
import time

def wait_for_operation_completion(path, timeout, sleep_time=30): 
  success = False
  while time.time() < timeout:
    print('Waiting for operation completion...')
    resp = authed_session.get(path)
    assert resp.status_code == 200, 'error polling for Operation results, code: {0}, response: {1}'.format(resp.status_code, resp.text)
    response = json.loads(resp.text)
    if 'done' in response:
      if response['done'] == True and 'error' not in response:
        success = True;
      break
    time.sleep(sleep_time)

  print('Full response:\n{0}'.format(resp.text))      
  assert success, "operation did not complete successfully in time limit"
  print('Success!')
  return response

In [0]:
path = os.path.join(HEALTHCARE_API_URL, operation_name)
timeout = time.time() + 40*60 # Wait up to 40 minutes.
_ = wait_for_operation_completion(path, timeout)

Explore the Cloud Healthcare DICOM dataset (optional)

This is an optional section to explore the Cloud Healthcare DICOM dataset. In the following code, we simply just list the studies that we have loaded into the Cloud Healthcare API. You can modify the num_of_studies_to_print parameter to print as many studies as desired.


In [0]:
num_of_studies_to_print = 2 # @param


path = os.path.join(HEALTHCARE_API_URL, dicom_store_path.dicomweb_path_str, 'studies')
resp = authed_session.get(path)
assert resp.status_code == 200, 'error querying Dataset, code: {0}, response: {1}'.format(resp.status_code, resp.text)
response = json.loads(resp.text)

print(json.dumps(response[:num_of_studies_to_print], indent=2))

Convert DICOM to JPEG

The ML model that we will build requires that the dataset be in JPEG. We will leverage the Cloud Healthcare API to transcode DICOM to JPEG.

First we will create a Google Cloud Storage bucket to hold the output JPEG files. Next, we will use the ExportDicomData API to transform the DICOMs to JPEGs.


In [0]:
# Folder to store input images for AutoML Vision.
jpeg_folder = automl_bucket_name + "/images/"

Next we will convert the DICOMs to JPEGs using the ExportDicomData.


In [0]:
%%bash -s {jpeg_folder} {project_id} {location} {dataset_id} {dicom_store_id}
gcloud beta healthcare --project $2  dicom-stores export gcs $5 --location=$3 --dataset=$4 --mime-type="image/jpeg; transfer-syntax=1.2.840.10008.1.2.4.50" --gcs-uri-prefix=$1

Meanwhile, you should be able to observe the JPEG images being added to your Google Cloud Storage bucket.

Next, we will join the training data stored in Google Cloud Storage with the labels in the TCIA website. The output of this step is a CSV file that is input to AutoML. This CSV contains a list of pairs of (IMAGE_PATH, LABEL).


In [0]:
# tensorflow==1.15.0 to have same versions in all environments - dataflow, automl, ai-platform
!pip install tensorflow==1.15.0 --ignore-installed
# CSV to hold (IMAGE_PATH, LABEL) list.
input_data_csv = automl_bucket_name + "/input.csv"

import csv
import os
import re
from tensorflow.python.lib.io import file_io
import scripts.tcia_utils as tcia_utils

# Get map of study_uid -> file paths.
path_list = file_io.get_matching_files(os.path.join(jpeg_folder, '*/*/*'))
study_uid_to_file_paths = {}
pattern = r'^{0}(?P<study_uid>[^/]+)/(?P<series_uid>[^/]+)/(?P<instance_uid>.*)'.format(jpeg_folder)
for path in path_list:
  match = re.search(pattern, path)
  study_uid_to_file_paths[match.group('study_uid')] = path

# Get map of study_uid -> labels.
study_uid_to_labels = tcia_utils.GetStudyUIDToLabelMap()

# Join the two maps, output results to CSV in Google Cloud Storage.
with file_io.FileIO(input_data_csv, 'w') as f:
  writer = csv.writer(f, delimiter=',')
  for study_uid, label in study_uid_to_labels.items():
    if study_uid in study_uid_to_file_paths:
      writer.writerow([study_uid_to_file_paths[study_uid], label])

Training

This section will focus on using AutoML through its API. AutoML can also be used through the user interface found here. The below steps in this section can all be done through the web UI .

We will use AutoML Vision to train the classification model. AutoML provides a fully managed solution for training the model. All we will do is input the list of input images and labels. The trained model in AutoML will be able to classify the mammography images as either "2" (scattered density) or "3" (heterogeneously dense).

As a first step, we will create a AutoML dataset.


In [0]:
automl_dataset_display_name = "MY_AUTOML_DATASET" # @param

In [0]:
import json
import os

# Path to AutoML API.
AUTOML_API_URL = 'https://automl.googleapis.com/v1beta1'

# Path to request creation of AutoML dataset.
path = os.path.join(AUTOML_API_URL, 'projects', project_id, 'locations', location, 'datasets')

# Headers (request in JSON format).
headers = {'Content-Type': 'application/json'}

# Body (encoded in JSON format).
config = {'display_name': automl_dataset_display_name, 'image_classification_dataset_metadata': {'classification_type': 'MULTICLASS'}}

resp = authed_session.post(path, headers=headers, json=config)
assert resp.status_code == 200, 'creating AutoML dataset, code: {0}, response: {1}'.format(resp.status_code, resp.text)
print('Full response:\n{0}'.format(resp.text))

# Record the AutoML dataset name.
response = json.loads(resp.text)
automl_dataset_name = response['name']

Next, we will import the CSV that contains the list of (IMAGE_PATH, LABEL) list into AutoML. Please ignore errors regarding an existing ground truth.


In [0]:
# Path to request import into AutoML dataset.
path = os.path.join(AUTOML_API_URL, automl_dataset_name + ':importData')

# Body (encoded in JSON format).
config = {'input_config': {'gcs_source': {'input_uris': [input_data_csv]}}} 

resp = authed_session.post(path, headers=headers, json=config)
assert resp.status_code == 200, 'error importing AutoML dataset, code: {0}, response: {1}'.format(resp.status_code, resp.text)
print('Full response:\n{0}'.format(resp.text))

# Record operation_name so we can poll for it later.
response = json.loads(resp.text)
operation_name = response['name']

The output of the previous step is an operation that will need to poll the status for. We will poll until the operation's "done" field is set to true. This will take a few minutes to complete so we will wait until completion.


In [0]:
path = os.path.join(AUTOML_API_URL, operation_name)
timeout = time.time() + 40*60 # Wait up to 40 minutes.
_ = wait_for_operation_completion(path, timeout)

Next, we will train the model to perform classification. We will set the training budget to be a maximum of 1hr (but this can be modified below). The cost of using AutoML can be found here. Typically, the longer the model is trained for, the more accurate it will be.


In [0]:
# Name of the model.
model_display_name = "MY_MODEL_NAME" # @param

# Training budget (1 hr).
training_budget = 1 # @param

In [0]:
# Path to request import into AutoML dataset.
path = os.path.join(AUTOML_API_URL, 'projects', project_id, 'locations', location, 'models')

# Headers (request in JSON format).
headers = {'Content-Type': 'application/json'}

# Body (encoded in JSON format).
automl_dataset_id = automl_dataset_name.split('/')[-1]
config = {'display_name': model_display_name, 'dataset_id': automl_dataset_id, 'image_classification_model_metadata': {'train_budget': training_budget}}

resp = authed_session.post(path, headers=headers, json=config)
assert resp.status_code == 200, 'error creating AutoML model, code: {0}, response: {1}'.format(resp.status_code, contenresp.text)
print('Full response:\n{0}'.format(resp.text))

# Record operation_name so we can poll for it later.
response = json.loads(resp.text)
operation_name = response['name']

The output of the previous step is also an operation that will need to poll the status of. We will poll until the operation's "done" field is set to true. This will take a few minutes to complete.


In [0]:
path = os.path.join(AUTOML_API_URL, operation_name)
timeout = time.time() + 40*60 # Wait up to 40 minutes.
sleep_time = 5*60 # Update each 5 minutes.
response = wait_for_operation_completion(path, timeout, sleep_time)
full_model_name = response['response']['name']

In [0]:
# google.cloud.automl to make api calls to Cloud AutoML
!pip install google-cloud-automl
from google.cloud import automl_v1
client = automl_v1.AutoMlClient()
response = client.deploy_model(full_model_name)
print(u'Model deployment finished. {}'.format(response.result()))

Next, we will check out the accuracy metrics for the trained model. The following command will return the AUC (ROC), precision and recall for the model, for various ML classification thresholds.


In [0]:
# Path to request to get model accuracy metrics.
path = os.path.join(AUTOML_API_URL, full_model_name,  'modelEvaluations')

resp = authed_session.get(path)
assert resp.status_code == 200, 'error getting AutoML model evaluations, code: {0}, response: {1}'.format(resp.status_code, resp.text)
print('Full response:\n{0}'.format(resp.text))

Inference

To allow medical imaging ML models to be easily integrated into clinical workflows, an inference module can be used. A standalone modality, a PACS system or a DICOM router can push DICOM instances into Cloud Healthcare DICOM stores, allowing ML models to be triggered for inference. This inference results can then be structured into various DICOM formats (e.g. DICOM structured reports) and stored in the Cloud Healthcare API, which can then be retrieved by the customer.

The inference module is built as a Docker container and deployed using Kubernetes, allowing you to easily scale your deployment. The dataflow for inference can look as follows (see corresponding diagram below):

  1. Client application uses STOW-RS to push a new DICOM instance to the Cloud Healthcare DICOMWeb API.

  2. The insertion of the DICOM instance triggers a Cloud Pubsub message to be published. The inference module will pull incoming Pubsub messages and will recieve a message for the previously inserted DICOM instance.

  3. The inference module will retrieve the instance in JPEG format from the Cloud Healthcare API using WADO-RS.

  4. The inference module will send the JPEG bytes to the model hosted on AutoML.

  5. AutoML will return the prediction back to the inference module.

  6. The inference module will package the prediction into a DICOM instance. This can potentially be a DICOM structured report, presentation state, or even burnt text on the image. In this codelab, we will focus on just DICOM structured reports. The structured report is then stored back in the Cloud Healthcare API using STOW-RS.

  7. The client application can query for (or retrieve) the structured report by using QIDO-RS or WADO-RS. Pubsub can also be used by the client application to poll for the newly created DICOM structured report instance.

To begin, we will create a new DICOM store that will store our inference source (DICOM mammography instance) and results (DICOM structured report). In order to enable Pubsub notifications to be triggered on inserted instances, we will give the DICOM store a Pubsub channel to publish on.


In [0]:
# Pubsub config.
pubsub_topic_id = "MY_PUBSUB_TOPIC_ID" # @param
pubsub_subscription_id = "MY_PUBSUB_SUBSRIPTION_ID" # @param

# DICOM Store for store DICOM used for inference.
inference_dicom_store_id = "MY_INFERENCE_DICOM_STORE" # @param

pubsub_subscription_name = "projects/" + project_id + "/subscriptions/" + pubsub_subscription_id
inference_dicom_store_path = dicom_path.FromPath(dicom_store_path, store_id=inference_dicom_store_id)

In [0]:
%%bash -s {pubsub_topic_id} {pubsub_subscription_id} {project_id} {location} {dataset_id} {inference_dicom_store_id}

# Create Pubsub channel.
gcloud beta pubsub topics create $1
gcloud beta pubsub subscriptions create $2 --topic $1

# Create a Cloud Healthcare DICOM store that published on given Pubsub topic.
TOKEN=`gcloud beta auth application-default print-access-token`
NOTIFICATION_CONFIG="{notification_config: {pubsub_topic: \"projects/$3/topics/$1\"}}"
curl -s -X POST -H "Content-Type: application/json" -H "Authorization: Bearer ${TOKEN}" -d "${NOTIFICATION_CONFIG}" https://healthcare.googleapis.com/v1/projects/$3/locations/$4/datasets/$5/dicomStores?dicom_store_id=$6

# Enable Cloud Healthcare API to publish on given Pubsub topic.
PROJECT_NUMBER=`gcloud projects describe $3 | grep projectNumber | sed 's/[^0-9]//g'`
SERVICE_ACCOUNT="service-${PROJECT_NUMBER}@gcp-sa-healthcare.iam.gserviceaccount.com"
gcloud beta pubsub topics add-iam-policy-binding $1 --member="serviceAccount:${SERVICE_ACCOUNT}" --role="roles/pubsub.publisher"

Next, we will building the inference module using Cloud Build API. This will create a Docker container that will be stored in Google Container Registry. The inference module code is found in inference.py. The build script used to build the Docker container for this module is cloudbuild.yaml. Progress of build may be found on cloud build dashboard.


In [0]:
%%bash -s {project_id}
PROJECT_ID=$1

gcloud builds submit --config scripts/inference/cloudbuild.yaml --timeout 1h scripts/inference

Next, we will deploy the inference module to Kubernetes.

Then we create a Kubernetes Cluster and a Deployment for the inference module.


In [0]:
%%bash -s {project_id} {location} {pubsub_subscription_name} {full_model_name} {inference_dicom_store_path}
gcloud container clusters create inference-module --region=$2 --scopes https://www.googleapis.com/auth/cloud-platform --num-nodes=1

PROJECT_ID=$1
SUBSCRIPTION_PATH=$3
MODEL_PATH=$4
INFERENCE_DICOM_STORE_PATH=$5

cat <<EOF | kubectl create -f -
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: inference-module
  namespace: default
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: inference-module
    spec:
      containers:
        - name: inference-module
          image: gcr.io/${PROJECT_ID}/inference-module:latest
          command:
            - "/opt/inference_module/bin/inference_module"
            - "--subscription_path=${SUBSCRIPTION_PATH}"
            - "--model_path=${MODEL_PATH}"
            - "--dicom_store_path=${INFERENCE_DICOM_STORE_PATH}"
            - "--prediction_service=AutoML"
EOF

Next, we will store a mammography DICOM instance from the TCIA dataset to the DICOM store. This is the image that we will request inference for. Pushing this instance to the DICOM store will result in a Pubsub message, which will trigger the inference module.


In [0]:
# DICOM Study/Series UID of input mammography image that we'll push for inference.
input_mammo_study_uid = "1.3.6.1.4.1.9590.100.1.2.85935434310203356712688695661986996009"
input_mammo_series_uid = "1.3.6.1.4.1.9590.100.1.2.374115997511889073021386151921807063992"
input_mammo_instance_uid = "1.3.6.1.4.1.9590.100.1.2.289923739312470966435676008311959891294"

In [0]:
from google.cloud import storage
from dicomweb_client.api import DICOMwebClient
from dicomweb_client import session_utils
from pydicom


storage_client = storage.Client()
bucket = storage_client.bucket('gcs-public-data--healthcare-tcia-cbis-ddsm', user_project=project_id)
blob = bucket.blob("dicom/{}/{}/{}.dcm".format(input_mammo_study_uid,input_mammo_series_uid,input_mammo_instance_uid))
blob.download_to_filename('example.dcm')
dataset = pydicom.dcmread('example.dcm')
session = session_utils.create_session_from_gcp_credentials()
study_path = dicom_path.FromPath(inference_dicom_store_path, study_uid=input_mammo_study_uid)
dicomweb_url = os.path.join(HEALTHCARE_API_URL, study_path.dicomweb_path_str)
dcm_client = DICOMwebClient(dicomweb_url, session)
dcm_client.store_instances(datasets=[dataset])

You should be able to observe the inference module's logs by running the following command. In the logs, you should observe that the inference module successfully recieved the the Pubsub message and ran inference on the DICOM instance. The logs should also include the inference results. It can take a few minutes for the Kubernetes deployment to start up, so you many need to run this a few times. The logs should also include the inference results. It can take a few minutes for the Kubernetes deployment to start up, so you many need to run this a few times.


In [0]:
!kubectl logs -l app=inference-module

You can also query the Cloud Healthcare DICOMWeb API (using QIDO-RS) to see that the DICOM structured report has been inserted for the study. The structured report contents can be found under tag "0040A730".

You can optionally also use WADO-RS to recieve the instance (e.g. for viewing).


In [0]:
dcm_client.search_for_instances(study_path.study_uid, fields=['all'])