2018 NUS-MIT Datathon Tutorial: Machine Learning on CBIS-DDSM

Goal

In this colab, we are going to train a simple convolutional neural network (CNN) with Tensorflow, which can be used to classify the mammographic images based on breast density.

The network we are going to build is adapted from the official tensorflow tutorial.

CBIS-DDSM

The dataset we are going to work with is CBIS-DDSM. Quote from their website:

"This CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is an updated and standardized version of the Digital Database for Screening Mammography (DDSM)."

CBIS-DDSM differs from the original DDSM dataset in that it converted images to DICOM format, which is easier to work with.

Note that although this tutorial focuses on the CBIS-DDSM dataset, most of it can be easily applied to The International Skin Imaging Collaboration (ISIC) dataset as well. More details will be provided in the Datasets section below.

Setup

To be able to run the code cells in this tutorial, you need to create a copy of this Colab notebook by clicking "File" > "Save a copy in Drive..." menu.

You can share your copy with your teammates by clicking on the "SHARE" button on the top-right corner of your Colab notebook copy. Everyone with "Edit" permission is able to modify the notebook at the same time, so it is a great way for team collaboration.

First Let's import modules needed to complete the tutorial. You can run the following cell by clicking on the triangle button when you hover over the [ ] space on the top-left corner of the code cell below.


In [0]:
import numpy as np
import os
import pandas as pd
import random
import tensorflow as tf

from google.colab import auth
from google.cloud import storage
from io import BytesIO
# The next import is used to print out pretty pandas dataframes
from IPython.display import display, HTML
from PIL import Image

Next, we need to authenticate ourselves to Google Cloud Platform. If you are running the code cell below for the first time, a link will show up, which leads to a web page for authentication and authorization. Login with your crendentials and make sure the permissions it requests are proper, after clicking Allow button, you will be redirected to another web page which has a verification code displayed. Copy the code and paste it in the input field below.


In [0]:
auth.authenticate_user()

At the same time, let's set the project we are going to use throughout the tutorial.


In [0]:
project_id = 'nus-datathon-2018-team-00'
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id

Optional: In this Colab we can opt to use GPU to train our model by clicking "Runtime" on the top menus, then clicking "Change runtime type", select "GPU" for hardware accelerator. You can verify that GPU is working with the following code cell.


In [0]:
# Should output something like '/device:GPU:0'.
tf.test.gpu_device_name()

Dataset

We have already extracted the images from the DICOM files to separate folders on GCS, and some preprocessing were also done with the raw images (If you need custom preprocessing, please consult our tutorial on image preprocessing).

The folders ending with _demo contain subsets of training and test images. Specifically, the demo training dataset has 100 images, with 25 images for each breast density category (1 - 4). There are 20 images in the test dataset which were selected randomly. All the images were first padded to 5251x7111 (largest width and height among the selected images) and then resized to 95x128 to fit in memory and save training time. Both training and test images are "Cranial-Caudal" only.

ISIC dataset is organized in a slightly different way, the images are in JPEG format and each image comes with a JSON file containing metadata information. In order to make this tutorial work for ISIC, you will need to first pad and resize the images (we provide a script to do that here), and extract the labels from the JSON files based on your interests.

Training

Before coding on our neurual network, let's create a few helper methods to make loading data from Google Cloud Storage (GCS) easier.


In [0]:
client = storage.Client()

bucket_name = 'datathon-cbis-ddsm-colab'
bucket = client.get_bucket(bucket_name)

def load_images(folder):
  images = []
  labels = []
  # The image name is in format: <LABEL>_Calc_{Train,Test}_P_<Patient_ID>_{Left,Right}_CC.
  for label in [1, 2, 3, 4]:
    blobs = bucket.list_blobs(prefix=("%s/%s_" % (folder, label)))

    for blob in blobs:
      byte_stream = BytesIO()
      blob.download_to_file(byte_stream)
      byte_stream.seek(0)

      img = Image.open(byte_stream)
      images.append(np.array(img, dtype=np.float32))
      labels.append(label-1) # Minus 1 to fit in [0, 4).

  return np.array(images), np.array(labels, dtype=np.int32)

def load_train_images():
  return load_images('small_train_demo')

def load_test_images():
  return load_images('small_test_demo')

Let's create a model function, which will be passed to an estimator that we will create later. The model has an architecture of 6 layers:

  1. Convolutional Layer: Applies 32 5x5 filters, with ReLU activation function
  2. Pooling Layer: Performs max pooling with a 2x2 filter and stride of 2
  3. Convolutional Layer: Applies 64 5x5 filters, with ReLU activation function
  4. Pooling Layer: Same setup as #2
  5. Dense Layer: 1,024 neurons, with dropout regulartization rate of 0.25
  6. Logits Layer: 4 neurons, one for each breast density category, i.e. [0, 4)

Note that you can change the parameters on the right (or inline) to tune the neurual network. It is highly recommended to check out the original tensorflow tutorial to get a deeper understanding of the network we are building here.


In [0]:
KERNEL_SIZE = 5 #@param
DROPOUT_RATE = 0.25 #@param

def cnn_model_fn(features, labels, mode):
  """Model function for CNN."""

  # Input Layer.
  # Reshape to 4-D tensor: [batch_size, height, width, channels]
  # DDSM images are grayscale, which have 1 channel.
  input_layer = tf.reshape(features["x"], [-1, 95, 128, 1])

  # Convolutional Layer #1.
  # Input Tensor Shape: [batch_size, 95, 128, 1]
  # Output Tensor Shape: [batch_size, 95, 128, 32]
  conv1 = tf.layers.conv2d(
      inputs=input_layer,
      filters=32,
      kernel_size=KERNEL_SIZE,
      padding="same",
      activation=tf.nn.relu)

  # Pooling Layer #1.
  # Input Tensor Shape: [batch_size, 95, 128, 1]
  # Output Tensor Shape: [batch_size, 47, 64, 32]
  pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)

  # Convolutional Layer #2.
  # Input Tensor Shape: [batch_size, 47, 64, 32]
  # Output Tensor Shape: [batch_size, 47, 64, 64]
  conv2 = tf.layers.conv2d(
      inputs=pool1,
      filters=64,
      kernel_size=KERNEL_SIZE,
      padding="same",
      activation=tf.nn.relu)

  # Pooling Layer #2.
  # Input Tensor Shape: [batch_size, 47, 64, 32]
  # Output Tensor Shape: [batch_size, 23, 32, 64]
  pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)

  # Flatten tensor into a batch of vectors
  # Input Tensor Shape: [batch_size, 23, 32, 64]
  # Output Tensor Shape: [batch_size, 23 * 32 * 64]
  pool2_flat = tf.reshape(pool2, [-1, 23 * 32 * 64])

  # Dense Layer.
  # Input Tensor Shape: [batch_size, 25 * 17 * 64]
  # Output Tensor Shape: [batch_size, 1024]
  dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)

  # Dropout operation.
  # 0.75 probability that element will be kept.
  dropout = tf.layers.dropout(inputs=dense, rate=DROPOUT_RATE,
                              training=(mode == tf.estimator.ModeKeys.TRAIN))

  # Logits Layer.
  # Input Tensor Shape: [batch_size, 1024]
  # Output Tensor Shape: [batch_size, 4]
  logits = tf.layers.dense(inputs=dropout, units=4)

  predictions = {
      # Generate predictions (for PREDICT and EVAL mode)
      "classes": tf.argmax(input=logits, axis=1),
      # Add `softmax_tensor` to the graph. It is used for PREDICT and by the
      # `logging_hook`.
      "probabilities": tf.nn.softmax(logits, name="softmax_tensor")
  }
  if mode == tf.estimator.ModeKeys.PREDICT:
    return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

  # Loss Calculation.
  loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

  if mode == tf.estimator.ModeKeys.TRAIN:
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
    train_op = optimizer.minimize(
        loss=loss,
        global_step=tf.train.get_global_step())
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

  # Add evaluation metrics (for EVAL mode).
  eval_metric_ops = {
      "accuracy": tf.metrics.accuracy(
          labels=labels, predictions=predictions["classes"])}
  return tf.estimator.EstimatorSpec(
      mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

Now that we have a model function, next step is feeding it to an estimator for training. Here are are creating a main function as required by tensorflow.


In [0]:
BATCH_SIZE = 20 #@param
STEPS = 1000 #@param

artifacts_bucket_name = 'nus-datathon-2018-team-00-shared-files'
# Append a random number to avoid collision.
artifacts_path = "ddsm_model_%s" % random.randint(0, 1000)
model_dir = "gs://%s/%s" % (artifacts_bucket_name, artifacts_path)

def main(_):
  # Load training and test data.
  train_data, train_labels = load_train_images()
  eval_data, eval_labels = load_test_images()

  # Create the Estimator.
  ddsm_classifier = tf.estimator.Estimator(
      model_fn=cnn_model_fn,
      model_dir=model_dir)

  # Set up logging for predictions.
  # Log the values in the "Softmax" tensor with label "probabilities".
  tensors_to_log = {"probabilities": "softmax_tensor"}
  logging_hook = tf.train.LoggingTensorHook(
      tensors=tensors_to_log, every_n_iter=50)

  # Train the model.
  train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
      x={"x": train_data},
      y=train_labels,
      batch_size=BATCH_SIZE,
      num_epochs=None,
      shuffle=True)
  ddsm_classifier.train(
      input_fn=train_input_fn,
      steps=STEPS,
      hooks=[logging_hook])

  # Evaluate the model and print results.
  eval_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
      x={"x": eval_data},
      y=eval_labels,
      num_epochs=1,
      shuffle=False)
  eval_results = ddsm_classifier.evaluate(input_fn=eval_input_fn)
  print(eval_results)

Finally, here comes the exciting moment. We are going to train and evaluate the model we just built! Run the following code cell and pay attention to the accuracy printed at the end of logs.

Note if this is not the first time you run the following cell, to avoid weird errors like "NaN loss during training", please run the following command to remove the temporary files.


In [0]:
# Remove temporary files.
artifacts_bucket = client.get_bucket(artifacts_bucket_name)
artifacts_bucket.delete_blobs(artifacts_bucket.list_blobs(prefix=artifacts_path))

In [0]:
# Set logging level.
tf.logging.set_verbosity(tf.logging.INFO)

# Start training, this will call the main method defined above behind the scene.
# The whole training process will take ~5 mins.
tf.app.run()

As you can see, the result doesn't look too good. This is expected given how little data we use for training and how simple our network is.

Now for those of you who are interested, let's move to use Cloud Machine Learning Engine to train a model on the whole dataset with a standalone GPU and a TPU respectively. Please continue the instructions here.