Summary

In Second Annual Data Science Bowl Kaggle competition you estimate the heart volume at maximal expansion and contraction using an MRI study.

My solution localizes the heart in the horizontal (sax) slice images of a study and use it to crop the images. It also converts the time sequence to channels (dc and first two DFT with phase.)

Each horizontal cropped slices are feed into a CNN which predicts the volume contribution of each slice to the entire volume of the heart. When predicting, the results from the same study are added up. When training a special arrangement is used in which all slices from the same study appear in the same batch and the loss function sums all slices from the same study before computing loss.

CNN predicts both the volume and the error of the prediction and the loss is negative log liklihood of a normal distribution.

The final submission is made from ensembly of many predictions, each based on a different sub set of the training data.

The solution is made from several steps, each a jupyter notebook, which you execute one after the other. Results of each step can be stored on S3 allowing for parallel execution in some of the steps on several AWS EC2 instances.

Setup

hardware

All steps are run on AWS EC2 g2.2xlarge instance running Ubuntu 14.04.3 LTS. But many other Linux / OS X will also work.

software

Code is running on python 2.7

Install my fork of Keras

pip install git+git://github.com/udibr/keras.git#validate_batch_size

configuration


In [1]:
!cat SETTINGS.json


{
  "TRAIN_DATA_PATH": "/vol1/data/train",
  "VALID_DATA_PATH": "/vol1/data/validate",
  "TEST_DATA_PATH": "/vol1/data/test",
  "OUT_DATA_PATH": "s3://udikaggle/dsb.2",
  "TEMP_DATA_PATH": "/mnt/data"
}

Each step stores its results in OUT_DATA_PATH. Modify this field to an S3 bucket and folder for which you have access. You can change it to a regular file system directory.

Some notebooks require a large disk space to store temporary results in `TEMP_DATA_PATH` make sure the path you are chossing exists (and/or mounted) and that you have enough disk space.

data

Download the data

train.csv should be in TRAIN_DATA_PATH and should have exactly Nt=500 studies. This directory should also contain all the 500 train DICOM study sub directories named 1 $\ldots$ 500

validate.csv should be in VALID_DATA_PATH and should have exactly Nv=200 studies. It could be missing if first stage run is performed. This directory should contain all the 200 valdiation study sub directories named 501 $\ldots$ 700

TEST_DATA_PATH should contain all the Ns=440 test study sub directories named 701 $\ldots$ 1140

Preprocessing

Patient information is read from the CSV file(s) and from DICOM meta data and placed in a Pandas Data Frame.

localization

you can perform the localization step in parallel to the preprocessing step

Both modeling and prediction require the approximate location and size of the LV in each image slice.

The next two notebooks are almost identical to the deep learning tutorial and you should follow the installation process of that tutorial in order for the notebooks below to work.

You first need to build a pixel-model which detemrines if each indvidual pixel belongs to LV. The building of the pixel model uses the Sunnybrook dataset (see instructions at start of notebook on how to download and open these file.)

This dataset does not change in the different stages of the competition and you can use my precomputed model which is downloaded for you in the next notebook.

Next you need to predict the pixel value for the entire competition data set which is stored in "masks".

Next run a fourier analysis which performs the localization based on the pixel level predictions read from the masks.

cropping

crop, zoom and rotate the area around the LV from every slice and store the reuslt in single file. This step also collapses the time sequence of images into 5 channels. DC and the first two frequencies of a DFT with phase.

At the bottom of this notebook you have the option to review samples from 120 studies at a time, validating that the croped images cover the LV and that it appears in about the same size and orientation in all studies.

Modeling

Build a model and use it to predict the heart volume on test data. The modeling looks on only $2/3$ of the training data based on the value assigned to the seed and isplit variables at the top of the notebook.

An ensembly of predictions is built by running the notebook many times, each with different seed and isplit values which you have to manipulate, by hand, before every run. For stage 2 of the competiton generate 60 models. The seed values ranging from 1000 to 1019 (20 different values) and for each seed value run on 3 different isplit values: 0, 1 and 2.

You can run in parallel on many `g2.2xlarge` AWS EC2 instances. After starting each instance you can "forget" it since it will self terminate once the final prediction is made.

Building submission

Finally, fuse the ensembly to a single submission file.