BioVida: The Cancer Imaging Archive


The Cancer Imaging Archive is a large repository of medical images of various forms of cancer. Programmatic web access is granted through a RESTful web API. However, this service requires an API-key to use, which you can apply for by following the instructions found here. Typically, this process is very speedy.

Note: If it is heavily requested, BioVida may simply include an API key for this service in the future. At the moment however, you will have to follow the instructions linked to above to obtain one.

BioVida provides an easy-to-use python interface for this web API which, like the OpeniInterface class, is located in the images subpackage.


In [1]:
from biovida.images import CancerImageInterface


Using Theano backend.

We can instantiate an instance of the tool using an API key.


In [2]:
cii = CancerImageInterface(api_key=YOUR_API_KEY_HERE)

Note:

The number of studies (collections) provided here is somewhat shorter than what is provided on The Cancer Imaging Archive's website. This is because certain studies, such as those with restricted access, are excluded.

With this notice out of the way, let's go ahead and perform a search.


In [3]:
cii.search(cancer_type='breast')


Out[3]:
collection cancer_type modalities subjects location metadata access status updated modalities_full
0 ISPY1 Breast Cancer MR, SEG 222 [Breast] Yes Public Complete 2016-08-31 [Magnetic Resonance Imaging (MRI), Segmentation]
1 CBIS-DDSM Breast Cancer MG 1431 [Breast] Yes Public Ongoing 2016-08-31 [Mammography]
2 Breast-MRI-NACT-Pilot Breast Cancer MR, SEG 64 [Breast] Yes Public Complete 2016-01-26 [Magnetic Resonance Imaging (MRI), Segmentation]
3 TCGA-BRCA Breast Cancer MR, MG 139 [Breast] Yes Public Complete 2014-12-30 [Magnetic Resonance Imaging (MRI), Mammography]
4 QIN Breast DCE-MRI Breast Cancer MR, KO 10 [Breast] Yes Public Ongoing 2014-07-31 [Magnetic Resonance Imaging (MRI), Key Object ...
5 Breast Diagnosis Breast Cancer MR, PT, CT, MG 88 [Breast] Yes Public Complete 2011-11-09 [Magnetic Resonance Imaging (MRI), Positron Em...
6 RIDER Breast MRI Breast Cancer MR 5 [Breast] No Public Complete 2011-11-08 [Magnetic Resonance Imaging (MRI)]

Next, we can easily download this data.


In [4]:
import numpy as np
def simplify_df(df):
    """This function simplifies dataframes
    for the purposes of this tutorial."""
    data_frame = df.copy()
    for c in ('cached_dicom_images_path', 'cached_images_path'):
        data_frame[c] = data_frame[c].map(
            lambda x: tuple(['path_to_image'] * len(x)) if isinstance(x, tuple) else x)
    return data_frame[0:5].replace({np.NaN: ''})

The code below will download the data in our search results, but with two noteworthy restrictions.
First, patient_limit=5 will limit the number of patients/subjects downloaded to the first 5.
Second, collections_limit will limit the number of collections downloaded to one (in this case, 'TCGA-COAD').
Third, session_limit=1 will limit the results returned to the first time the patient/subject was scanned, e.g., before surgical intervention to remove diseased tissue.

Additionally, the save_dicom parameter will enable us to save the raw DICOM image files that the Cancer Imaging Archive provides. By default, pull() only generates DICOM files. However, the save_png argument also gives you the option to convert the DICOM files to PNG images.


In [5]:
pull_df = cii.pull(patient_limit=5, collections_limit=1, session_limit=1)




Let's take a look at the data we've downloaded. We could view the pull_df object above, or the identical records_db attribute of cii, e.g., cii.records_db. However, both of those DataFrames contain several column which are not typically relevant for every data use. So, instead, we can view an abbreviated DataFrame, records_db_short.


In [6]:
simplify_df(cii.records_db_short)


Out[6]:
modality protocol_name series_date series_description body_part_examined annotations_flag collection manufacturer manufacturer_model_name age ... cancer_type query pull_time modality_full series_number_rescaled cached_dicom_images_path cached_images_path error_free_conversion allowed_modality image_count_converted_cache
0 MR 3D FIESTA - LEFT 1984-10-13 3P LEFT BREAST SCOUT breast ISPY1 GE MEDICAL SYSTEMS GENESIS_SIGNA 38.0 ... breast cancer {'location': None, 'collection': None, 'cancer... 2017-04-10 06:20:08.220896 Magnetic Resonance Imaging (MRI) 1.0 (path_to_image, path_to_image, path_to_image, ... True
1 MR 3D FIESTA - LEFT 1984-10-13 T1-axial-locator breast ISPY1 GE MEDICAL SYSTEMS GENESIS_SIGNA 38.0 ... breast cancer {'location': None, 'collection': None, 'cancer... 2017-04-10 06:20:08.220896 Magnetic Resonance Imaging (MRI) 2.0 (path_to_image, path_to_image, path_to_image, ... True
2 MR 3D FIESTA - LEFT 1984-10-13 T2-FSE-Sagittal breast ISPY1 GE MEDICAL SYSTEMS GENESIS_SIGNA 38.0 ... breast cancer {'location': None, 'collection': None, 'cancer... 2017-04-10 06:20:08.220896 Magnetic Resonance Imaging (MRI) 3.0 (path_to_image, path_to_image, path_to_image, ... True
3 MR 3D FIESTA - LEFT 1984-10-13 Dynamic-3dfgre breast ISPY1 GE MEDICAL SYSTEMS GENESIS_SIGNA 38.0 ... breast cancer {'location': None, 'collection': None, 'cancer... 2017-04-10 06:20:08.220896 Magnetic Resonance Imaging (MRI) 4.0 (path_to_image, path_to_image, path_to_image, ... True
4 MR 3D FIESTA - LEFT 1984-10-13 Dynamic-3dfgre: SER breast ISPY1 GE MEDICAL SYSTEMS GENESIS_SIGNA 38.0 ... breast cancer {'location': None, 'collection': None, 'cancer... 2017-04-10 06:20:08.220896 Magnetic Resonance Imaging (MRI) 4.1 (path_to_image, path_to_image, path_to_image, ... True

5 rows × 26 columns

Notes:

  • The 'cached_dicom_images_path' and 'cached_images_path' columns refer to multiple images.
  • The number of converted images may differ from the number of raw DICOM images because 3D DICOM images are saved as individual frames when they are converted to PNG. The 'image_count_converted_cache' column provides an account of how many images resulted from any given DICOM $\rightarrow$ PNG conversion.

Working With DICOMs

The Cancer Imaging Archive stores images in a format known as Digital Imaging and Communications in Medicine (DICOM). If you have experience working with this file format, you can safely skip this section.

In python, we can manipulate DICOM files using the pydicom library. This tool will allow us to extract the images data as ndarrays.


In [7]:
import dicom  # in the future you will have to use `import pydicom as dicom`

We can also go ahead an import matplotlib to allow us to visualize the ndarrays we extract.

We can start by extracting a list of DICOMs from the images we downloaded above.


In [8]:
sample_dicoms = cii.records_db['cached_dicom_images_path'].iloc[1]

We can load these images in as ndarrays using the dicom (pydicom) library.


In [9]:
dicoms_arrs = [dicom.read_file(f).pixel_array for f in sample_dicoms]

DICOM represents imaging sessions with a tag known as SeriesInstanceUID. That is, the unique ID of the series. If multiple DICOM files/images share the same SeriesInstanceUID, it means they are part of the same 'series'.

If we get the length of dicoms_arrs we see that multiple DICOM files share the same SeriesInstanceUID


In [10]:
len(dicoms_arrs)


Out[10]:
15

In [11]:
# cii.records_db['series_instance_uid'].iloc[1]

Thus suggesting that this particular series is either a 3D volume or a time-series. So we can go ahead stack these images on top of one another as a way of representing this relationship between the images.


In [12]:
stacked_dicoms = np.stack(dicoms_arrs)

If we check the shape of stacked_dicoms can see that we have indeed stacked 15 256x256 images on top of one another.


In [13]:
stacked_dicoms.shape


Out[13]:
(15, 256, 256)

We can also go ahead and define a small function which will enable us to visualize this 'stack' of images.


In [14]:
import matplotlib.pyplot as plt
def sample_stack(stack, rows=3, cols=5, start_with=0, show_every=1):
    """Function to display stacked ndarray.
    Source: https://www.raddq.com/dicom-processing-segmentation-visualization-in-python
    Note: this code has been slightly modified."""
    if rows*cols != stack.shape[0]:
        raise ValueError("The product of `rows` and `cols` does not equal number of images.")
    fig, ax = plt.subplots(rows, cols, figsize=[12, 12])
    for i in range(rows*cols):
        ind = start_with + i*show_every
        ax[int(i/cols), int(i % cols)].set_title('slice {0}'.format(str(ind + 1)))
        ax[int(i/cols), int(i % cols)].imshow(stack[ind], cmap='gray')
        ax[int(i/cols), int(i % cols)].axis('off')
    plt.show()

If you're curious to see what these images look like, you can uncomment the line below to view the stack of images.


In [15]:
# sample_stack(stacked_dicoms, rows=3)

Note:

Ordering DICOM images in space is tricky. Currently, this class uses a somewhat reliable, but far from ideal, means of ordering images.
Errors are possible.

For the Medical Imaging Folks:
Images in the ``'series_instance_uid'`` column are ordered against the ``InstanceNumber`` tag instead of actually working out the geometry required to sort the images in space. This is obviously not ideal because, among other reasons, ``InstanceNumber`` is a type 2 tag. Hopefully, in the future, this is something that will be improved.


Train, Validation and Test

Spitting images obtained from the Cancer Imaging Archive into training, validation and/or testing sets is nearly identical to doing so using an instance of OpeniInterface class introduced in the prior tutorial. Accordingly, the instructions provided here will be condensed. If you would like more detail, please review this earlier tutorial.

First, we import the image_divvy tool.


In [16]:
from biovida.images import image_divvy

Next, we can define a 'divvy_rule'.


In [17]:
def my_divvy_rule(row):
    if row['modality_full'] == 'Magnetic Resonance Imaging (MRI)':
        return 'mri'
    if row['modality_full'] == 'Segmentation':
        return 'seg'

This rule will select only those images which are MRIs. All other images will be excluded.


In [18]:
train_test = image_divvy(instance=cii,
                         divvy_rule=my_divvy_rule,
                         db_to_extract='records_db',
                         action='ndarray',
                         train_val_test_dict={'train': 0.8, 'test': 0.2})




Structure:

- 'train':
  - 'mri'
  - 'seg'
- 'test':
  - 'mri'
  - 'seg'

In [19]:
train_mri, test_mri = train_test['train']['mri'], train_test['test']['mri']
train_seg, test_seg = train_test['train']['seg'], train_test['test']['seg']

One important thing to point out is that some of the image arrays returned will, in fact, be stacked arrays of images.

For example:


In [20]:
train_seg[10].shape


Out[20]:
(60, 256, 256)

Conclusion

Here we've explored how BioVida can be used to easily obtain and process data from the Cancer Imaging Archive database.

In the next tutorial, we'll investigate ways of managing and integrating the data cached by BioVida.