BioVida: The Cancer Imaging Archive

The Cancer Imaging Archive is a large repository of medical images of various forms of cancer. Programmatic web access is granted through a RESTful web API. However, this service requires an API-key to use, which you can apply for by following the instructions found here. Typically, this process is very speedy.

Note: If it is heavily requested, BioVida may simply include an API key for this service in the future. At the moment however, you will have to follow the instructions linked to above to obtain one.

BioVida provides an easy-to-use python interface for this web API which, like the OpeniInterface class, is located in the images subpackage.



In [1]:

    
from biovida.images import CancerImageInterface









    



Using Theano backend.

We can instantiate an instance of the tool using an API key.



In [2]:

    
cii = CancerImageInterface(api_key=YOUR_API_KEY_HERE)

Note:

The number of studies (collections) provided here is somewhat shorter than what is provided on The Cancer Imaging Archive's website. This is because certain studies, such as those with restricted access, are excluded.

With this notice out of the way, let's go ahead and perform a search.



In [3]:

    
cii.search(cancer_type='breast')









    Out[3]:






  
    
      
      collection
      cancer_type
      modalities
      subjects
      location
      metadata
      access
      status
      updated
      modalities_full
    
  
  
    
      0
      ISPY1
      Breast Cancer
      MR, SEG
      222
      [Breast]
      Yes
      Public
      Complete
      2016-08-31
      [Magnetic Resonance Imaging (MRI), Segmentation]
    
    
      1
      CBIS-DDSM
      Breast Cancer
      MG
      1431
      [Breast]
      Yes
      Public
      Ongoing
      2016-08-31
      [Mammography]
    
    
      2
      Breast-MRI-NACT-Pilot
      Breast Cancer
      MR, SEG
      64
      [Breast]
      Yes
      Public
      Complete
      2016-01-26
      [Magnetic Resonance Imaging (MRI), Segmentation]
    
    
      3
      TCGA-BRCA
      Breast Cancer
      MR, MG
      139
      [Breast]
      Yes
      Public
      Complete
      2014-12-30
      [Magnetic Resonance Imaging (MRI), Mammography]
    
    
      4
      QIN Breast DCE-MRI
      Breast Cancer
      MR, KO
      10
      [Breast]
      Yes
      Public
      Ongoing
      2014-07-31
      [Magnetic Resonance Imaging (MRI), Key Object ...
    
    
      5
      Breast Diagnosis
      Breast Cancer
      MR, PT, CT, MG
      88
      [Breast]
      Yes
      Public
      Complete
      2011-11-09
      [Magnetic Resonance Imaging (MRI), Positron Em...
    
    
      6
      RIDER Breast MRI
      Breast Cancer
      MR
      5
      [Breast]
      No
      Public
      Complete
      2011-11-08
      [Magnetic Resonance Imaging (MRI)]

Next, we can easily download this data.



In [4]:

    
import numpy as np
def simplify_df(df):
    """This function simplifies dataframes
    for the purposes of this tutorial."""
    data_frame = df.copy()
    for c in ('cached_dicom_images_path', 'cached_images_path'):
        data_frame[c] = data_frame[c].map(
            lambda x: tuple(['path_to_image'] * len(x)) if isinstance(x, tuple) else x)
    return data_frame[0:5].replace({np.NaN: ''})

The code below will download the data in our search results, but with two noteworthy restrictions.
First, patient_limit=5 will limit the number of patients/subjects downloaded to the first 5.
Second, collections_limit will limit the number of collections downloaded to one (in this case, 'TCGA-COAD').
Third, session_limit=1 will limit the results returned to the first time the patient/subject was scanned, e.g., before surgical intervention to remove diseased tissue.

Additionally, the save_dicom parameter will enable us to save the raw DICOM image files that the Cancer Imaging Archive provides. By default, pull() only generates DICOM files. However, the save_png argument also gives you the option to convert the DICOM files to PNG images.



In [5]:

    
pull_df = cii.pull(patient_limit=5, collections_limit=1, session_limit=1)

Let's take a look at the data we've downloaded. We could view the pull_df object above, or the identical records_db attribute of cii, e.g., cii.records_db. However, both of those DataFrames contain several column which are not typically relevant for every data use. So, instead, we can view an abbreviated DataFrame, records_db_short.



In [6]:

    
simplify_df(cii.records_db_short)









    Out[6]:






  
    
      
      modality
      protocol_name
      series_date
      series_description
      body_part_examined
      annotations_flag
      collection
      manufacturer
      manufacturer_model_name
      age
      ...
      cancer_type
      query
      pull_time
      modality_full
      series_number_rescaled
      cached_dicom_images_path
      cached_images_path
      error_free_conversion
      allowed_modality
      image_count_converted_cache
    
  
  
    
      0
      MR
      3D FIESTA - LEFT
      1984-10-13
      3P LEFT BREAST SCOUT
      breast
      
      ISPY1
      GE MEDICAL SYSTEMS
      GENESIS_SIGNA
      38.0
      ...
      breast cancer
      {'location': None, 'collection': None, 'cancer...
      2017-04-10 06:20:08.220896
      Magnetic Resonance Imaging (MRI)
      1.0
      (path_to_image, path_to_image, path_to_image, ...
      
      
      True
      
    
    
      1
      MR
      3D FIESTA - LEFT
      1984-10-13
      T1-axial-locator
      breast
      
      ISPY1
      GE MEDICAL SYSTEMS
      GENESIS_SIGNA
      38.0
      ...
      breast cancer
      {'location': None, 'collection': None, 'cancer...
      2017-04-10 06:20:08.220896
      Magnetic Resonance Imaging (MRI)
      2.0
      (path_to_image, path_to_image, path_to_image, ...
      
      
      True
      
    
    
      2
      MR
      3D FIESTA - LEFT
      1984-10-13
      T2-FSE-Sagittal
      breast
      
      ISPY1
      GE MEDICAL SYSTEMS
      GENESIS_SIGNA
      38.0
      ...
      breast cancer
      {'location': None, 'collection': None, 'cancer...
      2017-04-10 06:20:08.220896
      Magnetic Resonance Imaging (MRI)
      3.0
      (path_to_image, path_to_image, path_to_image, ...
      
      
      True
      
    
    
      3
      MR
      3D FIESTA - LEFT
      1984-10-13
      Dynamic-3dfgre
      breast
      
      ISPY1
      GE MEDICAL SYSTEMS
      GENESIS_SIGNA
      38.0
      ...
      breast cancer
      {'location': None, 'collection': None, 'cancer...
      2017-04-10 06:20:08.220896
      Magnetic Resonance Imaging (MRI)
      4.0
      (path_to_image, path_to_image, path_to_image, ...
      
      
      True
      
    
    
      4
      MR
      3D FIESTA - LEFT
      1984-10-13
      Dynamic-3dfgre: SER
      breast
      
      ISPY1
      GE MEDICAL SYSTEMS
      GENESIS_SIGNA
      38.0
      ...
      breast cancer
      {'location': None, 'collection': None, 'cancer...
      2017-04-10 06:20:08.220896
      Magnetic Resonance Imaging (MRI)
      4.1
      (path_to_image, path_to_image, path_to_image, ...
      
      
      True
      
    
  

5 rows × 26 columns

Notes:

The 'cached_dicom_images_path' and 'cached_images_path' columns refer to multiple images.
The number of converted images may differ from the number of raw DICOM images because 3D DICOM images are saved as individual frames when they are converted to PNG. The 'image_count_converted_cache' column provides an account of how many images resulted from any given DICOM $\rightarrow$ PNG conversion.

Working With DICOMs

The Cancer Imaging Archive stores images in a format known as Digital Imaging and Communications in Medicine (DICOM). If you have experience working with this file format, you can safely skip this section.

In python, we can manipulate DICOM files using the pydicom library. This tool will allow us to extract the images data as ndarrays.



In [7]:

    
import dicom  # in the future you will have to use `import pydicom as dicom`

We can also go ahead an import matplotlib to allow us to visualize the ndarrays we extract.

We can start by extracting a list of DICOMs from the images we downloaded above.



In [8]:

    
sample_dicoms = cii.records_db['cached_dicom_images_path'].iloc[1]

We can load these images in as ndarrays using the dicom (pydicom) library.



In [9]:

    
dicoms_arrs = [dicom.read_file(f).pixel_array for f in sample_dicoms]

DICOM represents imaging sessions with a tag known as SeriesInstanceUID. That is, the unique ID of the series. If multiple DICOM files/images share the same SeriesInstanceUID, it means they are part of the same 'series'.

If we get the length of dicoms_arrs we see that multiple DICOM files share the same SeriesInstanceUID



In [10]:

    
len(dicoms_arrs)









    Out[10]:





15



In [11]:

    
# cii.records_db['series_instance_uid'].iloc[1]

Thus suggesting that this particular series is either a 3D volume or a time-series. So we can go ahead stack these images on top of one another as a way of representing this relationship between the images.



In [12]:

    
stacked_dicoms = np.stack(dicoms_arrs)

If we check the shape of stacked_dicoms can see that we have indeed stacked 15 256x256 images on top of one another.



In [13]:

    
stacked_dicoms.shape









    Out[13]:





(15, 256, 256)

We can also go ahead and define a small function which will enable us to visualize this 'stack' of images.



In [14]:

    
import matplotlib.pyplot as plt
def sample_stack(stack, rows=3, cols=5, start_with=0, show_every=1):
    """Function to display stacked ndarray.
    Source: https://www.raddq.com/dicom-processing-segmentation-visualization-in-python
    Note: this code has been slightly modified."""
    if rows*cols != stack.shape[0]:
        raise ValueError("The product of `rows` and `cols` does not equal number of images.")
    fig, ax = plt.subplots(rows, cols, figsize=[12, 12])
    for i in range(rows*cols):
        ind = start_with + i*show_every
        ax[int(i/cols), int(i % cols)].set_title('slice {0}'.format(str(ind + 1)))
        ax[int(i/cols), int(i % cols)].imshow(stack[ind], cmap='gray')
        ax[int(i/cols), int(i % cols)].axis('off')
    plt.show()

If you're curious to see what these images look like, you can uncomment the line below to view the stack of images.



In [15]:

    
# sample_stack(stacked_dicoms, rows=3)

Note:

Ordering DICOM images in space is tricky. Currently, this class uses a somewhat reliable, but far from ideal, means of ordering images.
Errors are possible.

For the Medical Imaging Folks:
Images in the ``'series_instance_uid'`` column are ordered against the ``InstanceNumber`` tag instead of actually working out the geometry required to sort the images in space. This is obviously not ideal because, among other reasons, ``InstanceNumber`` is a type 2 tag. Hopefully, in the future, this is something that will be improved.

Train, Validation and Test

Spitting images obtained from the Cancer Imaging Archive into training, validation and/or testing sets is nearly identical to doing so using an instance of OpeniInterface class introduced in the prior tutorial. Accordingly, the instructions provided here will be condensed. If you would like more detail, please review this earlier tutorial.

First, we import the image_divvy tool.



In [16]:

    
from biovida.images import image_divvy

Next, we can define a 'divvy_rule'.



In [17]:

    
def my_divvy_rule(row):
    if row['modality_full'] == 'Magnetic Resonance Imaging (MRI)':
        return 'mri'
    if row['modality_full'] == 'Segmentation':
        return 'seg'

This rule will select only those images which are MRIs. All other images will be excluded.



In [18]:

    
train_test = image_divvy(instance=cii,
                         divvy_rule=my_divvy_rule,
                         db_to_extract='records_db',
                         action='ndarray',
                         train_val_test_dict={'train': 0.8, 'test': 0.2})









    





 
 










    









    





 
 










    



Structure:

- 'train':
  - 'mri'
  - 'seg'
- 'test':
  - 'mri'
  - 'seg'



In [19]:

    
train_mri, test_mri = train_test['train']['mri'], train_test['test']['mri']
train_seg, test_seg = train_test['train']['seg'], train_test['test']['seg']

One important thing to point out is that some of the image arrays returned will, in fact, be stacked arrays of images.

For example:



In [20]:

    
train_seg[10].shape









    Out[20]:





(60, 256, 256)

Conclusion

Here we've explored how BioVida can be used to easily obtain and process data from the Cancer Imaging Archive database.

In the next tutorial, we'll investigate ways of managing and integrating the data cached by BioVida.

	collection	cancer_type	modalities	subjects	location	metadata	access	status	updated	modalities_full
0	ISPY1	Breast Cancer	MR, SEG	222	[Breast]	Yes	Public	Complete	2016-08-31	[Magnetic Resonance Imaging (MRI), Segmentation]
1	CBIS-DDSM	Breast Cancer	MG	1431	[Breast]	Yes	Public	Ongoing	2016-08-31	[Mammography]
2	Breast-MRI-NACT-Pilot	Breast Cancer	MR, SEG	64	[Breast]	Yes	Public	Complete	2016-01-26	[Magnetic Resonance Imaging (MRI), Segmentation]
3	TCGA-BRCA	Breast Cancer	MR, MG	139	[Breast]	Yes	Public	Complete	2014-12-30	[Magnetic Resonance Imaging (MRI), Mammography]
4	QIN Breast DCE-MRI	Breast Cancer	MR, KO	10	[Breast]	Yes	Public	Ongoing	2014-07-31	[Magnetic Resonance Imaging (MRI), Key Object ...
5	Breast Diagnosis	Breast Cancer	MR, PT, CT, MG	88	[Breast]	Yes	Public	Complete	2011-11-09	[Magnetic Resonance Imaging (MRI), Positron Em...
6	RIDER Breast MRI	Breast Cancer	MR	5	[Breast]	No	Public	Complete	2011-11-08	[Magnetic Resonance Imaging (MRI)]

	modality	protocol_name	series_date	series_description	body_part_examined	collection	manufacturer	manufacturer_model_name	age	...	cancer_type	query	pull_time	modality_full	series_number_rescaled	cached_dicom_images_path	allowed_modality
0	MR	3D FIESTA - LEFT	1984-10-13	3P LEFT BREAST SCOUT	breast	ISPY1	GE MEDICAL SYSTEMS	GENESIS_SIGNA	38.0	...	breast cancer	{'location': None, 'collection': None, 'cancer...	2017-04-10 06:20:08.220896	Magnetic Resonance Imaging (MRI)	1.0	(path_to_image, path_to_image, path_to_image, ...	True
1	MR	3D FIESTA - LEFT	1984-10-13	T1-axial-locator	breast	ISPY1	GE MEDICAL SYSTEMS	GENESIS_SIGNA	38.0	...	breast cancer	{'location': None, 'collection': None, 'cancer...	2017-04-10 06:20:08.220896	Magnetic Resonance Imaging (MRI)	2.0	(path_to_image, path_to_image, path_to_image, ...	True
2	MR	3D FIESTA - LEFT	1984-10-13	T2-FSE-Sagittal	breast	ISPY1	GE MEDICAL SYSTEMS	GENESIS_SIGNA	38.0	...	breast cancer	{'location': None, 'collection': None, 'cancer...	2017-04-10 06:20:08.220896	Magnetic Resonance Imaging (MRI)	3.0	(path_to_image, path_to_image, path_to_image, ...	True
3	MR	3D FIESTA - LEFT	1984-10-13	Dynamic-3dfgre	breast	ISPY1	GE MEDICAL SYSTEMS	GENESIS_SIGNA	38.0	...	breast cancer	{'location': None, 'collection': None, 'cancer...	2017-04-10 06:20:08.220896	Magnetic Resonance Imaging (MRI)	4.0	(path_to_image, path_to_image, path_to_image, ...	True
4	MR	3D FIESTA - LEFT	1984-10-13	Dynamic-3dfgre: SER	breast	ISPY1	GE MEDICAL SYSTEMS	GENESIS_SIGNA	38.0	...	breast cancer	{'location': None, 'collection': None, 'cancer...	2017-04-10 06:20:08.220896	Magnetic Resonance Imaging (MRI)	4.1	(path_to_image, path_to_image, path_to_image, ...	True