The Cancer Imaging Archive is a large repository of medical images of various forms of cancer. Programmatic web access is granted through a RESTful web API. However, this service requires an API-key to use, which you can apply for by following the instructions found here. Typically, this process is very speedy.
Note: If it is heavily requested, BioVida may simply include an API key for this service in the future. At the moment however, you will have to follow the instructions linked to above to obtain one.
BioVida provides an easy-to-use python interface for this web API which, like the OpeniInterface
class, is located in the images subpackage.
In [1]:
from biovida.images import CancerImageInterface
We can instantiate an instance of the tool using an API key.
In [2]:
cii = CancerImageInterface(api_key=YOUR_API_KEY_HERE)
Note:
The number of studies (collections) provided here is somewhat shorter than what is provided on The Cancer Imaging Archive's website. This is because certain studies, such as those with restricted access, are excluded.
With this notice out of the way, let's go ahead and perform a search.
In [3]:
cii.search(cancer_type='breast')
Out[3]:
Next, we can easily download this data.
In [4]:
import numpy as np
def simplify_df(df):
"""This function simplifies dataframes
for the purposes of this tutorial."""
data_frame = df.copy()
for c in ('cached_dicom_images_path', 'cached_images_path'):
data_frame[c] = data_frame[c].map(
lambda x: tuple(['path_to_image'] * len(x)) if isinstance(x, tuple) else x)
return data_frame[0:5].replace({np.NaN: ''})
The code below will download the data in our search results, but with two noteworthy restrictions.
First, patient_limit=5
will limit the number of patients/subjects downloaded to the first 5.
Second, collections_limit
will limit the number of collections downloaded to one (in this case, 'TCGA-COAD').
Third, session_limit=1
will limit the results returned to the first time the patient/subject was scanned, e.g., before surgical intervention to remove diseased tissue.
Additionally, the save_dicom
parameter will enable us to save the raw DICOM image files that the Cancer Imaging Archive provides. By default, pull()
only generates DICOM files. However, the save_png
argument also gives you the option to convert the DICOM files to PNG images.
In [5]:
pull_df = cii.pull(patient_limit=5, collections_limit=1, session_limit=1)
Let's take a look at the data we've downloaded. We could view the pull_df
object above, or the identical records_db
attribute of cii
, e.g., cii.records_db
. However, both of those DataFrames contain several column which are not typically relevant for every data use. So, instead, we can view an abbreviated DataFrame, records_db_short
.
In [6]:
simplify_df(cii.records_db_short)
Out[6]:
Notes:
'cached_dicom_images_path'
and 'cached_images_path'
columns refer to multiple images.'image_count_converted_cache'
column provides an account of how many images resulted from any given DICOM $\rightarrow$ PNG conversion.The Cancer Imaging Archive stores images in a format known as Digital Imaging and Communications in Medicine (DICOM). If you have experience working with this file format, you can safely skip this section.
In python, we can manipulate DICOM files using the pydicom library. This tool will allow us to extract the images data as ndarrays
.
In [7]:
import dicom # in the future you will have to use `import pydicom as dicom`
We can also go ahead an import matplotlib
to allow us to visualize the ndarrays
we extract.
We can start by extracting a list of DICOMs from the images we downloaded above.
In [8]:
sample_dicoms = cii.records_db['cached_dicom_images_path'].iloc[1]
We can load these images in as ndarrays
using the dicom
(pydicom
) library.
In [9]:
dicoms_arrs = [dicom.read_file(f).pixel_array for f in sample_dicoms]
DICOM represents imaging sessions with a tag known as SeriesInstanceUID. That is, the unique ID of the series. If multiple DICOM files/images share the same SeriesInstanceUID, it means they are part of the same 'series'.
If we get the length of dicoms_arrs
we see that multiple DICOM files share the same SeriesInstanceUID
In [10]:
len(dicoms_arrs)
Out[10]:
In [11]:
# cii.records_db['series_instance_uid'].iloc[1]
Thus suggesting that this particular series is either a 3D volume or a time-series. So we can go ahead stack these images on top of one another as a way of representing this relationship between the images.
In [12]:
stacked_dicoms = np.stack(dicoms_arrs)
If we check the shape of stacked_dicoms
can see that we have indeed stacked 15 256x256 images on top of one another.
In [13]:
stacked_dicoms.shape
Out[13]:
We can also go ahead and define a small function which will enable us to visualize this 'stack' of images.
In [14]:
import matplotlib.pyplot as plt
def sample_stack(stack, rows=3, cols=5, start_with=0, show_every=1):
"""Function to display stacked ndarray.
Source: https://www.raddq.com/dicom-processing-segmentation-visualization-in-python
Note: this code has been slightly modified."""
if rows*cols != stack.shape[0]:
raise ValueError("The product of `rows` and `cols` does not equal number of images.")
fig, ax = plt.subplots(rows, cols, figsize=[12, 12])
for i in range(rows*cols):
ind = start_with + i*show_every
ax[int(i/cols), int(i % cols)].set_title('slice {0}'.format(str(ind + 1)))
ax[int(i/cols), int(i % cols)].imshow(stack[ind], cmap='gray')
ax[int(i/cols), int(i % cols)].axis('off')
plt.show()
If you're curious to see what these images look like, you can uncomment the line below to view the stack of images.
In [15]:
# sample_stack(stacked_dicoms, rows=3)
Note:
Ordering DICOM images in space is tricky. Currently, this class uses a somewhat reliable, but far from ideal, means of ordering images.
Errors are possible.
For the Medical Imaging Folks:
Images in the ``'series_instance_uid'`` column are ordered against the ``InstanceNumber`` tag instead of actually working out the geometry required to sort the images in space. This is obviously not ideal because, among other reasons, ``InstanceNumber`` is a type 2 tag. Hopefully, in the future, this is something that will be improved.
Spitting images obtained from the Cancer Imaging Archive into training, validation and/or testing sets is nearly identical to doing so using an instance of OpeniInterface
class introduced in the prior tutorial. Accordingly, the instructions provided here will be condensed. If you would like more detail, please review this earlier tutorial.
First, we import the image_divvy
tool.
In [16]:
from biovida.images import image_divvy
Next, we can define a 'divvy_rule'.
In [17]:
def my_divvy_rule(row):
if row['modality_full'] == 'Magnetic Resonance Imaging (MRI)':
return 'mri'
if row['modality_full'] == 'Segmentation':
return 'seg'
This rule will select only those images which are MRIs. All other images will be excluded.
In [18]:
train_test = image_divvy(instance=cii,
divvy_rule=my_divvy_rule,
db_to_extract='records_db',
action='ndarray',
train_val_test_dict={'train': 0.8, 'test': 0.2})
In [19]:
train_mri, test_mri = train_test['train']['mri'], train_test['test']['mri']
train_seg, test_seg = train_test['train']['seg'], train_test['test']['seg']
One important thing to point out is that some of the image arrays returned will, in fact, be stacked arrays of images.
For example:
In [20]:
train_seg[10].shape
Out[20]:
Here we've explored how BioVida
can be used to easily obtain and process data from the Cancer Imaging Archive database.
In the next tutorial, we'll investigate ways of managing and integrating the data cached by BioVida
.