Author: Brendan Crabb brendancrabb8388@pointloma.edu
Created August 1, 2017
Welcome to SlideSeg, a python module that allows you to segment whole slide images into usable image chips for deep learning. Image masks for each chip are generated from associated markup and annotation files.
SlideSeg runs on Python 2.7 and depends on the following Python libraries:
SlideSeg and the necessary Python libraries can be installed using:
pip install slideseg
If pip isn't installed, you may have to enter the following before installing slideseg (OS X):
sudo easy_install pip
If you are using the preconfigured SlideSeg anaconda environment, these dependencies will already be installed. SlideSeg also depends on several C libraries; see section 2.2 (windows) and section 2.3 (Mac OS X) for installation instructions.
Make sure anaconda is installed. The SlideSeg environment has an Ipython kernel with all of the necessary packages already installed; however, conda support for jupyter notebooks is needed to switch kernels. This support is available through conda itself and can be enabled by issuing the following command:
conda install nb_conda
Copy the environment_slideseg.yml file to the anaconda directory, .../anaconda/scripts/. In the same directory, issue the following command to create the anaconda environment from the file:
conda env create -f environment_slideseg.yml
Creating the environment might take a few minutes. Once finished, issue the following command to activate the environment:
activate SlideSeg
source activate SlideSeg
If the environment was activated successfully, you should see (SlideSeg) at the beggining of the command prompt. This will set the SlideSeg kernel as your default kernel when running jupyter.
OpenSlide and OpenCV are C libraries; as a result, they have to be installed separately from the conda environment, which contains all of the python dependencies.
The Windows Binaries for OpenSlide can be found at 'openslide.org/download/'. Download the appropriate binaries for your system (either 32-bit or 64-bit) and unzip the file.
Copy the .dll files in ../bin/ to .../Anaconda/envs/SlideSeg/Library/bin/.
Copy the .h files to .../Anaconda/envs/SlideSeg/include/.
Finally, copy the .lib file to .../Anaconda/envs/SlideSeg/libs/.
OpenSlide has now been installed.
Use the following tutorial to download OpenCV, either from prebuilt binaries or from source:
http://docs.opencv.org/3.2.0/d5/de5/tutorial_py_setup_in_windows.html
OpenSlide and OpenCV are C libraries; as a result, they have to be installed separately from the conda environment, which contains all of the python dependencies.
If you are using Homebrew, enter the following in the terminal:
brew install openslide
brew install opencv
OpenSlide and OpenCV should now be installed in your anaconda environment.
The Jupyter Notebook App can be launched by clicking on the Jupyter Notebook icon installed by Anaconda in the start menu (Windows) or by typing in the terminal (cmd on Windows):
jupyter notebook
This will launch a new browser window showing the Notebook Dashboard. When started, the Jupyter Notebook app can only access files within its start-up folder. If you stored the SlideSeg notebook documents in a subfolder of your user folder, no configuration is necessary. Otherwise, you need to change your Jupyter Notebook App start-up folder.
To launch Jupyter Notebook App:
After launching the Jupyter Notebook App, navigate to the SlideSeg notebook and click on its name to open in a new browser tab. In the upper right corner, you should see Python [conda env:SlideSeg]. If not, click on Kernel> Change Kernel> and change your current kernel to Python [conda env:SlideSeg].
Copy all of the slide images into the images folder in the main project directory. Copy the markup and annotation files (in .xml format) into the xml folder in the main project directory. It is important that the annotation files have the same file name as the slide they are associated with.
SlideSeg can read virtual slides in the following formats:
SlideSeg can read annotations in the following formats:
SlideSeg depends on the following parameters:
slide_path: Path to the folder of slide images
xml_path: Path to the folder of xml files
output_dir: Path to the output folder where image_chips, image_masks, and text_files will be saved
format: Output format of the image_chips and image_masks (png or jpg only)
quality: Output quality: JPEG compression if output format is 'jpg' (100 recommended,jpg compression artifacts will distort image segmentation)
size: Size of image_chips and image_masks in pixels
overlap: Pixel overlap between image chips
key: The text file containing annotation keys and color codes
save_all: True saves every image_chip, False only saves chips containing an annotated pixel
save_ratio: Ratio of image_chips containing annotations to image_chips not containing annotations (use 'inf' if only annotated chips are desired; only applicable if save_all == False
These parameters can be specified in the cell below.
In [ ]:
Parameters = {
'slide_path': 'images/',
'xml_path': 'xml/',
'output_dir': 'output/',
'format': 'jpg',
'quality': 100,
'size': 128,
'overlap': 1,
'key': 'Annotation_Key.txt',
'save_all': False,
'save_ratio': 'inf'
}
The main directory should already contain an Annotation_Key.txt file. If no Annotation_Key file is present, one will be generated automatically from the annotation files in the xml folder.
The Annotation_Key file contains every annotation key with its associated color code. In all image masks, annotations with that key will have the specified pixel value. If an unknown key is encountered, it will be given a pixel value and added to the Annotation_Key automatically.
The following functions are defined within the slideseg module and used to generate, edit, and read the annotation key:
def loadkeys(annotation_key):
"""
Opens annotation_key file and loads keys and color codes
:param: annotation_key: the filename of the annotation key
:return: color codes
"""
def addkeys(annotation_key, key): """ Adds new key and color_code to annotation key :param annotation_key: the filename of the annotation key :param key: The annotation to be added :return: updated annotation key file """
def writeannotations(annotation_key, annotations): """ Writes annotation keys and color codes to annotation key text file :param annotation_key: filename of annotation key :param annotations: Dictionary of annotation keys and color codes :return: .txt file with annotation keys """
def generatekey(annotation_key, path): """ Generates annotation_key from folder of xml files :param annotation_key: the name of the annotation key file :param path: Directory containing xml files :return: annotation_key file """ </code>
Use the cell below to import slideseg, as well as some other useful modules.
In [ ]:
import slideseg
import sys
import os
Run the cell below to display the Annotation Key. The first function generates a new annotation keys from the folder 'xml/' if no annotation key exists. The second function displays the key in the notebook.
In [ ]:
if not os.path.isfile('Annotation_Key.txt'):
slideseg.generatekey('Annotation_Key.txt', 'xml/')
file = open('Annotation_Key.txt', 'r')
for line in file:
sys.stdout.write(line)
Every generated image chip will be saved in the output/image_chips folder. The chips are saved with the naming convention of slide filename_level number_row_column.format. If the chip contains an area that was annotated and the tags are enabled, it will have an associated tag (under the Subject category) with the annotation key. If the image chip does not contain annotations, the 'NONE' tag will be added. To view these tags, switch to details view and click display 'Subject' in the explorer. The files can be sorted according to their tags. Unfortunately, these tags will only be available if the output format is .jpg.
The following functions are defined in the slideseg module and are used to save both the image chips and image masks, as well as attaching exif metadata to the images:
def ensuredirectory(dest):
"""
Ensures the existence of a directory
:param dest: Directory to ensure.
:return: new directory if it did not previously exist.
"""
def attachtags(path, keys): """ Attaches image tags to metadata of chips and masks :param path: file to attach tags to. :param keys: keys to attach as tags :return: JPG with metadata tags """
def savechip(chip, path, quality, keys): """ Saves the image chip :param chip: the slide image chip to save :param path: the full path to the chip :param quality: the output quality :param keys: keys associated with the chip :return: """
def savemask(mask, path, keys): """ Saves the image masks :param mask: the image mask to save :param path: the complete path for the mask :param keys: keys associated with the chip :return: """
def checksave(save_all, pix_list, save_ratio, save_count_annotated, save_count_blank): """ Checks whether or not an image chip should be saved :param save_all: (bool) saves all chips if true :param pix_list: list of pixel values in image mask :param save_ratio: ratio of annotated chips to unannotated chips :param save_count_annotated: total annotated chips saved :param save_count_blank: total blank chips saved :return: bool """
def formatcheck(format): """ Assures correct format parameter was defined correctly :param format: the output format parameter :return: format :return: suffix """ </code>
The main functionality of SlideSeg is performed by the following functions. These functions takes all of the inputs specified in parameters and uses it to generate image chips and image masks.
def openwholeslide(path):
"""
Opens a whole slide image
:param path: Slide image path.
:return: slide image, levels, and dimensions
"""
def curatemask(mask, scale_width, scale_height, chip_size): """ Resize and pad annotation mask if necessary :param mask: an image mask :param scale_width: scaling for higher magnification levels :param scale_height: scaling for higher magnification levels :return: curated annotation mask """
def getchips(levels, dims, chip_size, overlap, mask, annotations, filename, suffix, save_all, save_ratio): """ Finds chip locations that should be loaded and saved
:param levels: levels in whole slide image
:param dims: dimension of whole slide image
:param chip_size: the size of the image chips
:param overlap: overlap between image chips (stride)
:param mask: annotation mask for slide image
:param annotations: dictionary of annotations in image
:param filename: slide image filename
:param suffix: output format for saving.
:param save_all: whether or not to save every image chip (bool)
:param save_ratio: ratio of annotated to unannotated chips (float)
:return: chip_dict. Dictionary of chip names, level, col, row, and scale
:return: image_dict. Dictionary of annotations and chips with those annotations
"""
def run(parameters, filename): """ Runs SlideSeg: Generates image chips from a whole slide image. :param parameters: specified in Parameters.txt file :param filename: filename of whole slide image :return: image chips and masks. """ </code>
An image mask for each image chip is saved in the output/image_masks folder. The mask has the same name as the image chip it is associated with. Furthermore, these masks will have the same tags, allowing you to sort by annotation type.
The following function handles the generation of an annotation mask from xml files:
def makemask(annotation_key, size, xml_path):
"""
Reads xml file and makes annotation mask for entire slide image
:param annotation_key: name of the annotation key file
:param size: size of the whole slide image
:param xml_path: path to the xml file
:return: annotation mask
:return: dictionary of annotation keys and color codes
"""
A text file with details about annotations and image chips will also be saved to output/textfiles. For each slide image, this text file will contain a list of all annotation keys present in the image. For each annotation key, a list of every image chip/mask containing that specific key is also recorded in this file.
The following functions generates these .txt files:
def writekeys(filename, annotations):
"""
Writes each annotation key to the output text file
:param filename: filename of image chip
:param annotations: dictionary of annotation keys
:return: updated text file
"""
def writeimagelist(filename, image_dictionary): """ Writes list of images containing each annotation key :param filename: the name of the slide image :param image_dictionary: dictionary of images with each key :return text """ </code>
To execute SlideSeg, simply run the jupyter notebook cells below. Alternatively, you can run the python script 'main.py'. Make sure that you defined the Parameters above. If the python script is used, the parameters are specified in the Parameters.txt file.
To get started, run the cell below to make sure all of the necessary modules are imported.
In [ ]:
import slideseg
import tqdm
import os
The following cell defines a function run(parameters, filename)
that generates image chips and masks from the slide image and xml file specified by filename. This function uses the slideseg module to open the slide image, generate an annotation mask, find regions of interest, and save chip data. This function is also defined within the module as slideseg.run(parameters, filename)
.
In [ ]:
def run(parameters, filename):
"""
Runs SlideSeg: Generates image chips from a whole slide image.
:param parameters: specified in Parameters.txt file
:param filename: filename of whole slide image
:return: image chips and masks.
"""
# Define variables
_slide_path = parameters["slide_path"]
_xml_path = parameters["xml_path"]
_output_dir = parameters["output_dir"]
_format = parameters["format"]
_quality = int(parameters["quality"])
_chip_size = int(parameters["size"])
_overlap = int(parameters["overlap"])
_key = parameters["key"]
_save_all = parameters["save_all"]
_save_ratio = parameters["save_ratio"]
# Open slide
_osr, _levels, _dims = slideseg.openwholeslide('{0}{1}'.format(_slide_path, filename))
_size = (int(_dims[0][0]), int(_dims[0][1]))
# Annotation Mask
xml_file = filename.rstrip(".svs")
xml_file = xml_file + ".xml"
print('loading annotation data from {0}/{1}'.format(_xml_path, xml_file))
_mask, _annotations = slideseg.makemask(_key, _size, '{0}{1}'.format(_xml_path, xml_file))
# Define output directory
output_directory_chip = '{0}image_chips/'.format(_output_dir)
output_directory_mask = '{0}image_mask/'.format(_output_dir)
# Output formatting check
_format, _suffix = slideseg.formatcheck(_format)
# Find chip data/locations to be saved
chip_dictionary, image_dict = slideseg.getchips(_levels, _dims, _chip_size, _overlap,
_mask, _annotations, filename, _suffix, _save_all, _save_ratio)
# Save chips and masks
print('Saving chips... {0} total chips'.format(len(chip_dictionary)))
for filename, value in tqdm.tqdm(chip_dictionary.iteritems()):
keys = value[0]
i = value[1]
col = value[2]
row = value[3]
scale_factor_width = value[4]
scale_factor_height = value[5]
# load chip region from slide image
img = _osr.read_region([int(col * scale_factor_width), int(row * scale_factor_height)], i,
[_chip_size, _chip_size]).convert('RGB')
# load image mask and curate
img_mask = _mask[int(row * scale_factor_height):int((row + _chip_size) * scale_factor_height),
int(col * scale_factor_width):int((col + _chip_size) * scale_factor_width)]
img_mask = slideseg.curatemask(img_mask, scale_factor_width, scale_factor_height, _chip_size)
# save the image chip and image mask
_path_chip = output_directory_chip + filename
_path_mask = output_directory_mask + filename
slideseg.savechip(img, _path_chip, _quality, keys)
slideseg.savemask(img_mask, _path_mask, keys)
# Make text output of Annotation Data
print('Updating txt file details...')
slideseg.writekeys(xml_file, _annotations)
slideseg.writeimagelist(xml_file, image_dict)
print('txt file details updated')
Now that we have imported the necessary modules and defined the function run()
, we can execute SlideSeg by running the cell below, which simply passes the parameter and filename information to run()
.
In [ ]:
print('running __main__ with parameters: {0}'.format(Parameters))
if not os.path.isdir(Parameters["slide_path"]):
path, filename = os.path.split(Parameters["slide_path"])
xpath, xml_filename = os.path.split(Parameters["xml_path"])
Parameters["slide_path"] = path
Parameters["xml_path"] = xpath
print('loading {0}'.format(filename))
run(Parameters, filename)
else:
for filename in os.listdir(Parameters["slide_path"]):
run(Parameters, filename)
In [ ]: