Merging annotations from tiled arrays

Overview:

This notebook describes how to merge annotations generated by tiled analysis of a whole-slide image. Since tiled analysis is carried out on small tiles, the annotations produced by image segmentation algorithms will be disjoint at the tile boundaries, prohibiting analysis of large structures that span multiple tiles.

The example presented below addresses the case where the annotations are stored in an array format that preserves the spatial organization of tiles. This scenario arises when iterating through the columns and rows of a tiled representation of a whole-slide image. Analysis of the organized array format is faster and preferred since the interfaces where annotations need to be merged are known. In cases where where the annotations to be merged do not come from tiled analysis, or where the tile results are not organized, an alternative method based on R-trees provides a slightly slower solution.

This extends on some of the work described in Amgad et al, 2019:

Mohamed Amgad, Habiba Elfandy, Hagar Hussein, ..., Jonathan Beezley, Deepak R Chittajallu, David Manthey, David A Gutman, Lee A D Cooper, Structured crowdsourcing enables convolutional segmentation of histology images, Bioinformatics, 2019, btz083

This is a sample result:

polygon_merger

Implementation summary

In the tiled array approach the tiles must be rectangular and unrotated. The algorithm used merges polygons in coordinate space so that almost-arbitrarily large structures can be handled without encountering memory issues. The algorithm works as follows:

Extract contours from the given masks using functionality from the masks_to_annotations_handler.py, making sure to account for contour offset so that all coordinates are relative to whole-slide image frame.
Identify contours that touch tile interfaces.
Identify shared edges between tiles.
For each shared edge, find contours that neighbor each other (using bounding box location) and verify that they should be paired using shapely.
Using 4-connectivity link all pairs of contours that are to be merged.
Use morphologic processing to dilate and fill gaps in the linked pairs and then erode to generate the final merged contour.

This initial steps ensures that the number of comparisons made is << n^2. This is important since algorithm complexity plays a key role as whole slide images may contain tens of thousands of annotated structures.

Where to look?

histomicstk/
 |_annotations_and_masks/
    |_polygon_merger.py
    |_tests/
       |_ test_polygon_merger.py
       |_ test_annotations_to_masks_handler.py



In [1]:

    
import os
CWD = os.getcwd()
import os
import girder_client
from pandas import read_csv

from histomicstk.annotations_and_masks.polygon_merger import Polygon_merger
from histomicstk.annotations_and_masks.masks_to_annotations_handler import (
    get_annotation_documents_from_contours, )

1. Connect girder client and set parameters



In [2]:

    
APIURL = 'http://candygram.neurology.emory.edu:8080/api/v1/'
SAMPLE_SLIDE_ID = '5d586d76bd4404c6b1f286ae'

gc = girder_client.GirderClient(apiUrl=APIURL)
gc.authenticate(interactive=True)
# gc.authenticate(apiKey='kri19nTIGOkWH01TbzRqfohaaDWb6kPecRqGmemb')

# read GTCodes dataframe
PTESTS_PATH = os.path.join(CWD, '..', '..', 'tests')
GTCODE_PATH = os.path.join(PTESTS_PATH, 'test_files', 'sample_GTcodes.csv')
GTCodes_df = read_csv(GTCODE_PATH)
GTCodes_df.index = GTCodes_df.loc[:, 'group']

# This is where masks for adjacent rois are saved
MASK_LOADPATH = os.path.join(
    PTESTS_PATH,'test_files', 'annotations_and_masks', 'polygon_merger_roi_masks')
maskpaths = [
    os.path.join(MASK_LOADPATH, j) for j in os.listdir(MASK_LOADPATH)
    if j.endswith('.png')]

2. Polygon merger

The Polygon_merger() is the top level function for performing the merging.



In [3]:

    
print(Polygon_merger.__doc__)









    



Methods to merge polygons in tiled masks.



In [4]:

    
print(Polygon_merger.__init__.__doc__)









    



Init Polygon_merger object.

        Arguments:
        -----------
        maskpaths : list
            list of strings representing pathos to masks
        GTCodes_df : pandas DataFrame
            the ground truth codes and information dataframe.
            This is a dataframe that MUST BE indexed by annotation group name
            and has the following columns.

            group: str
                group name of annotation, eg. mostly_tumor.
            GT_code: int
                desired ground truth code (in the mask). Pixels of this value
                belong to corresponding group (class).
            color: str
                rgb format. eg. rgb(255,0,0).
        merge_thresh : int
            how close do the polygons need to be (in pixels) to be merged
        contkwargs : dict
            dictionary of kwargs to pass to get_contours_from_mask()
        discard_nonenclosed_background : bool
            If a background group contour is NOT fully enclosed, discard it.
            This is a purely aesthetic method, makes sure that the background
            contours (eg stroma) are discarded by default to avoid cluttering
            the field when posted to DSA for viewing online. The only exception
            is if they are enclosed within something else (eg tumor), in which
            case they are kept since they represent holes. This is related to
            https://github.com/DigitalSlideArchive/HistomicsTK/issues/675
            WARNING - This is a bit slower since the contours will have to be
            converted to shapely polygons. It is not noticeable for hundreds of
            contours, but you will notice the speed difference if you are
            parsing thousands of contours. Default, for this reason, is False.
        background_group : str
            name of bckgrd group in the GT_codes dataframe (eg mostly_stroma)
        roi_group : str
            name of roi group in the GT_Codes dataframe (eg roi)
        verbose : int
            0 - Do not print to screen
            1 - Print only key messages
            2 - Print everything to screen
        monitorPrefix : str
            text to prepend to printed statements



In [5]:

    
print(Polygon_merger.run.__doc__)









    



Run full pipeline to get merged contours.

        Returns:
        - pandas DataFrame: has the same structure as output from
        get_contours_from_mask().

Required arguments for initialization

Ground truth codes file

This contains the ground truth codes and information dataframe. This is a dataframe that is indexed by the annotation group name and has the following columns:

group: group name of annotation (string), eg. "mostly_tumor"
GT_code: int, desired ground truth code (in the mask) Pixels of this value belong to corresponding group (class)
color: str, rgb format. eg. rgb(255,0,0).

NOTE:

Zero pixels have special meaning and do NOT encode specific ground truth class. Instead, they simply mean 'Outside ROI' and should be IGNORED during model training or evaluation.



In [6]:

    
GTCodes_df.head()









    Out[6]:







  
    
      
      group
      overlay_order
      GT_code
      is_roi
      is_background_class
      color
      comments
    
    
      group
      
      
      
      
      
      
      
    
  
  
    
      roi
      roi
      0
      254
      1
      0
      rgb(200,0,150)
      NaN
    
    
      evaluation_roi
      evaluation_roi
      0
      253
      1
      0
      rgb(255,0,0)
      NaN
    
    
      mostly_tumor
      mostly_tumor
      1
      1
      0
      0
      rgb(255,0,0)
      core class
    
    
      mostly_stroma
      mostly_stroma
      2
      2
      0
      1
      rgb(255,125,0)
      core class
    
    
      mostly_lymphocytic_infiltrate
      mostly_lymphocytic_infiltrate
      1
      3
      0
      0
      rgb(0,0,255)
      core class

maskpaths

These are absolute paths for the masks generated by tiled analysis to be used.



In [7]:

    
[os.path.split(j)[1] for j in maskpaths[:5]]









    Out[7]:





['TCGA-A2-A0YE-01Z-00-DX1.8A2E3094-5755-42BC-969D-7F0A2ECA0F39_left-45886_top-43750_mag-BASE.png',
 'TCGA-A2-A0YE-01Z-00-DX1.8A2E3094-5755-42BC-969D-7F0A2ECA0F39_left-44862_top-45286_mag-BASE.png',
 'TCGA-A2-A0YE-01Z-00-DX1.8A2E3094-5755-42BC-969D-7F0A2ECA0F39_left-44862_top-44262_mag-BASE.png',
 'TCGA-A2-A0YE-01Z-00-DX1.8A2E3094-5755-42BC-969D-7F0A2ECA0F39_left-45374_top-44262_mag-BASE.png',
 'TCGA-A2-A0YE-01Z-00-DX1.8A2E3094-5755-42BC-969D-7F0A2ECA0F39_left-44350_top-44262_mag-BASE.png']

Note that the pattern _left-123_ and _top-123_ is assumed to encode the x and y offset of the mask at base magnification. If you prefer some other convention, you will need to manually provide the parameter roi_offsets to the method Polygon_merger.set_roi_bboxes.



In [8]:

    
print(Polygon_merger.set_roi_bboxes.__doc__)









    



Get dictionary of roi bounding boxes.

        Arguments:
        - roi_offsets: dict (default, None): dict indexed by maskname,
        each entry is a dict with keys top and left each is an integer.
        If None, then the x and y offset is inferred from mask name.

        Sets:
        - self.roiinfos: dict: dict indexed by maskname, each entry is a
        dict with keys top, left, bottom, right, all of which are integers.

3. Initialize and run the merger

To keep things clean we discard background contours (in this case, stroma) that are now enclosed with another contour. See docs for masks_to_annotations_handler.py if this is confusing. This is purely aesthetic.



In [9]:

    
pm = Polygon_merger(
    maskpaths=maskpaths, GTCodes_df=GTCodes_df,
    discard_nonenclosed_background=True, verbose=1,
    monitorPrefix='test')
contours_df = pm.run()









    



test: Set contours from all masks

test: Set ROI bounding boxes

test: Set shard ROI edges

test: Set merged contours

test: Get concatenated contours
test: _discard_nonenclosed_background_group: discarded 4 contours

This is the result



In [10]:

    
contours_df.head()









    Out[10]:







  
    
      
      group
      color
      ymin
      ymax
      xmin
      xmax
      has_holes
      touches_edge-top
      touches_edge-left
      touches_edge-bottom
      touches_edge-right
      coords_x
      coords_y
    
  
  
    
      0
      mostly_tumor
      rgb(255,0,0)
      NaN
      NaN
      NaN
      NaN
      0
      NaN
      NaN
      NaN
      NaN
      46312,46312,46315,46316,46316,46316,46317,4631...
      44252,44253,44256,44256,44257,44257,44257,4425...
    
    
      1
      mostly_tumor
      rgb(255,0,0)
      NaN
      NaN
      NaN
      NaN
      0
      NaN
      NaN
      NaN
      NaN
      45822,45822,45822,45823,45823,45823,45823,4582...
      43915,43916,43916,43916,43917,43917,43917,4391...
    
    
      2
      mostly_tumor
      rgb(255,0,0)
      NaN
      NaN
      NaN
      NaN
      0
      NaN
      NaN
      NaN
      NaN
      44350,44350,44384,44384,44385,44385,44386,4438...
      44445,44446,44446,44445,44445,44445,44445,4444...
    
    
      3
      mostly_tumor
      rgb(255,0,0)
      NaN
      NaN
      NaN
      NaN
      0
      NaN
      NaN
      NaN
      NaN
      44350,44350,44350,44350,44350,44350,44350,4435...
      44615,44615,44615,44771,44771,44772,44772,4477...
    
    
      4
      mostly_tumor
      rgb(255,0,0)
      NaN
      NaN
      NaN
      NaN
      0
      NaN
      NaN
      NaN
      NaN
      44350,44350,44350,44350,44350,44350,44350,4435...
      45129,45129,45283,45283,45284,45284,45285,4528...

4. Visualize results on HistomicsTK



In [11]:

    
# deleting existing annotations in target slide (if any)
existing_annotations = gc.get('/annotation/item/' + SAMPLE_SLIDE_ID)
for ann in existing_annotations:
    gc.delete('/annotation/%s' % ann['_id'])

# get list of annotation documents
annotation_docs = get_annotation_documents_from_contours(
    contours_df.copy(), separate_docs_by_group=True,
    docnamePrefix='test',
    verbose=False, monitorPrefix=SAMPLE_SLIDE_ID + ": annotation docs")

# post annotations to slide -- make sure it posts without errors
for annotation_doc in annotation_docs:
    resp = gc.post(
        "/annotation?itemId=" + SAMPLE_SLIDE_ID, json=annotation_doc)

Now you can go to HistomicsUI and confirm that the posted annotations make sense and correspond to tissue boundaries and expected labels.

	group	overlay_order	GT_code	is_roi	is_background_class	color	comments
group
roi	roi	0	254	1	0	rgb(200,0,150)	NaN
evaluation_roi	evaluation_roi	0	253	1	0	rgb(255,0,0)	NaN
mostly_tumor	mostly_tumor	1	1	0	0	rgb(255,0,0)	core class
mostly_stroma	mostly_stroma	2	2	0	1	rgb(255,125,0)	core class
mostly_lymphocytic_infiltrate	mostly_lymphocytic_infiltrate	1	3	0	0	rgb(0,0,255)	core class

	group	color	ymin	ymax	xmin	xmax	touches_edge-top	touches_edge-left	touches_edge-bottom	touches_edge-right	coords_x	coords_y
0	mostly_tumor	rgb(255,0,0)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	46312,46312,46315,46316,46316,46316,46317,4631...	44252,44253,44256,44256,44257,44257,44257,4425...
1	mostly_tumor	rgb(255,0,0)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	45822,45822,45822,45823,45823,45823,45823,4582...	43915,43916,43916,43916,43917,43917,43917,4391...
2	mostly_tumor	rgb(255,0,0)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	44350,44350,44384,44384,44385,44385,44386,4438...	44445,44446,44446,44445,44445,44445,44445,4444...
3	mostly_tumor	rgb(255,0,0)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	44350,44350,44350,44350,44350,44350,44350,4435...	44615,44615,44615,44771,44771,44772,44772,4477...
4	mostly_tumor	rgb(255,0,0)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	44350,44350,44350,44350,44350,44350,44350,4435...	45129,45129,45283,45283,45284,45284,45285,4528...