Positive pixel count is a routine that classifies pixels by their position in HSI Color Space and computes statistics based on this classification.
HistomicsTK has two main functions for positive pixel counting, count_slide and count_image,
both of which live in the histomicstk.segmentation.positive_pixel_count module
(imported in this notebook as simply ppc). count_image operates on an in-memory image
in a format compatible with NumPy's ndarray type, and returns both the classification statistics
and a label image that may be useful for visualization or further analysis.
count_slide accepts a path to an image instead, and while it can also carry out
exactly the same computation as count_image, its advantage lies in its ability to distribute
its computation using Dask and to operate on images
too large to fit in memory. This ability comes at a cost – to enable it, the generation of
the label image must be disabled. (For the curious, the necessary underlying support for writing
large images a tile at a time is lacking.) The HistomicsTK CLI PositivePixelCount is a wrapper
around this function.
The rest of this example is subdivided, in order, into count_image and count_slide examples, followed by a CLI example.
In [1]:
# Configuration and imports of other libraries
from __future__ import print_function
import large_image
%matplotlib inline
import matplotlib.pyplot as plt
import skimage.io
#Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 15, 15
plt.rcParams['image.cmap'] = 'gray'
# Import and alias positive_pixel_count
import histomicstk.segmentation.positive_pixel_count as ppc
In [3]:
image_url = ('https://data.kitware.com/api/v1/file/'
'598b71ee8d777f7d33e9c1d4/download') # DAB.png
im_input = skimage.io.imread(image_url)
print('Input image')
plt.imshow(im_input)
plt.show()
With a helper function, it becomes reasonable to use count_image interactively
to explore parameter values. A more advanced analysis might also look at the HSI values directly.
Here, we show briefly how one can use the helper function count_and_label defined below.
In [4]:
def count_and_label(params):
"Compute the label image with count_image, and then display it"
label_image = ppc.count_image(im_input, params)[1]
plt.imshow(label_image)
plt.show()
To start out, we'll pick a set of parameters. The HSI color space used by the routines is defined to use values in the range [0,1].
In [5]:
template_params = ppc.Parameters(
hue_value=0.05,
hue_width=0.15,
saturation_minimum=0.05,
intensity_upper_limit=0.95,
intensity_weak_threshold=0.65,
intensity_strong_threshold=0.35,
intensity_lower_limit=0.05,
)
In [6]:
count_and_label(template_params)
Internally, the label values are conveniently increasing, regularly spaced integers.
Black is therefore background / negative, dark gray is weak positive, light gray is positive,
and white is strong positive. To use the label values programmatically, use of the static
attributes of ppc.Labels – .NEGATIVE, .WEAK, .PLAIN, and .STRONG –
is recommended.
This does a reasonably good job aleady, but let's see what happens if we shift the hue center a bit.
Since Parameters is a collections.namedtuple, we can use its ._replace method to substitute a value.
In [7]:
count_and_label(template_params._replace(hue_value=0.))
Here, we see that a number of pixels previously considered positive are now considered negative, creating holes in visible nuclei. This indicates that we've moved the hue range too far.
In any case, we can also view the statistics. Here we use the original parameter values.
In [8]:
stats, label_image = ppc.count_image(im_input, template_params)
def pp_namedtuple(t):
"Pretty-print a namedtuple by printing each field on its own line and left-aligning all values"
print(type(t).__name__)
maxlen = max(map(len, t._fields))
for f in t._fields:
print(f, getattr(t, f), sep=':' + ' ' * (maxlen - len(f)) + '\t')
pp_namedtuple(stats)
count_slidecount_slide has several more parameters than count_image. Besides make_label_image,
which controls the creation of the label image as mentioned, there is also region, which instructs
count_slide to operate on only the specified region of the image.
We will visualize its operation on a small region, and then run it on a larger region.
In [9]:
# Comment this out (or just don't run it) once you have the file
!curl -OJ 'https://data.kitware.com/api/v1/file/598b5ee88d777f7d33e9c1d1/download'
In [10]:
# Set up a basic configuration. Change as needed.
import dask.distributed
dask.distributed.Client()
Out[10]:
In [11]:
slide_path = 'TCGA-DX-A6BG-01Z-00-DX2.34763958-0613-4069-9ACC-13D6633FE415.svs'
region = dict(
left=50000, top=35000,
width=1600, height=900,
)
ts = large_image.getTileSource(slide_path)
im_region = ts.getRegion(region=region, format=large_image.tilesource.TILE_FORMAT_NUMPY)[0]
print("The region")
plt.imshow(im_region)
plt.show()
We'll reuse the parameters from before.
In [12]:
stats, label_image = ppc.count_slide(slide_path, template_params, region=region, make_label_image=True)
pp_namedtuple(stats)
plt.imshow(label_image)
plt.show()
The output is about the same quality as before. As the small region used in the previous section is in fact extracted from another part of this slide, this is not too surprising.
We'll now rerun for only the stats, which will make use of Dask. make_label_image
defaults to False, so we simply omit it here. For the purposes of illustration,
we force tile_grouping to 1 to process each tile in its own task. A larger value, 256,
is used as the default value to reduce the overhead associated with Dask tasks.
In [13]:
# Note that we still return a tuple, though it now has length 1.
stats_dask, = ppc.count_slide(slide_path, template_params, region=region, tile_grouping=1)
pp_namedtuple(stats_dask)
The results are identical up to a tiny amount of floating point error, as can be seen below:
In [14]:
print("stats_dask - stats:")
pp_namedtuple(ppc.Output(**{f: getattr(stats_dask, f) - getattr(stats, f) for f in ppc.Output._fields}))
count_slide on a large regionAs a better exhibition of count_slide's use of parallelism in computing statistics,
we run it here on a much larger region, 30Kx30K pixels.
(By leaving out the region parameter, count_slide will look at the entire image.
At 146Kx79K pixels, this image will cause count_slide to use a lot of memory,
potentially enough to require swap space, without additional configuration of
large_image's caching behavior.)
In [15]:
large_region = dict(
left=60e3, top=30e3,
width=30e3, height=30e3,
)
stats, = %time ppc.count_slide(slide_path, template_params, large_region)
pp_namedtuple(stats)
In [16]:
pp_namedtuple(template_params)
print('\nRegion:', region)
In [17]:
%%script sh
# Stats and label image output must be specified via file arguments to
# --returnparameterfile and --outputLabelImage, respectively.
python ../../server/PositivePixelCount/PositivePixelCount.py \
'TCGA-DX-A6BG-01Z-00-DX2.34763958-0613-4069-9ACC-13D6633FE415.svs' \
0.05 0.15 0.05 0.95 0.65 0.35 0.05 --region 50000,35000,1600,900 \
--returnparameterfile stats.txt --outputLabelImage labelImage.png 2>/dev/null
Here are the results. Note that, if a label image is output, it's colorized according to colors from the coolwarm color map. White is negative, blue is weak positive, gray is positive, and red is strong positive.
In [18]:
print("stats.txt:")
for l in open('stats.txt'):
print(l.rstrip())
print("\nlabelImage.png:")
plt.imshow(skimage.io.imread('labelImage.png'))
plt.show()