developed by Jonas Hartmann (Gilmour group, EMBL Heidelberg)
as part of the EMBL Bio-IT Course on Intermediate Python Programming
20.09.2017
This tutorial introduces a number of tools and strategies for image analysis (specifically fluorescence microscopy images as produced in the biosciences) available in python. It aims to give the course attendees a starting point to further explore image analysis packages and pipelines. Furthermore, it serves as another practical example of scientific python programming.
In the course, teacher and students develop the pipeline below together in an open session over the course of about two hours. The tutorial can also be used for self-study, which is best done by re-implementing, testing and playing around with each step of the pipeline.
The tutorial pipeline encompasses the following parts:
It is based on an example 3-channel image of human HT29 colon cancer cells in culture, labeled with...
The example image was obtained from the CellProfiler website and derives from [Moffat et al., 2006].
All packages used in this tutorial are part of the Anaconda distribution of python.
In [ ]:
# For python 2 users
from __future__ import division, print_function
# Scientific python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Image analysis
import scipy.ndimage as ndi
from skimage import io, segmentation, graph, filters, measure
# Machine learning
from sklearn import preprocessing, svm, metrics
In [ ]:
# Read data
raw = io.imread('../data/HT29.tif')
# Check data structure and type
print(type(raw))
print(raw.shape)
print(raw.dtype)
print(raw.min(), raw.max(), raw.mean())
# Split channels
nuc = raw[:,:,0]
pH3 = raw[:,:,1]
act = raw[:,:,2]
In [ ]:
# Simple imshow
plt.imshow(nuc, interpolation='none', cmap='gray')
plt.show()
In [ ]:
# Nice subplots
fig,ax = plt.subplots(1, 3, figsize=(12,4))
for axis,image,title in zip(ax, [nuc,pH3,act], ['Hoechst','pH3','Phalloidin']):
axis.imshow(image, interpolation='none', cmap='gray')
axis.set_title(title)
axis.axis('off')
plt.show()
In [ ]:
# Colored overlay using rgb
fig = plt.figure(figsize=(4,4))
plt.imshow(np.zeros_like(nuc), vmax=1) # Black background
rgb = np.zeros(image.shape+(3,)) # Empty RGB
for i,image in enumerate([act,pH3,nuc]): # Add each channel to RGB
rgb[:,:,i] = (image.astype(np.float) - image.min()) / (image.max() - image.min()) # Normalize images to [0,1]
plt.imshow(rgb, interpolation='none')
plt.axis('off')
plt.show()
Smoothing an image to reduce technical noise is almost always the first step in image analysis. The most common smoothing algorithm is the Gaussian filter.
The Gaussian filter is an example of a key technique of image analysis: kernel convolution. In image analysis, a kernel is a small 'mask' that is moved over each pixel in the image. At each pixel position, the kernel determines which of the surrounding pixels are used to compute the new value and how much each surrounding pixel contributes. Kernel convolutions can be implemented using Fast Fourier Transforms (FFTs), which makes them very fast.
For the Gaussian filter, the kernel is a small Gaussian-like distribution (fig. 1). To compute a pixel of the smoothed image, the values of the surrounding pixels are multiplied by the corresponding kernel value, summed up, and normalized again (by dividing by the sum of kernel values). Thus, by 'diluting' the values of individual pixels with the values of neighboring pixels, a convolution with a Gaussian kernel leads to a smoothing of the image.
In [ ]:
# Gaussian smoothing
nuc_smooth = ndi.gaussian_filter(nuc, sigma=1)
# Show
fig,ax = plt.subplots(1, 2, figsize=(12,12))
ax[0].imshow(nuc, interpolation='none', cmap='gray')
ax[0].set_title('Raw')
ax[0].axis('off')
ax[1].imshow(nuc_smooth, interpolation='none', cmap='gray')
ax[1].set_title('Smoothed')
ax[1].axis('off')
plt.show()
A simple way of segmenting nuclei in these images is to combine adaptive background subtraction and thresholding.
The idea of adaptive background subtraction is to compute a local background for each position of the image. If there is a slow continuous change in the image background, the local background can be adjusted for this, hence evening out the image.
A simple way of computing the local background is a convolution with a relatively large uniform (mean) kernel (fig. 2). If this kernel is large compared to the structures in the image, the mean will usually end up lower than the foreground but higher than the background - perfect for background subtraction.
In [ ]:
# Adaptive background subtraction
nuc_smooth_bg = ndi.uniform_filter(nuc_smooth, size=20)
nuc_smooth_bgsub = nuc_smooth - nuc_smooth_bg
nuc_smooth_bgsub[nuc_smooth < nuc_smooth_bg] = 0
# Show
fig,ax = plt.subplots(1, 2, figsize=(12,12))
ax[0].imshow(nuc_smooth, interpolation='none', cmap='gray')
ax[0].set_title('Smoothed')
ax[0].axis('off')
ax[1].imshow(nuc_smooth_bgsub, interpolation='none', cmap='gray')
ax[1].set_title('Background-subtracted')
ax[1].axis('off')
plt.show()
In [ ]:
# Interactive search for a good threshold
# Plotting function
def threshold_plot(threshold=10):
# Threshold
nuc_mask = nuc_smooth_bgsub > threshold
# Show
fig = plt.figure(figsize=(6,6))
plt.imshow(nuc_smooth_bgsub, interpolation='none', cmap='gray')
plt.imshow(np.ma.array(nuc_mask, mask=nuc_mask==0), interpolation='none', cmap='autumn', alpha=0.5)
plt.axis('off')
plt.show()
# Interactive widget
from ipywidgets import interactive
interactive(threshold_plot, threshold=(1,255,1))
In [ ]:
# Apply threshold
nuc_mask = nuc_smooth_bgsub > 10
Note: There are a number of problems with a simple segmentation like this, namely the risk of fused nuclei and of artefacts, e.g. small debris or background fluctuations that are wrongly considered a nucleus. There are a number of ways to address these problems but for the purpose of this course we will consider the current result good enough.
In [ ]:
# Label the image to give each object a unique number
nuc_labeled = ndi.label(nuc_mask)[0]
# Show
fig = plt.figure(figsize=(6,6))
plt.imshow(nuc_smooth_bgsub, interpolation='none', cmap='gray')
plt.imshow(np.ma.array(nuc_labeled, mask=nuc_labeled==0), interpolation='none', cmap='prism', alpha=0.5)
plt.axis('off')
plt.show()
For many structures, simply filtering and thresholding the image is not enough to get a segmentation. In these cases, one of many alternatives must be applied.
A very common approach is the watershed algorithm (fig. 3), which works by treating the image as a topographical map and slowly filling up the valleys in the map with water, starting from so-called seeds. Wherever the waterfronts of two different seeds meet, the boundary between these two objects is generated.
Here we can use the labeled nuclei as seeds for a watershed segmentation of the cells based on the phalloidin channel.
In [ ]:
# Identify a background seed
# Here, the 5th percentile on signal intensity is used.
act_bgsub = act - np.percentile(act,5)
act_bgsub[act < np.percentile(act,5)] = 0
# Show
fig = plt.figure(figsize=(6,6))
plt.imshow(act, interpolation='none', cmap='gray')
plt.imshow(np.ma.array(act_bgsub==0, mask=act_bgsub!=0), interpolation='none', cmap='autumn', alpha=0.5)
plt.axis('off')
plt.show()
In [ ]:
# Prepare the seeds
seeds = np.copy(nuc_labeled) # Cell seeds
seeds[act_bgsub==0] = nuc_labeled.max()+1 # Add background seeds
In [ ]:
# Prepare the image by Sobel edge filtering
act_sobel = filters.sobel(act_bgsub)
# Show
fig = plt.figure(figsize=(6,6))
plt.imshow(act_sobel, interpolation='none', cmap='gray')
plt.axis('off')
plt.show()
In [ ]:
# Run watershed
act_ws = segmentation.watershed(act_sobel, seeds)
# Remove background
act_ws[act_ws==nuc_labeled.max()+1] = 0
# Show
fig = plt.figure(figsize=(6,6))
plt.imshow(act_bgsub, interpolation='none', cmap='gray')
plt.imshow(np.ma.array(act_ws, mask=act_ws==0), interpolation='none', cmap='prism', alpha=0.5)
plt.axis('off')
plt.show()
In [ ]:
# Better visualization
fig = plt.figure(figsize=(12,12))
plt.imshow(act_bgsub, interpolation='none', cmap='gray')
plt.imshow(np.ma.array(nuc_labeled, mask=nuc_labeled==0), interpolation='none', cmap='prism', alpha=0.3)
boundaries = filters.sobel(act_ws) > 0
plt.imshow(np.ma.array(boundaries, mask=boundaries==0), interpolation='none', cmap='autumn', alpha=0.5)
plt.axis('off')
plt.show()
In [ ]:
# Regionprops provides a number of measurements per label
nuc_props_nuc = measure.regionprops(nuc_labeled, intensity_image=nuc) # Props for nuclear mask, nuc channel
nuc_props_pH3 = measure.regionprops(nuc_labeled, intensity_image=pH3) # Props for nuclear mask, pH3 channel
nuc_props_act = measure.regionprops(nuc_labeled, intensity_image=act) # Props for nuclear mask, act channel
In [ ]:
# To better handle these, they can be transformed into dictionaries or pandas dataframes
# Function to convert to dict
def props2dict(props):
# Get prop names (excluding non-scalar props!)
propdict = {prop_name:[] for prop_name in props[0]
if not (type(props[0][prop_name]) in [tuple, np.ndarray])}
# For each prop name...
for prop_name in propdict:
# For each region...
for region in props:
# Add the corresponding value
propdict[prop_name].append(region[prop_name])
# Convert the values to an array
propdict[prop_name] = np.array(propdict[prop_name])
# Return results
return propdict
# Converting nuc_props_pH3 and nuc_props_act to dicts
propdict_pH3 = props2dict(nuc_props_pH3)
propdict_act = props2dict(nuc_props_act)
# Converting nuc_props_nuc to a pandas df
propdf = pd.DataFrame(props2dict(nuc_props_nuc))
propdf = propdf.drop(['bbox_area', 'euler_number'], axis=1)
In [ ]:
propdf.head()
In [ ]:
propdf.describe()
In [ ]:
# Boxplot
fig = plt.figure(figsize=(12,4))
propdf.boxplot()
fig.autofmt_xdate()
plt.show()
In [ ]:
# Backmapping onto image
color_prop = 'area'
nuc_propcolored = np.zeros(nuc.shape)
for row,label in enumerate(propdf.label):
nuc_propcolored[nuc_labeled==label] = propdict_act[color_prop][row]
# Show
fig = plt.figure(figsize=(6,6))
plt.imshow(act_bgsub, interpolation='none', cmap='gray')
plt.imshow(np.ma.array(nuc_propcolored, mask=nuc_propcolored==0),
interpolation='none', cmap='plasma')
plt.colorbar(label=color_prop, fraction=0.046, pad=0.04)
plt.axis('off')
plt.show()
In [ ]:
# Scatter plot to look at relations
fig = plt.figure(figsize=(4,4))
plt.scatter(propdf.area, propdf.perimeter, edgecolor='', alpha=0.5)
plt.plot(np.sort(propdf.perimeter)**2.0 / (4*np.pi), np.sort(propdf.perimeter), color='r', alpha=0.8)
plt.legend(['theoretical circles', 'data'], loc=4, fontsize=10)
plt.xlabel('area')
plt.ylabel('perimeter')
plt.show()
Based on the pH3 channel, cells currently undergoing mitosis can be identified without any ambiguity. However, if the Hoechst channel holds enough information on its own to confidently classify cells as mitotic and non-mitotic, the pH3 channel is no longer needed and its wavelength is freed up for other purposes.
Here, we can use the pH3 channel to create a ground truth for the mitotic vs. non-mitotic classification. We can then use this to train a support vector classifier to identify mitotic cells without use of the pH3 channel.
Warning: This is just a mock example! Larger datasets will feature many cases that are less clear-cut than the ones observed here, so much more data would be needed to train a robust classifier. In particular, many more cases of mitotic cells would be needed so that a balanced training set can be constructed. Furthermore, because of the lack of data in this example, the classifier's performance is evaluated on the training set itself, which means that overfitting goes unnoticed. In a real case, it is imperative to evaluate classifiers on separate training and test sets!
In [ ]:
# Standardizing the features to zero mean and unit variance
propdf_stand = (propdf - propdf.mean()) / propdf.std()
# Show
fig = plt.figure(figsize=(12,4))
propdf_stand.boxplot()
fig.autofmt_xdate()
plt.show()
In [ ]:
# Use pH3 signal to create ground truth labels (True: "in mitosis" | False: "not in mitosis")
# Check pH3 signal distribution with histogram
plt.hist(propdict_pH3['mean_intensity'], bins=50)
plt.ylim([0,20])
plt.show()
# Create ground truth
ground_truth = propdict_pH3['mean_intensity'] > 20
In [ ]:
# Train Support Vector Classifier
svc = svm.SVC()
svc.fit(propdf_stand, ground_truth)
In [ ]:
# Predict on the training data
prediction = svc.predict(propdf_stand)
In [ ]:
# Evaluate prediction with a confusion matrix
cmat = metrics.confusion_matrix(ground_truth, prediction)
# Show
plt.imshow(cmat,interpolation='none',cmap='Blues')
for (i, j), z in np.ndenumerate(cmat):
plt.text(j, i, z, ha='center', va='center')
plt.xticks([0,1], ["Non-Mitotic","Mitotic"])
plt.yticks([0,1], ["Non-Mitotic","Mitotic"], rotation=90)
plt.xlabel("prediction")
plt.ylabel("ground truth")
plt.show()
Jason Moffat, Dorre A. Grueneberg, Xiaoping Yang, So Young Kim, Angela M. Kloepfer, Gregory Hinkle, Bruno Piqani, Thomas M. Eisenhaure, Biao Luo, Jennifer K. Grenier, Anne E. Carpenter, Shi Yin Foo, Sheila A. Stewart, Brent R. Stockwell, Nir Hacohen, William C. Hahn, Eric S. Lander, David M. Sabatini, David E. Root (2006): A Lentiviral RNAi Library for Human and Mouse Genes Applied to an Arrayed Viral High-Content Screen, Cell 124:6, 1283-1298.