An Introduction to BioImage Data Analysis with Python

developed by Jonas Hartmann (Gilmour group, EMBL Heidelberg)
as part of the EMBL Bio-IT Course on Intermediate Python Programming
20.09.2017


About

Purpose

This tutorial introduces a number of tools and strategies for image analysis (specifically fluorescence microscopy images as produced in the biosciences) available in python. It aims to give the course attendees a starting point to further explore image analysis packages and pipelines. Furthermore, it serves as another practical example of scientific python programming.

Format

In the course, teacher and students develop the pipeline below together in an open session over the course of about two hours. The tutorial can also be used for self-study, which is best done by re-implementing, testing and playing around with each step of the pipeline.

Content

The tutorial pipeline encompasses the following parts:

  1. Loading & viewing images
  2. Image processing & segmentation
  3. Data Extraction & analysis

It is based on an example 3-channel image of human HT29 colon cancer cells in culture, labeled with...

  • Hoechst stain (DNA)
  • Phalloidin (actin)
  • Histone H3 phosphorylated on serine 10 [pH3] antibody (mitosis marker)

The example image was obtained from the CellProfiler website and derives from [Moffat et al., 2006].

Dependencies

  • python (2.7 or 3.x)
  • numpy, scipy, matplotlib
  • skimage, pandas, sklearn

All packages used in this tutorial are part of the Anaconda distribution of python.

1. Loading & Viewing Images

1a. Imports


In [ ]:
# For python 2 users
from __future__ import division, print_function 

# Scientific python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Image analysis
import scipy.ndimage as ndi
from skimage import io, segmentation, graph, filters, measure

# Machine learning
from sklearn import preprocessing, svm, metrics

1b. Loading Image Data


In [ ]:
# Read data
raw = io.imread('../data/HT29.tif')

# Check data structure and type
print(type(raw))
print(raw.shape)
print(raw.dtype)
print(raw.min(), raw.max(), raw.mean())

# Split channels
nuc = raw[:,:,0]
pH3 = raw[:,:,1]
act = raw[:,:,2]

1c. Viewing Image Data


In [ ]:
# Simple imshow
plt.imshow(nuc, interpolation='none', cmap='gray')
plt.show()

In [ ]:
# Nice subplots
fig,ax = plt.subplots(1, 3, figsize=(12,4))
for axis,image,title in zip(ax, [nuc,pH3,act], ['Hoechst','pH3','Phalloidin']):
    axis.imshow(image, interpolation='none', cmap='gray')
    axis.set_title(title)
    axis.axis('off')
plt.show()

In [ ]:
# Colored overlay using rgb
fig = plt.figure(figsize=(4,4))
plt.imshow(np.zeros_like(nuc), vmax=1)    # Black background
rgb = np.zeros(image.shape+(3,))          # Empty RGB
for i,image in enumerate([act,pH3,nuc]):  # Add each channel to RGB
    rgb[:,:,i] = (image.astype(np.float) - image.min()) / (image.max() - image.min()) # Normalize images to [0,1]
plt.imshow(rgb, interpolation='none')
plt.axis('off')
plt.show()

2. Image Processing & Segmentation

2a. Preprocessing by Smoothing

Smoothing an image to reduce technical noise is almost always the first step in image analysis. The most common smoothing algorithm is the Gaussian filter.

The Gaussian filter is an example of a key technique of image analysis: kernel convolution. In image analysis, a kernel is a small 'mask' that is moved over each pixel in the image. At each pixel position, the kernel determines which of the surrounding pixels are used to compute the new value and how much each surrounding pixel contributes. Kernel convolutions can be implemented using Fast Fourier Transforms (FFTs), which makes them very fast.

For the Gaussian filter, the kernel is a small Gaussian-like distribution (fig. 1). To compute a pixel of the smoothed image, the values of the surrounding pixels are multiplied by the corresponding kernel value, summed up, and normalized again (by dividing by the sum of kernel values). Thus, by 'diluting' the values of individual pixels with the values of neighboring pixels, a convolution with a Gaussian kernel leads to a smoothing of the image.


Fig 1: Example of a Gaussian convolution kernel.


In [ ]:
# Gaussian smoothing
nuc_smooth = ndi.gaussian_filter(nuc, sigma=1)

# Show
fig,ax = plt.subplots(1, 2, figsize=(12,12))
ax[0].imshow(nuc, interpolation='none', cmap='gray')
ax[0].set_title('Raw')
ax[0].axis('off')
ax[1].imshow(nuc_smooth, interpolation='none', cmap='gray')
ax[1].set_title('Smoothed')
ax[1].axis('off')
plt.show()

2b. Simple Nucleus Segmentation

A simple way of segmenting nuclei in these images is to combine adaptive background subtraction and thresholding.

The idea of adaptive background subtraction is to compute a local background for each position of the image. If there is a slow continuous change in the image background, the local background can be adjusted for this, hence evening out the image.

A simple way of computing the local background is a convolution with a relatively large uniform (mean) kernel (fig. 2). If this kernel is large compared to the structures in the image, the mean will usually end up lower than the foreground but higher than the background - perfect for background subtraction.


Fig 2: Example of a circular uniform convolution kernel.


In [ ]:
# Adaptive background subtraction
nuc_smooth_bg = ndi.uniform_filter(nuc_smooth, size=20)
nuc_smooth_bgsub = nuc_smooth - nuc_smooth_bg
nuc_smooth_bgsub[nuc_smooth < nuc_smooth_bg] = 0

# Show
fig,ax = plt.subplots(1, 2, figsize=(12,12))
ax[0].imshow(nuc_smooth, interpolation='none', cmap='gray')
ax[0].set_title('Smoothed')
ax[0].axis('off')
ax[1].imshow(nuc_smooth_bgsub, interpolation='none', cmap='gray')
ax[1].set_title('Background-subtracted')
ax[1].axis('off')
plt.show()

In [ ]:
# Interactive search for a good threshold

# Plotting function
def threshold_plot(threshold=10):
    
    # Threshold
    nuc_mask = nuc_smooth_bgsub > threshold

    # Show
    fig = plt.figure(figsize=(6,6))
    plt.imshow(nuc_smooth_bgsub, interpolation='none', cmap='gray')
    plt.imshow(np.ma.array(nuc_mask, mask=nuc_mask==0), interpolation='none', cmap='autumn', alpha=0.5)
    plt.axis('off')
    plt.show()    
    
    
# Interactive widget
from ipywidgets import interactive
interactive(threshold_plot, threshold=(1,255,1))

In [ ]:
# Apply threshold
nuc_mask = nuc_smooth_bgsub > 10

Note: There are a number of problems with a simple segmentation like this, namely the risk of fused nuclei and of artefacts, e.g. small debris or background fluctuations that are wrongly considered a nucleus. There are a number of ways to address these problems but for the purpose of this course we will consider the current result good enough.


In [ ]:
# Label the image to give each object a unique number
nuc_labeled = ndi.label(nuc_mask)[0]

# Show
fig = plt.figure(figsize=(6,6))
plt.imshow(nuc_smooth_bgsub, interpolation='none', cmap='gray')
plt.imshow(np.ma.array(nuc_labeled, mask=nuc_labeled==0), interpolation='none', cmap='prism', alpha=0.5)
plt.axis('off')
plt.show()

2c. Cell Segmentation by Watershed

For many structures, simply filtering and thresholding the image is not enough to get a segmentation. In these cases, one of many alternatives must be applied.

A very common approach is the watershed algorithm (fig. 3), which works by treating the image as a topographical map and slowly filling up the valleys in the map with water, starting from so-called seeds. Wherever the waterfronts of two different seeds meet, the boundary between these two objects is generated.

Here we can use the labeled nuclei as seeds for a watershed segmentation of the cells based on the phalloidin channel.


**Fig 3:** Graphical explanation of watershed segmentation.

In [ ]:
# Identify a background seed
# Here, the 5th percentile on signal intensity is used.
act_bgsub = act - np.percentile(act,5)
act_bgsub[act < np.percentile(act,5)] = 0

# Show
fig = plt.figure(figsize=(6,6))
plt.imshow(act, interpolation='none', cmap='gray')
plt.imshow(np.ma.array(act_bgsub==0, mask=act_bgsub!=0), interpolation='none', cmap='autumn', alpha=0.5)
plt.axis('off')
plt.show()

In [ ]:
# Prepare the seeds
seeds = np.copy(nuc_labeled)               # Cell seeds
seeds[act_bgsub==0] = nuc_labeled.max()+1  # Add background seeds

In [ ]:
# Prepare the image by Sobel edge filtering
act_sobel = filters.sobel(act_bgsub)

# Show
fig = plt.figure(figsize=(6,6))
plt.imshow(act_sobel, interpolation='none', cmap='gray')
plt.axis('off')
plt.show()

In [ ]:
# Run watershed
act_ws = segmentation.watershed(act_sobel, seeds)

# Remove background
act_ws[act_ws==nuc_labeled.max()+1] = 0

# Show
fig = plt.figure(figsize=(6,6))
plt.imshow(act_bgsub, interpolation='none', cmap='gray')
plt.imshow(np.ma.array(act_ws, mask=act_ws==0), interpolation='none', cmap='prism', alpha=0.5)
plt.axis('off')
plt.show()

In [ ]:
# Better visualization
fig = plt.figure(figsize=(12,12))
plt.imshow(act_bgsub, interpolation='none', cmap='gray')
plt.imshow(np.ma.array(nuc_labeled, mask=nuc_labeled==0), interpolation='none', cmap='prism', alpha=0.3)
boundaries = filters.sobel(act_ws) > 0 
plt.imshow(np.ma.array(boundaries, mask=boundaries==0), interpolation='none', cmap='autumn', alpha=0.5)
plt.axis('off')
plt.show()

3. Data Extraction & Analysis

3a. Extracting Region Data


In [ ]:
# Regionprops provides a number of measurements per label
nuc_props_nuc = measure.regionprops(nuc_labeled, intensity_image=nuc)  # Props for nuclear mask, nuc channel
nuc_props_pH3 = measure.regionprops(nuc_labeled, intensity_image=pH3)  # Props for nuclear mask, pH3 channel
nuc_props_act = measure.regionprops(nuc_labeled, intensity_image=act)  # Props for nuclear mask, act channel

In [ ]:
# To better handle these, they can be transformed into dictionaries or pandas dataframes

# Function to convert to dict
def props2dict(props):
        
    # Get prop names (excluding non-scalar props!)
    propdict = {prop_name:[] for prop_name in props[0]
                if not (type(props[0][prop_name]) in [tuple, np.ndarray])}
    
    # For each prop name...
    for prop_name in propdict:
        
        # For each region...
        for region in props:
            
            # Add the corresponding value
            propdict[prop_name].append(region[prop_name])
        
        # Convert the values to an array
        propdict[prop_name] = np.array(propdict[prop_name])
        
    # Return results
    return propdict


# Converting nuc_props_pH3 and nuc_props_act to dicts
propdict_pH3 = props2dict(nuc_props_pH3)
propdict_act = props2dict(nuc_props_act)

# Converting nuc_props_nuc to a pandas df
propdf = pd.DataFrame(props2dict(nuc_props_nuc))
propdf = propdf.drop(['bbox_area', 'euler_number'], axis=1)

3b. Visualizing Region Data


In [ ]:
propdf.head()

In [ ]:
propdf.describe()

In [ ]:
# Boxplot
fig = plt.figure(figsize=(12,4))
propdf.boxplot()
fig.autofmt_xdate()
plt.show()

In [ ]:
# Backmapping onto image
color_prop = 'area'
nuc_propcolored = np.zeros(nuc.shape)
for row,label in enumerate(propdf.label):
    nuc_propcolored[nuc_labeled==label] = propdict_act[color_prop][row]
    
# Show
fig = plt.figure(figsize=(6,6))
plt.imshow(act_bgsub, interpolation='none', cmap='gray')
plt.imshow(np.ma.array(nuc_propcolored, mask=nuc_propcolored==0), 
           interpolation='none', cmap='plasma')
plt.colorbar(label=color_prop, fraction=0.046, pad=0.04)
plt.axis('off')
plt.show()

In [ ]:
# Scatter plot to look at relations
fig = plt.figure(figsize=(4,4))
plt.scatter(propdf.area, propdf.perimeter, edgecolor='', alpha=0.5)
plt.plot(np.sort(propdf.perimeter)**2.0 / (4*np.pi), np.sort(propdf.perimeter), color='r', alpha=0.8)
plt.legend(['theoretical circles', 'data'], loc=4, fontsize=10)
plt.xlabel('area')
plt.ylabel('perimeter')
plt.show()

3b. Identifying Dividing Cells using a Support Vector Machine

Based on the pH3 channel, cells currently undergoing mitosis can be identified without any ambiguity. However, if the Hoechst channel holds enough information on its own to confidently classify cells as mitotic and non-mitotic, the pH3 channel is no longer needed and its wavelength is freed up for other purposes.

Here, we can use the pH3 channel to create a ground truth for the mitotic vs. non-mitotic classification. We can then use this to train a support vector classifier to identify mitotic cells without use of the pH3 channel.

Warning: This is just a mock example! Larger datasets will feature many cases that are less clear-cut than the ones observed here, so much more data would be needed to train a robust classifier. In particular, many more cases of mitotic cells would be needed so that a balanced training set can be constructed. Furthermore, because of the lack of data in this example, the classifier's performance is evaluated on the training set itself, which means that overfitting goes unnoticed. In a real case, it is imperative to evaluate classifiers on separate training and test sets!


In [ ]:
# Standardizing the features to zero mean and unit variance
propdf_stand = (propdf - propdf.mean()) / propdf.std()

# Show
fig = plt.figure(figsize=(12,4))
propdf_stand.boxplot()
fig.autofmt_xdate()
plt.show()

In [ ]:
# Use pH3 signal to create ground truth labels (True: "in mitosis" | False: "not in mitosis") 

# Check pH3 signal distribution with histogram
plt.hist(propdict_pH3['mean_intensity'], bins=50)
plt.ylim([0,20])
plt.show()

# Create ground truth
ground_truth = propdict_pH3['mean_intensity'] > 20

In [ ]:
# Train Support Vector Classifier
svc = svm.SVC()
svc.fit(propdf_stand, ground_truth)

In [ ]:
# Predict on the training data
prediction = svc.predict(propdf_stand)

In [ ]:
# Evaluate prediction with a confusion matrix
cmat = metrics.confusion_matrix(ground_truth, prediction)

# Show
plt.imshow(cmat,interpolation='none',cmap='Blues')
for (i, j), z in np.ndenumerate(cmat):
    plt.text(j, i, z, ha='center', va='center')
plt.xticks([0,1], ["Non-Mitotic","Mitotic"])
plt.yticks([0,1], ["Non-Mitotic","Mitotic"], rotation=90)
plt.xlabel("prediction")
plt.ylabel("ground truth")
plt.show()

Resources for Further Study

  • Free E-Book: BioImage Data Analysis, a comprehensive introduction to BioImage Data Analysis with Fiji and MATLAB, by Kota Miura (former EMBL Center for Microscopy and Image Analysis).
  • Online resource: OpenCV is a large library of image analysis algorithms that can be used with different programming languages. Tutorials on how to use OpenCV with Python can be found here.
  • Online resource: The SciPy Docs for scipy.ndimage provide a (brief and technical) overview of the ndimage module, which tends to be useful when working with higher-dimensional images and complements scikit-image quite well.
  • Online resource: Mahotas is another python package for image analysis. It is largely redundant with scikit-image but built differently under the hood (with C++ instead of Cython), which can be advantageous in some cases.
  • Online resource: Python (or more accurately Jython) can also be used as a macro language in ImageJ/Fiji. An introduction can be found here, though I believe it is currently being rewritten and therefore a mess. The old version is here.

References

Jason Moffat, Dorre A. Grueneberg, Xiaoping Yang, So Young Kim, Angela M. Kloepfer, Gregory Hinkle, Bruno Piqani, Thomas M. Eisenhaure, Biao Luo, Jennifer K. Grenier, Anne E. Carpenter, Shi Yin Foo, Sheila A. Stewart, Brent R. Stockwell, Nir Hacohen, William C. Hahn, Eric S. Lander, David M. Sabatini, David E. Root (2006): A Lentiviral RNAi Library for Human and Mouse Genes Applied to an Arrayed Viral High-Content Screen, Cell 124:6, 1283-1298.