Gel Analysis Using K Means Clustering

I want to automate the selection of intensity bands from gel images. The data I'm using in this notebook is from folding results of my 84X and 42X noodle structures. I've used a Cy3 object probe and a Cy5 defect probe for both strucutres. I used the Typhoon at 940V for Cy3 and 900V for Cy5

I will be using KMeans clustering to segment the gel images. There are many other clustering algorithms we could use, but I like K Means because it is not too computationally intensive. Here is a cool explanation of the differences in clustering algorithms from the folks at scikit-learn: http://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods

We start with some package imports:


In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
import sklearn
from sklearn.cluster import KMeans
from sklearn import cluster
from scipy import misc
from scipy.ndimage import label

Functions

Here we define some functions to use later on


In [40]:
def invert(img):
    from copy import deepcopy
    imin = np.min(img)
    imax = np.max(img)
    img_inv = imax - deepcopy(img)+imin
    return img_inv

K Means Clustering


In [3]:
def kmeans(img,nlabels=5):
    lena = img
    X = lena.reshape((-1, 1)) # We need an (n_sample, n_feature) array
    k_means = cluster.KMeans(n_clusters=nlabels, n_init=1)
    k_means.fit(X) 
    values = k_means.cluster_centers_.squeeze()
    labels = k_means.labels_
    return labels.reshape(img.shape)

Cluster Labelling and Centroid Calculation


In [4]:
def get_labels(data, min_extent=5):
    labels, nlabels = label(data)
    for idx in range(1, nlabels+1):
        if sum(labels==idx)<min_extent:
            labels[labels==idx] = 0
    nlabels = len(np.unique(labels))
    return labels, nlabels

In [5]:
def get_centroids(labels):
    labelnums = np.unique(labels)
    centroids = []
    for label in labelnums:
        centroids.append(np.mean(np.asarray(np.nonzero(labels==label)), axis = 1))
    return centroids

Plotting Functions


In [6]:
def make_boxplot(measure_names,cy5,cy3,title=''):
    import scipy.stats as stats
    ind = np.arange(len(measure_names))
    width = 0.35

    fig = plt.figure(figsize=(24,8))
    ax = fig.add_subplot(111)
    
    rects1 = ax.boxplot(cy3,positions = ind,widths=0.2)
    rects2 = ax.boxplot(cy5,positions = [i+width for i in ind],widths=0.2)
    ax.set_xticklabels(measure_names,rotation=0)
    p1 = [plot(np.ones(len(cy3[i]))*ind[i],cy3[i],'ro',alpha=0.005) for i in range(len(measure_names))]
    c1 = [plot(np.ones(len(cy5[i]))*ind[i]+width,cy5[i],'bo',alpha=0.005) for i in range(len(measure_names))]
    
    for key in rects1.keys():
        setp(rects1[key],color='red')
        setp(rects2[key],color='blue')
    
    ax.set_ylabel('Intensity')
    ax.set_xlabel('MgCl2 concentration')
    ax.set_title(title)
    ax.set_xticks(ind+width/2)
    ax.legend( (rects1[key][0], rects2[key][0]), ('Cy3', 'Cy5') )

42X Noodle

Read in Files


In [43]:
cy3 = imread("../downloads/10282013_42XCy3_D_O_3H_70V-[Cy3].jpg")
cy5 = imread("../downloads/10282013_42XCy3_D_O_3H_70V-[Cy5].jpg")
fig,ax = subplots(ncols=2,nrows=1,figsize=(12,6))
ax[0].hist(cy3.ravel(),bins=100,color="black");
ax[0].set_title("Cy3 Histogram")
ax[1].hist(cy5.ravel(),bins=100,color="black");
ax[1].set_title("Cy5 Histogram")


Out[43]:
<matplotlib.text.Text at 0x121869850>

Run K Means Clustering


In [44]:
labels = kmeans(cy3)

In [45]:
fig,ax = subplots(ncols=2,nrows=1,figsize=(24,8))
ax[0].imshow(cy3,cmap=cm.Greys)
ax[0].set_title('Cy3 - Object')
cmap = cm.get_cmap('jet', 5)
l = ax[1].imshow(labels,cmap=cmap)
ax[1].set_title('K Means Clustering')
ax[1].figure.colorbar(l);


Choose K-Means Cluster, and Minimum Cluster Size

In this case, we choose cluster 3, or the orange clusters in the picture above. Unfortunately this value will change each time K Means is run. We also limit the size of our clusters to >1000 pixels


In [53]:
bandlabels, nlabels = get_labels(labels==3,1000)

Plot Images


In [54]:
fig,ax = subplots(ncols=4,nrows=1,figsize=(36,12))
ax[0].imshow(cy3,cmap=cm.Greys)
cmap = cm.get_cmap('jet', 5)
l = ax[1].imshow(labels,cmap=cmap)
ax[1].set_title('K Means Clustering')
#ax[1].figure.colorbar(l)
cmap = cm.get_cmap('jet', nlabels)
b=ax[2].imshow(bandlabels,cmap=cmap,alpha=0.3)
ax[2].set_title("ROI Labelling")
centroids = get_centroids(bandlabels)
for i,c in enumerate(centroids):
    ax[2].annotate('%d'%i,c[::-1],color="black");
ax[3].imshow(cy5,cmap=cm.Greys)
ax[3].set_title('Cy5 - Defect')
ax[0].set_title('Cy3 - Object')
#ax[1].figure.colorbar(b)


Out[54]:
<matplotlib.text.Text at 0x1070ca990>

Choose Regions of Interest (ROIs) and map to names


In [55]:
rois = [5,6,9,11,13,15,17]
roinames = ['10','12','14','16','18','20','22'] # These are the MgCl2 concentrations I used in the gel

Extract ROI values for both Cy3 and Cy5 Images


In [56]:
labelnames = np.unique(bandlabels)
cy3vals = []
cy5vals = []
cy3inv = invert(cy3)
cy5inv = invert(cy5)
for idx in rois:
    cy3vals.append(cy3inv[bandlabels==labelnames[idx]])
    cy5vals.append(cy5inv[bandlabels==labelnames[idx]])

Plot a Boxplot of Defect(Cy5) and object(Cy3) probes


In [57]:
names = ['%d'%d for d in range(len(labelnames))]
make_boxplot(roinames,cy5vals,cy3vals,'Object (Cy3) and Defect(Cy5) Probe Intensities for 42X')


84X Noodle

Read in Files


In [58]:
cy3 = imread("../downloads/10232013_84XCy3_D_O_3H_70V-[Cy3].jpg")
cy5 = imread("../downloads/10232013_84XCy3_D_O_3H_70V-[Cy5].jpg")
fig,ax = subplots(ncols=2,nrows=1,figsize=(12,6))
ax[0].hist(cy3.ravel(),bins=100,color="black");
ax[0].set_title("Cy3 Histogram")
ax[1].hist(cy5.ravel(),bins=100,color="black");
ax[1].set_title("Cy5 Histogram")


Out[58]:
<matplotlib.text.Text at 0x13ded7990>

Run K Means Clustering

For this case, I calculated just 3 clusters


In [59]:
labels = kmeans(cy3,3)

In [60]:
fig,ax = subplots(ncols=2,nrows=1,figsize=(24,8))
ax[0].imshow(cy3,cmap=cm.Greys)
ax[0].set_title('Cy3 - Object')
cmap = cm.get_cmap('jet', 3)
l = ax[1].imshow(labels,cmap=cmap)
ax[1].set_title('K Means Clustering')
ax[1].figure.colorbar(l);


Choose K-Means Cluster, and Minimum Cluster Size

In this case, we choose cluster 1 (the green clusters in the picture above). Unfortunately this value will change each time K Means is run. We also limit the size of our clusters to >500 pixes, since the resolution is different than for the 42X example


In [61]:
bandlabels, nlabels = get_labels(labels==1,500)

Plot Images


In [62]:
fig,ax = subplots(ncols=4,nrows=1,figsize=(36,12))
ax[0].imshow(cy3,cmap=cm.Greys)
cmap = cm.get_cmap('jet', 5)
l = ax[1].imshow(labels,cmap=cmap)
ax[1].set_title('K Means Clustering')
#ax[1].figure.colorbar(l)
cmap = cm.get_cmap('jet', nlabels)
b=ax[2].imshow(bandlabels,cmap=cmap,alpha=0.3)
ax[2].set_title("ROI Labelling")
centroids = get_centroids(bandlabels)
for i,c in enumerate(centroids):
    ax[2].annotate('%d'%i,c[::-1],color="black");
ax[3].imshow(cy5,cmap=cm.Greys)
ax[3].set_title('Cy5 - Defect')
ax[0].set_title('Cy3 - Object');
#ax[1].figure.colorbar(b)


Out[62]:
<matplotlib.text.Text at 0x1251e8fd0>

Choose Regions of Interest (ROIs) and map to names


In [63]:
rois = [14,12,13,11,9,8,10]
roinames = ['10','12','14','16','18','20','22'] # These are the MgCl2 concentrations I used in the gel

Extract ROI values for both Cy3 and Cy5 Images


In [66]:
labelnames = np.unique(bandlabels)
cy3vals = []
cy5vals = []
cy3inv = invert(cy3)
cy5inv = invert(cy5)
for idx in rois:
    cy3vals.append(cy3inv[bandlabels==labelnames[idx]])
    cy5vals.append(cy5inv[bandlabels==labelnames[idx]])

Plot a Boxplot of Defect(Cy5) and object(Cy3) probes


In [67]:
make_boxplot(roinames,cy5vals,cy3vals,'Object (Cy3) and Defect(Cy5) Probe Intensities for 84X')


Results

  • There is a dramatic difference in the boxplots of the 42X noodle and the 84X noodle!
  • The defect probe (Cy5) intensity is high and closer to the object probe (Cy3) for 42X than 84X, showing what we already knew: 42X folds badly
  • There doesn't seem to be a trend between magnesium concentration and Cy3/Cy5 intensity for both the 84X and 42X noodles
  • For the 84X plot, the object probe intensity is much higher than defect, we know this folds well.
  • I have not calclated the ratio between Cy5 and Cy3 yet.
    • I think I should add a background subtraction step where I subtract the average of the $0^{th}$ ROI.

Future Steps

This processing pipeline is not quite fully automated. We still needed to define the following from visual inspection:

  • The number of clusters for K Means
    • This really depends on the contast spread of the image. Therefore, it might be useful to analyze the peaks in a histogram to determine the number of K Means clusters.
  • The K Means cluster of our desired bands
    • Unfortunately, the K Means algorithms labels clusters randomly. So in one case the "staples" cluster might be 0 but other times it is 1, even though the pixels included in it are identical
    • Finding the cluster label with our desired bands is also tricky. One way is to count the total number of clusters from each K Means label, and choose the label where the number of clusters is closest to a user set value
  • The minimum cluster size
    • This can be calculated from the resolution, I think. We need to choose an area cutoff in real life units, like $mm^2$
  • ROI definition
    • This can't be helped! The user needs to choose which ROIs to run analyses on

In [ ]: