Corpus callosum's shape signature for segmentation error detection in large datasets

Abstract

Corpus Callosum (CC) is a subcortical, white matter structure with great importance in clinical and research studies because its shape and volume are correlated with subject's characteristics and neurodegenerative diseases. CC segmentation is a important step for any medical, clinical or research posterior study. Currently, magnetic resonance imaging (MRI) is the main tool for evaluating brain because it offers the better soft tissue contrast. Particullary, segmentation in MRI difussion modality has great importante given information associated to brain microstruture and fiber composition.

In this work a method for detection of erroneous segmentations in large datasets is proposed based-on shape signature. Shape signature is obtained from segmentation, calculating curvature along contour using a spline formulation. A mean correct signature is used as reference for compare new segmentations through root mean square error. This method was applied to 152 subject dataset for three different segmentation methods in diffusion: Watershed, ROQS and pixel-based presenting high accuracy in error detection. This method do not require per-segmentation reference and it can be applied to any MRI modality and other image aplications.


In [1]:
## Functions

import sys,os
import copy
import random
path = os.path.abspath('../dev/')
if path not in sys.path:
    sys.path.append(path)

import bib_mri as FW
import numpy as np
import scipy as scipy
import scipy.misc as misc 
import matplotlib as mpl
import matplotlib.pyplot as plt
from numpy import genfromtxt
import platform

import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
%matplotlib inline

from sklearn.cluster import DBSCAN
import sklearn.cluster as skl_clus
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn import preprocessing, svm

def sign_extract(seg, resols): #Function for shape signature extraction
    splines = FW.get_spline(seg,smoothness)

    sign_vect = np.array([]).reshape(0,points) #Initializing temporal signature vector
    for resol in resols:
        sign_vect = np.vstack((sign_vect, FW.get_profile(splines, n_samples=points, radius=resol)))
    
    return sign_vect

def sign_fit(sig_ref, sig_fit): #Function for signature fitting
    dif_curv = []
    for shift in range(points):
        dif_curv.append(np.abs(np.sum((sig_ref - np.roll(sig_fit[0],shift))**2)))
    return np.apply_along_axis(np.roll, 1, sig_fit, np.argmin(dif_curv))

print "Python version: ", platform.python_version()
print "Numpy version: ", np.version.version
print "Scipy version: ", scipy.__version__
print "Matplotlib version: ", mpl.__version__


Python version:  2.7.13
Numpy version:  1.12.1
Scipy version:  0.19.0
Matplotlib version:  2.0.2

In [2]:
def agreement_matrix(x1,x2):
    x = np.vstack((x1,x2)).T
    y = x.dot(1 << np.arange(x.shape[-1] - 1, -1, -1)).astype('uint8')
    m_ag = np.zeros((4))
    for pos in np.unique(y):
        m_ag[pos] = np.sum(y==pos)
    return m_ag[::-1]

def plot_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    import itertools
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    #print(cm)

    plt.figure(figsize=(8, 8))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

Introduction

The Corpus Callosum (CC) is the largest white matter structure in the central nervous system that connects both brain hemispheres and allows the communication between them. The CC has great importance in research studies due to the correlation between shape and volume with some subject's characteristics, such as: gender, age, numeric and mathematical skills and handedness. In addition, some neurodegenerative diseases like Alzheimer, autism, schizophrenia and dyslexia could cause CC shape deformation.

CC segmentation is a necessary step for morphological and physiological features extraction in order to analyze the structure in image-based clinical and research applications. Magnetic Resonance Imaging (MRI) is the most suitable image technique for CC segmentation due to its ability to provide contrast between brain tissues however CC segmentation is challenging because of the shape and intensity variability between subjects, volume partial effect in diffusion MRI, fornex proximity and narrow areas in CC. Among the known MRI modalities, Diffusion-MRI arouses special interest to study the CC, despite its low resolution and high complexity, since it provides useful information related to the organization of brain tissues and the magnetic field does not interfere with the diffusion process itself.

Some CC segmentation approaches using Diffusion-MRI were found in the literature. Niogi et al. proposed a method based on thresholding, Freitas et al. e Rittner et al. proposed region methods based on Watershed transform, Nazem-Zadeh et al. implemented based on level surfaces, Kong et al. presented an clustering algorithm for segmentation, Herrera et al. segmented CC directly in diffusion weighted imaging (DWI) using a model based on pixel classification and Garcia et al. proposed a hybrid segmentation method based on active geodesic regions and level surfaces.

With the growing of data and the proliferation of automatic algorithms, segmentation over large databases is affordable. Therefore, error automatic detection is important in order to facilitate and speed up filter on CC segmentation databases. presented proposals for content-based image retrieval (CBIR) using shape signature of the planar object representation.

In this work, a method for automatic detection of segmentation error in large datasets is proposed based on CC shape signature. Signature offers shape characterization of the CC and therefore it is expected that a "typical correct signature" represents well any correct segmentation. Signature is extracted measuring curvature along segmentation contour. The method was implemented in three main stages: mean correct signature generation, signature configuration and method testing. The first one takes 20 corrects segmentations and generates one correct signature of reference (typical correct signature), per-resolution, using mean values in each point. The second stage stage takes 10 correct segmentations and 10 erroneous segmentations and adjusts the optimal resolution and threshold, based on mean correct signature, that lets detection of erroneous segmentations. The third stage labels a new segmentation as correct and erroneous comparing with the mean signature using optimal resolution and threshold.

The comparison between signatures is done using root mean square error (RMSE). True label for each segmentation was done visually. Correct segmentation corresponds to segmentations with at least 50% of agreement with the structure. It is expected that RMSE for correct segmentations is lower than RMSE associated to erroneous segmentation when compared with a typical correct segmentation.


In [3]:
#Loading labeled segmentations
seg_label = genfromtxt('../../dataset/Seg_Watershed/watershed_label.csv', delimiter=',').astype('uint8')

list_masks_wt = seg_label[np.logical_or(seg_label[:,1] == 0, seg_label[:,1] == 1), 0] #Extracting segmentations
list_labels_wt = seg_label[np.logical_or(seg_label[:,1] == 0, seg_label[:,1] == 1), 1] #Extracting labels

#Loading labeled segmentations
seg_label = genfromtxt('../../dataset/Seg_ROQS/roqs_label.csv', delimiter=',').astype('uint8')

list_masks_rq = seg_label[np.logical_or(seg_label[:,1] == 0, seg_label[:,1] == 1), 0] #Extracting segmentations
list_labels_rq = seg_label[np.logical_or(seg_label[:,1] == 0, seg_label[:,1] == 1), 1] #Extracting labels

#Loading labeled segmentations
seg_label = genfromtxt('../../dataset/Seg_pixel/pixel_label.csv', delimiter=',').astype('uint8')

list_masks_px = seg_label[np.logical_or(seg_label[:,1] == 0, seg_label[:,1] == 1), 0] #Extracting segmentations
list_labels_px = seg_label[np.logical_or(seg_label[:,1] == 0, seg_label[:,1] == 1), 1] #Extracting labels

#Loading labeled segmentations
seg_label = genfromtxt('../../dataset/Seg_DataAug/dataaug_label.csv', delimiter=',').astype('uint8')

list_masks_dt = seg_label[np.logical_or(seg_label[:,1] == 0, seg_label[:,1] == 1), 0] #Extracting segmentations
list_labels_dt = seg_label[np.logical_or(seg_label[:,1] == 0, seg_label[:,1] == 1), 1] #Extracting labels

Shape signature for comparison

Signature is a shape descriptor that measures the rate of variation along the segmentation contour. As shown in figure, the curvature $k$ in the pivot point $p$, with coordinates ($x_p$,$y_p$), is calculated using the next equation. This curvature depict the angle between the segments $\overline{(x_{p-ls},y_{p-ls})(x_p,y_p)}$ and $\overline{(x_p,y_p)(x_{p+ls},y_{p+ls})}$. These segments are located to a distance $ls>0$, starting in a pivot point and finishing in anterior and posterior points, respectively. The signature is obtained calculating the curvature along all segmentation contour.

\begin{equation} \label{eq:per1} k(x_p,y_p) = \arctan\left(\frac{y_{p+ls}-y_p}{x_{p+ls}-x_p}\right)-\arctan\left(\frac{y_p-y_{p-ls}}{x_p-x_{p-ls}}\right) \end{equation}

Signature construction is performed from segmentation contour of the CC. From contour, spline is obtained. Spline purpose is twofold: to get a smooth representation of the contour and to facilitate calculation of the curvature using its parametric representation. The signature is obtained measuring curvature along spline. $ls$ is the parametric distance between pivot point and both posterior and anterior points and it determines signature resolution. By simplicity, $ls$ is measured in percentage of reconstructed spline points.

In order to achieve quantitative comparison between two signatures root mean square error (RMSE) is introduced. RMSE measures distance, point to point, between signatures $a$ and $b$ along all points $p$ of signatures.

\begin{equation} \label{eq:per4} RMSE = \sqrt{\frac{1}{P}\sum_{p=1}^{P}(k_{ap}-k_{bp})^2} \end{equation}

Frequently, signatures of different segmentations are not fitted along the 'x' axis because of the initial point on the spline calculation starts in different relative positions. This makes impossible to compare directly two signatures and therefore, a prior fitting process must be accomplished. The fitting process is done shifting one of the signature while the other is kept fixed. For each shift, RMSE between the two signatures is measured. The point giving the minor error is the fitting point. Fitting was done at resolution $ls = 0.35$. This resolution represents globally the CC's shape and eases their fitting.

After fitting, RMSE between signatures can be measured in order to achieve final quantitative comparison.

Signature for segmentation error detection

For segmentation error detection, a typical correct signature is obtained calculating mean over a group of signatures from correct segmentations. Because of this signature could be used in any resolution, $ls$ must be chosen for achieve segmentation error detection. The optimal resolution must be able to return the greatest RMSE difference between correct and erroneous segmentation when compared with a typical correct signature.

In the optimal resolution, a threshold must be chosen for separate erroneous and correct segmentations. This threshold stays between RMSE associated to correct ($RMSE_E$) and erroneous ($RMSE_C$) signatures and it is given by the next equation where N (in percentage) represents proximity to correct or erroneous RMSE. If RMSE calculated over a group of signatures, mean value is applied.

\begin{equation} \label{eq:eq3} th = N*(\overline{RMSE_E}-\overline{RMSE_C})+\overline{RMSE_C} \end{equation}

Experiments and results

In this work, comparison of signatures through RMSE is used for segmentation error detection in large datasets. For this, it will be calculated a mean correct signature based on 20 correct segmentation signatures. This mean correct signature represents a tipycal correct segmentation. For a new segmentation, signature is extracted and compared with mean signature.

For experiments, DWI from 152 subjects at the University of Campinas, were acquired on a Philips scanner Achieva 3T in the axial plane with a $1$x$1mm$ spatial resolution and $2mm$ slice thickness, along $32$ directions ($b-value=1000s/mm^2$, $TR=8.5s$, and $TE=61ms$). All data used in this experiment was acquired through a project approved by the research ethics committee from the School of Medicine at UNICAMP. From each acquired DWI volume, only the midsaggital slice was used.

Three segmentation methods were implemented to obtained binary masks over a 152 subject dataset: Watershed, ROQS and pixel-based. 40 Watershed segmentations were chosen as follows: 20 correct segmentations for mean correct signature generation and 10 correct and 10 erroneous segmentations for signature configuration stage. Watershed was chosen to generate and adjust the mean signature because of its higher error rate and its variability in the erroneous segmentation shape. These characteristics allow improve generalization. The method was tested on the remaining Watershed segmentations (108 masks) and two additional segmentations methods: ROQS (152 masks) and pixel-based (152 masks).

Mean correct signature generation

In this work, segmentations based on Watershed method were used for implementation of the first and second stages. From the Watershed dataset, 20 correct segmentations were chosen. Spline for each one was obtained from segmentation contour. The contour was obtained using mathematical morphology, applying xor logical operation, pixel-wise, between original segmentation and the eroded version of itself by an structuring element b:

\begin{equation} \label{eq:per2} G_E = XOR(S,S \ominus b) \end{equation}

From contour, it is calculated spline. The implementation, is a B-spline (Boor's basic spline). This formulation has two parameters: degree, representing polynomial degrees of the spline, and smoothness, being the trade off between proximity and smoothness in the fitness of the spline. Degree was fixed in 5 allowing adequate representation of the contour. Smoothness was fixed in 700. This value is based on the mean quantity of pixels of the contour that are passed for spline calculation. The curvature was measured over 500 points over the spline to generate the signature along 20 segmentations. Signatures were fitted to make possible comparison (Fig. signatures). Fitting resolution was fixed in 0.35.

In order to get a representative correct signature, mean signature per-resolution is generated using 20 correct signatures. The mean is calculated in each point.

Signature configuration

Because of the mean signature was extracted for all the resolutions, it is necessary to find resolution in that diference between RMSE for correct signature and RMSE for erroneous signature is maximum. So, 20 news segmentations were used to find this optimal resolution, being divided as 10 correct segmentations and 10 erroneous segmentations. For each segmentation, it was extracted signature for all resolutions.


In [4]:
res_ex = 15
smoothness = 700 #Smoothness
degree = 5 #Spline degree
fit_res = 0.35
resols = np.arange(0.01,0.5,0.01) #Signature resolutions
resols = np.insert(resols,0,fit_res) #Insert resolution for signature fitting
points = 500 #Points of Spline reconstruction

In [4]:
prof_vec_wt = np.empty((len(list_masks_wt),resols.shape[0],points)) #Initializing correct signature vector
for ind, mask in enumerate(list_masks_wt):
    #Loading correct mask
    mask_pn = np.load('../../dataset/Seg_Watershed/mask_wate_{}.npy'.format(mask))
    refer_temp = sign_extract(mask_pn, resols) #Function for shape signature extraction
    prof_vec_wt[ind] = refer_temp
    if mask > 0: #Fitting curves using the first one as basis
        prof_ref = prof_vec_wt[0]
        prof_vec_wt[ind] = sign_fit(prof_ref[0], refer_temp) #Function for signature fitting

ind_rel_cor = np.where(np.logical_not(list_labels_wt))[0]
ind_rel_err = np.where(list_labels_wt)[0]

print "Correct segmentations' vector: ", prof_vec_wt[ind_rel_cor].shape
print "Erroneous segmentations' vector: ", prof_vec_wt[ind_rel_err].shape


Correct segmentations' vector:  (123, 50, 500)
Erroneous segmentations' vector:  (24, 50, 500)

In [5]:
plt.figure()
plt.plot(prof_vec_wt[ind_rel_cor,res_ex,:].T)
plt.title("Correct signatures for res: %f"%(resols[res_ex]))
plt.show()

plt.figure()
plt.plot(prof_vec_wt[ind_rel_err,res_ex,:].T)
plt.title("Erroneous signatures for res: %f"%(resols[res_ex]))
plt.show()



In [6]:
prof_vec_rq = np.empty((len(list_masks_rq),resols.shape[0],points)) #Initializing correct signature vector
for ind, mask in enumerate(list_masks_rq):
    mask_pn = np.load('../../dataset/Seg_ROQS/mask_roqs_{}.npy'.format(mask)) #Loading mask
    refer_temp = sign_extract(mask_pn, resols) #Function for shape signature extraction
    prof_vec_rq[ind] = sign_fit(prof_vec_wt[0][0], refer_temp) #Function for signature fitting using Watershed as basis

ind_rel_cor = np.where(np.logical_not(list_labels_rq))[0]
ind_rel_err = np.where(list_labels_rq)[0]
        
print "Correct segmentations' vector: ", prof_vec_rq[ind_rel_cor].shape
print "Erroneous segmentations' vector: ", prof_vec_rq[ind_rel_err].shape


Correct segmentations' vector:  (136, 50, 500)
Erroneous segmentations' vector:  (16, 50, 500)

In [7]:
plt.figure()
plt.plot(prof_vec_rq[ind_rel_cor,res_ex,:].T)
plt.title("Correct signatures for res: %f"%(resols[res_ex]))
plt.show()

plt.figure()
plt.plot(prof_vec_rq[ind_rel_err,res_ex,:].T)
plt.title("Erroneous signatures for res: %f"%(resols[res_ex]))
plt.show()



In [8]:
prof_vec_px = np.empty((len(list_masks_px),resols.shape[0],points)) #Initializing correct signature vector
for ind, mask in enumerate(list_masks_px):
    mask_pn = np.load('../../dataset/Seg_pixel/mask_pixe_{}.npy'.format(mask)) #Loading mask
    refer_temp = sign_extract(mask_pn, resols) #Function for shape signature extraction
    prof_vec_px[ind] = sign_fit(prof_vec_wt[0][0], refer_temp) #Function for signature fitting using Watershed as basis

ind_rel_cor = np.where(np.logical_not(list_labels_px))[0]
ind_rel_err = np.where(list_labels_px)[0]
        
print "Correct segmentations' vector: ", prof_vec_px[ind_rel_cor].shape
print "Erroneous segmentations' vector: ", prof_vec_px[ind_rel_err].shape


Correct segmentations' vector:  (130, 50, 500)
Erroneous segmentations' vector:  (22, 50, 500)

In [9]:
plt.figure()
plt.plot(prof_vec_px[ind_rel_cor,res_ex,:].T)
plt.title("Correct signatures for res: %f"%(resols[res_ex]))
plt.show()

plt.figure()
plt.plot(prof_vec_px[ind_rel_err,res_ex,:].T)
plt.title("Erroneous signatures for res: %f"%(resols[res_ex]))
plt.show()



In [10]:
prof_vec_dt = np.empty((len(list_masks_dt),resols.shape[0],points)) #Initializing correct signature vector
for ind, mask in enumerate(list_masks_dt):
    mask_pn = np.load('../../dataset/Seg_DataAug/mask_data_{}.npy'.format(mask)) #Loading mask
    refer_temp = sign_extract(mask_pn, resols) #Function for shape signature extraction
    prof_vec_dt[ind] = sign_fit(prof_vec_dt[0][0], refer_temp) #Function for signature fitting using Watershed as basis

ind_rel_cor = np.where(np.logical_not(list_labels_dt))[0]
ind_rel_err = np.where(list_labels_dt)[0]
        
print "Correct segmentations' vector: ", prof_vec_dt[ind_rel_cor].shape
print "Erroneous segmentations' vector: ", prof_vec_dt[ind_rel_err].shape


Correct segmentations' vector:  (0, 50, 500)
Erroneous segmentations' vector:  (151, 50, 500)

In [12]:
plt.figure()
plt.plot(prof_vec_dt[ind_rel_err,res_ex,:].T)
plt.title("Erroneous signatures for res: %f"%(resols[res_ex]))
plt.show()


Datasets


In [13]:
X_total = np.concatenate((prof_vec_wt, prof_vec_rq, prof_vec_px, prof_vec_dt), axis=0)
Y_total = np.concatenate((list_labels_wt, list_labels_rq, list_labels_px, list_labels_dt), axis=0)
print X_total.shape
print Y_total.shape


(602, 50, 500)
(602,)

In [14]:
#np.save('../../dataset/X_set.npy',X_total)
#np.save('../../dataset/Y_set.npy',Y_total)

In [5]:
X_total = np.load('../../dataset/X_set.npy')
X_total = X_total[:,1:,:]
print X_total.shape
Y_total = np.load('../../dataset/Y_set.npy')
print Y_total.shape
resols = resols[1:]
print resols.shape

resols_ref = np.arange(1,len(resols)+1)
print resols_ref.shape


(602, 49, 500)
(602,)
(49,)
(49,)

In [6]:
print("Initial division:", np.sum(Y_total)/len(Y_total))
split_train = 0.4
split_valid = split_train + 0.2
set_split_train = int(Y_total.shape[0]*split_train)
set_split_valid = int(Y_total.shape[0]*split_valid)
index_random = np.arange(Y_total.shape[0])
np.random.shuffle(index_random)

X_train = X_total[index_random[:set_split_train]]
X_valid = X_total[index_random[set_split_train:set_split_valid]]
X_test = X_total[index_random[set_split_valid:]]
Y_train = Y_total[index_random[:set_split_train]]
Y_valid = Y_total[index_random[set_split_train:set_split_valid]]
Y_test = Y_total[index_random[set_split_valid:]]
print X_train.shape, Y_train.shape, np.sum(Y_train)/len(Y_train)
print X_valid.shape, Y_valid.shape, np.sum(Y_valid)/len(Y_valid)
print X_test.shape, Y_test.shape, np.sum(Y_test)/len(Y_test)


('Initial division:', 0.3538205980066445)
(240, 49, 500) (240,) 0.375
(121, 49, 500) (121,) 0.371900826446
(241, 49, 500) (241,) 0.323651452282

Normalizing


In [7]:
val_norm = np.amax(np.amax(X_train,axis=0),axis=-1).reshape(1,-1,1)
print val_norm.shape

X_train_norm = X_train/val_norm
X_valid_norm = X_valid/val_norm
X_test_norm = X_test/val_norm


(1, 49, 1)

SVM's


In [8]:
tuned_parameters = [{'kernel': ['rbf', 'linear'], 'C': [1, 10, 50, 100]}]
cv_s = StratifiedKFold(5, shuffle=True)

tuned_parameters2 = [{'kernel': ['rbf', 'linear', 'poly'], 'C': [1, 10, 50, 100]}]

acum_acc = []
acum_pred = np.array([]).reshape(0,Y_valid.shape[0])
d_train = {}
for res in np.arange(resols.shape[0]):
    d_train["string{0}".format(res)] = GridSearchCV(svm.SVC(),tuned_parameters, cv=cv_s, scoring='accuracy')
    d_train["string{0}".format(res)].fit(X_train_norm[:,res,:], Y_train)
    #print "--------------------------------------------------------------------"
    #print d_train["string{0}".format(res)].best_estimator_

    y_true, y_pred = Y_valid, d_train["string{0}".format(res)].predict(X_valid_norm[:,res,:])
    Matrix_conf = confusion_matrix(y_true, y_pred)
    A_rf_n = (Matrix_conf[0,0]+Matrix_conf[1,1])/np.sum(Matrix_conf).astype('float64')
    acum_acc = np.append(acum_acc, A_rf_n)
    acum_pred = np.vstack((acum_pred, y_pred))

acum_acc2 = np.logical_not(np.logical_xor(acum_pred.astype('bool'),Y_valid.astype('bool')))

mm_conf = np.empty((4,acum_acc2.shape[0],acum_acc2.shape[0]))
for i in range(acum_acc2.shape[0]):
    for j in range(acum_acc2.shape[0]):
        mm_conf[:,i,j] = agreement_matrix(acum_acc2[i],acum_acc2[j])

a = mm_conf[0]
b = mm_conf[1]
c = mm_conf[2]
d = mm_conf[3]

DM = (b+c)/(a+b+c+d)
plot_matrix(DM, classes=map(str, resols),title='DM matrix')



In [42]:
choose_el = 'min_dist'#'min_dist', 'random'
num_clusters = np.arange(1,50)
clusters_acc_mean = []
clusters_acc_std = []
for n_cl in num_clusters:
    agg = skl_clus.AgglomerativeClustering(n_clusters=n_cl, affinity='precomputed', linkage='average')
    labels_res = agg.fit_predict(DM)
    #print labels_res
    random_acc = []
    for loop_random in range(15):
        list_clusters = []
        res_chs = []
        labels_list = np.unique(labels_res)
        for clust in labels_list:
            res_clust = np.where(labels_res==clust)[0]
            list_clusters.append(res_clust)
            #print "Set({}): ".format(clust), resols_ref[res_clust]
            if choose_el == 'random':
                res_chs.append(random.choice(res_clust))
            else:
                sum_dist = []
                for r_clust in res_clust:
                    sum_dist.append(np.sum(DM[r_clust,res_clust]))
                res_chs.append(res_clust[np.argmin(sum_dist)])
        #print "Clusters:", len(list_clusters), "Choosen resolutions:", resols_ref[res_chs]

        svm_ind = np.array([]).reshape(0,Y_valid.shape[0])
        for res_ch in res_chs:
            svm_ind = np.vstack((svm_ind, d_train["string{0}".format(res_ch)].predict(X_valid_norm[:,res_ch,:])))
        svm_ind = svm_ind.T
        clf = GridSearchCV(svm.SVC(),tuned_parameters2, cv=cv_s, scoring='accuracy')
        clf.fit(svm_ind, Y_valid)

        svm_ind = np.array([]).reshape(0,Y_test.shape[0])
        for res_ch in res_chs:
            svm_ind = np.vstack((svm_ind, d_train["string{0}".format(res_ch)].predict(X_test_norm[:,res_ch,:])))
        svm_ind = svm_ind.T
        y_true, y_pred = Y_test, clf.predict(svm_ind)
        Matrix_conf = confusion_matrix(y_true, y_pred)
        A_rf_n = (Matrix_conf[0,0]+Matrix_conf[1,1])/np.sum(Matrix_conf).astype('float64')
        #print 'Final accuracy: ', A_rf_n
        random_acc.append(A_rf_n)
    print n_cl,":", random_acc
    print "Mean:",np.mean(random_acc),"/ SD:",np.std(random_acc)
    clusters_acc_mean.append(np.mean(random_acc))
    clusters_acc_std.append(np.std(random_acc))


1 : [0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498]
Mean: 0.979253112033 / SD: 0.0
2 : [0.975103734439834, 0.975103734439834, 0.975103734439834, 0.975103734439834, 0.975103734439834, 0.975103734439834, 0.975103734439834, 0.975103734439834, 0.975103734439834, 0.975103734439834, 0.975103734439834, 0.975103734439834, 0.975103734439834, 0.975103734439834, 0.975103734439834]
Mean: 0.97510373444 / SD: 0.0
3 : [0.97095435684647302, 0.97095435684647302, 0.97095435684647302, 0.97095435684647302, 0.97925311203319498, 0.97925311203319498, 0.97095435684647302, 0.97925311203319498, 0.97095435684647302, 0.97925311203319498, 0.975103734439834, 0.97095435684647302, 0.975103734439834, 0.97925311203319498, 0.97925311203319498]
Mean: 0.974827109267 / SD: 0.00385294281527
4 : [0.97925311203319498, 0.97925311203319498, 0.97095435684647302, 0.97095435684647302, 0.97925311203319498, 0.97095435684647302, 0.97095435684647302, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97925311203319498, 0.97095435684647302, 0.97095435684647302, 0.97925311203319498, 0.97925311203319498]
Mean: 0.975933609959 / SD: 0.00406554314155
5 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596]
Mean: 0.987275242047 / SD: 0.00103503662151
6 : [0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.985892116183 / SD: 0.00203277157077
7 : [0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596]
Mean: 0.984232365145 / SD: 0.00165975103734
8 : [0.97925311203319498, 0.98755186721991706, 0.98755186721991706, 0.97925311203319498, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.97925311203319498, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.97925311203319498]
Mean: 0.985338865837 / SD: 0.00366984762418
9 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.97925311203319498, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.97925311203319498, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.97925311203319498]
Mean: 0.985892116183 / SD: 0.00331950207469
10 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706]
Mean: 0.986721991701 / SD: 0.00165975103734
11 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
12 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
13 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706]
Mean: 0.986168741355 / SD: 0.00195603535598
14 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706]
Mean: 0.986445366528 / SD: 0.00183492381209
15 : [0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.986721991701 / SD: 0.00165975103734
16 : [0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.986721991701 / SD: 0.00165975103734
17 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.987275242047 / SD: 0.00103503662151
18 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.987275242047 / SD: 0.00103503662151
19 : [0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.986998616874 / SD: 0.00141051715452
20 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596]
Mean: 0.986721991701 / SD: 0.00165975103734
21 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.987275242047 / SD: 0.00103503662151
22 : [0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.986721991701 / SD: 0.00165975103734
23 : [0.97925311203319498, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.97925311203319498, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596]
Mean: 0.982849239281 / SD: 0.00141051715452
24 : [0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.97925311203319498, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596]
Mean: 0.983125864454 / SD: 0.00103503662151
25 : [0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596]
Mean: 0.983679114799 / SD: 0.00103503662151
26 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
27 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
28 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
29 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
30 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.986998616874 / SD: 0.00141051715452
31 : [0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596]
Mean: 0.984785615491 / SD: 0.00195603535598
32 : [0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596]
Mean: 0.98561549101 / SD: 0.00207007324303
33 : [0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596]
Mean: 0.985892116183 / SD: 0.00203277157077
34 : [0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706]
Mean: 0.985338865837 / SD: 0.00207007324303
35 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
36 : [0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596]
Mean: 0.984785615491 / SD: 0.00195603535598
37 : [0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.985892116183 / SD: 0.00203277157077
38 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706]
Mean: 0.985892116183 / SD: 0.00203277157077
39 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
40 : [0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706]
Mean: 0.986445366528 / SD: 0.00183492381209
41 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
42 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
43 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
44 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
45 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
46 : [0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706, 0.98755186721991706]
Mean: 0.98755186722 / SD: 0.0
47 : [0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596]
Mean: 0.984785615491 / SD: 0.00195603535598
48 : [0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596]
Mean: 0.984508990318 / SD: 0.00183492381209
49 : [0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98340248962655596, 0.98755186721991706, 0.98340248962655596]
Mean: 0.984508990318 / SD: 0.00183492381209

In [43]:
plt.figure()
plt.plot(clusters_acc_mean)
plt.plot(np.array(clusters_acc_mean)-np.array(clusters_acc_std), linestyle='dashed')
plt.plot(np.array(clusters_acc_mean)+np.array(clusters_acc_std), linestyle='dashed')
plt.title("Mean and standard deviation accuracy per cluster size")
plt.show()

plt.figure()
plt.plot(clusters_acc_std)
plt.title("Standard deviation accuracy per cluster size")
plt.show()



In [44]:
max_acc, min_acc = np.amax(clusters_acc), np.amin(clusters_acc)
num_clust_ch = np.where(clusters_acc > min_acc+0.9*(max_acc-min_acc))[0][0]+1
print "Cluster size:", num_clust_ch


Cluster size: 10

Qualitative test


In [45]:
agg = skl_clus.AgglomerativeClustering(n_clusters=num_clust_ch, affinity='precomputed', linkage='complete')
labels_res = agg.fit_predict(DM)
print 'Element labels', labels_res

list_clusters = []
res_chs = []
labels_list = np.unique(labels_res)
for clust in labels_list:
    res_clust = np.where(labels_res==clust)[0]
    list_clusters.append(res_clust)
    print "--------------------------------------------------------------------"
    if len(res_clust) == 1:
        dist_intra = np.amax(DM[res_clust[0],res_clust[0]])
    else:
        dist_intra = np.amax(DM[res_clust[0],res_clust[1:]])
    print "Cluster({}): ".format(clust),resols_ref[res_clust],"Distance intra: ",dist_intra
    
    for res_ch in res_clust:
        y_true, y_pred = Y_test, d_train["string{0}".format(res_ch)].predict(X_test_norm[:,res_ch,:])
        Matrix_conf = confusion_matrix(y_true, y_pred)
        A_rf_n = (Matrix_conf[0,0]+Matrix_conf[1,1])/np.sum(Matrix_conf).astype('float64')
        
        ind_err_rand = np.where(np.logical_xor(y_true, y_pred))
        ind_err = index_random[set_split_valid:][ind_err_rand]
        print "Element({}): ".format(resols_ref[res_ch]),ind_err, "Acc:({})".format(A_rf_n)

    if choose_el == 'random':
        res_chs.append(random.choice(res_clust))
    else:
        sum_dist = []
        for r_clust in res_clust:
            sum_dist.append(np.sum(DM[r_clust,res_clust]))
        res_chs.append(res_clust[np.argmin(sum_dist)])

print "====================================================================="
print "Size ensemble:", len(list_clusters), "Choosen resolutions:", resols_ref[res_chs]

svm_ind = np.array([]).reshape(0,Y_valid.shape[0])
for res_ch in res_chs:
    svm_ind = np.vstack((svm_ind, d_train["string{0}".format(res_ch)].predict(X_valid_norm[:,res_ch,:])))
svm_ind = svm_ind.T
clf = GridSearchCV(svm.SVC(),tuned_parameters2, cv=cv_s, scoring='accuracy')
clf.fit(svm_ind, Y_valid)

svm_ind = np.array([]).reshape(0,Y_test.shape[0])
for res_ch in res_chs:
    svm_ind = np.vstack((svm_ind, d_train["string{0}".format(res_ch)].predict(X_test_norm[:,res_ch,:])))
svm_ind = svm_ind.T
y_true, y_pred = Y_test, clf.predict(svm_ind)
Matrix_conf = confusion_matrix(y_true, y_pred)
A_rf_n = (Matrix_conf[0,0]+Matrix_conf[1,1])/np.sum(Matrix_conf).astype('float64')

ind_err_rand = np.where(np.logical_xor(y_true, y_pred))
ind_err = index_random[set_split_valid:][ind_err_rand]
print "Ensemble: ",ind_err, "Final acc ensemble:({})".format(A_rf_n)

y_gt = y_true[ind_err_rand]
y_pr = y_pred[ind_err_rand]

for i, ind_e in enumerate(ind_err):
    if ind_e < 147:
        mask = list_masks_wt[ind_e]
        mask_pn = np.load('../../dataset/Seg_Watershed/mask_wate_{}.npy'.format(mask))
    elif np.logical_and(ind_e >= 147, ind_e < 299):
        mask = list_masks_rq[ind_e-147]
        mask_pn = np.load('../../dataset/Seg_ROQS/mask_roqs_{}.npy'.format(mask))
    elif np.logical_and(ind_e >= 299, ind_e < 451):
        mask = list_masks_px[ind_e-299]
        mask_pn = np.load('../../dataset/Seg_pixel/mask_pixe_{}.npy'.format(mask))
    else:
        mask = list_masks_dt[ind_e-451]
        mask_pn = np.load('../../dataset/Seg_DataAug/mask_data_{}.npy'.format(mask))

    plt.figure()
    plt.imshow(mask_pn,cmap='gray')
    plt.title("Element:{},True:{},Predicted:{}".format(ind_e, y_gt[i], y_pr[i]))


Element labels [4 7 7 4 4 4 9 9 3 3 3 3 3 3 3 1 1 1 1 1 8 6 6 6 5 5 1 5 6 6 0 2 2 2 0 0 0
 2 0 0 0 0 0 1 0 0 0 0 1]
--------------------------------------------------------------------
Cluster(0):  [31 35 36 37 39 40 41 42 43 45 46 47 48] Distance intra:  0.0413223140496
Element(31):  [257 265 144 183 379 162 397] Acc:(0.970954356846)
Element(35):  [257 265 183 379 162] Acc:(0.979253112033)
Element(36):  [257 265 144 183 379 162 397] Acc:(0.970954356846)
Element(37):  [265 144 183 379 397] Acc:(0.979253112033)
Element(39):  [265 364 183 379 397] Acc:(0.979253112033)
Element(40):  [265 364 183 379 397] Acc:(0.979253112033)
Element(41):  [265 364 183 379 397] Acc:(0.979253112033)
Element(42):  [257 265 364 183 379 200 397] Acc:(0.970954356846)
Element(43):  [257 265 364 183 379 397] Acc:(0.97510373444)
Element(45):  [265 183 379 397 390] Acc:(0.979253112033)
Element(46):  [183 379 249 397] Acc:(0.983402489627)
Element(47):  [183 379 162 397] Acc:(0.983402489627)
Element(48):  [183 379 249 397] Acc:(0.983402489627)
--------------------------------------------------------------------
Cluster(1):  [16 17 18 19 20 27 44 49] Distance intra:  0.0330578512397
Element(16):  [257 183 379 200 385] Acc:(0.979253112033)
Element(17):  [183 379 200 397 390 125] Acc:(0.97510373444)
Element(18):  [257 183 379 200 390] Acc:(0.979253112033)
Element(19):  [257 183 379 200 390 453] Acc:(0.97510373444)
Element(20):  [257  77 183 379 200  84 390 453] Acc:(0.966804979253)
Element(27):  [257 265 183 379 200 162 390] Acc:(0.970954356846)
Element(44):  [ 51 183 379 397 390] Acc:(0.979253112033)
Element(49):  [183 379 397 555] Acc:(0.983402489627)
--------------------------------------------------------------------
Cluster(2):  [32 33 34 38] Distance intra:  0.0165289256198
Element(32):  [257 265 183 379 200 162 397] Acc:(0.970954356846)
Element(33):  [257 265 183 379 200 162] Acc:(0.97510373444)
Element(34):  [257 265  99 183 379 200 397 390] Acc:(0.966804979253)
Element(38):  [257 265 183 379 200 162 397] Acc:(0.970954356846)
--------------------------------------------------------------------
Cluster(3):  [ 9 10 11 12 13 14 15] Distance intra:  0.0247933884298
Element(9):  [257 183 379 200 390] Acc:(0.979253112033)
Element(10):  [183 379 200 390] Acc:(0.983402489627)
Element(11):  [257 144 183 379 200 390] Acc:(0.97510373444)
Element(12):  [144 183 379 200 390] Acc:(0.979253112033)
Element(13):  [257 144 183 379 200 390] Acc:(0.97510373444)
Element(14):  [257 144 183 379 200 390] Acc:(0.97510373444)
Element(15):  [183 379 200 385] Acc:(0.983402489627)
--------------------------------------------------------------------
Cluster(4):  [1 4 5 6] Distance intra:  0.0330578512397
Element(1):  [257 143 183 379 200 162 397 286 428 233] Acc:(0.958506224066)
Element(4):  [257  32 183 379 162 397 385 233] Acc:(0.966804979253)
Element(5):  [257  32 183 379 162 385] Acc:(0.97510373444)
Element(6):  [257  32 183 379 162] Acc:(0.979253112033)
--------------------------------------------------------------------
Cluster(5):  [25 26 28] Distance intra:  0.0247933884298
Element(25):  [257 265  77 144  51 183 379  84 384] Acc:(0.96265560166)
Element(26):  [257 265 144  51 183 379  84] Acc:(0.970954356846)
Element(28):  [257 265 144  51 183 379  84 129] Acc:(0.966804979253)
--------------------------------------------------------------------
Cluster(6):  [22 23 24 29 30] Distance intra:  0.0165289256198
Element(22):  [257  77 144  51 183 379 200  84 390] Acc:(0.96265560166)
Element(23):  [257 265  77 144 183 379 200  84 390 129] Acc:(0.958506224066)
Element(24):  [257 265  77 144  51 183 379 200  84 390] Acc:(0.958506224066)
Element(29):  [257 265 144  51 183 379  84 397] Acc:(0.966804979253)
Element(30):  [257 265 144 183  95 379  84 162 397] Acc:(0.96265560166)
--------------------------------------------------------------------
Cluster(7):  [2 3] Distance intra:  0.0165289256198
Element(2):  [257  32 183  95 379 397 286 385 453] Acc:(0.96265560166)
Element(3):  [257  32 183 379 397 385] Acc:(0.97510373444)
--------------------------------------------------------------------
Cluster(8):  [21] Distance intra:  0.0
Element(21):  [257 265 183 379 200 390] Acc:(0.97510373444)
--------------------------------------------------------------------
Cluster(9):  [7 8] Distance intra:  0.00826446280992
Element(7):  [257 183 379 162] Acc:(0.983402489627)
Element(8):  [257 265 183 379 162 390] Acc:(0.97510373444)
=====================================================================
Size ensemble: 10 Choosen resolutions: [39 18 32  9  5 25 23  2 21  7]
Ensemble:  [257 183 379] Final acc ensemble:(0.98755186722)

Assessing clusters

Design of Effective Neural Network Ensembles for Image Classification Purposes

$\forall E_i, E_j \text{ and } i\neq j$

$d(E_i, E_j)=\max_{n_s \in E_i, n_t \in E_j}(d(n_s, n_t))$


In [46]:
matrix_distance = np.zeros((len(labels_list),len(labels_list)))
ref_clust = np.arange(len(labels_list))
for clust in ref_clust:
    curr_clus = list_clusters[clust]
    for clusts in ref_clust[np.arange(len(ref_clust))!=clust]:
        d_per_clust = []
        for in_d in curr_clus:
            #print(in_d,"vs",list_clusters[clusts])
            d_per_clust.append(DM[in_d,list_clusters[clusts]])
        matrix_distance[clust,clusts] = np.amax(d_per_clust)
        #print("Max distance inter-cluster ({}) to ({})".format(labels_list[clust],labels_list[clusts]),
        #      np.amax(d_per_clust))
plot_matrix(matrix_distance, classes=map(str, labels_list),title='Inter-cluster distance')


The RMSE over the 10 correct segmentations was compared with RMSE over the 10 erroneous segmentations. As expected, RMSE for correct segmentations was greater than RMSE for erroneous segmentations along all the resolutions. In general, this is true, but optimal resolution guarantee the maximum difference between both of RMSE results: correct and erroneous.

So, to find optimal resolution, difference between correct and erroneous RMSE was calculated over all resolutions.

The greatest difference resulted at resolution 0.1. In this resolution, threshold for separate erroneous and correct segmentations is established as 30% of the distance between the mean RMSE of the correct masks and the mean RMSE of the erroneous masks.

Method testing

Finally, method test was performed in the 152 subject dataset: Watershed dataset with 112 segmentations, ROQS dataset with 152 segmentations and pixel-based dataset with 152 segmentations.

Discussion and conclusion

In this work, a method for segmentation error detection in large datasets was proposed based-on shape signature. RMSE was used for comparison between signatures. Signature can be extracted en various resolutions but optimal resolution (ls=0.1) was chosen in order to get maximum separation between correct RMSE and erroneous RMSE. In this optimal resolution, threshold was fixed at 27.95 allowing separation of the two segmentation classes.The method achieved 95% of accuracy on the test Watershed segmentations, and 95% and 94% on new datasets: ROQS and pixel-based, respectively.

40 Watershed segmentations on dataset were used to generation and configuration mean correct signature because of the greater number of erroneous segmentations and major variability on the error shape. Because the signature holds the CC shape, the method can be extended to new datasets segmented with any method. Accuracy and generalization can be improve varying the segmentations used to generate and adjust the mean correct signature.