author: Diogo Silva


In [1]:
%pylab inline
#%qtconsole


Populating the interactive namespace from numpy and matplotlib

In [49]:
home = %env HOME

In [50]:
cd $home/QCThesis/EAC


/home/diogoaos/QCThesis/EAC

Generate data & data partitions


In [4]:
%run generatePartitions.py -d synthetic -n 100 -D 2 -C 6 -i 3 -m cuda -s syn_cem -np 20 -mc 4 -Mc 25 -dir test/


2015-04-22 09:26:30,627 - status - INFO - Start of logging.
2015-04-22 09:26:30,627 - status - INFO - Start of logging.
INFO:status:Start of logging.
2015-04-22 09:26:30,630 - status - INFO - Generating data:nsamples=100,ndims=2,centers=6
2015-04-22 09:26:30,630 - status - INFO - Generating data:nsamples=100,ndims=2,centers=6
INFO:status:Generating data:nsamples=100,ndims=2,centers=6
2015-04-22 09:26:30,637 - status - INFO - Generating partition: #0, clusters=[11]
2015-04-22 09:26:30,637 - status - INFO - Generating partition: #0, clusters=[11]
INFO:status:Generating partition: #0, clusters=[11]
2015-04-22 09:26:30,654 - status - INFO - Saved partition: test/syn_cem_partition_0.csv
2015-04-22 09:26:30,654 - status - INFO - Saved partition: test/syn_cem_partition_0.csv
INFO:status:Saved partition: test/syn_cem_partition_0.csv
2015-04-22 09:26:30,658 - status - INFO - Generating partition: #1, clusters=[13]
2015-04-22 09:26:30,658 - status - INFO - Generating partition: #1, clusters=[13]
INFO:status:Generating partition: #1, clusters=[13]
2015-04-22 09:26:30,675 - status - INFO - Saved partition: test/syn_cem_partition_1.csv
2015-04-22 09:26:30,675 - status - INFO - Saved partition: test/syn_cem_partition_1.csv
INFO:status:Saved partition: test/syn_cem_partition_1.csv
2015-04-22 09:26:30,679 - status - INFO - Generating partition: #2, clusters=[17]
2015-04-22 09:26:30,679 - status - INFO - Generating partition: #2, clusters=[17]
INFO:status:Generating partition: #2, clusters=[17]
2015-04-22 09:26:30,698 - status - INFO - Saved partition: test/syn_cem_partition_2.csv
2015-04-22 09:26:30,698 - status - INFO - Saved partition: test/syn_cem_partition_2.csv
INFO:status:Saved partition: test/syn_cem_partition_2.csv
2015-04-22 09:26:30,702 - status - INFO - Generating partition: #3, clusters=[12]
2015-04-22 09:26:30,702 - status - INFO - Generating partition: #3, clusters=[12]
INFO:status:Generating partition: #3, clusters=[12]
2015-04-22 09:26:30,721 - status - INFO - Saved partition: test/syn_cem_partition_3.csv
2015-04-22 09:26:30,721 - status - INFO - Saved partition: test/syn_cem_partition_3.csv
INFO:status:Saved partition: test/syn_cem_partition_3.csv
2015-04-22 09:26:30,724 - status - INFO - Generating partition: #4, clusters=[11]
2015-04-22 09:26:30,724 - status - INFO - Generating partition: #4, clusters=[11]
INFO:status:Generating partition: #4, clusters=[11]
2015-04-22 09:26:30,742 - status - INFO - Saved partition: test/syn_cem_partition_4.csv
2015-04-22 09:26:30,742 - status - INFO - Saved partition: test/syn_cem_partition_4.csv
INFO:status:Saved partition: test/syn_cem_partition_4.csv
2015-04-22 09:26:30,746 - status - INFO - Generating partition: #5, clusters=[8]
2015-04-22 09:26:30,746 - status - INFO - Generating partition: #5, clusters=[8]
INFO:status:Generating partition: #5, clusters=[8]
2015-04-22 09:26:30,763 - status - INFO - Saved partition: test/syn_cem_partition_5.csv
2015-04-22 09:26:30,763 - status - INFO - Saved partition: test/syn_cem_partition_5.csv
INFO:status:Saved partition: test/syn_cem_partition_5.csv
2015-04-22 09:26:30,767 - status - INFO - Generating partition: #6, clusters=[4]
2015-04-22 09:26:30,767 - status - INFO - Generating partition: #6, clusters=[4]
INFO:status:Generating partition: #6, clusters=[4]
2015-04-22 09:26:30,784 - status - INFO - Saved partition: test/syn_cem_partition_6.csv
2015-04-22 09:26:30,784 - status - INFO - Saved partition: test/syn_cem_partition_6.csv
INFO:status:Saved partition: test/syn_cem_partition_6.csv
2015-04-22 09:26:30,788 - status - INFO - Generating partition: #7, clusters=[6]
2015-04-22 09:26:30,788 - status - INFO - Generating partition: #7, clusters=[6]
INFO:status:Generating partition: #7, clusters=[6]
2015-04-22 09:26:30,804 - status - INFO - Saved partition: test/syn_cem_partition_7.csv
2015-04-22 09:26:30,804 - status - INFO - Saved partition: test/syn_cem_partition_7.csv
INFO:status:Saved partition: test/syn_cem_partition_7.csv
2015-04-22 09:26:30,808 - status - INFO - Generating partition: #8, clusters=[9]
2015-04-22 09:26:30,808 - status - INFO - Generating partition: #8, clusters=[9]
INFO:status:Generating partition: #8, clusters=[9]
2015-04-22 09:26:30,824 - status - INFO - Saved partition: test/syn_cem_partition_8.csv
2015-04-22 09:26:30,824 - status - INFO - Saved partition: test/syn_cem_partition_8.csv
INFO:status:Saved partition: test/syn_cem_partition_8.csv
2015-04-22 09:26:30,828 - status - INFO - Generating partition: #9, clusters=[22]
2015-04-22 09:26:30,828 - status - INFO - Generating partition: #9, clusters=[22]
INFO:status:Generating partition: #9, clusters=[22]
2015-04-22 09:26:30,848 - status - INFO - Saved partition: test/syn_cem_partition_9.csv
2015-04-22 09:26:30,848 - status - INFO - Saved partition: test/syn_cem_partition_9.csv
INFO:status:Saved partition: test/syn_cem_partition_9.csv
2015-04-22 09:26:30,851 - status - INFO - Generating partition: #10, clusters=[14]
2015-04-22 09:26:30,851 - status - INFO - Generating partition: #10, clusters=[14]
INFO:status:Generating partition: #10, clusters=[14]
2015-04-22 09:26:30,868 - status - INFO - Saved partition: test/syn_cem_partition_10.csv
2015-04-22 09:26:30,868 - status - INFO - Saved partition: test/syn_cem_partition_10.csv
INFO:status:Saved partition: test/syn_cem_partition_10.csv
2015-04-22 09:26:30,872 - status - INFO - Generating partition: #11, clusters=[11]
2015-04-22 09:26:30,872 - status - INFO - Generating partition: #11, clusters=[11]
INFO:status:Generating partition: #11, clusters=[11]
2015-04-22 09:26:30,889 - status - INFO - Saved partition: test/syn_cem_partition_11.csv
2015-04-22 09:26:30,889 - status - INFO - Saved partition: test/syn_cem_partition_11.csv
INFO:status:Saved partition: test/syn_cem_partition_11.csv
2015-04-22 09:26:30,893 - status - INFO - Generating partition: #12, clusters=[10]
2015-04-22 09:26:30,893 - status - INFO - Generating partition: #12, clusters=[10]
INFO:status:Generating partition: #12, clusters=[10]
2015-04-22 09:26:30,910 - status - INFO - Saved partition: test/syn_cem_partition_12.csv
2015-04-22 09:26:30,910 - status - INFO - Saved partition: test/syn_cem_partition_12.csv
INFO:status:Saved partition: test/syn_cem_partition_12.csv
2015-04-22 09:26:30,914 - status - INFO - Generating partition: #13, clusters=[24]
2015-04-22 09:26:30,914 - status - INFO - Generating partition: #13, clusters=[24]
INFO:status:Generating partition: #13, clusters=[24]
2015-04-22 09:26:30,935 - status - INFO - Saved partition: test/syn_cem_partition_13.csv
2015-04-22 09:26:30,935 - status - INFO - Saved partition: test/syn_cem_partition_13.csv
INFO:status:Saved partition: test/syn_cem_partition_13.csv
2015-04-22 09:26:30,938 - status - INFO - Generating partition: #14, clusters=[11]
2015-04-22 09:26:30,938 - status - INFO - Generating partition: #14, clusters=[11]
INFO:status:Generating partition: #14, clusters=[11]
2015-04-22 09:26:30,956 - status - INFO - Saved partition: test/syn_cem_partition_14.csv
2015-04-22 09:26:30,956 - status - INFO - Saved partition: test/syn_cem_partition_14.csv
INFO:status:Saved partition: test/syn_cem_partition_14.csv
2015-04-22 09:26:30,959 - status - INFO - Generating partition: #15, clusters=[16]
2015-04-22 09:26:30,959 - status - INFO - Generating partition: #15, clusters=[16]
INFO:status:Generating partition: #15, clusters=[16]
2015-04-22 09:26:30,978 - status - INFO - Saved partition: test/syn_cem_partition_15.csv
2015-04-22 09:26:30,978 - status - INFO - Saved partition: test/syn_cem_partition_15.csv
INFO:status:Saved partition: test/syn_cem_partition_15.csv
2015-04-22 09:26:30,981 - status - INFO - Generating partition: #16, clusters=[23]
2015-04-22 09:26:30,981 - status - INFO - Generating partition: #16, clusters=[23]
INFO:status:Generating partition: #16, clusters=[23]
2015-04-22 09:26:31,001 - status - INFO - Saved partition: test/syn_cem_partition_16.csv
2015-04-22 09:26:31,001 - status - INFO - Saved partition: test/syn_cem_partition_16.csv
INFO:status:Saved partition: test/syn_cem_partition_16.csv
2015-04-22 09:26:31,004 - status - INFO - Generating partition: #17, clusters=[21]
2015-04-22 09:26:31,004 - status - INFO - Generating partition: #17, clusters=[21]
INFO:status:Generating partition: #17, clusters=[21]
2015-04-22 09:26:31,023 - status - INFO - Saved partition: test/syn_cem_partition_17.csv
2015-04-22 09:26:31,023 - status - INFO - Saved partition: test/syn_cem_partition_17.csv
INFO:status:Saved partition: test/syn_cem_partition_17.csv
2015-04-22 09:26:31,026 - status - INFO - Generating partition: #18, clusters=[9]
2015-04-22 09:26:31,026 - status - INFO - Generating partition: #18, clusters=[9]
INFO:status:Generating partition: #18, clusters=[9]
2015-04-22 09:26:31,043 - status - INFO - Saved partition: test/syn_cem_partition_18.csv
2015-04-22 09:26:31,043 - status - INFO - Saved partition: test/syn_cem_partition_18.csv
INFO:status:Saved partition: test/syn_cem_partition_18.csv
2015-04-22 09:26:31,047 - status - INFO - Generating partition: #19, clusters=[5]
2015-04-22 09:26:31,047 - status - INFO - Generating partition: #19, clusters=[5]
INFO:status:Generating partition: #19, clusters=[5]
2015-04-22 09:26:31,063 - status - INFO - Saved partition: test/syn_cem_partition_19.csv
2015-04-22 09:26:31,063 - status - INFO - Saved partition: test/syn_cem_partition_19.csv
INFO:status:Saved partition: test/syn_cem_partition_19.csv

Get test partitions


In [55]:
ls test


syn_cem.log                syn_data_partition_1.csv   syn_data_partition_34.csv
syn_cem_data.csv           syn_data_partition_10.csv  syn_data_partition_35.csv
syn_cem_ground_truth.csv   syn_data_partition_11.csv  syn_data_partition_36.csv
syn_cem_partition_0.csv    syn_data_partition_12.csv  syn_data_partition_37.csv
syn_cem_partition_1.csv    syn_data_partition_13.csv  syn_data_partition_38.csv
syn_cem_partition_10.csv   syn_data_partition_14.csv  syn_data_partition_39.csv
syn_cem_partition_11.csv   syn_data_partition_15.csv  syn_data_partition_4.csv
syn_cem_partition_12.csv   syn_data_partition_16.csv  syn_data_partition_40.csv
syn_cem_partition_13.csv   syn_data_partition_17.csv  syn_data_partition_41.csv
syn_cem_partition_14.csv   syn_data_partition_18.csv  syn_data_partition_42.csv
syn_cem_partition_15.csv   syn_data_partition_19.csv  syn_data_partition_43.csv
syn_cem_partition_16.csv   syn_data_partition_2.csv   syn_data_partition_44.csv
syn_cem_partition_17.csv   syn_data_partition_20.csv  syn_data_partition_45.csv
syn_cem_partition_18.csv   syn_data_partition_21.csv  syn_data_partition_46.csv
syn_cem_partition_19.csv   syn_data_partition_22.csv  syn_data_partition_47.csv
syn_cem_partition_2.csv    syn_data_partition_23.csv  syn_data_partition_48.csv
syn_cem_partition_3.csv    syn_data_partition_24.csv  syn_data_partition_49.csv
syn_cem_partition_4.csv    syn_data_partition_25.csv  syn_data_partition_5.csv
syn_cem_partition_5.csv    syn_data_partition_26.csv  syn_data_partition_6.csv
syn_cem_partition_6.csv    syn_data_partition_27.csv  syn_data_partition_7.csv
syn_cem_partition_7.csv    syn_data_partition_28.csv  syn_data_partition_8.csv
syn_cem_partition_8.csv    syn_data_partition_29.csv  syn_data_partition_9.csv
syn_cem_partition_9.csv    syn_data_partition_3.csv   syn_mil.log
syn_data.log               syn_data_partition_30.csv  syn_mil_data.csv
syn_data_data.csv          syn_data_partition_31.csv  syn_mil_ground_truth.csv
syn_data_ground_truth.csv  syn_data_partition_32.csv
syn_data_partition_0.csv   syn_data_partition_33.csv

In [88]:
files=!ls $home/QCThesis/EAC/test
folder= home + "/QCThesis/EAC/test/"
for i,f in enumerate(files):
    files[i] = folder+f

partition_files = [f for f in files if "_partition_" in f and "syn_cem" in f]

In [89]:
partition_files


Out[89]:
['/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_0.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_1.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_10.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_11.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_12.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_13.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_14.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_15.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_16.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_17.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_18.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_19.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_2.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_3.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_4.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_5.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_6.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_7.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_8.csv',
 '/home/diogoaos/QCThesis/EAC/test/syn_cem_partition_9.csv']

Cluster


In [60]:
pwd


Out[60]:
u'/home/diogoaos/QCThesis/EAC'

In [83]:
import eac

In [84]:
reload(eac)


Out[84]:
<module 'eac' from 'eac.pyc'>

In [90]:
#%debug
reload(eac)
estimator=eac.EAC(100)
estimator.fit(partition_files,files=True,assoc_mode='full', prot_mode='random', nprot=None)

In [91]:
np.where(estimator._coassoc==20)[0].size


Out[91]:
140

In [92]:
print estimator._coassoc.shape
print estimator._coassoc
print "Diagonal"
print estimator._coassoc.diagonal()
print "Symmetric:\t",(estimator._coassoc.T == estimator._coassoc).all()


(100, 100)
[[ 20.  14.  14. ...,  10.  11.   7.]
 [ 14.  20.  15. ...,  10.  12.   7.]
 [ 14.  15.  20. ...,  11.  14.   9.]
 ..., 
 [ 10.  10.  11. ...,  20.   8.   5.]
 [ 11.  12.  14. ...,   8.  20.  15.]
 [  7.   7.   9. ...,   5.  15.  20.]]
Diagonal
[ 20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.
  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.
  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.
  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.
  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.
  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.
  20.  20.  20.  20.  20.  20.  20.  20.  20.  20.]
Symmetric:	True

Linkage


In [20]:
from scipy.cluster.hierarchy import linkage
from scipy.spatial.distance import squareform

def _apply_linkage(assoc_mat,method='single'):
    """
    SciPy linkage wants a distance array of format pdist. SciPy squareform 
    converts between the two formats.

    assoc_mat 	: pair-wise similarity association matrix
    method 		: linkage method to use; can be 'single'(default), 'complete',
                  'average', 'weighted', 'centroid', 'median', 'ward'
    """

    condensed_assoc = squareform(assoc_mat)

    # convert pair-wise similarity array (assoc_mat->condensed_assoc) to dissimilarity
    condensed_diss_assoc = condensed_assoc.max() - condensed_assoc

    Z = linkage(condensed_diss_assoc,method=method)

    return Z

In [56]:
assoc=estimator._coassoc
coassoc = assoc / 20
coassoc = coassoc.max() - coassoc

In [57]:
coassoc


Out[57]:
array([[ 0.  ,  0.3 ,  0.3 , ...,  0.5 ,  0.45,  0.65],
       [ 0.3 ,  0.  ,  0.25, ...,  0.5 ,  0.4 ,  0.65],
       [ 0.3 ,  0.25,  0.  , ...,  0.45,  0.3 ,  0.55],
       ..., 
       [ 0.5 ,  0.5 ,  0.45, ...,  0.  ,  0.6 ,  0.75],
       [ 0.45,  0.4 ,  0.3 , ...,  0.6 ,  0.  ,  0.25],
       [ 0.65,  0.65,  0.55, ...,  0.75,  0.25,  0.  ]])

In [58]:
Z=_apply_linkage(coassoc)

In [48]:
import scipy.cluster.hierarchy as hie

In [64]:
for l in Z:
    print l
    
X = [c for c in Z if c[0] >=100 and c[1] >= 100]
print "argmax\t",np.argmax(Z[:,2])
print Z[np.argmax(Z[:,2])]


[  0.  98.   0.   2.]
[  34.  100.    0.    3.]
[  38.  101.    0.    4.]
[  41.  102.    0.    5.]
[  42.  103.    0.    6.]
[  45.  104.    0.    7.]
[  91.  105.    0.    8.]
[  47.  106.    0.    9.]
[  51.  107.    0.   10.]
[  53.  108.    0.   11.]
[  55.  109.    0.   12.]
[  56.  110.    0.   13.]
[  61.  111.    0.   14.]
[  63.  112.    0.   15.]
[  64.  113.    0.   16.]
[  66.  114.    0.   17.]
[  67.  115.    0.   18.]
[  70.  116.    0.   19.]
[  71.  117.    0.   20.]
[  75.  118.    0.   21.]
[  78.  119.    0.   22.]
[  79.  120.    0.   23.]
[  81.  121.    0.   24.]
[  82.  122.    0.   25.]
[  84.  123.    0.   26.]
[  85.  124.    0.   27.]
[  88.  125.    0.   28.]
[  89.  126.    0.   29.]
[  33.  127.    0.   30.]
[  32.  128.    0.   31.]
[  46.  129.    0.   32.]
[  16.  130.    0.   33.]
[   1.  131.    0.   34.]
[   2.  132.    0.   35.]
[   6.  133.    0.   36.]
[   9.  134.    0.   37.]
[  14.  135.    0.   38.]
[  18.  136.    0.   39.]
[  20.  137.    0.   40.]
[  21.  138.    0.   41.]
[  22.  139.    0.   42.]
[  23.  140.    0.   43.]
[  92.  141.    0.   44.]
[  24.  142.    0.   45.]
[  97.  143.    0.   46.]
[  28.  144.    0.   47.]
[  27.  145.    0.   48.]
[  26.  146.    0.   49.]
[  4.30000000e+01   9.90000000e+01   5.00000000e-02   2.00000000e+00]
[  6.90000000e+01   1.48000000e+02   5.00000000e-02   3.00000000e+00]
[  1.10000000e+01   1.49000000e+02   5.00000000e-02   4.00000000e+00]
[  7.20000000e+01   1.50000000e+02   5.00000000e-02   5.00000000e+00]
[  7.30000000e+01   1.51000000e+02   5.00000000e-02   6.00000000e+00]
[  7.40000000e+01   1.52000000e+02   5.00000000e-02   7.00000000e+00]
[  9.30000000e+01   1.53000000e+02   5.00000000e-02   8.00000000e+00]
[  7.00000000e+00   1.54000000e+02   5.00000000e-02   9.00000000e+00]
[  1.30000000e+01   1.55000000e+02   5.00000000e-02   1.00000000e+01]
[  5.00000000e+00   1.56000000e+02   5.00000000e-02   1.10000000e+01]
[  8.30000000e+01   1.57000000e+02   5.00000000e-02   1.20000000e+01]
[  8.60000000e+01   1.58000000e+02   5.00000000e-02   1.30000000e+01]
[  9.00000000e+01   1.59000000e+02   5.00000000e-02   1.40000000e+01]
[  3.50000000e+01   1.60000000e+02   5.00000000e-02   1.50000000e+01]
[  6.50000000e+01   1.61000000e+02   5.00000000e-02   1.60000000e+01]
[  3.10000000e+01   1.62000000e+02   5.00000000e-02   1.70000000e+01]
[  6.20000000e+01   1.63000000e+02   5.00000000e-02   1.80000000e+01]
[  5.80000000e+01   1.64000000e+02   5.00000000e-02   1.90000000e+01]
[  3.70000000e+01   1.65000000e+02   5.00000000e-02   2.00000000e+01]
[  4.80000000e+01   1.66000000e+02   5.00000000e-02   2.10000000e+01]
[  1.47000000e+02   1.67000000e+02   5.00000000e-02   7.00000000e+01]
[  5.00000000e+01   1.68000000e+02   5.00000000e-02   7.10000000e+01]
[  3.60000000e+01   1.69000000e+02   1.00000000e-01   7.20000000e+01]
[  4.40000000e+01   1.70000000e+02   1.00000000e-01   7.30000000e+01]
[  2.90000000e+01   1.71000000e+02   1.00000000e-01   7.40000000e+01]
[  4.00000000e+01   1.72000000e+02   1.00000000e-01   7.50000000e+01]
[  8.70000000e+01   1.73000000e+02   1.00000000e-01   7.60000000e+01]
[  3.90000000e+01   1.74000000e+02   1.00000000e-01   7.70000000e+01]
[  3.00000000e+00   1.75000000e+02   1.00000000e-01   7.80000000e+01]
[  4.00000000e+00   1.76000000e+02   1.00000000e-01   7.90000000e+01]
[  5.20000000e+01   1.77000000e+02   1.00000000e-01   8.00000000e+01]
[  8.00000000e+00   1.78000000e+02   1.00000000e-01   8.10000000e+01]
[  1.50000000e+01   1.79000000e+02   1.00000000e-01   8.20000000e+01]
[  9.40000000e+01   1.80000000e+02   1.00000000e-01   8.30000000e+01]
[  7.60000000e+01   1.81000000e+02   1.00000000e-01   8.40000000e+01]
[  5.40000000e+01   1.82000000e+02   1.00000000e-01   8.50000000e+01]
[  1.70000000e+01   1.83000000e+02   1.00000000e-01   8.60000000e+01]
[  9.50000000e+01   1.84000000e+02   1.00000000e-01   8.70000000e+01]
[  1.00000000e+01   1.85000000e+02   1.00000000e-01   8.80000000e+01]
[  5.90000000e+01   1.86000000e+02   1.00000000e-01   8.90000000e+01]
[  6.80000000e+01   1.87000000e+02   1.00000000e-01   9.00000000e+01]
[  6.00000000e+01   1.88000000e+02   1.00000000e-01   9.10000000e+01]
[  9.60000000e+01   1.89000000e+02   1.00000000e-01   9.20000000e+01]
[  1.90000000e+01   1.90000000e+02   1.00000000e-01   9.30000000e+01]
[  1.20000000e+01   1.91000000e+02   1.00000000e-01   9.40000000e+01]
[  4.90000000e+01   1.92000000e+02   1.50000000e-01   9.50000000e+01]
[  7.70000000e+01   1.93000000e+02   1.50000000e-01   9.60000000e+01]
[  5.70000000e+01   1.94000000e+02   1.50000000e-01   9.70000000e+01]
[  3.00000000e+01   1.95000000e+02   1.50000000e-01   9.80000000e+01]
[  2.50000000e+01   1.96000000e+02   1.50000000e-01   9.90000000e+01]
[  8.00000000e+01   1.97000000e+02   1.50000000e-01   1.00000000e+02]
argmax	93
[  4.90000000e+01   1.92000000e+02   1.50000000e-01   9.50000000e+01]

In [125]:
plt.figure(figsize=(16,12))
r=hie.dendrogram(Z,leaf_rotation=90,show_contracted=True)



In [131]:
len(r['color_list'])


Out[131]:
99

In [123]:
dif=Z[1:,2]-Z[0:-1,2]
#[maximo,indice]=max(dif);

In [124]:
print "max lifetime: ", dif.max()
print "index: ", dif.argmax()


max lifetime:  0.05
index:  47

K random prototypes


In [100]:
#%debug
reload(eac)
estimator=eac.EAC(100)
estimator.fit(partition_files,files=True,assoc_mode='prot', prot_mode='random', nprot=5)

In [101]:
estimator._coassoc.sum()


Out[101]:
4274.0

K Centroid-based prototypes


In [21]:
#%%debug -b eac.py:190
reload(eac)
estimator=eac.EAC(100,data=data)
estimator.fit(partition_files,files=True,assoc_mode='prot', prot_mode='other', nprot=5)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-21-799d1632dc67> in <module>()
      2 reload(eac)
      3 estimator=eac.EAC(100,data=data)
----> 4 estimator.fit(partition_files,files=True,assoc_mode='prot', prot_mode='other', nprot=5)

/home/chiroptera/workspace/QCThesis/EAC/eac.py in fit(self, ensemble, files, assoc_mode, prot_mode, nprot, link)
     64 
     65 
---> 66                         self._build_prototypes(nprot=nprot,mode=prot_mode, data=self.data)
     67 
     68                 # received names of partition files

/home/chiroptera/workspace/QCThesis/EAC/eac.py in _build_prototypes(self, nprot, mode, data)
    146                         if data is None:
    147                                 raise Exception("Data needs to be set for this method of choosing prototypes.")
--> 148                         self.k_labels = self._build_k_prototypes(nprot,data)
    149 
    150                 else:

/home/chiroptera/workspace/QCThesis/EAC/eac.py in _build_k_prototypes(self, nprot, data)
    176             grouper = K_Means()
    177             grouper._centroid_mode = "index"
--> 178             grouper.fit(data, nprot, iters=300, mode="cuda", cuda_mem='manual',tol=1e-4,max_iters=300)
    179             centroids = grouper.centroids
    180 

/home/chiroptera/workspace/QCThesis/EAC/K_Means3.pyc in fit(self, data, K, iters, mode, cuda_mem, tol, max_iters)
     51     def fit(self, data, K, iters=3, mode="cuda", cuda_mem='manual',tol=1e-4,max_iters=300):
     52 
---> 53         N,D = data.shape
     54 
     55         self.N = N

AttributeError: 'module' object has no attribute 'shape'

In [112]:
b=np.zeros(a.shape[0])
for c in xrange(a.shape[0]):
    dist = data - a[c]
    dist = dist **2
    dist = dist.sum(axis=1)
    b[c] = np.argmin(dist)
b


Out[112]:
array([ 62.,   5.,  32.])

In [108]:
a=array([[-778.1035957 ,  728.38305131],
       [ 474.98214377,  654.43652209],
       [ -62.22709694,  637.21319263]])

In [115]:
estimator.k_labels


Out[115]:
array([32,  0], dtype=int32)

In [88]:
estimator._coassoc.sum()


Out[88]:
3730.0

K-Nearest Neighbours

This is how it is implemented inside the EAC class.

Load data


In [7]:
datafile=None
for f in files:
    if 'syn_cem' in f and 'data' in f:
        datafile=f
        
print datafile


/home/chiroptera/workspace/QCThesis/EAC/test/syn_cem_data.csv

In [8]:
data = np.genfromtxt(datafile,delimiter=',')

Selecting the neighbours


In [251]:
from sklearn.neighbors import NearestNeighbors

In [252]:
K_neigh=6
# Minkowski distance is a generalization of Euclidean distance and is equivelent to it for p=2
neigh = NearestNeighbors(n_neighbors=K_neigh, radius=1.0, algorithm='auto', leaf_size=30, metric='minkowski', p=2)
neigh.fit(data)


Out[252]:
NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_neighbors=6, p=2, radius=1.0)

In [253]:
data_neighbours=neigh.kneighbors(X=data,return_distance=False)

# the closest neighbour is the point itself - no interest in that
data_neighbours = data_neighbours[:,1:]

In [254]:
data_neighbours


Out[254]:
array([[58, 98, 19, 16,  9],
       [61, 31, 35, 11,  6],
       [68, 65, 50, 28, 81],
       [ 4, 77,  8,  7,  5],
       [ 7,  5,  8,  3, 53],
       [ 7, 53,  4,  8, 30],
       [35, 11, 61, 80, 97],
       [ 5, 53,  4,  8, 63],
       [14,  4,  7,  5, 90],
       [41, 71, 72, 16, 95],
       [91, 36, 42, 39, 22],
       [80,  6, 61,  1, 35],
       [49, 31, 29, 64,  1],
       [73, 66, 59, 81, 24],
       [90, 54, 60,  8, 85],
       [47, 66, 33, 13, 83],
       [92,  9, 72, 41, 71],
       [23, 55, 78, 94, 38],
       [57, 93, 52, 26, 78],
       [74, 58,  0, 98, 44],
       [80, 11,  6,  1, 97],
       [69, 27, 86, 25, 46],
       [39, 96, 87, 75, 79],
       [17, 94, 38, 78, 51],
       [99, 83, 33, 66, 84],
       [86, 75, 87, 79, 69],
       [52, 57, 18, 93, 78],
       [46, 21, 86, 69, 25],
       [50, 68, 65, 81, 59],
       [82, 35, 61, 31, 49],
       [85, 53,  5,  7, 60],
       [ 1, 61, 49, 35,  6],
       [40, 44, 95, 71, 41],
       [83, 24, 99, 66, 84],
       [88, 89, 48, 38, 94],
       [61,  6,  1, 11, 31],
       [42, 10, 25, 86, 91],
       [64, 29, 82, 12, 49],
       [51, 89, 23, 34, 94],
       [22, 96, 87, 75, 79],
       [32, 44, 95, 71, 41],
       [71, 95,  9, 72, 44],
       [36, 46, 86, 10, 25],
       [60, 76, 85, 54, 30],
       [41, 95, 71, 40, 98],
       [67,  0, 16, 92,  9],
       [27, 42, 21, 86, 25],
       [15, 13, 66, 73, 59],
       [88, 34, 89, 93, 38],
       [12, 31, 29,  1, 61],
       [28, 81, 68, 65, 59],
       [38, 89, 23, 34, 94],
       [57, 26, 93, 18, 78],
       [ 5,  7, 30, 85,  4],
       [60, 90, 76, 14, 85],
       [78, 17, 94, 23, 52],
       [77, 90, 14,  8, 54],
       [52, 18, 93, 26, 78],
       [98,  0, 44, 41, 19],
       [81, 73, 50, 28, 13],
       [54, 85, 14, 76, 90],
       [35,  6,  1, 11, 31],
       [69, 79, 87, 75, 25],
       [ 7,  4,  5, 53,  3],
       [37, 12, 49, 29, 82],
       [68, 28, 50,  2, 84],
       [24, 73, 99, 33, 83],
       [45, 16, 92,  0,  9],
       [65, 50, 28,  2, 81],
       [21, 62, 86, 25, 79],
       [49, 12, 31,  1, 61],
       [95, 41, 72,  9, 44],
       [71, 95,  9, 41, 92],
       [59, 81, 66, 13, 50],
       [19, 58, 98,  0, 44],
       [87, 25, 79, 86, 39],
       [54, 60, 43, 90, 14],
       [56,  8,  3,  4, 14],
       [94, 55, 23, 52, 17],
       [87, 75, 62, 25, 22],
       [11,  6,  1, 61, 35],
       [59, 50, 73, 28, 68],
       [29, 35, 61,  6, 97],
       [33, 24, 99, 66, 84],
       [99, 28, 24, 65, 68],
       [30, 60, 14, 53, 90],
       [25, 75, 69, 87, 21],
       [79, 75, 25, 39, 22],
       [48, 34, 89, 38, 93],
       [38, 51, 34, 94, 88],
       [14, 54, 60,  8, 85],
       [10, 36, 42, 39, 22],
       [16,  9, 72, 71, 41],
       [57, 18, 52, 34, 94],
       [78, 23, 55, 34, 89],
       [71, 41, 72,  9, 44],
       [22, 39, 87, 79, 75],
       [ 6, 11, 35, 61, 80],
       [58, 44, 41,  0, 71],
       [24, 83, 84, 33, 66]])

EAC


In [11]:
#%debug
reload(eac)
estimator=eac.EAC(100,data=data)
estimator.fit(partition_files,files=True,assoc_mode='prot', prot_mode='knn', nprot=5)

In [12]:
estimator._coassoc.sum()


Out[12]:
3601.0

sparse matrix linkage


In [197]:
diss_mat = estimator._coassoc
diss_labels = estimator.k_neighbours

In [13]:
from scipy.sparse import *

In [214]:
smat = csr_matrix((100,100))
mat = np.zeros((100,100))

In [212]:
%timeit smat[0,0]=50
%timeit mat[0,0]=50


The slowest run took 6.74 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 179 µs per loop
The slowest run took 16.41 times longer than the fastest. This could mean that an intermediate result is being cached 
10000000 loops, best of 3: 174 ns per loop

In [208]:
rows=diss_labels[0]
cols=diss_labels[1]

In [216]:
rows


Out[216]:
array([58, 98, 19, 16,  9])

In [223]:
%timeit mat[rows[:,np.newaxis],cols]=1
%timeit smat[rows[:,np.newaxis],cols]=1


The slowest run took 66.20 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 5.8 µs per loop
1000 loops, best of 3: 237 µs per loop

In [227]:
smat*2


Out[227]:
<100x100 sparse matrix of type '<type 'numpy.float64'>'
	with 25 stored elements in Compressed Sparse Row format>

Testing updating sparse coassoc


In [142]:
def _update_coassoc_knn(assoc_mat,clusters,k_neighbours):
    """
    Updates an NxK co-association matrix.
    k_neighbours is an NxK array where the k-th element of the i-th row is the index of a data point 
    that corresponds to the k-th nearest neighbour of the i-th data point. That neighbour is the k-th
    prototype of the i-th data point.
    """
    nclusters = len(clusters)
    for i in xrange(nclusters):

        if clusters[i].shape > 1:

            # all data points in cluster - rows to select
            n_in_cluster = clusters[i]

            # update row j of matrix
            for j in n_in_cluster:
                # all prototypes in cluster - columns to select
                k_in_cluster = np.where(np.in1d(k_neighbours[j]	,n_in_cluster))

                # this indexing selects the rows and columns specified by n_in_cluster and k_in_cluster
                assoc_mat[j,k_in_cluster] += 1 # np.newaxis is alias for None
        pass

def _update_coassoc_knn_sparse(assoc_mat,clusters,k_neighbours):
    """
    Updates an NxK co-association matrix.
    k_neighbours is an NxK array where the k-th element of the i-th row is the index of a data point 
    that corresponds to the k-th nearest neighbour of the i-th data point. That neighbour is the k-th
    prototype of the i-th data point.
    """
    nclusters = len(clusters)
    for i in xrange(nclusters):

        if clusters[i].shape > 1:

            # all data points in cluster - rows to select
            n_in_cluster = clusters[i]

            # update row j of matrix
            for j in n_in_cluster:
                # all prototypes in cluster - columns to select
                # column indices corresponding to the K-prototypes
                #k_in_cluster = np.where(np.in1d(n_in_cluster,k_neighbours[j]))[0]
                
                k_in_cluster = n_in_cluster[np.in1d(n_in_cluster,k_neighbours[j])]
                
                # this indexing selects the rows and columns specified by n_in_cluster and k_in_cluster
                #assoc_mat[j,k_in_cluster] += 1 # np.newaxis is alias for None
                assoc_mat[j,k_in_cluster] += np.ones_like(k_in_cluster)
        pass

In [10]:
def mat_match(sparse,normal,neigh):

    for row in xrange(sparse.shape[0]):
        cols_in_sparse = neigh[row][normal[row].astype(bool)]
        row_in_sparse = sparse[row,cols_in_sparse]
        row_in_normal = normal[row][normal[row].astype(bool)]
        if (row_in_sparse != row_in_normal).any():
            return False
    return True

In [111]:
def zero_low_tri(sparse):
    r,c = sparse.shape
    for i in xrange(r):
        for j in xrange(c):
            if j >= c:
                continue
            if sparse[i,j] != 0:
                return False
    return True

In [202]:
def func1(smat):
    """
    check that, where the matrix is not symetric, at least one of the values is 0
    """
    a=smat.todense()
    row,col = np.where(a!=a.T)
    row=np.array(row).flatten()
    col=np.array(col).flatten()
    for i in xrange(row.size):
        if a[row[i],col[i]] != 0 and a[col[i],row[i]] != 0:
            return False
    return True

In [227]:
def func2(smat):
    """
    true if lower triangle is zero where matrix is not symetric
    """
    a=smat.todense()
    row,col = np.where(a!=a.T)
    row=np.array(row).flatten()
    col=np.array(col).flatten()
    for i in xrange(row.size):
        if row[i] > col[i] and a[row[i],col[i]] != 0:
            return False
    return True

In [226]:
def func3(smat):
    """
    true if upper triangle is zero where matrix is not symetric
    """
    a=smat.todense()
    row,col = np.where(a!=a.T)
    row=np.array(row).flatten()
    col=np.array(col).flatten()
    for i in xrange(row.size):
        if row[i] < col[i] and a[row[i],col[i]] != 0:
            return False
    return True

In [251]:
def func3(smat):
    """
    make dists only on upper triangle
    """
    a=smat.todense()
    rows,cols = smat.nonzero()
    for i in xrange(rows.size):
        if rows[i]>cols[i] and smat[rows[i],cols[i]] != 0:
            smat[cols[i],rows[i]] = smat[rows[i],cols[i]]
        smat[rows[i],cols[i]] = 0      
        
    pass

In [259]:
def func4(smat):
    """
    make dists symmetric
    """
    a=smat.todense()
    rows,cols = smat.nonzero()
    for i in xrange(rows.size):
        if rows[i]<cols[i] and smat[rows[i],cols[i]] != 0:
            smat[cols[i],rows[i]] = smat[rows[i],cols[i]]    
        
    pass

In [157]:
clusters_a = estimator._readPartition(partition_files[0])
neigh_labels = estimator.k_neighbours
#sparse_coassoc = csc_matrix((100,100))
#sparse_coassoc = csr_matrix((100,100))
sparse_coassoc = lil_matrix((100,100))
#sparse_coassoc = coo_matrix((100,100)) # doesn't support slicing
coassoc = np.zeros((100,neigh_labels.shape[1]))

In [158]:
for f in partition_files:
    clusters_a = estimator._readPartition(f)
    _update_coassoc_knn(coassoc,clusters_a,neigh_labels)
    _update_coassoc_knn_sparse(sparse_coassoc,clusters_a,neigh_labels)
mat_match(sparse_coassoc,coassoc,neigh_labels)


Out[158]:
True

In [29]:
%timeit _update_coassoc_knn(coassoc,clusters_a,neigh_labels)
%timeit _update_coassoc_knn_sparse(sparse_coassoc,clusters_a,neigh_labels)


100 loops, best of 3: 10.5 ms per loop
10 loops, best of 3: 86.8 ms per loop

In [30]:
print "csr/np = {}".format(134/10.4)
print "csc/np = {}".format(131/10.4)
print "lil/np = {}".format(86.8/10.4)


csr/np = 12.8846153846
csc/np = 12.5961538462
lil/np = 8.34615384615

In [31]:
from scipy.cluster.hierarchy import linkage
from scipy.spatial.distance import squareform

def _apply_linkage(assoc_mat,method='single'):
    """
    SciPy linkage wants a distance array of format pdist. SciPy squareform 
    converts between the two formats.

    assoc_mat 	: pair-wise similarity association matrix
    method 		: linkage method to use; can be 'single'(default), 'complete',
                  'average', 'weighted', 'centroid', 'median', 'ward'
    """

    condensed_assoc = squareform(assoc_mat)

    # convert pair-wise similarity array (assoc_mat->condensed_assoc) to dissimilarity
    condensed_diss_assoc = condensed_assoc.max() - condensed_assoc

    Z = linkage(condensed_diss_assoc,method=method)

    return Z

In [87]:
Y = sparse_coassoc.tocsr()
Y[i,j]=Y.max()-Y[i,j]

In [88]:



Out[88]:
<100x100 sparse matrix of type '<type 'numpy.float64'>'
	with 167 stored elements in Compressed Sparse Row format>

Linkage experiment

I want to evaluate if complete link is equivalent to single link when the dissimilarity measure is inverted.


In [115]:
import scipy

In [119]:
points=np.array([[5,5],[5,6],[1,1],[1,2],[-2,-2]])
dist=scipy.spatial.distance.pdist(points)
invDist=dist.max()-dist

In [125]:
Z_d=linkage(dist,method="single")
Z_id=linkage(invDist,method="complete")

In [126]:
p=scipy.cluster.hierarchy.dendrogram(Z_d)



In [127]:
p=scipy.cluster.hierarchy.dendrogram(Z_id)


They're not equivelent since the distance between clusters is always minimized indepedentely of how the metric chosen.


In [138]:
a=np.ones(5)

In [140]:
a[:,np.newaxis]


Out[140]:
array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]])