Intro

The enphasis of the last notebook was on the accuracy of Horn's algorithm. We analyzed the iris and crab datasets, and also a mixture of gaussians.

Now the method's performance will be analysed. Currently the gradient descent implementation (port of the original code provided by (Horn) is quite simple and the only stopping criteria is the number of iterations. Future work will involve implementing a more complete gradient descent, but for now we'll use the number of iterations to test it's performance.

Since we're only testing performance, accuracy is not of essence so we'll use only artificial data where it's easy to scale diminsionality and cardinality and see it's effect on the performance. Furthermore, raw data will be input to the algorithm instead of the PCA representation, Also, while the true algorithm is only the convergence part, we'll also time the assignment for each test case.

Tests

In [3]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib

In [55]:
import sklearn
from sklearn import preprocessing,decomposition,datasets
import seaborn as sns

In [5]:
%cd /home/chiroptera/workspace/QCThesis/Horn
import HornAlg

/home/chiroptera/workspace/QCThesis/Horn
Out[5]:
<module 'HornAlg' from 'HornAlg.pyc'>

In [6]:
%cd /home/chiroptera/workspace/QCThesis/cluster\ consistency
import determine_ci

/home/chiroptera/workspace/QCThesis/cluster consistency
Out[6]:
<module 'determine_ci' from 'determine_ci.pyc'>

In [8]:
# These are the "Tableau 20" colors as RGB.
tableau20 = [(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),
(44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),
(148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),
(227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),
(188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]
# Scale the RGB values to the [0, 1] range, which is the format matplotlib accepts.
for i in range(len(tableau20)):
r, g, b = tableau20[i]
tableau20[i] = (r / 255., g / 255., b / 255.)

We'll use a random sigma and fixed number of steps of 200 for all runs.

In [10]:
n_magnitude=[1,2,3,4] #10^n_magniture data_points
n_features=[2,3,4,5]
centers=4

steps=200
sigma=np.random.rand()
minD=2

In [54]:
timings_clust=np.zeros((len(n_magnitude),len(n_features)))
timings_assign=np.zeros((len(n_magnitude),len(n_features)))

for i,mag in enumerate(n_magnitude):
n_samples=10**mag
for j,dim in enumerate(n_features):
print "\nGenerating dataset with ",n_samples, " samples, ",dim, " features."
x_Gauss,x_assign=sklearn.datasets.make_blobs(n_samples=n_samples,n_features=dim,centers=centers)
print "Clustering..."
print "Took ",hello," seconds"
timings_clust[i,j]=hello

print "Assigning..."
assignment,hello=HornAlg.fineCluster(gaussD,dist,potential=V,timeit=True)
timings_assign[i,j]=hello
print "Took ",hello," seconds"

Generating dataset with  10  samples,  2  features.
Clustering...
Took  0.35382  seconds
Assigning...
Took  0.000159  seconds

Generating dataset with  10  samples,  3  features.
Clustering...
Took  0.411391  seconds
Assigning...
Took  0.000158  seconds

Generating dataset with  10  samples,  4  features.
Clustering...
Took  0.385114  seconds
Assigning...
Took  0.000156  seconds

Generating dataset with  10  samples,  5  features.
Clustering...
Took  0.429747  seconds
Assigning...
Took  0.000157  seconds

Generating dataset with  100  samples,  2  features.
Clustering...
Took  2.95465  seconds
Assigning...
Took  0.000519  seconds

Generating dataset with  100  samples,  3  features.
Clustering...
Took  3.322593  seconds
Assigning...
Took  0.000506  seconds

Generating dataset with  100  samples,  4  features.
Clustering...
Took  3.74372  seconds
Assigning...
Took  0.000521  seconds

Generating dataset with  100  samples,  5  features.
Clustering...
Took  4.143823  seconds
Assigning...
Took  0.000516  seconds

Generating dataset with  1000  samples,  2  features.
Clustering...
Took  52.840666  seconds
Assigning...
Took  0.007672  seconds

Generating dataset with  1000  samples,  3  features.
Clustering...
Took  60.293262  seconds
Assigning...
Took  0.008273  seconds

Generating dataset with  1000  samples,  4  features.
Clustering...
Took  68.225671  seconds
Assigning...
Took  0.00777  seconds

Generating dataset with  1000  samples,  5  features.
Clustering...
Took  81.523212  seconds
Assigning...
Took  0.00807  seconds

Generating dataset with  10000  samples,  2  features.
Clustering...
Took  3009.678259  seconds
Assigning...
Took  0.395657  seconds

Generating dataset with  10000  samples,  3  features.
Clustering...
Took  3418.34283  seconds
Assigning...
Took  0.400728  seconds

Generating dataset with  10000  samples,  4  features.
Clustering...
Took  3956.289064  seconds
Assigning...
Took  0.400611  seconds

Generating dataset with  10000  samples,  5  features.
Clustering...
Took  4918.185844  seconds
Assigning...
Took  0.426076  seconds

In [7]:
clust=array([[  3.53820000e-01,   4.11391000e-01,   3.85114000e-01,
4.29747000e-01],
[  2.95465000e+00,   3.32259300e+00,   3.74372000e+00,
4.14382300e+00],
[  5.28406660e+01,   6.02932620e+01,   6.82256710e+01,
8.15232120e+01],
[  3.00967826e+03,   3.41834283e+03,   3.95628906e+03,
4.91818584e+03]])
assign=array([[  1.59000000e-04,   1.58000000e-04,   1.56000000e-04,
1.57000000e-04],
[  5.19000000e-04,   5.06000000e-04,   5.21000000e-04,
5.16000000e-04],
[  7.67200000e-03,   8.27300000e-03,   7.77000000e-03,
8.07000000e-03],
[  3.95657000e-01,   4.00728000e-01,   4.00611000e-01,
4.26076000e-01]])

In [59]:
fig = pyplot.figure(figsize=(16,18))

for i in range(4):
ax.set_title(str(10**n_magnitude[i]) + ' samples')
ax.set_xlabel('Features')
ax.set_ylabel('Time [s]')

line, = ax.plot(n_features,clust[i], color=tableau20[i*2], lw=1)
ax.set_yscale('linear')

ax.set_title('Influence of cardinality')
ax.set_xlabel('10^ samples')
ax.set_ylabel('Time log[s]')
for i in range(4):
line, = ax.plot(n_magnitude,clust[:,i], color=tableau20[i*2], lw=1,label=str(n_features[i])+' features')
ax.set_yscale('linear')

ax.set_title('Influence of cardinality')
ax.set_xlabel('10^ samples')
ax.set_ylabel('Time log[s]')
for i in range(4):
line, = ax.plot(n_magnitude,clust[:,i], color=tableau20[i*2], lw=1,label=str(n_features[i])+' features')
ax.set_yscale('log')

In [53]:
feature_inc=np.zeros((4,3))
for i in range(1,4):
feature_inc[:,i-1]=clust[:,i]/clust[:,i-1]

sample_inc=np.zeros((3,4))
for i in range(1,4):
sample_inc[i-1,:]=clust[i,:]/clust[i-1,:]

In [56]:
import pandas as pd

In [85]:
feature_inc_pd=pd.DataFrame(feature_inc)
sample_inc_pd=pd.DataFrame(sample_inc)

0         1         2
0  1.162713  0.936126  1.115896
1  1.124530  1.126746  1.106873
2  1.141039  1.131564  1.194905
3  1.135783  1.157370  1.243131

0          1          2          3
0   8.350715   8.076484   9.721070   9.642471
1  17.883900  18.146448  18.224031  19.673430
2  56.957614  56.695271  57.988276  60.328656

In [92]:
import numba

return a+2

type(numba_kernel)

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-92-56d4bb95910e> in <module>()
----> 1 import numba
2
4     return a+2
5

/usr/local/lib/python2.7/dist-packages/numba/__init__.py in <module>()
4 from __future__ import print_function, division, absolute_import
5 import re
----> 6 from . import testing, decorators
7 from ._version import get_versions
8 from . import special, types, config

/usr/local/lib/python2.7/dist-packages/numba/testing.py in <module>()
2 import sys
3 import functools
----> 4 from numba import config
5
6

/usr/local/lib/python2.7/dist-packages/numba/config.py in <module>()
7 import warnings
8
----> 9 import llvmlite.binding as ll
10
11

ImportError: No module named llvmlite.binding