Clustering and outlier detection


Objectives

  • Test clustering methods on the features extracted from the graph for nodes and transactions
  • Test outlier detection methds on the features extracted from the graph for nodes and transactions
  • Detect if the clustering is splitting some publicy known groups of addresses (exchange, pool, smart contracts etc.)
  • Plot Clustering and outlier detection

0 - Data preparation

Importing librairies


In [150]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from time import time
from joblib import Parallel, delayed
import multiprocessing
import time
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.cluster import MiniBatchKMeans, KMeans
from sklearn.metrics.pairwise import pairwise_distances_argmin
from sklearn.datasets.samples_generator import make_blobs
from scipy.spatial.distance import cdist, pdist
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.covariance import EmpiricalCovariance, MinCovDet

loading different datasets :

  • Publicy known adresses
  • Features dataframe from the graph features generators

In [151]:
%%time
known = pd.read_csv('../data/known.csv')
rogues = pd.read_csv('../data/rogues.csv')
transactions = pd.read_csv('../data/edges.csv').drop('Unnamed: 0',1)
#Dropping features and fill na with 0
df = pd.read_csv('../data/features_full.csv').drop('Unnamed: 0',1).fillna(0)
df = df.set_index(['nodes'])
#build normalize values
data = scale(df.values)
n_sample = 10000


CPU times: user 42.6 s, sys: 20.4 s, total: 1min 3s
Wall time: 1min 12s

I - Clustering Nodes


Exploring clustering methods on the nodes featured dataset

A - k-means

First a very simple kmeans method


In [3]:
#Define estimator / by default clusters = 6 an init = 10
#kmeans = KMeans(init='k-means++', n_clusters=6, n_init=10)

In [4]:
#kmeans.fit(data)


Out[4]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=6, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

1 - Parameters Optimization

a - Finding the best k

code from http://www.slideshare.net/SarahGuido/kmeans-clustering-with-scikitlearn#notes-panel)


In [160]:
#Quick PCA for k selection
X = PCA(n_components=2).fit_transform(data)

In [162]:
%%time
#Determine your k range
k_range = range(1,14)
# Fit the kmeans model for each n_clusters = k
k_means_var = [KMeans(n_clusters=k).fit(X) for k in k_range]
# Pull out the centroids for each model
centroids = [X.cluster_centers_ for X in k_means_var]


CPU times: user 2min 25s, sys: 35.2 s, total: 3min 1s
Wall time: 2min 46s

In [161]:
X


Out[161]:
array([[ -3.02596507e+01,  -2.24063114e+01],
       [ -1.53363648e-02,   1.80421834e-01],
       [ -2.50002320e-02,  -2.33593931e-01],
       ..., 
       [ -1.06877774e-02,   5.65299279e-02],
       [ -1.06721013e-02,   5.65302070e-02],
       [ -1.07159776e-02,   5.56550749e-02]])

In [156]:
%%time
# Caluculate the Euclidean distance from each pont to each centroid
k_euclid=[cdist(X, cent, 'euclidean') for cent in centroids]
dist = [np.min(ke,axis=1) for ke in k_euclid]

# Total within-cluster sum of squares
wcss = [sum(d**2) for d in dist]

# The total sum of squares
tss = sum(pdist(X)**2)/X.shape[0]

#The between-cluster sum of squares
bss = tss - wcss


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-156-fac1ab8808b7> in <module>()
----> 1 get_ipython().run_cell_magic(u'time', u'', u"# Caluculate the Euclidean distance from each pont to each centroid\nk_euclid=[cdist(X, cent, 'euclidean') for cent in centroids]\ndist = [np.min(ke,axis=1) for ke in k_euclid]\n\n# Total within-cluster sum of squares\nwcss = [sum(d**2) for d in dist]\n\n# The total sum of squares\ntss = sum(pdist(X)**2)/X.shape[0]\n\n#The between-cluster sum of squares\nbss = tss - wcss")

/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_cell_magic(self, magic_name, line, cell)
   2291             magic_arg_s = self.var_expand(line, stack_depth)
   2292             with self.builtin_trap:
-> 2293                 result = fn(magic_arg_s, cell)
   2294             return result
   2295 

/usr/local/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)

/usr/local/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
    191     # but it's overkill for just that one bit of state.
    192     def magic_deco(arg):
--> 193         call = lambda f, *a, **k: f(*a, **k)
    194 
    195         if callable(arg):

/usr/local/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
   1165         else:
   1166             st = clock2()
-> 1167             exec(code, glob, local_ns)
   1168             end = clock2()
   1169             out = None

<timed exec> in <module>()

/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.pyc in cdist(XA, XB, metric, p, V, VI, w)
   2058 
   2059     # The C code doesn't do striding.
-> 2060     [XA] = _copy_arrays_if_base_present([_convert_to_double(XA)])
   2061     [XB] = _copy_arrays_if_base_present([_convert_to_double(XB)])
   2062 

/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.pyc in _convert_to_double(X)
    144 def _convert_to_double(X):
    145     if X.dtype != np.double:
--> 146         X = X.astype(np.double)
    147     if not X.flags.contiguous:
    148         X = X.copy()

TypeError: float() argument must be a string or a number

In [ ]:
%%time
plt.plot(k_range,bss/tss,'-bo')
plt.xlabel('number of cluster')
plt.ylabel('% of variance explained')
plt.title('Variance explained vs k')
plt.grid(True)
plt.show()
  • Difficult to find an elbow criteria
  • Other heuristic criteria k = sqrt(n/2)

b - Other heuristic method

$k=\sqrt{\frac{n}{2}}$


In [ ]:
np.sqrt(data.shape[0]/2)

-> Weird

c - Silhouette Metrics for supervised ?

2 - Visualize with PCA reduction

code from scikit learn


In [6]:
batch_size = 10
n_clusters = 6
#PCA
X = PCA(n_components=2).fit_transform(data)

##############################################################################
# Compute clustering with Means

k_means = KMeans(init='k-means++', n_clusters=6, n_init=10,random_state=2)
t0 = time.time()
k_means.fit(X)
t_batch = time.time() - t0
k_means_labels = k_means.labels_
k_means_cluster_centers = k_means.cluster_centers_
k_means_labels_unique = np.unique(k_means_labels)

##############################################################################
# Compute clustering with MiniBatchKMeans

mbk = MiniBatchKMeans(init='k-means++', n_clusters=6, batch_size=batch_size,
                      n_init=10, max_no_improvement=10, verbose=0,random_state=2)
t0 = time.time()
mbk.fit(X)
t_mini_batch = time.time() - t0
mbk_means_labels = mbk.labels_
mbk_means_cluster_centers = mbk.cluster_centers_
mbk_means_labels_unique = np.unique(mbk_means_labels)

In [7]:
##############################################################################
# Plot result

fig = plt.figure(figsize=(15, 5))
colors = ['#4EACC5', '#FF9C34', '#4E9A06','#FF0000','#800000','purple']
#fig.subplots_adjust(left=0.02, right=0.98, bottom=0.05, top=0.9)

# We want to have the same colors for the same cluster from the
# MiniBatchKMeans and the KMeans algorithm. Let's pair the cluster centers per
# closest one.

order = pairwise_distances_argmin(k_means_cluster_centers,
                                  mbk_means_cluster_centers)

# KMeans
ax = fig.add_subplot(1, 3, 1)
for k, col in zip(range(n_clusters), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    ax.plot(X[my_members, 0], X[my_members, 1], 'w',
            markerfacecolor=col, marker='.',markersize=10)
    ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=6)
ax.set_title('KMeans')
ax.set_xticks(())
ax.set_yticks(())
#plt.text(10,10,  'train time: %.2fs\ninertia: %f' % (
    #t_batch, k_means.inertia_))

# Plot result


# MiniBatchKMeans
ax = fig.add_subplot(1, 3, 2)
for k, col in zip(range(n_clusters), colors):
    my_members = mbk_means_labels == order[k]
    cluster_center = mbk_means_cluster_centers[order[k]]
    ax.plot(X[my_members, 0], X[my_members, 1], 'w',
            markerfacecolor=col, marker='.', markersize=10)
    ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=6)
ax.set_title('MiniBatchKMeans')
ax.set_xticks(())
ax.set_yticks(())
#plt.text(-5, 10, 'train time: %.2fs\ninertia: %f' %
         #(t_mini_batch, mbk.inertia_))

# Plot result



# Initialise the different array to all False
different = (mbk_means_labels == 4)
ax = fig.add_subplot(1, 3, 3)

for l in range(n_clusters):
    different += ((k_means_labels == k) != (mbk_means_labels == order[k]))

identic = np.logical_not(different)
ax.plot(X[identic, 0], X[identic, 1], 'w',
        markerfacecolor='#bbbbbb', marker='.')
ax.plot(X[different, 0], X[different, 1], 'w',
        markerfacecolor='m', marker='.')
ax.set_title('Difference')
ax.set_xticks(())
ax.set_yticks(())

plt.show()



In [14]:
fig2 = plt.figure(figsize=(15, 10))
for k, col in zip(range(n_clusters), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    plt.plot(X[my_members, 0], X[my_members, 1], 'w',
            markerfacecolor=col, marker='.',markersize=13)
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
            markeredgecolor='k', markersize=8)
plt.title('KMeans')
plt.show()


B - Mini batch

II - Outlier Detection


Objectives :

  • Perform outlier detection on node data
  • Test different methods (with perf metrics)
  • Plot outlier detection
  • Tag transaction

Explain : Mahalanobis Distance


In [56]:
X = PCA(n_components=2).fit_transform(data)
# compare estimators learnt from the full data set with true parameters
emp_cov = EmpiricalCovariance().fit(X)
robust_cov = MinCovDet().fit(X)


###############################################################################
# Display results
fig = plt.figure(figsize=(15, 8))
plt.subplots_adjust(hspace=-.1, wspace=.4, top=.95, bottom=.05)

# Show data set
subfig1 = plt.subplot(1, 1, 1)
inlier_plot = subfig1.scatter(X[:, 0], X[:, 1],
                              color='black', label='points')
subfig1.set_xlim(subfig1.get_xlim()[0], 11.)
subfig1.set_title("Mahalanobis distances of a contaminated data set:")

# Show contours of the distance functions
xx, yy = np.meshgrid(np.linspace(plt.xlim()[0], plt.xlim()[1], 100),
                     np.linspace(plt.ylim()[0], plt.ylim()[1], 100))
zz = np.c_[xx.ravel(), yy.ravel()]

mahal_emp_cov = emp_cov.mahalanobis(zz)
mahal_emp_cov = mahal_emp_cov.reshape(xx.shape)
emp_cov_contour = subfig1.contour(xx, yy, np.sqrt(mahal_emp_cov),
                                  cmap=plt.cm.PuBu_r,
                                  linestyles='dashed')

mahal_robust_cov = robust_cov.mahalanobis(zz)
mahal_robust_cov = mahal_robust_cov.reshape(xx.shape)
robust_contour = subfig1.contour(xx, yy, np.sqrt(mahal_robust_cov),
                                 cmap=plt.cm.YlOrBr_r, linestyles='dotted')


plt.xticks(())
plt.yticks(())
plt.show()



III - Look at the clusters


In [16]:
k_means = KMeans(init='random', n_clusters=6, n_init=10, random_state=2)
clusters = k_means.fit_predict(data)

In [ ]:
df['clusters'] = clusters


Tagging the known and rogues nodes


In [ ]:
tagged = pd.merge(known,df,left_on='id',how='inner',right_index=True)
rogues_tag = pd.merge(rogues,df,left_on='id',how='inner',right_index=True)


Build the distribution


In [149]:
distrib = pd.DataFrame(df.groupby('clusters').count().apply(lambda x: 100*x/float(x.sum()))['total_degree'].values,columns=['Global'])
distrib['Known']=tagged.groupby('clusters').count().apply(lambda x: 100*x/float(x.sum()))['id']
distrib['Rogues']=rogues_tag.groupby('clusters').count().apply(lambda x: 100*x/float(x.sum()))['id']
distrib['Clusters']=distrib.index
distrib
distrib.get(['Global', 'Known','Rogues','Clusters']).groupby(['Clusters']).mean().plot(kind='bar',title='Cluster Distriubtion per Population');



Insight and Nature of clusters


In [138]:
#Several Insights, Nature of the clusters
df.groupby('clusters').mean()


Out[138]:
total_degree degree_in degree_out unique_successors unique_predecessors mean_value_in mean_value_out std_value_in std_value_out ratio_in_timestamp ... frequency_out balance mean_velocity_in mean_velocity_out std_velocity_in std_velocity_out mean_acceleration_in mean_acceleration_out min_path_to_rogue min_path_from_rogue
clusters
0 2.612900e+05 4.042695e+04 220863.052632 7042.631579 7918.052632 7.212737e+20 1.098110e+22 5.273081e+21 2.178159e+22 1.890468 ... 0.011008 3.381983e+24 8.516933e+16 2.825936e+20 4.798054e+18 3.794037e+20 1.967291e+14 6.388873e+18 2.315789 2.526316
1 2.138136e+01 2.118231e+01 0.199044 0.052243 2.076097 7.417071e+19 3.937509e+19 5.389531e+19 4.038152e+19 1.156968 ... 0.000140 4.895816e+20 1.127456e+17 3.659001e+16 6.598585e+16 4.025345e+15 1.157953e+15 3.678747e+13 0.111728 4.103121
2 1.228371e+06 1.084279e+06 144092.000000 42729.000000 23113.000000 5.009805e+19 4.865686e+20 2.017054e+21 7.598143e+21 5.043979 ... 0.004264 -1.579038e+25 2.566505e+15 3.213480e+17 7.026051e+17 3.929951e+19 3.696645e+11 1.529833e+16 2.000000 1.000000
3 3.786418e+02 2.353881e+02 143.253731 71.447761 181.477612 9.070842e+22 2.818395e+23 8.254031e+22 9.752651e+22 4.250715 ... 0.004012 7.059440e+22 6.472854e+20 1.240381e+19 2.959118e+20 2.150052e+19 1.432339e+19 1.243209e+16 3.253731 2.791045
4 2.000000e+00 1.000000e+00 1.000000 1.000000 1.000000 9.173332e+18 9.172241e+18 0.000000e+00 0.000000e+00 1.000000 ... 0.000000 1.091282e+15 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4317.845744 4316.450729
5 3.097724e+01 1.625887e+01 14.718373 2.183575 1.622426 1.340362e+20 2.098518e+20 7.264903e+19 1.237454e+20 0.967353 ... 0.001317 -2.875345e+20 6.179641e+16 2.648585e+17 2.996784e+16 1.146132e+17 2.842660e+14 3.994207e+14 21.917123 22.265763

6 rows × 22 columns

IV - Tag transactions

Function to get the cluster corresponding to the node


In [146]:
#write function
def get_cluster(node,df):
    return df.loc[node].clusters


Let's tag the node in the transaction database


In [148]:
#Tag from node
%%time
transactions['cluster_from'] = transactions['from'].map(lambda x: get_cluster(x,df))


CPU times: user 26min 8s, sys: 24.4 s, total: 26min 33s
Wall time: 26min 51s

In [ ]:
#Tag to node
%%time
transactions['cluster_to'] = transactions['to'].map(lambda x: get_cluster(x,df))