http://www.meetup.com/Dusseldorf-Data-Science-Meetup/events/225280000/
"Frankfurt airport is an international hub"
"Ebola is a contagious disease"
"Lehman was too central to be let fail"
"Thomas Wiecki is a well connected data scientist"
The laplacian of the graph is given by $$ L = D - A $$ where $D$ is the degree matrix $$D_{i,j}:= \begin{cases} \deg(v_i) & \mbox{if}\ i = j \\ 0 & \mbox{otherwise} \end{cases} $$ where $\deg(v_i)$ is degree of the vertex i.
The eigenvectors of the graph Laplacian are very interesting!
In [1]:
%load_ext autoreload
%autoreload 2
In [2]:
from IPython.display import display, display_html
Let's assume a well behaved dataset, with points selected from either of the following distributions
$X_1 \sim N\left( \mu=(1,1),\, \sigma^2=(0.9, 0.9) \right)$
$X_2 \sim N\left( \mu=(-0.5,0),\, \sigma^2=(0.9, 0.9) \right)$
$X_3 \sim N\left( \mu=(1,-1),\, \sigma^2=(0.9, 0.9) \right)$
Well-behaved means in this case that it is described by almost non-overlaping centroids.
In [3]:
import matplotlib.pyplot as plt
import seaborn as sns
from bokeh.plotting import output_notebook, figure, show, ColumnDataSource
TOOLS="resize,crosshair,pan,wheel_zoom,box_zoom,reset,tap,previewsave,box_select,poly_select,lasso_select"
output_notebook()
In [4]:
from kmeans_aux import *
In [5]:
from sympy import *
init_printing(use_unicode=True)
In [6]:
m = Matrix(np.matrix([[-2,1,1], [2,-2,0], [1,0,-1]]) )
m
Out[6]:
In [7]:
X, labels_true, centres = generate_three_circles()
src = create_datasource(X, labels_true, len(centres))
In [8]:
p = figure(tools=TOOLS);
p.scatter('x', 'y', radius='radius', color='fill_color', source=src);
show(p)
One possible way of recovering the categories is by using the K-Means clustering algorithm.
The $k$-means uses $k$ centroids to define clusters. It finds the best centroids by alternating between
Repeat until convergence : {
For every $i$, set $$c^{(i)} := arg \min_{j} || x^(i) - \mu_j ||^2$$
For each $j$, set $$\mu_j := \frac{\sum_{i=1}^m 1_{c^(i) = j}x^(i) }{\sum_{i=1}^m 1_{c^(i) = j}x^(i)}$$ }
Scikit-learn already has an implementation of the algorithm
In [9]:
k_means_3_labels, k_means_3_cluster_centres = do_kmeans(X, n_clusters=3)
distance, order, is_correct = compare_kmeans_with_prior(k_means_3_cluster_centres, k_means_3_labels, centres, labels_true)
In [10]:
# part the plot
alpha = 0.9 - is_correct * 0.8
rad = src.data['radius'] * (1 + (~ is_correct * 1))
src.add(rad, 'radius2')
src.add(alpha, name='alpha' )
Out[10]:
In [11]:
p = figure(tools=TOOLS);
p.scatter('x', 'y', color='fill_color', alpha='alpha', radius='radius2', source=src);
show(p)
In [12]:
X, true_labels = generate_two_moons()
p = figure(tools=TOOLS);
x,y = [list(t) for t in zip(*X)]
p.scatter(x, y, color='black', radius=0.02);
show(p)
The distribution is no longer described by centroids.
K-Means does not capture the non-linearity of the dataset structure
In [13]:
k_means_2_labels, k_means_2_cluster_centres = do_kmeans(X, n_clusters=2, n_init=10)
src2 = create_datasource(X, k_means_2_labels, len(k_means_2_cluster_centres))
In [14]:
p = figure(tools=TOOLS);
p.scatter('x', 'y', color='fill_color', source=src2);
# alpha='alpha', radius='radius',
show(p)
Construct a similarity graph $A$
Let $W$ be its weighted adjacency matrix.
Compute the unnormalized Laplacian $L = D - A$
Compute the first $k$ eigenvectors $u_1, \ldots, u_k$ of $L$
Let $U \in R^{n \times k}$ be the matrix containing the vectors $u_1, \ldots, u_k$ as columns
For $i = 1,\ldots, n$, let $y_i \in R^k$ be the vector corresponding to the i-th row of U
Clusters $A_1,\ldots,A_k$ with $A_i = \{ j\, |\, y_j \in C_i\}$
Following similarity graphs can be defined
$\epsilon-$neighborhood graph connects all points whose pairwise distances are smaller than $\epsilon$
$k$-nearest neighbor graphs connects $v_i$ and $v_j$ if $v_j$ is among the $k$ nearest points of $k_i$
fully connected graph outsources the locality modelling to the similarity function, such as $s(x_i, x_j) = \exp\left( - \frac{|| x_i - x_j||^2}{ 2 \sigma^2} \right)$
No full metric is required E.g. it is possible to compare:
In [15]:
# Turn the distance into a similarity by applying a Gaussian kernel
distance = euclidean_distances(X, X, squared=True)
sig = 0.5
similarity = np.exp(-distance/sig)
In [16]:
%matplotlib notebook
sns.clustermap(similarity);
In [17]:
A = create_adjacency_matrix(similarity, num_neighbours=10)
# Calculate the degree - the sum of all incoming weights to a vertex
D = np.diag(np.sum(A, axis=0))
# Unnormalised Laplacian Matrix
L = D - A
In [18]:
# Perform an Eigenvalue deccomposition of the Laplacian
from scipy.linalg import eig
[eig_vals, eig_vecs] = eig(L)
sorted_inds = np.argsort(eig_vals.real, axis=0)
In [19]:
# Perform a dimensionality reduction
k = 3
F = eig_vecs[:,sorted_inds[0:k]]
# Do k-means on the reduced space
k_means_3_labels, k_means_3_cluster_centers = do_kmeans(F, n_clusters=3)
ind_3 = np.argsort(k_means_3_labels)
In [34]:
%matplotlib notebook
plt.bar(range(10), eig_vals.real[sorted_inds[0:10]]);
# plt.imshow(similarity[ind_3].T[ind_3])
Out[34]:
In [21]:
from networkx.generators import random_graphs
from networkx.generators import classic
In [22]:
from d3networkx import EventfulGraph, ForceDirectedGraph, empty_eventfulgraph_hook
In [23]:
def handle_graph(graph):
print(graph.graph._sleep)
graph_widget = ForceDirectedGraph(graph, width=1800, height=1600)
display(graph_widget)
EventfulGraph.on_constructed(handle_graph)
random_graphs.empty_graph = empty_eventfulgraph_hook(sleep=0.01)
In [24]:
import networkx as nx
g = nx.from_numpy_matrix(A)
In [25]:
g0 = random_graphs.empty_graph()
g0.add_nodes_from(g)
for v in g.edges_iter():
g0.add_edge(v[0], v[1])
In [26]:
ll = k_means_3_labels
ll = true_labels
for n in g0.nodes_iter():
g0.node[n]['fill'] = '#770000' if ll[n] == 0 else '#007700' if ll[n] == 1 else '#000077'
In [27]:
# k_means_2_labels, k_means_2_cluster_centres = do_kmeans(X, n_clusters=2, n_init=10)
src3 = create_datasource(X, k_means_3_labels, len(k_means_3_cluster_centres))
In [28]:
len(k_means_3_labels)
Out[28]:
In [29]:
p = figure(tools=TOOLS);
p.scatter('x', 'y', color='fill_color', source=src3);
show(p)
Construct a sparse similarity graph by applying a threshold r on the distance.
Determine the connected components.
Outliers are components with a number of nodes smaller than a threshold p
More info in https://github.com/dmarx/Topological-Anomaly-Detection
More of an art than a science. Run the algorithm with several configurations and try to minimize the 'inertia' $$\sum_{i=0}^{n} min_{\mu_j \in C} (||x_j - \mu_i||^2)$$ , i.e. the sum of the intra-cluster distances. The (related) concept of "silhouette" is also important.
No theoretical argument for each of the versions.
Not too many different clusters. Better for non-linear clusters.
Similarity matrix is quadratic on the number of samples... no "BigData"
It might also be helpful to perform a dimensionality reduction first with e.g. PCA.
http://www.cs.yale.edu/homes/spielman/561/2009/lect02-09.pdf
Spread of risk across financial markets: better to invest in the peripheries F. Pozzi, T. Di Matteo and T. Aste 2013, Nature Scientific Reports 3, Article number: 1665 doi:10.1038/srep01665 http://www.nature.com/srep/2013/130416/srep01665/full/srep01665.html
IPython notebooks and material in https://github.com/mvaz/PyData2014-Berlin
Different methods for sparsifying the matrix were tested: a topological argument (Planar Maximally Filtered Graph) was used.
In this paper the authors were interested in centrality, instead of community structure. The gist of the paper is by choosing assets in the periphery, the portfolios become more diversified.
(picture taken from the paper)
Goal is to embedd high-dimensional objects into a lower-dimensional space (e.g. 2D).
Sophisticated modelling of the effects of the local structure onto the low dimensional embedding.
Given a set of N high-dimensional objects $\mathbf{x}_1, \dots, \mathbf{x}_N$, t-SNE first computes probabilities $p_{ij}$ that are proportional to the similarity of objects $\mathbf{x}_i$ and $\mathbf{x}_j$, as follows:
$$p_{j|i} = \frac{\exp(-\lVert\mathbf{x}_i - \mathbf{x}_j\rVert^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\lVert\mathbf{x}_i - \mathbf{x}_k\rVert^2 / 2\sigma_i^2)}$$,
Symmetric distance $$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}$$
The gradient turns out to be easy is easy to compute
$$ \frac{\partial \, KL(P || Q)}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij}) g\left( \left| x_i - x_j\right| \right) u_{ij},\\ \\ \quad \textrm{where} \, g(z) = \frac{z}{1+z^2} $$Here, $u_{ij}$ is a unit vector going from $y_j$ to $y_i$. This gradient expresses the sum of all spring forces applied to map point $i$.
Adaptations to big data sets already available using the Barns Hut
Local methods can help reveal the structure of your data, with minor assumptions on its shape
You don't always need a distance metric. Often, a similarity function is enough, e.g.
Graph theory helps you find
This talk is highly influenced by Bart Baddeley's talk at the PyData 2014 in London
http://nbviewer.ipython.org/github/BartBaddeley/PyDataTalk-2014/blob/master/PyDataTalk.ipynb
and the repository is made available in
In [ ]: