In [1]:
%load_ext autoreload
%autoreload 2
In [74]:
import warnings
warnings.filterwarnings('ignore')
http://pydata.org/amsterdam2016/schedule/presentation/34/ https://github.com/mvaz/PyData2016-Amsterdam
"Frankfurt airport is an international hub"
"Ebola is a contagious disease"
"Lehman was too central to be let fail"
Obligation for certain types of contracts introduced by the "European Market Infrastructure Regulation (EMIR)", and "Dodd–Frank Wall Street Reform and Consumer Protection Act".
Trading networks, abnormal motifs and stock manipulation, Quantitative Finance Letters, 1, 1-8 Zhi-Qiang Jiang , Wen-Jie Xie , Xiong Xiong , Wei Zhang , Yong-Jie Zhang & Wei-Xing Zhou (2013) http://dx.doi.org/10.1080/21649502.2013.802877
In [39]:
from IPython.display import display, display_html
In [40]:
from kmeans_aux import *
In [85]:
import matplotlib.pyplot as plt
import seaborn as sns
from bokeh.plotting import output_notebook, figure, show, ColumnDataSource
TOOLS="resize,crosshair,pan,wheel_zoom,box_zoom,reset,tap,previewsave,box_select,poly_select,lasso_select"
output_notebook()
In [111]:
X, true_labels = generate_two_moons()
p = figure(tools=TOOLS);
x,y = [list(t) for t in zip(*X)]
p.scatter(x, y, color='black', radius=0.02);
show(p);
A naïve density based approach using a method like K-Means does not capture the non-linearity of the dataset structure
In [91]:
k_means_2_labels, k_means_2_cluster_centres = do_kmeans(X, n_clusters=2, n_init=10)
src2 = create_datasource(X, k_means_2_labels, len(k_means_2_cluster_centres))
In [112]:
p = figure(tools=TOOLS);
p.scatter('x', 'y', color='fill_color', source=src2);
# alpha='alpha', radius='radius',
show(p);
The laplacian of the graph is given by $$ L = D - A $$ where $D$ is the degree matrix $$D_{i,j}:= \begin{cases} \deg(v_i) & \mbox{if}\ i = j \\ 0 & \mbox{otherwise} \end{cases} $$ where $\deg(v_i)$ is degree of the vertex i.
The eigenvectors of the graph Laplacian are very interesting!
Construct a similarity graph $A$
Let $W$ be its weighted adjacency matrix.
Compute the unnormalized Laplacian $L = D - A$
Compute the first $k$ eigenvectors $u_1, \ldots, u_k$ of $L$
Let $U \in R^{n \times k}$ be the matrix containing the vectors $u_1, \ldots, u_k$ as columns
For $i = 1,\ldots, n$, let $y_i \in R^k$ be the vector corresponding to the i-th row of U
Clusters $A_1,\ldots,A_k$ with $A_i = \{ j\, |\, y_j \in C_i\}$
Following similarity graphs can be defined
$\epsilon-$neighborhood graph connects all points whose pairwise distances are smaller than $\epsilon$
$k$-nearest neighbor graphs connects $v_i$ and $v_j$ if $v_j$ is among the $k$ nearest points of $k_i$
fully connected graph outsources the locality modelling to the similarity function, such as $s(x_i, x_j) = \exp\left( - \frac{|| x_i - x_j||^2}{ 2 \sigma^2} \right)$
No full metric is required E.g. it is possible to compare:
In [93]:
# Turn the distance into a similarity by applying a Gaussian kernel
distance = euclidean_distances(X, X, squared=True)
sig = 0.5
similarity = np.exp(-distance/sig)
In [110]:
%matplotlib notebook
sns.clustermap(similarity);
In [96]:
A = create_adjacency_matrix(similarity, num_neighbours=10)
# Calculate the degree - the sum of all incoming weights to a vertex
D = np.diag(np.sum(A, axis=0))
# Unnormalised Laplacian Matrix
L = D - A
In [97]:
# Perform an Eigenvalue deccomposition of the Laplacian
from scipy.linalg import eig
[eig_vals, eig_vecs] = eig(L)
sorted_inds = np.argsort(eig_vals.real, axis=0)
In [98]:
# Perform a dimensionality reduction
k = 3
F = eig_vecs[:,sorted_inds[0:k]]
# Do k-means on the reduced space
k_means_3_labels, k_means_3_cluster_centers = do_kmeans(F, n_clusters=3)
ind_3 = np.argsort(k_means_3_labels)
In [101]:
%matplotlib notebook
plt.bar(range(10), eig_vals.real[sorted_inds[0:10]]);
# plt.imshow(similarity[ind_3].T[ind_3])
Out[101]:
In [102]:
from networkx.generators import random_graphs
from networkx.generators import classic
In [103]:
import networkx as nx
g = nx.from_numpy_matrix(A)
In [104]:
g0 = random_graphs.empty_graph()
g0.add_nodes_from(g)
for v in g.edges_iter():
g0.add_edge(v[0], v[1])
In [105]:
from graphs import *
In [113]:
p_g0 = figure(toolbar_location='left', tools=TOOLS)
ds_g0, ds_g0 = prepare_datasources(g0, k=.05, iterations=150)
plot_graph(p_g0, ds_g0, ds_g0)
show(p_g0);
In [107]:
# k_means_2_labels, k_means_2_cluster_centres = do_kmeans(X, n_clusters=2, n_init=10)
src3 = create_datasource(X, k_means_3_labels, 3) #len(k_means_3_cluster_centres))
In [114]:
p = figure(toolbar_location='left', tools=TOOLS);
p.scatter('x', 'y', color='fill_color', source=src3);
show(p);
Construct a sparse similarity graph by applying a threshold r on the distance.
Determine the connected components.
Outliers are components with a number of nodes smaller than a threshold p
More info in https://github.com/dmarx/Topological-Anomaly-Detection
Then compute the windowed pairwise correlation of each pair of assets
$$X = R_{i,t_0\ldots t_k}$$$$Y = R_{j,t_0\ldots t_k}$$Can we put this structure to use?
Spread of risk across financial markets: better to invest in the peripheries F. Pozzi, T. Di Matteo and T. Aste 2013, Nature Scientific Reports 3, Article number: 1665 doi:10.1038/srep01665 http://www.nature.com/srep/2013/130416/srep01665/full/srep01665.html
IPython notebooks and material in https://github.com/mvaz/PyData2014-Berlin
If you preprocess your data such that each row has $\mu=0$ and $\sigma = 1$ (which disallows constant vectors!), then correlation reduces to cosine:
$$Cor(X,Y)=\frac{Cov(X,Y)}{\sigma_X \sigma_Y} = \frac{E[(X- \mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_Y} = E[X Y]= \frac{1}{dim<X,Y>}$$Under the same conditions, squared Euclidean distance also reduces to cosine:
$$d^2Euclid(X,Y)= \Sigma_i (X_i-Y_i)^2= \Sigma_i X_i + \Sigma_i Yi - 2 \Sigma_i X_i Y_i = 2 <X,Y>$$[other notebook]
Different methods for sparsifying the matrix were tested: a topological argument (Planar Maximally Filtered Graph) was used.
In this paper the authors were interested in centrality, instead of community structure. The gist of the paper is by choosing assets in the periphery, the portfolios become more diversified.
(picture taken from the paper)
For each point in time:
Build asset baskets:
Construct portfolios
Evaluate SNR for periods following construction $$ \frac{\overline{r}}{s} $$
Market portfolio profits the most from Markowitz.
Central baskets profit less, which means that the diversification is already very good.
Large central portfolios already very good.
Structure is important. Network theory helps you
Lots of structure in finance
Local methods can help reveal the structure of your data.
You don't always need a distance metric. Often, a similarity function is enough, e.g.
Goal is to embedd high-dimensional objects into a lower-dimensional space (e.g. 2D).
Sophisticated modelling of the effects of the local structure onto the low dimensional embedding.
Given a set of N high-dimensional objects $\mathbf{x}_1, \dots, \mathbf{x}_N$, t-SNE first computes probabilities $p_{ij}$ that are proportional to the similarity of objects $\mathbf{x}_i$ and $\mathbf{x}_j$, as follows:
$$p_{j|i} = \frac{\exp(-\lVert\mathbf{x}_i - \mathbf{x}_j\rVert^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\lVert\mathbf{x}_i - \mathbf{x}_k\rVert^2 / 2\sigma_i^2)}$$,
Symmetric distance $$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}$$
The gradient turns out to be easy is easy to compute
$$ \frac{\partial \, KL(P || Q)}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij}) g\left( \left| x_i - x_j\right| \right) u_{ij},\\ \\ \quad \textrm{where} \, g(z) = \frac{z}{1+z^2} $$Here, $u_{ij}$ is a unit vector going from $y_j$ to $y_i$. This gradient expresses the sum of all spring forces applied to map point $i$.
Adaptations to big data sets already available using the Barns Hut
In [ ]:
More of an art than a science. Run the algorithm with several configurations and try to minimize the 'inertia' $$\sum_{i=0}^{n} min_{\mu_j \in C} (||x_j - \mu_i||^2)$$ , i.e. the sum of the intra-cluster distances. The (related) concept of "silhouette" is also important.
No theoretical argument for each of the versions.
Not too many different clusters. Better for non-linear clusters.
Similarity matrix is quadratic on the number of samples... no "BigData"
It might also be helpful to perform a dimensionality reduction first with e.g. PCA.
http://www.cs.yale.edu/homes/spielman/561/2009/lect02-09.pdf