Given two anonymized versions of a graph $G=(V,E)$, an interesting problem is to find the true mapping between the nodes of these two graphs. Generally, these two anonymized graphs could also have a slightly different set of edges (assume for example that they are the connections of a set of people in two different online social networks)
This problem can be formulated as follows:
Given two graphs, $G_1 = (V,E_1)$ and $G_2 = (V,E_2)$ of size $n=|V|$, a permutation $\pi$ on $V$ maps the vertices from $G_1$ onto those in $G_2$.
Find the “best” such permutation, denoted by $\pi_0$.
The cost of a permutation can be expressed in many different ways. One such measure is the edge-inconsistency between two graphs, $G_1$ and $G_2$ [Pedarsani2011]. This represents the number of edges that exist in one graph and not the other, under a matching $\pi$. \begin{equation} \Delta_{\pi} = \sum_{e \in E_1} \mathbb{1}_{\{\pi(e) \not\in E_2\}} + \sum_{e \in E_2} \mathbb{1}_{\{\pi^{-1}(e) \not\in E_1\}} \end{equation} Here $\mathbb{1}_{\{ A \}}$ denotes the indicator function, and $\pi(e)$ denotes the mapped version of edge $e = (u,v)$, i.e. $\pi(e) = (\pi(u), \pi(v))$.
We will try two different ways of generating the input instances. First, we will just anonymize the graph $G$ and we will try to match it to itself. This is equivalent to generating $G_1$ and $G_2$ by starting from $G$ and randomly renaming the nodes in $V$.
Then, in order to simulate the generation of $G_1$ and $G_2$ based on two different social networks, we will start from $G$ and, with some probability $p$, we will remove edges from $E$.
We will examine these two scenarios separately, starting with the easy one.
In [66]:
%matplotlib inline
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import graph_isomorphisms as giso
# This helper function will be used to generate the seed for the randomness, so that we can be consistent
import hashlib
def hash_string(text):
return int(hashlib.md5(text).hexdigest()[0:7], 16)
giso.setup_plots()
Our algorithms are all based on the same intuition. We will represent each node in the graph as a multi-dimensional point. Each dimension will represent the similarity of the node to some fixed beacon node. For these beacon nodes we assume that we know the correct mapping across the two graphs. This is a reasonable assumption to make, as in most social networks there are some nodes with a high and unique degree.
In [6]:
def find_beacons_max_unique(G, num_of_beacons=3):
# Return the nodes with maximum unique degrees
inverse_degree = {}
for k, v in nx.degree(G).iteritems():
inverse_degree[v] = inverse_degree.get(v, [])
inverse_degree[v].append(k)
degrees = sorted(inverse_degree.keys(),reverse=True)
multiplicity=[len(inverse_degree[degree]) for degree in degrees]
unique = [i for i,m in enumerate(multiplicity) if m == 1]
return [inverse_degree[degrees[i]][0] for i in unique[0:num_of_beacons]]
def find_beacons_sample(G, num_of_beacons=3, seed=hash_string('find_beacons_sample')):
# Sample a beacon based on its degree
from numpy.random import RandomState
prng = RandomState(seed)
degrees = np.array(nx.degree(G).values())
return prng.choice(np.arange(len(degrees)), num_of_beacons, p=degrees *1./sum(degrees), replace=False )
def find_beacons_sample_inverse(G, num_of_beacons=3, seed=hash_string('find_beacons_sample')):
# Sample a beacon based on its degree
from numpy.random import RandomState
prng = RandomState(seed)
degrees = np.array(nx.degree(G).values())
return prng.choice(np.arange(len(degrees)), num_of_beacons, p=(1./degrees) *1./sum(1./degrees), replace=False )
In [191]:
G = nx.generators.barabasi_albert_graph(50,2, seed=hash_string('example graphs'))
In [192]:
G1 = giso.noisy_remove(G,0.2,seed=giso.hash_string('noisy removal 2'))
G2 = giso.noisy_remove(G,0.2,seed=giso.hash_string('noisy removal 2'))
In [193]:
beacons_G1 = find_beacons_sample_inverse(G1,4)
beacons_G2 = beacons_G1
In [194]:
rest_of_nodes = [x for x in G1.nodes() if x not in beacons_G1]
anonymous_mapping = dict(zip(beacons_G1,beacons_G1))
anonymous_mapping.update(zip(rest_of_nodes,np.random.permutation(rest_of_nodes)))
G2 = nx.relabel_nodes(G2,anonymous_mapping, copy=True)
The end result can be seen below. The beacon nodes are the bigger ones, with the labels. For these nodes, we know their correspondence accross the two graphs.
In [195]:
giso.plot_graphs(G1, G2, beacons_G1, beacons_G2)
Now that we have picked the beacon sets $B_{G_1}$ and $B_{G_2}$, each of size $b$, and we know the true mapping $\pi_0(i), i \in B_{G_1}$, we will project each node in the graph to a multi-dimensional space.
The projection of node $v$ in $G_i$ is the vector $p(v) \in \mathcal{R}^{b}$ defined as \begin{equation} p(v) = \left\lbrace d(v, B_{G_i}^{(1)}), \ldots, d(v, B_{G_i}^{(b)}) \right\rbrace \end{equation} where $d(v, B_{G_i}^{(j)})$ is the shortest-path distance between node $v$ and the $j$-th beacon. Notice that when we do this projection, the dimensions are ordered according to the mapping between the beacons.
In [7]:
def shortest_path_project(G, beacons):
projection = np.zeros((G.number_of_nodes() - len(beacons), len(beacons)))
for i,beacon in enumerate(beacons):
lengths = nx.shortest_path_length(G, source=beacon)
node_index = 0
for node in lengths:
if node in beacons:
continue
projection[node_index][i] = lengths[node]
node_index += 1
return projection
In [197]:
p1 = shortest_path_project(G1,beacons_G1)
p2 = shortest_path_project(G2,beacons_G2)
In [198]:
from sklearn.manifold import Isomap
from sklearn.metrics.pairwise import euclidean_distances
Here, we get a sense of how these $b$-dimensional spaces look like. We project their coordinates to a $2$-dimensional space by using the Isomap algorithm. Keep in mind that the darker the marker is, the bigger the number of points that have the same projection.
In [199]:
coords1 = giso.isomap_project(p1)
coords2 = giso.isomap_project(p2)
ax = plt.subplot(121)
ax.plot(coords1[:, 0], coords1[:, 1], 'o', alpha = 0.4, markersize=8, c='#4b809c')
remove_border(ax)
ax = plt.subplot(122)
ax.plot(coords2[:, 0], coords2[:, 1], 'D', alpha = 0.3, markersize=8, c='#ba7c5a')
remove_border(ax)
A natural question is where the nodes that are close with each other in the high-dimensional space are in the graph. We answer this by running a k-medians algorithm in that space and then showing these clusters back on the graphs, side by side.
In [200]:
from sklearn.cluster import KMeans
def kmeans_plot(X, mds, ax, n_clusters=8, marker='o'):
est = KMeans(n_clusters=n_clusters)
est.fit(X-X.mean())
kmeans_labels = est.labels_
for k in range(n_clusters):
my_members = kmeans_labels == k
ax.plot(mds[my_members, 0], mds[my_members, 1], marker, alpha = 0.6, markersize=8)
remove_border(ax)
return kmeans_labels
def kmedians_plot(X, mds, ax, n_clusters=8, marker='o'):
est = giso.KMedians(k=n_clusters)
est.fit(X-X.mean())
kmedians_labels = est.labels_
for k in range(n_clusters):
my_members = kmedians_labels == k
ax.plot(mds[my_members, 0], mds[my_members, 1], marker, alpha = 0.6, markersize=8)
remove_border(ax)
return kmedians_labels
In [201]:
import brewer2mpl
from sklearn.cluster import KMeans
accent_colors = brewer2mpl.get_map('Accent', 'qualitative',8).mpl_colors
rcParams['axes.color_cycle'] = accent_colors
f, (ax1,ax2) = plt.subplots(1,2)
kmedians_labels1 = kmedians_plot(p1, coords1, ax1, n_clusters=4)
plt.figure()
kmedians_labels2 = kmedians_plot(p2, coords2, ax2, n_clusters=4, marker='D')
In [202]:
def cluster_colors(G, beacons, kmeans_labels, colors):
j=0
clustered_colors=[]
for i,node in enumerate(G.nodes()):
if node not in beacons:
clustered_colors.append(colors[kmeans_labels[j]])
j = j + 1
else:
clustered_colors.append((101./256,171./256,208./256))
return clustered_colors
giso.paint_clusters(G1,G2,beacons_G1, beacons_G2, kmeans_labels1, kmeans_labels2)
In [204]:
d= euclidean_distances(p1,p2)
best = {k:v for k,v in giso.find_best_match(d)}
In the plot below, we visualize the result of the matching. The x and y axes are the node labels, ordered by the edge centrality of the node. If the cell $(i,j)$ is light green, it means that the node $i \in G_1$ and $j \in G_2$ were used as beacons. If the cell $(i,j)$ is dark green, it means that the method found a correct matching between nodes $i$ and $j$, in $G_1$ and $G_2$ respectively. Finally, a purple square represents a mismatch.
In [221]:
(correct, matching_matrix) = giso.count_matches(G1, G2, beacons_G1, beacons_G2, best, anonymous_mapping)
print "Found:", correct
giso.print_matching(G, matching_matrix)
The next thing we will check is how the number of beacons affects the recovery of nodes. For a fixed graph, and fixed noisy versions of the graph, we try different values of beacon percentage. Every time we repeat the experiment, we pick a different beacon set and measure how many of the node labels we can correctly recover.
In [74]:
G = nx.generators.barabasi_albert_graph(num_nodes,2)
G1 = giso.noisy_remove(G,0.05)
G2 = giso.noisy_remove(G,0.05)
repeats = 50
num_nodes = 200
beacon_percentages=[0.01,0.02,0.03,0.04,0.05,0.1, 0.15,0.2]
averages = []
for i,beacon_percentage in enumerate(beacon_percentages):
total = 0
for tries in xrange(repeats):
(best, anonymous, beacons_G1, beacons_G2) = giso.match(G1, G2, beacon_percentage)
(correct, matching_matrix) = giso.count_matches(G1, G2, beacons_G1, beacons_G2, best, anonymous)
total += correct
averages.append(total * 1. / repeats)
In [73]:
beacon_percentages=[0.01,0.02,0.03,0.04,0.05,0.1, 0.15,0.2]
fig, ax = plt.subplots()
ax.plot(averages)
ax.set_xticklabels(beacon_percentages)
ax.set_xlabel("Beacon set site (as a percentage)")
ax.set_ylabel("Average number of recovered labels")
Out[73]:
We ran this experiment for a graph of 200 nodes. As expected, the number of recovered labels increases as we add more beacons to the graph. However, when the beacons exceed the 5% of the total graph, the number of recovered nodes drops rapidly. The reason might be that there is a predefined set of "possibly identifiable" nodes, that depends on the structure of the graph, and as we add more beacon nodes, we cover parts of this set as well.
We tried this experiment for different sizes for $G$, and the trend seems to be similar in all the cases.
As the next step, instead of using the shortest path distance between each node and the beacons, we will use the effective resistance between them.
The effective resistance between two nodes is the resistance of the whole graph when voltage is applied to them. We imply here that we view the network as an electrical circuit, where the edges are unit resistors and the nodes are connection points. Consequently, we are allow to attach a battery at any two nodes in the network, and current will flow from one battery terminal to the other.
In [110]:
def effective_resistance_project(G, beacons):
from numpy.linalg import pinv
projection = np.zeros((G.number_of_nodes() - len(beacons), len(beacons)))
L = nx.laplacian_matrix(G)
B = nx.incidence_matrix(G).T
B_e = B.copy()
L_pseudo = pinv(L)
for i in xrange(B.shape[0]):
min_ace = np.min(np.where(B[i,:] ==1)[1])
B_e[i, min_ace] = -1
for i,beacon in enumerate(beacons):
node_index = 0
for j,node in enumerate(G.nodes()):
if node in beacons:
continue
battery = np.zeros((B_e.shape[1],1))
battery[i] = 1
battery[node_index] = -1
p = L_pseudo * battery
projection[node_index][i] = abs(p[i] - p[j])
node_index += 1
return projection
For a random graph $G$, we plot the nodes by using the effective resistance projection above. We notice that, even though the two graphs are noisy versions of each other, the geometry of the points is very similar in both plots.
In [108]:
G = nx.generators.barabasi_albert_graph(50,2)
G1 = giso.noisy_remove(G,0.05)
G2 = giso.noisy_remove(G,0.05)
beacons = giso.find_beacons_sample_inverse(G, num_of_beacons=int(0.05 * G.number_of_nodes()))
In [117]:
ef_project1 = effective_resistance_project(G1,beacons)
ef_project2 = effective_resistance_project(G2,beacons)
In [120]:
coords1 = ef_project1
coords2 = ef_project2
ax = plt.subplot(121)
ax.plot(coords1[:, 0], coords1[:, 1], 'o', alpha = 0.4, markersize=8, c='#4b809c')
giso.remove_border(ax)
ax = plt.subplot(122)
ax.plot(coords2[:, 0], coords2[:, 1], 'D', alpha = 0.3, markersize=8, c='#ba7c5a')
giso.remove_border(ax)
Here, we will repeat the experiment we had above, but we will now use the effective resistance as the similarity function between two nodes.
In [129]:
repeats = 50
num_nodes = 200
G = nx.generators.barabasi_albert_graph(num_nodes,2)
G1 = giso.noisy_remove(G,0.05)
G2 = giso.noisy_remove(G,0.05)
beacon_percentages=[0.01,0.02,0.03,0.04,0.05,0.1, 0.15,0.2]
averages = []
for i,beacon_percentage in enumerate(beacon_percentages):
total = 0
for tries in xrange(repeats):
(best, anonymous, beacons_G1, beacons_G2) = giso.match(G1, G2, beacon_percentage, electrical=True)
(correct, matching_matrix) = giso.count_matches(G1, G2, beacons_G1, beacons_G2, best, anonymous)
total += correct
averages.append(total * 1. / repeats)
In [130]:
beacon_percentages=[0.01,0.02,0.03,0.04,0.05,0.1, 0.15,0.2]
fig, ax = plt.subplots()
ax.plot(averages)
ax.set_xticklabels(beacon_percentages)
ax.set_xlabel("Beacon set site (as a percentage)")
ax.set_ylabel("Average number of recovered labels")
Out[130]:
As it seems, although the geometry of the projection looks very promising, the nearest neighbor distance fails to identify the correct isomorphism in the presence of noise. The reason for this might be that the projection is very dense, i.e. there are many neighbors to each node even when the neighborhood is small.