Finding an isomorphism between unlabeled graphs

Preliminaries

Problem definition

Given two anonymized versions of a graph $G=(V,E)$, an interesting problem is to find the true mapping between the nodes of these two graphs. Generally, these two anonymized graphs could also have a slightly different set of edges (assume for example that they are the connections of a set of people in two different online social networks)

This problem can be formulated as follows:

Given two graphs, $G_1 = (V,E_1)$ and $G_2 = (V,E_2)$ of size $n=|V|$, a permutation $\pi$ on $V$ maps the vertices from $G_1$ onto those in $G_2$.

Find the “best” such permutation, denoted by $\pi_0$.

Measuring the error in a solution

The cost of a permutation can be expressed in many different ways. One such measure is the edge-inconsistency between two graphs, $G_1$ and $G_2$ [Pedarsani2011]. This represents the number of edges that exist in one graph and not the other, under a matching $\pi$. \begin{equation} \Delta_{\pi} = \sum_{e \in E_1} \mathbb{1}_{\{\pi(e) \not\in E_2\}} + \sum_{e \in E_2} \mathbb{1}_{\{\pi^{-1}(e) \not\in E_1\}} \end{equation} Here $\mathbb{1}_{\{ A \}}$ denotes the indicator function, and $\pi(e)$ denotes the mapped version of edge $e = (u,v)$, i.e. $\pi(e) = (\pi(u), \pi(v))$.

Problem instances

We will try two different ways of generating the input instances. First, we will just anonymize the graph $G$ and we will try to match it to itself. This is equivalent to generating $G_1$ and $G_2$ by starting from $G$ and randomly renaming the nodes in $V$.

Then, in order to simulate the generation of $G_1$ and $G_2$ based on two different social networks, we will start from $G$ and, with some probability $p$, we will remove edges from $E$.

We will examine these two scenarios separately, starting with the easy one.


In [66]:
%matplotlib inline
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import graph_isomorphisms as giso

# This helper function will be used to generate the seed for the randomness, so that we can be consistent
import hashlib
def hash_string(text):
    return int(hashlib.md5(text).hexdigest()[0:7], 16)

giso.setup_plots()

Algorithms

Our algorithms are all based on the same intuition. We will represent each node in the graph as a multi-dimensional point. Each dimension will represent the similarity of the node to some fixed beacon node. For these beacon nodes we assume that we know the correct mapping across the two graphs. This is a reasonable assumption to make, as in most social networks there are some nodes with a high and unique degree.


In [6]:
def find_beacons_max_unique(G, num_of_beacons=3):
    # Return the nodes with maximum unique degrees
    inverse_degree = {}
    for k, v in nx.degree(G).iteritems():
        inverse_degree[v] = inverse_degree.get(v, [])
        inverse_degree[v].append(k)
    degrees = sorted(inverse_degree.keys(),reverse=True)
    multiplicity=[len(inverse_degree[degree]) for degree in degrees]
    unique = [i for i,m in enumerate(multiplicity) if m == 1]
    return [inverse_degree[degrees[i]][0] for i in unique[0:num_of_beacons]]

def find_beacons_sample(G, num_of_beacons=3, seed=hash_string('find_beacons_sample')):
    # Sample a beacon based on its degree
    from numpy.random import RandomState
    prng = RandomState(seed)
    degrees = np.array(nx.degree(G).values())
    return prng.choice(np.arange(len(degrees)), num_of_beacons, p=degrees *1./sum(degrees), replace=False )

def find_beacons_sample_inverse(G, num_of_beacons=3, seed=hash_string('find_beacons_sample')):
    # Sample a beacon based on its degree
    from numpy.random import RandomState
    prng = RandomState(seed)
    degrees = np.array(nx.degree(G).values())
    return prng.choice(np.arange(len(degrees)), num_of_beacons, p=(1./degrees) *1./sum(1./degrees), replace=False )

Generate the source graph


In [191]:
G = nx.generators.barabasi_albert_graph(50,2, seed=hash_string('example graphs'))

Remove edges with some probability

For each edge in $G$, we remove it with probability $p=0.2$. We repeat this process two times and get back $G_1$ and $G_2$.


In [192]:
G1 = giso.noisy_remove(G,0.2,seed=giso.hash_string('noisy removal 2'))
G2 = giso.noisy_remove(G,0.2,seed=giso.hash_string('noisy removal 2'))

Find the beacons

To pick the beacons, we use one of the techniques that we saw above. The one that seemed to be more practical is sampling according to the inverse degree.


In [193]:
beacons_G1 = find_beacons_sample_inverse(G1,4)
beacons_G2 = beacons_G1

Anonymize the graphs by relabeling the non-beacon nodes

After having choosen the beacon nodes, we need to randomly label the rest of the nodes. We pick a random permutation and we label the nodes accordingly.


In [194]:
rest_of_nodes = [x for x in G1.nodes() if x not in beacons_G1]

anonymous_mapping = dict(zip(beacons_G1,beacons_G1))
anonymous_mapping.update(zip(rest_of_nodes,np.random.permutation(rest_of_nodes)))
G2 = nx.relabel_nodes(G2,anonymous_mapping, copy=True)

The end result can be seen below. The beacon nodes are the bigger ones, with the labels. For these nodes, we know their correspondence accross the two graphs.


In [195]:
giso.plot_graphs(G1, G2, beacons_G1, beacons_G2)


Shortest-path distance projection

Now that we have picked the beacon sets $B_{G_1}$ and $B_{G_2}$, each of size $b$, and we know the true mapping $\pi_0(i), i \in B_{G_1}$, we will project each node in the graph to a multi-dimensional space.

The projection of node $v$ in $G_i$ is the vector $p(v) \in \mathcal{R}^{b}$ defined as \begin{equation} p(v) = \left\lbrace d(v, B_{G_i}^{(1)}), \ldots, d(v, B_{G_i}^{(b)}) \right\rbrace \end{equation} where $d(v, B_{G_i}^{(j)})$ is the shortest-path distance between node $v$ and the $j$-th beacon. Notice that when we do this projection, the dimensions are ordered according to the mapping between the beacons.


In [7]:
def shortest_path_project(G, beacons):
    projection = np.zeros((G.number_of_nodes() - len(beacons), len(beacons)))
    
    for i,beacon in enumerate(beacons):
        lengths = nx.shortest_path_length(G, source=beacon)
        node_index = 0
        for node in lengths:
            if node in beacons:
                continue
            projection[node_index][i] = lengths[node]
            node_index += 1           
    return projection

In [197]:
p1 = shortest_path_project(G1,beacons_G1)
p2 = shortest_path_project(G2,beacons_G2)

Projecting to a 2-dimensional space, for visualization purposes


In [198]:
from sklearn.manifold import Isomap
from sklearn.metrics.pairwise import euclidean_distances

Here, we get a sense of how these $b$-dimensional spaces look like. We project their coordinates to a $2$-dimensional space by using the Isomap algorithm. Keep in mind that the darker the marker is, the bigger the number of points that have the same projection.


In [199]:
coords1 = giso.isomap_project(p1)
coords2 = giso.isomap_project(p2)
ax = plt.subplot(121)
ax.plot(coords1[:, 0], coords1[:, 1], 'o', alpha = 0.4, markersize=8, c='#4b809c')
remove_border(ax)
ax = plt.subplot(122)
ax.plot(coords2[:, 0], coords2[:, 1], 'D', alpha = 0.3, markersize=8, c='#ba7c5a')
remove_border(ax)


Run k-means on the $b$-dimensional projection space

A natural question is where the nodes that are close with each other in the high-dimensional space are in the graph. We answer this by running a k-medians algorithm in that space and then showing these clusters back on the graphs, side by side.


In [200]:
from sklearn.cluster import KMeans
def kmeans_plot(X, mds, ax, n_clusters=8, marker='o'):
    est = KMeans(n_clusters=n_clusters)
    est.fit(X-X.mean())
    kmeans_labels = est.labels_
    for k in range(n_clusters):
        my_members = kmeans_labels == k
        ax.plot(mds[my_members, 0], mds[my_members, 1], marker, alpha = 0.6, markersize=8)
    remove_border(ax)
    return kmeans_labels

def kmedians_plot(X, mds, ax, n_clusters=8, marker='o'):
    est = giso.KMedians(k=n_clusters)
    est.fit(X-X.mean())
    kmedians_labels = est.labels_
    for k in range(n_clusters):
        my_members = kmedians_labels == k
        ax.plot(mds[my_members, 0], mds[my_members, 1], marker, alpha = 0.6, markersize=8)
    remove_border(ax)
    return kmedians_labels

In [201]:
import brewer2mpl
from sklearn.cluster import KMeans
accent_colors = brewer2mpl.get_map('Accent', 'qualitative',8).mpl_colors
rcParams['axes.color_cycle'] = accent_colors
f, (ax1,ax2) = plt.subplots(1,2)
kmedians_labels1 = kmedians_plot(p1, coords1, ax1, n_clusters=4)
plt.figure()
kmedians_labels2 = kmedians_plot(p2, coords2, ax2, n_clusters=4, marker='D')


<matplotlib.figure.Figure at 0x279a9b38>

Paint the graphs with the cluster colors

Make the clustering above viewable directly on the graph. The goal is to check if there is a certain structure in the coloring of the nodes that we can discern.


In [202]:
def cluster_colors(G, beacons, kmeans_labels, colors):
    j=0
    clustered_colors=[]
    for i,node in enumerate(G.nodes()):
        if node not in beacons:
            clustered_colors.append(colors[kmeans_labels[j]])
            j = j + 1
        else:
            clustered_colors.append((101./256,171./256,208./256))
    return clustered_colors

giso.paint_clusters(G1,G2,beacons_G1, beacons_G2, kmeans_labels1, kmeans_labels2)


Match nodes from $G_1$ to nodes in $G_2$ by finding the nearest neigbor

To generate a matching, we assign each node from graph $G_1$ to its closest node in graph $G_2$


In [204]:
d= euclidean_distances(p1,p2)
best = {k:v for k,v in giso.find_best_match(d)}

Visualize the matching

In the plot below, we visualize the result of the matching. The x and y axes are the node labels, ordered by the edge centrality of the node. If the cell $(i,j)$ is light green, it means that the node $i \in G_1$ and $j \in G_2$ were used as beacons. If the cell $(i,j)$ is dark green, it means that the method found a correct matching between nodes $i$ and $j$, in $G_1$ and $G_2$ respectively. Finally, a purple square represents a mismatch.


In [221]:
(correct, matching_matrix) = giso.count_matches(G1, G2, beacons_G1, beacons_G2, best, anonymous_mapping)
print "Found:", correct

giso.print_matching(G, matching_matrix)


Found: 5

Number of beacons

The next thing we will check is how the number of beacons affects the recovery of nodes. For a fixed graph, and fixed noisy versions of the graph, we try different values of beacon percentage. Every time we repeat the experiment, we pick a different beacon set and measure how many of the node labels we can correctly recover.


In [74]:
G = nx.generators.barabasi_albert_graph(num_nodes,2)
G1 = giso.noisy_remove(G,0.05)
G2 = giso.noisy_remove(G,0.05)

repeats = 50
num_nodes = 200
beacon_percentages=[0.01,0.02,0.03,0.04,0.05,0.1, 0.15,0.2]
averages = []
for i,beacon_percentage in enumerate(beacon_percentages):
    total = 0
    for tries in xrange(repeats):
        (best, anonymous, beacons_G1, beacons_G2) = giso.match(G1, G2, beacon_percentage)
        (correct, matching_matrix) = giso.count_matches(G1, G2, beacons_G1, beacons_G2, best, anonymous)
        total += correct
    averages.append(total * 1. / repeats)

In [73]:
beacon_percentages=[0.01,0.02,0.03,0.04,0.05,0.1, 0.15,0.2]
fig, ax = plt.subplots()
ax.plot(averages)
ax.set_xticklabels(beacon_percentages)
ax.set_xlabel("Beacon set site (as a percentage)")
ax.set_ylabel("Average number of recovered labels")


Out[73]:
<matplotlib.text.Text at 0x140916d8>

We ran this experiment for a graph of 200 nodes. As expected, the number of recovered labels increases as we add more beacons to the graph. However, when the beacons exceed the 5% of the total graph, the number of recovered nodes drops rapidly. The reason might be that there is a predefined set of "possibly identifiable" nodes, that depends on the structure of the graph, and as we add more beacon nodes, we cover parts of this set as well.

Size of $G$

We tried this experiment for different sizes for $G$, and the trend seems to be similar in all the cases.

Using Effective Resistance as the distance function

As the next step, instead of using the shortest path distance between each node and the beacons, we will use the effective resistance between them.

The effective resistance between two nodes is the resistance of the whole graph when voltage is applied to them. We imply here that we view the network as an electrical circuit, where the edges are unit resistors and the nodes are connection points. Consequently, we are allow to attach a battery at any two nodes in the network, and current will flow from one battery terminal to the other.


In [110]:
def effective_resistance_project(G, beacons):
    from numpy.linalg import pinv
    projection = np.zeros((G.number_of_nodes() - len(beacons), len(beacons)))
    L = nx.laplacian_matrix(G)
    B = nx.incidence_matrix(G).T
    B_e = B.copy()
    
    L_pseudo = pinv(L)
    for i in xrange(B.shape[0]):
        min_ace = np.min(np.where(B[i,:] ==1)[1])
        B_e[i, min_ace] = -1
    
    for i,beacon in enumerate(beacons):
        node_index = 0
        for j,node in enumerate(G.nodes()):
            if node in beacons:
                continue
                
            battery = np.zeros((B_e.shape[1],1))
            battery[i] = 1
            battery[node_index] = -1

            p = L_pseudo * battery
            projection[node_index][i] = abs(p[i] - p[j])
            node_index += 1 
    return projection

For a random graph $G$, we plot the nodes by using the effective resistance projection above. We notice that, even though the two graphs are noisy versions of each other, the geometry of the points is very similar in both plots.


In [108]:
G = nx.generators.barabasi_albert_graph(50,2)
G1 = giso.noisy_remove(G,0.05)
G2 = giso.noisy_remove(G,0.05)
beacons = giso.find_beacons_sample_inverse(G, num_of_beacons=int(0.05 * G.number_of_nodes()))

In [117]:
ef_project1  = effective_resistance_project(G1,beacons)
ef_project2 = effective_resistance_project(G2,beacons)


(48L, 2L)
(48L, 2L)

In [120]:
coords1 = ef_project1
coords2 = ef_project2
ax = plt.subplot(121)
ax.plot(coords1[:, 0], coords1[:, 1], 'o', alpha = 0.4, markersize=8, c='#4b809c')
giso.remove_border(ax)
ax = plt.subplot(122)
ax.plot(coords2[:, 0], coords2[:, 1], 'D', alpha = 0.3, markersize=8, c='#ba7c5a')
giso.remove_border(ax)


The effect of the number of beacons when using the effective resistance projection

Here, we will repeat the experiment we had above, but we will now use the effective resistance as the similarity function between two nodes.


In [129]:
repeats = 50
num_nodes = 200
G = nx.generators.barabasi_albert_graph(num_nodes,2)
G1 = giso.noisy_remove(G,0.05)
G2 = giso.noisy_remove(G,0.05)

beacon_percentages=[0.01,0.02,0.03,0.04,0.05,0.1, 0.15,0.2]
averages = []
for i,beacon_percentage in enumerate(beacon_percentages):
    total = 0
    for tries in xrange(repeats):
        (best, anonymous, beacons_G1, beacons_G2) = giso.match(G1, G2, beacon_percentage, electrical=True)
        (correct, matching_matrix) = giso.count_matches(G1, G2, beacons_G1, beacons_G2, best, anonymous)
        total += correct
    averages.append(total * 1. / repeats)

In [130]:
beacon_percentages=[0.01,0.02,0.03,0.04,0.05,0.1, 0.15,0.2]
fig, ax = plt.subplots()
ax.plot(averages)
ax.set_xticklabels(beacon_percentages)
ax.set_xlabel("Beacon set site (as a percentage)")
ax.set_ylabel("Average number of recovered labels")


Out[130]:
<matplotlib.text.Text at 0x15b317f0>

As it seems, although the geometry of the projection looks very promising, the nearest neighbor distance fails to identify the correct isomorphism in the presence of noise. The reason for this might be that the projection is very dense, i.e. there are many neighbors to each node even when the neighborhood is small.

References

Pedarsani, Pedram, and Matthias Grossglauser. "On the privacy of anonymized networks." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.