Notes

Permuation matrices and graphs

  • $P$ obtained by permuting rows of an identity matrix. $N!$ possile permutations possible of an identity matrix. $PA$ permutes the $i^{th}$ row of A to $\pi(i^{th})$ row of $PA$. $AP$ moves the $i^{th}$ column of $A$ to $\pi(i^{th})$ column of $AP$.
  • $PP^T = P^TP = I$ so $P^T = P^{-1}$
  • $Me_j$ selects the $j^{th}$ column of $M$. $e^T_iM$ selects the $i^{th}$ row of $M$. $e^T_iMe_j$ selects the $i^{th}$ row of $j^{th}$ column, which is equal to $M_{ij}$
  • let $A_1$ and $A_2$ be the adjacency matrices of two isomorphic graphs with permutation $\pi_A$. Edge $(i,j)$ in $A_1$ corresponds to $(\pi_A(i),\pi_A(j))$ in $A_2$, so $$(A_2)_{(\pi_A(i),\pi_A(j))} = e^T_iA_1e_j$$ $$(Pe_i)^TA_1(Pe_j) = e^T_{i}A_1e_{j}$$
  • more generally, $A_2 = PA_1P^T$, which is equivalent to $A_2P=PA_1$

QAP formulation of network alignment problem

  • network alignment problem: $$ \arg\min_{P \in \prod} ||A-PBP^T||^2_2 = \arg\min_{P \in \prod} ||AP-PB||^2_2 $$
  • norm and trace: $$ ||A||^2_2 = \sum_{i,j} A_{ij}A_{ij} = \sum_i \sum_j A_{ij}A^T_{ji} = \sum_i (AA^T)_{ii} = Tr(AA^T)$$
  • norm invariant under cyclic permutation: $$ Tr(AB) = \sum_i (AB)_{ii} = \sum_i \sum_j A_{ij}B_{ji} = \sum_j \sum_i B_{ji} A_{ij} = \sum_j (BA)_{jj} = Tr(BA) $$
  • norm and transpose: $$ Tr(A) = \sum_i A_{ii} = \sum_i A^T_{ii} = Tr(A^T) $$
  • QAP formulation using cycle+transpose: $$||AP-PB||^2_2 = Tr((AP-PB)^T(AP-PB)) = Tr(P^TA^TAP-P^TA^TPB-B^TP^TAP+B^TP^TPB)$$ $$=Tr(A^TA+B^TB) - Tr(P^TA^TPB+B^TP^TAP) = Tr(A^TA+B^TB) - Tr((B^TP^TAP)^T+B^TP^TAP) = Tr(A^TA+B^TB) - 2Tr(B^TP^TAP) $$
  • Therefore: $$\arg\min_{P \in \prod} ||AP-PB||^2_2 = \arg\min_{P \in \prod} Tr(A^TA+B^TB) - 2Tr(B^TP^TAP) = \arg\max_{P \in \prod} Tr(B^TP^TAP) = \arg\max_{P \in \prod} Tr(A^TPBP^T) $$

PGM papers

Summary

  • 1.2 first paper to introduce network de-anonymization problem as a "graph matching with side information" problem. algorithm has 2 steps: find clique-y seed set and then use threshold percolation to incrementally match seeds; specific algorithm purely heuristic
  • 1.3 formalize graph matching problem using random bigraph model $G(n,p,s)$, shows that PGM algorithm can identify nodes with mean degree grows slightly faster than log(n) and that identity matching minimizes cost function; mismatch cost function
  • 1.4 analysis of phase transition in seed size of PGM model using bootstrap percolation theory
  • 1.5 is an extension of 1.3; for 2 correlated random graphs with partial overlap, it introduces a cost function and then shows that identity matching minimizes it for some value of $\alpha$ (no match penalty)
  • 1.6 introduces matching algorithm for PA networs; algorithm exploits degree distribution and percolates in phases by bucketing and percolating over higher-degree nodes initially
  • 1.7 extends PGM algorithm by reducing seed size requirement for percolation, but at the cost of some loss in precision
  • 1.8 is similar to Korula paper; uses graph slices w.r.t. degree and percolation to show that if seeds can be selected, then number of seeds required drops to $O(n^{\epsilon}$
  • 1.9 extends PGM to directed networks; use threshold to filter candidate pairs and select pair with max similarity score (based on edge directionality and {in,out} degree)

Strong assumptions

  • edge sampling is independent controlled by a single parameter $s$; one way to analyze matching is use $s_1$, $s_2$
  • auxilary network is uniform samples from the graph; Isn't it more realistic to model the seconday network as a edge-sampled subgraph? maybe this is more practical for motif-based percolation

Extensions

  • seed selection problem: how to best select seeds in the network to maximize percolation
  • anonymization problem: how to best anonymize (alter) the network to make it difficult to deanonymize but at the same time avoid altering fundamental properties of the network significantly
  • can we reduce percolation to a influence maximization problem on the product graph?
  • possible applications to network model fitting?

Deanonymizing social networks (Narayanan, 2009)

  • first paper to demonstrate feasibility of large-scale, passive de-anonymization of realworld social networks.
  • Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy “sybil” nodes, is robust to noise. In social networks, too, user anonymity has been used as the answer to all privacy concerns
  • Backstron et al active attacks to de-anonymize users in a network impractical but can be used to find initial set of seeds; collusion based attacks using small set of colluding nodes can only deanonymize nodes in the neighborhood
  • Zhou and Pei: anonymization algotrithm if adversary knows 1-hop neighborhood of node: edge additions to make neighborhood of node $u$ isopmorphic to $k-1$ other neighborhoods (k-anonymity)
  • We argue that the auxiliary information which is likely to be available to the attacker is global in nature (e.g., another social network with partially overlapping membership); re-identification of some nodes provides the attacker with more auxiliary information, which is then used for further re-identification
  • In terms of entropy, most of the information in the released graph resides in the edges, and this is what our de-anonymization algorithm will exploit.
  • network owners release anonymized and possibly sanitized network graphs to com- mercial partners and academic researchers; can sensitive information about specific individuals be extracted from anonymized social-network graphs?
  • We assume that in addition to the anonymized, sanitized target network Ssan, the attacker also has access to a different network S_aux as a graph G_aux = (V_aux, E_aux) whose membership partially overlaps with S.
  • We also assume that the attacker possesses detailed information about a very small2 number of members of the target network S (in our experiments, we find that between 30 and 150 seeds are sufficient for networks with 10^5 members)
  • To keep our model independent of the semantics of a particular network, we treat the privacy policy as a syntactic, exogenous labeling that specifies for every node attribute, edge, and edge attribute whether it should be public or private. (PP: XUY x E -> {public, private}
  • taslk: node reidentification using G_san and G_aux; only a fraction of nodes are mapped since overlap between the 2 graphs can be small
  • algorithm takes as input S_san and S_aux and outputs probabilistic mapping of nodes from V_aux to (V_san U Null). Privacy of a node u is compromised if the attacker assigns a high probability to the true private attribute value
  • Since mapping $\hat{\mu}$ is probabilistic, a concrete mapping $\mu(u)$ is sampled. Then, the success of the deanonymization is $$ \frac{\sum_{v\in V_{mapped}} \Pr(\mu(v)=\mu_G(v))\rho(v)}{v\in V_{mapped}\rho(v)} $$ where $\rho(v)$ is the weight of node $v$ (degree centrality or 1)
  • Algorithm:
    • 2 steps: seed identification and propagation, seed is a clique of k nodes that present in both graphs. Exp. Brute force algorithm takes as input target graph, $k$ seed nodes in auxiliary graph, $k$ degree values, $kC2$ common neighbor counts and error parameter $\epsilon$ and searches target graph for $k$ clique with matching degree nodes and common neighbor counts.
    • Propagation: given 2 graphs and partiall seed mapping, output 1-to-1 mapping using network structure. At every iteration, pick an unmapped node $u$ in $V_1$. For every node $v$ in $V_2$, assign a score equal to the number of neighbors of $u$ that have been mapped to neighbors of $u$. If score exceeds threshold for some node in $V_2$, add mapping to list. Mapping $(u,v)$ is only retained if v gets mapped back to u when the input graphs are switched. Threshold based on eccentricity.

On the privacy of anonymized networks (Pedarsani, Grossglauser, 2011)

  • can we assume that a sufficiently sparse network is inherently anonymous? paper proves that if mean degree grows slightly faster than log(n) then nodes can be identified
  • paper assumes that attacker has no side information about nodes in the network, other than network structure
  • n! possible mappings; choose mapping that minimized error function $$\Delta_{\pi} = \sum_{e \in E_1} I[\pi(e) \notin E_2] + \sum_{e \in E_2} I[\pi^{-1}(e) \notin E_1]$$
  • if edge sampling probability s beyond threshold, then correct mapping minimizes $\Delta_{\pi}$
  • limitations: theorem only holds for G(n,p,s) model and if p is low enough; no method to find the correct mapping provided,
  • experiments: verrify assumption of edge sampling independence, edge overlap in real networks
  • proof sketch
    • if s=w(1/n), p->0, f(p,s) = g(n)+w(1/n), then identity mapping has lowest eror and almost surely yields perfect matching by proving that the probability that the error of random mapping is less than that of the identity mapping approaches
    • S_k: number of order-k permutations that have less error than identity; E[S] is the expected number of permutations for which the number of errors is leq that of identity. if E[S] approaches 0, then identity mapping has lowest error (non-negative RV)

On the perfomance of PGM (Yartseva, Grosglauser, 2013)

  • perfomance of PGM algorithms sensitive to seed size. this paper rigorously analyzes a common PGM algorithm using bootstrap percolation theory
  • algorithm: add (i,j) to mapped pairs if it has at least $r$ neighboring mapped pairs. "canonical base algorithm for PGM"
  • contribution: derive phase transition of seed set size for matching as a function of threshold $r$ and network parameters of ER-like $G(n,p,s)$ model
  • current work: limited understanding of seed set size, heuristic based, not scalable, privacy-related research allows "attacker" is allowed to alter network before de-anonymization; attributes + structure used
  • $G(n,p,s)$: generate ER graph using n and p. Create two copies of ER graph and delete each edge with probability $1-s$.
  • (theory paper: Janson et al, Bootstrap percolation theory)
    • bootstrap percolation: systems in which node is a part of the clusters if it has at least $r$ neighbors in the cluster; captures spread of influence
    • theorem: for $r \geq 2$ and $n^{-1} < p < n^{-\frac{1}{r}}$, if seed set size and critical value ration $\frac{a_0}{a_c} > 1$, then final percolation size $a^* = n-o(n)$ w.h.p. where $$a_c = (1-\frac{1}{r})(\frac{(r-1)!}{np^r})^{\frac{1}{r-1}}$$
  • this paper extends the theorem to graph matching, where percolation is defined over pairs of nodes of two graphs. need to show correctness and that percolation touches most nodes.
  • input: graphs $G_1, G_2$ have $n$ nodes. and input seed set $A_0$ of correct matches (i,i)
  • propagation: at time $t$, use one unused mapped pair $(i,j)$, and add one to the score $M_{(i',j')}$ of each tuple $(i',j') \in N(i) \times N(j)$. If score of any tuple equals $r$, add to $A$. Break tie by picking at random. stops at time $T=min(t \geq 0: A(t)-Z(t) = \emptyset)$
  • tradeoff: $r$ low -> false matches, $r$ high -> percolation stops early
  • limitation: at every time t, only one unused mapped pair is used to spread the marks. If multiple mapped pairs are used at once, the most "obvious" correct pairs will receive a higher score, and avoid matching errors. (deferred matching variant)
    • use all matched pairs to spread marks. Once $A-Z$ is empty, pick candidate pair with maximum score of all candidates if max. score is greater than threshold
    • variant decreases error rate but has the same critical value seed set size
  • theoretical results and proof sketch
    • Janson percolation seed size cannot directly be applied because the probability of a wrong pair receiving a mark is not independent. However, the probability of a correct pair receiving a mark is independent w.r.t other candidate pairs at time t.
    • using correct pair independence and $ps^2 > (ps)^2$, subset of $(i,i)$ pairs can be analyzed using bootstrap percolation framework
    • define X to be the event that wrong pair $(i,j)$ gets $r$ marks before $(i,i)$ or $(j,j)$. we can show that $P(X) \rightarrow 0$ so matching algorithm does not make an error whp. if $p < n^{-\frac{3}{r}-\epsilon}/log(n), r \geq 4, then P(x)$ goes to 0.
    • algorithm percolates: restrict percolation process to correct pairs only (independence). The Janson results can be applied to $G(n, ps^2)$ directly; final set size $n-o(n)$
    • seed set size increases with $r$. $ps^2 > \frac{1}{n}$ because ER connected component phrase transition. results only hold for sparse graphs $ps^2 < n^{-\frac{3}{r}-\epsilon}$ bceause ratio in the probabilities of generating correct and wrong credits $\frac{1}{p}$ is not large enough
    • denser graphs (as $p,s$ increase) require fewer seeds
  • simulations validate critical value; experiments on real networks show that error is less if there is more correlation, more coverage if many seeds; conjecture: percolation works well because of the compactness of these graphs; every node is close to some seeds, which allows to "triangulate" the nodes; counterexample: random geometric model, high error rate, low coverage

When can two unlabeled networks be aligned under partial overlap? (Kazemi, Grossglauser, 2015)

  • extension of "privacy in anonymized networks", looks into matching of 2 graphs that overlap; define G(n,p; t,s) model and cost function to minimize and sufficient parameter conditions for perfect matchability (information theoretic constraints w.r.t graph properties, computational aspects not considered)
  • partial mappings on a subset of cartesian product of V1,2; any nod e in V1,2 is matched to at most one node in the other graph; let V0 is set of overlapping nodes and pi0 identity map; node pairs can either be matched ([i,i]), mismatched ([i,j]) or unmatched (not in G or corresponding node not in G')
  • define prod(l,k) as the set of all matching with l correct matches and k wrong matches
  • PHI: mistmatched edges: number of edges in E1 mapped to wrong edge in E2 + E2 edges mapped to wrong edge in E1
  • U: unmatched edges: number of unmatched edges in E1 + in E2
  • cost function Delta(pi) = PHI(pi) + alpha*U(pi); alpha is a parameter; e.g. if alpha > 1 then optimal matching is always of largest possible size since net increase per node pair addition
  • theorem: if log(n)/(n s^3 t^2) << p << 1, for some alpha, optimal matching that minimizes cost function is the identity parameter
  • misc: G_0 (overlap induced subgraph) is sufficient to infer optimal identity because t,s are iid so knowing G1-G0 and G0-G1 information does not tell anything about G1<->G0 matching
  • "on the privacy of networks" paper find matching from set of all matchings on V (common), in this case, because of overlap, min(n1,n2) is only the upper bound so set of matchings is much larger
  • if alpha=1, then error function is basically edit distance: fix mismatch by addition/deletion
  • proof sketch: let S=number of matchins s.t. error less than that of identity matching; P[S>=1]<=E[S]v(markov inequality), and show that E[S] approaches 0 (concentration bound of sum of bernoulli; then stochastically bound the difference)

An Efficient reconciliation algorithm for social networks (Korula, Lattanzi, 2014)

  • theoretically analyzes percolation-based algorithm (w. provable guarantees) to match networks on ER and PA graphs using intiial seed set of correct matches
  • OSNs are subsets of "real" social networks; graph matching to combine all into 1
  • easier than "generalized graph isopmorphism" because OSN have specific network structure (degree distribution etc); graphs are not identical i.e. correct matching defined over $V_0 = V_1 \bigcap V_2$; domain-specific attributes can be added to improve this algorithm but not considered
  • feature-based ML models are not robust to malicious users that can "copy" features of real nodes by creating fake profiles; trusted seed set + percolation makes algorithm robust
  • similar to Naraynan-Shmatikov paper; difference: theory, scalable, use degree bucketing + common neighbor threshold
  • graph model: real graph $G(V,E)$ is generated using preferential attachment. Graphs $G_{1,2}$ are generated by selecting each edge in $E$ with prob. $s_1, s_2$; without seed nodes, if $s_1=s_2=1$, then problem is equivalent to graph isomorphism. With probability $l$, nodes explicitly linked across $G_1$ and $G_2$
  • user-matching algorithm
    • input $G_1, G_2, L, D, T, k$, L is seed set, D is max degree, T is threshold, k is number of iterations; outputs large set of matches
    • at iteration $i$, consider all pairs of nodes $(u,v) \in V_1 \times V_2$ s.t. $d_u, d_v > \frac{D}{2^i}$. compute number of neighboring matched pairs (similarity witnesses) for each pair. Add $(u,v)$ to set if score exceeds $T$ and is the maximum score in which $u$ or $v$ appear
  • correctness proof for ER graphs
    • assume $p < \frac{1}{c}$ (most networks have lower edge density) and $nps > (1-\epsilon) log(n)$ (for $G_{1,2}$ to be connected)
    • expected number of similarity witnesses for correct pairs $(u_i, v_i)$ is $(n-1)lps^2$ and for wrong pairs $(n-2)lp^2s^2$.
    • if $p$ is greater than $p_d$, chernoff bound used to show the gap between the expected number of similarity witnesses for correct and wrong pairs. first phase is enough to distinguish correct copy and possible incorrect matches
    • if $p$ is less than $p_d$, then w.h.p not have more than 2 similarity witnesses can exist for any such pair. So, $T=3$.
  • correctness proof sketch for PA graphs
    • power law distribution complicates proof. for high-degree nodes, neighborhoods are different enough (high degree early birds) to use chernoff bound. for medium degree ndoes, any pair is unlikely to have more than 8 neighbors in common, so $T=9$ used. high degree nodes identified initially (degree bucketing) help identify lower degre nodes.
  • note: experiments: extremely high link probability $l=0.1$ used to evaluate real networks; robust to strong attacks; matches high-degree nodes better than low-deg nodes; degree bucketing improves precision and percolation coverage/recall

Growing a graph matching from a handful of seeds (Kazemi, Grossglauser, 2015)

contribution

  • new matching algorithm that requires fewer seeds
  • very robust to incorrect matchings, but at the cost of few matching errors
  • In summary, we manage to trade off a very significant reduction in the seed set size with a fairly benign increase in the error rate

graph matching (1)

  • generalization of graph isopmorphism, relies only on link structure
  • matching: bijection between full or partial vertex sets of the 2 graphs using only link structure
  • applications: security de-anonymization, protein network alignment and computer vision
  • most approaches use percolation theory, start from (very few) matched seeds and infect neighborhoods until most matched
  • existing algorithms very sensitive to size and correctness of seed set
  • The matching is generated incrementally, starting from the seed pairs and percolating to other node pairs; for this reason, we refer to this class of algorithms as Percolation Graph Matching (PGM) methods.
  • strong senstivitiy to seed size; too small, the percolation did not take off; when increased, abrupt change to a supercritical regime, where the algorithm succeeded in de-anonymizing a large fraction of the network
  • In every step, the set of node pairs matched so far are used as evidence to match an additional node pair, if possible.

main idea

  • there is a tradeoff between accuracy/correctness of matching and the number of matches made by PGM. One bad match can have a cascading effect since new matches are found by using matched pairs. On the other hand, have a strong evidence threshold to match a pair can restrict the percolation.
  • New algo breaks the tradeoff by decoupling the decision of matching a candidate pair based on evidence and deciding on whether or not to use a candidate pair as evidence. This allows the algo to match more pairs without incorrectly matching because of previous wrong matches
    • requires fewer seed matches and more robust to errors

notation

  • A pair [i′, j′] ∈ V1 ×V2 is a neighbour of another pair [i, j] if (i, i′) ∈ E1 and (j, j′) ∈ E2. In the description of the matching algorithms, we refer to a pair [i,j] spreading out marks as adding one mark to each neighbouring pair of [i,j]. The score of a pair is defined as the number of marks it received from other pairs so far.
  • the hidden correct mapping be- tween the nodes in V0 is the identity mapping
  • Let Λ(S) denote the number of correct pairs in a set S of pairs, and let Ψ(S) represent the number of wrong pairs

PercolatedMatched (related work)

  • at each time step τ, the algorithm picks an unused pair from set A and spread marks to all of its neighbouring pairs; (iii) as soon as a pair obtains at least r marks, i.e., it is a neighbour of at least r used pairs
  • in the set A, it will be added to the set A; and (iv) the algorithm stops when there exist no further unused pairs in the set A.

Correlated Random Graph Model (Sythetic network for analysis)

  • G(n,p; t,s) first generates G(n,p) ER to then generate G1 and G2 with partially overlapping vertex sets
  • filter vertex set of G with probability t for G1 (V1) and G2 (V2)
  • for all pairs of nodes in V1/V2 that have edges in G/E, sample each edge between pair with probability s
  • goal: identify mapping between nodes in intersection of vertex sets

NoisySeeds algorithm

  • starts with an initial noisy seed set A0, i.e., a set with possibly many wrong pairs
  • spread marks from pairs in A0 and add A0 pairs to Z used pairs set
  • If there is any pair with score at least r, then we add this pair to the matched set M.
  • Each time a pair [i,j] ∈ M \ Z is chosen randomly, it spreads marks to its neighbouring pairs and is added to Z
  • As the pair [i,j] is in the matching M, any other pair in the form of [i,j′] or [i′,j] is not a candidate for matching any longer and is permanently removed

expand once algorithm

  • expands seed set to a larger seed set and then starts PGM
  • algorithm
    • for all pairs [i,j] in initial seed set A0
      • for all neighboring pairs [i', j'] of [i,j]
        • if size of A0' < max_size and no pair conflict
          • add [i', j'] to A0'
    • return A0'
  • noisy seed adds a lot of pairs to initial set, which improves the chance of proper percolation but at the cost of adding many incorrectly matched pairs

intuition behind robustness let known matched pair be [i,j]. The neighor pair be [i', j'] such that (i,i') \in E1 and (j,j') in E2.

  • suppose [i,j] match is correct. Then the neighborhood of i and j are "correct"/same too. The probability that candidate [i',j'] is marked is p*s^2.
    • p because they are ultimately the same edge generated once in the original network
    • s^2 because they are in G1 and G2.
  • suppose [i,j] match is wrong or that [i,j] is correct but [i', j'] match is wrong. In either case, the probability that [i',j'] is marked is p^2*s^2.
    • p^2 because the 2 edges actually are mismatched and different. since there are 2 in G, p^2
    • s^2 same reason, in G1 and G2

So, the probability of a correct pair being marked once is much more that that of an incorrect pair. P(marked twice): p^2s^4 vs. p^4s^4. Only few wrong pairs with more than 2 marks: O(n1^2n2^2p^4s^4)

expand when heuristic to improve perfomance on real networks

  • The robustness arguments of Sec- tion 4 allow PGM algorithms to be much more aggressive in spreading out marks
  • A main feature: expand the seed set by many noisy candidate pairs whenever there are no other unused matched pairs
  • Whenever there are no further pairs with score at least two, we add all the unused, unmatched neighbouring pairs of all the matched pairs to the candidate pairs
    • most are wrong but effect negligible
    • some are correct and help percolation by increasing size
  • we further make the following modification. At each time step, instead of adding all the candidate pairs with score at least two to the matched set, we choose the one with the highest score among such pairs and add it to the matched
    • tie break with pairs that min. abs deg diff

empirical analyis

  • real world networks have more structural variance (degree, clsutering, communities) than ER graphs. So if algorithm performs well on ER, it'll likely perform well on real networks too
  • precision is better than recall in all experiments. Percolation stops at some point, which is why recall is low.
  • Our experiments show that choosing seeds among high– degree nodes, instead of picking them randomly, results in better matchings.
  • clustering and motif-based counts

De-anonymizing Scale-free Social Networks by Percolation Graph Matching (Chiaserrini, Leonardi, 2014)

  • exploits large inhomogeneities in node degree of scale free networks; leads to improvement in matching, reduction in seed size; method: bootstrap percolation + graph slicing
  • O(n/log n) seed nodes required to percolate G(n,p,s) networks; what is the seed size critical value of more realistic network models? Paper uses chung-lu model; p(i,j) edge probability = w(i)w(j)/(N*avg_w), max deg is O(n^0.5)
  • DDM (degree-driven matching) problem; sqrt(n) seeds nodes uniformly selected can percolate scale-free networks; even fewer if attacker can choose based on degree
  • Different from Korula model: simpler scale-free model, korula paper doesn't consider seed size
  • pair graph P(G_t): V=V1xV2, E=E1xE2, edge between (i1,j2) and (m2,n2) if i1->m1 in G1 and j2->n2 in G2. (i1,i2) is a correctly matched pair
  • DDM algo partitions the pair graph by first look at subgraph P1 induced by high-degree vertices; degrees in P1 are homogenous so results from ER graphs can be adapted, repeat for P2..Pk until percolation stops
  • small but non-vanishing fraction of nodes in G1 and G2 have edges less than threshold r so cannot be identified
  • if seed nodes chosen uniform (n^0.5+eps), chosen based on degree (n^eps), then O(n) good pairs and no bad pairs
  • very low degree nodes are usually susceptible to error since its edges do not provide much information. P_0 is considered at the very end; DDM algorithm can provably match all good pairs in slices Pk for k geq 2 and propagate to the next slice
  • uses high threshold r=4 since objective is to maximize precision; r can be lowered but at the cost of precision
  • Experiment: DDM w. selected seeds take 10 times fewer seed nodes critical value than uniform, similar precision

PDGM: Percolation-based Directed Graph Matching in Social Networks (Wang, Chen, 2017)

  • extend PGM to directed OSNs by modeling directional relationships and "celebrity penalty" for high-indegree nodes; outperforms baseline PGM algorithms
  • model: G(V,E) -> G1, G2 where fraction of overlapping nodes is alpha_v, fraction of overlapping edges is alpha_e; small seed set L known
  • matching is a sucess if precision and network coverage is high
  • phase transition, definitions of similarity witness and candidate matched pair same as paper "On the perf. of PGM"
  • similarity score c(i,j) recvd from each witness (i,j) is not 1; it depends on the degree of i and j; S(i,j) of a candidate pair is the sum of all c(i,j) scores of its witnesses
  • K(i,j) is the number of marks of witnesses recvd by a pair
  • celebrity effect: node with high centrality/degree has little impact in identifying neighbor's social behavior pattern; similarity score is low if witness degrees are high; penalize high indegree neighbors using indegree threshold
    • weight of edge from i to j w(i,j) has score 1 iff indegree of node j is less that d_tr (threshold), o.w 0
  • c(i',j') is the similarity score from witness (i',j') for candidate (i,j). c_out(i',j')=2 if product of weights of edges from i to i' and j to j' is nonzero (i.e. i',j' degree should be low)..
  • algorithm: same as PGM; main difference: similarity function based on high indegree used. At every step, pick matched pair with max similarity score S(i,j) and K(i,j) >= r. Stop if no pair w. K(i,j) >= r
  • not a good paper -> no justification for "celebrity penalty" and similarity function, no theoretical results, experiments show that PGM has more coverage than PGDM, difference in precision is not mcuh

Non-PGM papers

Summary

  • 2.2: each node has 2 types of features (local e.g. degree, clustering and distance: distance to mapped nodes); this forms the feature vector. estimate posterior probability that 2 nodes are the same using conditionally idndependent features. Use posterior probabilities as weights of edges connecting the pair and find likeley matches by maximizing flow in bipartite grpah.
  • 2.3: uses local, neighborhood and recursive (agg. prev features) to de-anonymize specific nodes in the network, pick the one that minimizes euclidean distance (evaluation: number of guesses until correct guess). not scalable

Bayesian method for matching 2 similar graphs without seeds (Pedarsani, Grossglauser, 2013)

  • AGM in a bayesian formulation without explicit side (seed) information, which uses node structral feature vectors (degree, distance) to incrementally map likely pairs.
  • percolation algorithms limitations: require moderaly large seed set and relatively dense network for percolation to take place
  • method: bayesian model used to compute posterior probability P(u1=u2 | features); posteriors used as weights in maximum weighted bipartite matching. Size of candidates sets doubles at every step (total log(N) steps), time complexity O(n^3 log(n)). limitation: not as scalable as PGM
  • n! possible mappings; reduce search space by incrementally finding matched pairs. Evaluate "quality" of mapping by the product of the posteriors of the final matched pairs.
  • Node features used: node-specific features: degree, clustering etc. anchor specifc features: distance of candidate pair (i,j) to each anchor (already mapped pairs)
  • Posterior P(u1=u2|f1,f2) computed using bayes rule + conditional independence of individual features. P(Xi=Xj|U1=U2) = \sum_y P(Xi=xi|Y=y)P(Xj=xj|Y=y)p(y) and P(Xi=Xj|U1!=U2) = (\sum_y P(Xi=xi|Y=y)p(y))(\sum_y P(Xj=xj|Y=y)p(y)) i.e. LOTP
  • each anchor can be "wrong". Random variable M_i=0 implies i^th anchor is wrong. P(M_i=0) = posterior of matched anchor. 2^m (m: total anchors) combinations. Monte carlo used to approximate P[U1=U2|f1,f2,M]. Conditional probability of distance in component graph given distance in original graph tough to computed; heuristic used that "performs well empirically".
  • matching algorithm
    • sort nodes in G1, G2 by degree. Size of candidate set at step t is Nt = 2^t+1. Let V{t,i} = set of N_t nodes with highest degree in G_i. For each node, compute structural feature vector. For each pair of nodes in product of V1 and V2, compute posterior probability of match. normalized posterior probability and run bipartite max weight matching algo to find optimal mapping.
    • repeat for step t+1 until all nodes are matched. Use top 50% of the matched pairs with highest posterior as anchors in next step.
  • algorithm does not perform well at all if edge overlap is low (<.7). perfomance improves if clustering increases because it makes the node neighborhoods more robust to edge removals.

It's who you know: graph mining using recursive structural features (Henderson, Faloutsos, 2011)

  • note: general feature extraction method applied to graph matching problem; does not work as well as PGM algorithms
  • presents recursive feature extraction method for network structure features; applies method to de-anonymization and classification
  • features for de-anonymization should be general enough to be transferable; considers neighbor (ego-network) and regional features in addition to local node features
  • algorithm
    • neighborhood features (local + egonetwork) and recursive features (local + egonetwork + regional combnined recursively)
    • neighborhood features are based features; local: (un)weighted (un)directed degrees of nodes; egonet: clustering coefficient, number of edges incoming and outgoing from egonet
    • two types of recursive features; mean and sums; example: mean value of clustering coefficient of all neighbor egonet features
    • number of recursive features grows exponentially with every iteration; 2 features that are too similar are pruned -> simpler feature selected
    • pruning: all features first binned logarithmically ($p$ fraction of nodes with lowest values get 0 and so on). justification: features like degree and clustering have power law and lognormal distributions; discriminatory power placed among nodes with large feature values. Pairs of features that do not disagree by more than threshold $s$ are connected by an edges. The simplest feature of the connected components are chosen as final features. pruning performed at every iteration. Stop if no new features are added at the end of iteration.
  • deanonymization
    • data: twitter follow network and twitter mention network, IP network; graphs must be somewhat correlated w.r.t. structural features
    • not exactly graph matching; for a given test vertex, how many guesses are required to predict the correct match? guess based on euclidean distance between the feature vectors of test vertex in reference graph and candidate vertices in the target graph. Same set of features used.
    • only consider high-degree vertices to evaluate algorithm. this overestimates the perfomance of this algorithm. regional and neighborhood features better than only local features.

Projected power iteration for network alignment (Onaran, Villar, 2017)

  • summary: find permutation $\pi$ such that the largest fraction of edges of 2 graphs are matched. graph matching is NP-hard worst-case (over entire data distribution $D$) but efficient algorithms for a given input distribution (subset of $D$) e.g ER work well. Extends Eigenalign by using projected power iteration.
  • notes: matching, isomorphism, TSP special cases of quadratic assignment problem, which is NP-hard. Restrict D to be signal (low rank) plus some $\lambda \cdot$ noise; example $D_{\lambda} = (G_1, G_2=PG_1P^T+E_{\lambda})$, $E_{\lambda}$ flips values in permuted adjacency matrix with prob. $\lambda$, SBM to study community detection threshold; open problem: amount of noise in the dataset vs. {exact, weak, no} recovery of permutation. optimization problem to obtain optimal permutation $X$: $$ \min_{X\in \prod} ||G_1X-XG_2||^2_F $$
  • Eigenalign setup: construct alignment matrix $A=(G_1,G_2)$ where each node is a potential match. There are $n^2$ nodes. $A_{i,j} = s_1$ if $i \in E_1, j \in E_2$, $A_{i,j} = s_2$ if $i \notin E_1, j \notin E_2$ and $s_3$ otherwise. Then optimization problem is minimize $y^TA^y$ where $\sum_{j} y_{i,j} = \sum_{i} y_{i,j} = 1$ and $y_i \in \{0,1\}$. $y$ is basically the flattened version of the optimal permutation matrix.
  • Eigenalign algorithm: find top eigenvector $v$ of a. Then solve max. weight bipartite matching i.e. get $y$ by maximizing $v^Ty$ subject to same constraints.
  • Projected power iteration: http://www.princeton.edu/~yc5/slides/Alignment_slides_Harvard.pdf, like power method but at every iterature, project $z^{(t)}$ back to feasiblity set convex hull.
  • Contribution: Projected power network alignment: extension of EigenAlign, compute top eigenvector $v$ of $A$ and then use projected power iteration; $v^{(t+1)} = F(Av^{(t)})$. Instead of rounding top eigenvector with a linear program, this algorithm rounds eigenvector in a greedy fashion using projection.
  • Contribution: Improves EigenAlign by using projected power method. Simulations show signficantly improvement in recovery of permutation matrix as noise level increases.
  • Limitations: not scalable to large network, compared to PGM.
  • TODO: PCA, SVD, Maximum weight bipartite matching, message pasing, power method, projected power method

30 years of graph matching in pattern recognition (Conte, Vento, 2004)

  • graph matching categorized into exact and inexact graph matching.
  • exact: Exact graph matching is characterized by the fact that the mapping between the nodes of the two graphs must be edge-preserving in the sense that if two nodes in the first graph are linked by an edge, they are mapped to two nodes in the second graph that are linked by an edge as well. multiple variants: graph isopmorphism, subgraph isopmorphism (graph to subgraph mapping), monomorphism (edges preserved one way), homomorphism (many-to-one mapping), mapping between 2 subsets of node (MCS)
    • most algorithms based on tree search with backtracking + extensions. Notable: Ullman, Ghahramani netgraph, VF2 (similar to PGM),
    • others: canonical labeling of graphs, decision trees
    • most algorithms are extended to incorporate edge and node attribuets; common in ML
  • inexact: . So the matching process must be tolerant: it must accommodate the differences by relaxing, to some extent, the constraints that defne the matching type. tradeoff between exactness (accuracy) vs. time. Constraits relaxed to penalities.
    • Optimal inexact matching algorithms always nd a solution that is the global minimum of the matching
    • Approximate may not find global minimum, but local. Often, much faster.
    • cost-based models that penalize "deformations" are categorized as error-correcting (graph edit operations etc)
    • tree-search models w. backtracking extended to inexact matching by directing search using heuristic estimates of cost
    • continuous optimization: approximate matching, discrete-optimization problem casted into continuous optimization problem, solved in polynomial time, and mapped back to discrete labels. {relaxation labling (probabilities to labels), bayesian graph edit distance, quadratic assignment using permutation matrix, graduated assignment
    • spectral methods: Spectral methods are based on the following observation: the eigenvalues and the eigenvectors of the adjacency matrix of a graph are invariant with respect to node permutations. Hence, if two graphs are isomorphic, their adjacency matrices will have the same eigenvalues and eigenvectors (Orthogonal matrix + PCA + GD, Eigendecomposition + clustering)
    • other methods: neural networks, genetic algorithms, discrete relaxation/propagation
  • applications: specific to ML - images, video, biomedical {recognition, retrieval, authentical, alignment

Misc papers

Detecting strong ties using motifs (Rotabi, Sharma, 2017)

Kronecker graphs: an approach to modeling networks (Leskovec, Ghahramani, 2010)

  • why is it relevant? The (stochastic) kronecker model is fit to the network data using a likelihood formulation. This means that each node in the kronecker graph needs to correspond to a node in the network data; nodes in the 2 graphs need to be matched in order to maxmimize likelihood. The method described in the paper maximizes the expected likelihood by averaging the likelihood wrt mappings sampled using a metropolis sampler
  • one alternative method would be to use scalable graph matching algorithms to find the optimal matching; applicable to any graph generated that can be optimized using likelihood formulation.
  • https://cs.stanford.edu/people/jure/pubs/kronecker-jmlr10.pdf

In [ ]: