In [1]:
# %load ../../preconfig.py
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['axes.grid'] = False
import numpy as np
import pandas as pd
#import itertools
import networkx as nx
import string
import logging
logger = logging.getLogger()
The essential characteristics of a social network are:
There is a collection of entities that participate in the network.
There is at least one relationship between entities of the network.
all-or-nothing: friends of Facebook.
discrete degree: friends, family as in Google plus.
real number: the times of talking.
There is an assumption of nonrandomness or locality.
If A is related to both B and C, then there is a higher probability than average that B and C are related.
In [4]:
plt.imshow(plt.imread('./res/fig10_1.png'))
Out[4]:
Is Fig 10.1 typical of a social network, in the sense that it exhibits locality of relationships?
Suppose X, Y and Z are nodes of Fig 10.1, with edges between X and Y and also between X and Z.
What would we expect the probability of an edge between Y and Z to be?
The graph has 9 edges out of the $C_7^2 = 21$ pairs of nodes.
Since we already have (X,Y) and (X,Z), the probability of an edge (Y,Z) is $(9-2)/(21-2) = 0.368$. If the graph were large enough, that probability would be very close to the fraction of the pairs of nodes that have edges between them, i.e., $9/21 = 0.429$.
Then, we must compute the probability that $(Y,Z)$ exists in Fig 10.1, given that edges $(X,Y)$ and $(Y,Z)$ exist.
We count the pairs of nodes that could be $Y$ and $Z$, without worrying about which node is $Y$ and which is $Z$.
If $X$ is $A$, $(B,C)$ contributes one positive example. All details are in the table following. In all, the fraction of times the third edge exists in thus $9 / 16 = 0.563 > 0.368$.
node | pos | neg | |
---|---|---|---|
A | 1 | 0 | |
C | 1 | 0 | |
E | 1 | 0 | |
G | 1 | 0 | |
F | 2 | 1 | |
B | 1 | 2 | |
D | 2 | 4 | |
sum | 9 | 7 | 16 |
Telephone Networks
Email Networks
Collaboration Networks
Other example
There are other social phenomena that involve entities of different types. The natural way to represent such information is as a $K$-partite graph for some $k > 1$. In general, a $k$-partite graph consists of $k$ disjoint sets of nodes, with no edges between nodes of the same set.
Eg:
deli.cio.cu: there are three different kinds of entites: users, tags and pages.
In [5]:
plt.imshow(plt.imread('./res/fig10_2.png'))
Out[5]:
#exercise
clustering of the graph as a way to identify communities.
distance measure: \begin{align} d(x,y) = \begin{cases} 0 \text{ or } 1 \text{ or } 1 & \text{if edge} (x,y) {exists} \\ 1 \text{ or } \infty \text{ or } 1.5 & \text{otherwise} \end{cases} \end{align}
They are generally unsuitable for the problem of clustering social-network graphs.
betweenness of an edge $(a,b)$:
the number of pairs of nodes $x$ and $y$ such that $(a,b)$ lies on the shortest path between $x$ and $y$. And it's credicted with the fraction of all shortest paths between $x$ and $y$ if existed.
A high score indicates that $(a,b)$ runs between two different communities, namely, $a$ and $b$ do no belong to the same community.
The algorithm is aimed to calculate the number of shortest paths going through each eage.
1) Starting at a node $X$, perform a breadth-first search (BFS) of the graph to label levels of each node.
DAG edges: edges between levels.
If there is a DAG edge $(Y,Z)$, where $Y$ is at the level about $Z$, then we shall call $Y$ a parent of $Z$ and $Z$ a child of $Y$.
2) to label each node by the number of shortest paths that reach it from the root.
In [4]:
plt.imshow(plt.imread('./res/fig10_4.png'))
Out[4]:
3) to calculate for each edge $e$ the sum over all nodes $Y$ of the fraction of shortest paths from the root $X$ to $Y$ that go through $e$.
Each leaf in the DAG gets a credit of 1.
Each node that is not a leaf gets a credit = 1 + the sum of the credits of the DAG edges from that node to its child nodes.
credit for $(Y_i, Z)$ is: $$\text{credit}(Y_i, Z) = \text{credit}(Z) \frac{p_i}{\sum_{j=1}^k p_i}$$ where $Y_1, Y_2, \dotsc, Y_k$ are the parents of $Z$, and $p_i$ is the number of the shorest paths from the root to $Y_i$, namely, the labels in Step 2).
In [3]:
plt.imshow(plt.imread('./res/fig10_6.png'))
Out[3]:
4) After performing the credit calculation with each node as the root,
we sum the credits for each edge.
And then divive the credit for each edge by 2, as each shortest path will have been discovered twice.
Then we get the the true betweenness.
In [5]:
plt.imshow(plt.imread('./res/fig10_7.png'))
plt.figure()
plt.imshow(plt.imread('./res/fig10_8.png'))
Out[5]:
Speeding up the Betweenness Calculation:
If the large is large, we can pick a subset of the nodes at random and use these as the root of breadth-first searches, we can get an approximation to the betweenness of each edge that will serve in most applications.
cons:
It is not possible to place an individual in two different communities, and everyone is assigned to a community.
In [6]:
# Exercises for Section 10.2
A complete bipartite graph $K_{s,t}$ consists of $s$ nodes on one side and $t$ nodes on the other side, with all $st$ possible edges between the nodes of one side and the other present.
Idea:
While it is not possible to guarantee that a graph with many edges necessarily has a large clique, it is possible to guarantee that a bipartite graph with many edges has a large complete bipartite subgraph.
We can regard a complete bipartite subgraph as the nucleus of a community, and add to it nodes with many edges to existing members of the community.
If its nodes is consisted of two or more types, construct bipartite graphs directly.
If all nodes have the same type, divide the node into two equal groups at random.
It's possible to view the problem of finding instance of $K_{s,t}$ within $G$ as one of finding frequent itemsets.
"items" - left side.
"baskets" - right side.
The members of the basket for node $v$ are the nodes of the left side to which $v$ is connected.
Let the support threshold be $s$, the number of nodes that the instance of $K_{s,t}$ has on the right side.
the problem of find $K_{s,t}$ $\to$ finding frequent itemsets $F$ of size $t$.
Assume:
the graph $G$ has $n$ nodes on the left and another $n$ nodes on the right.
let $d$ be the average degree of all nodes.
the degree of the $i$th node on the right is $d_i$.
Proof:
The total contribution of the $n$ nodes on the right is $\sum_i \binom{d_i}{t} \geq n \binom{d}{t}$ .
The number of itemsets of size $t$ is $\binom{n}{t}$.
Thus, the average count of an itemset of size $t$ is $n \binom{d}{t} / \binom{n}{t}$ ???
That is, if there is a community with $n$ nodes on each side, the average degree of the nodes is $d$, and $n(d/n)^t \geq s$, then this community is guaranteed to have a complete bipartite subgraph $K_{s,t}$.
In [2]:
# Exercises for Section 10.3
In [2]:
plt.imshow(plt.imread('./res/fig10_11.png'))
Out[2]:
A proper definition of a "good" cut must belance the size of the cut itself against the difference in the sizes of the sets that the cut creates.
The normalized cut value for $S$ and $T$ is \begin{equation} \frac{\operatorname{Cut}(S,T)}{\operatorname{Vol}(S)} + \frac{\operatorname{Cut}(S,T)}{\operatorname{Vol}(T)} \end{equation} where $\operatorname{Vol}(S)$ is the number of edges with at least one end in $S$, and $\operatorname{Cut}(S,T)$ is the number of edges connected between $S$ and $T$.
adjacent matrix: \begin{equation}
A_{i,j} = \begin{cases}
1, & \text{if node $i$ and $j$ is connected} \\
0, & \text{otherwise}
\end{cases}
\end{equation}
degree matrix: $D_{i,i}$ is the degree of the $i$th node.
Laplacian matrix: $L = D - A$ Notice that each row and column sums to zero.
The smallest eignevalues and their eigenvectors reveal the information we desire.
The smallest eignevalues for every Laplacian matrix is 0, and its corresponding eigenvectors is $\mathbf{1}$ ones matrix.
the second-smallest eigenvalues of $L$ is the minimum of $x^T L x$, and the minimum is taken under the constraints:
In all: $x^T L x = \sum_{i,j} (x_i - x_j)^2$
As a consequence, $x$ must have some positive and some negative components $\to$ two sets/groups.
We could set the threshold at some point other than zero.
We may also want to a partition into more than two components:
Attention:
while each eigenvector tries to produce a minimum-sized cut, the fact that successive eigenvectors have to satisfy more and more constraints generally causes the cuts they describe to be progressively worse.
In [4]:
# Exercises for Section 10.4
## Ex 10.4.1
### (a)
edges = [
('A', 'B'), ('A', 'C'), ('B', 'C'),
('B', 'H'), ('C', 'D'), ('H', 'I'),
('H', 'G'), ('D', 'E'), ('D', 'F'),
('I', 'G'), ('G', 'E'), ('E', 'F')
]
G = nx.Graph()
G.add_edges_from(edges)
In [13]:
nx.draw(G)
In [16]:
A = nx.adjacency_matrix(G).todense()
A
Out[16]:
In [35]:
D = np.diag(np.ravel(A.sum(axis=1)))
D
Out[35]:
In [36]:
L = D - A
L
Out[36]:
In [89]:
### Ex 10.4.2
w, v = np.linalg.eig(L)
In [98]:
eig_min = np.argmin(w)
w = np.delete(w, eig_min)
v =np.delete(v, eig_min, axis=0)
eig_2nd_min = np.argmin(w)
eigv_2nd_min = np.ravel(v[eig_2nd_min])
In [105]:
setA = set([n for v, n in zip(eigv_2nd_min, G.nodes()) if v >= 0])
setB = set(G.nodes()) - setA
print(setA, setB)
In [2]:
plt.imshow(plt.imread('./res/fig10_19.png'))
Out[2]:
We assume that the value of the parameters that gives the largest value of the likelihood is the correct model for the observed artifact. $$\operatorname{argmax}_{\theta} P[f(\theta)]) $$
prior probabilities: $$\operatorname{argmax}_{\theta} P[f(\theta)]) = \operatorname{argmax}_{\theta} P[f(\theta) | \theta] \, P[\theta] $$
affiliation-graph model: generate social graphs from communities.
community-affiliation graphs:
Given: $C$ communities, $N$ nodes(individuals).
Question: $n_i \in C_k$ ?
Model:
we compute the likelihood that a given graph with the proper number of nodes is generated by this mechanism.
Parameter: membership
define membership: $n_i \in C_k$.
0 / 1, Yes / No, decrete variable $\to$ brute search
Parameter: $P_{ck}$ \begin{align} &P_{u,v} = 1 - \displaystyle \prod_{C_k in M} (1 - P_{ck}) \quad \text{where } M = {C_i: u \in C_i, v \in C_i}\\ &P[f(P_{ck}, C_k, E)] = \displaystyle \prod_{(u,v) \in E} P_{u,v} \, \prod_{(u,v) \notin E} (1 - P_{u,v}) \end{align}
Goal: find $\operatorname{argmax}_{C_k} P[f(P_{ck}, C_k, E)]$.
avoiding the use of discrete membership changes. This improvement allows us to use standard methods.
"strenght of membership":
the stronger the membership of two individuals in the same community, the more likely it is that this community will cause them to have an edge between them.
In the improved model,
Parameter: membership define membership: strength, $F_{xC} \in \mathbb{R}_{\ge 0}$
Parameter: $P_{ck}$ $$P_C(u,v) = 1 - e^{- F_{u,C} F_{v,C}}$$
In [3]:
# exercise
Let $M$ be the transition matrix of the graph $G$: the entry in row $i$ and column $j$ of $M$ is $1/k$ if node $j$ of $G$ has degree $k$, and one of the adjacent nodes is $i$.
$\beta$ is the probability that the walker continues at random, so $1 - \beta$ is the probability the walker will teleport to the initial node $N$.
$v'$ is the probability the walker is at each of the nodes at the next round: $$v' = \beta M v + (1 - \beta) e_N$$
In [2]:
plt.imshow(plt.imread('./res/fig10_22.png'))
Out[2]:
In [2]:
edges = [
('Picture 1', 'Sky'),
('Picture 1', 'Tree'),
('Picture 2', 'Sky'),
('Picture 3', 'Sky'),
('Picture 3', 'Tree')
]
G=nx.Graph()
G.add_edges_from(edges)
nx.draw(G)
In [3]:
nodelist = ['Picture 1', 'Picture 2', 'Picture 3', 'Sky', 'Tree']
adj = nx.adjacency_matrix(G, nodelist=nodelist).todense()
In [4]:
adj = pd.DataFrame(adj, index=nodelist, columns=nodelist)
adj
Out[4]:
In [5]:
M = adj.apply(lambda x: x / sum(x), axis=0)
M
Out[5]:
In [6]:
beta = 0.8
e_0 = np.identity(M.shape[0])[0]
e_0 = pd.DataFrame({'e_0': e_0}, index=nodelist)
e_0
Out[6]:
In [7]:
def random_walk(v: pd.DataFrame, beta: float, M: pd.DataFrame, e_n: pd.DataFrame) -> pd.DataFrame:
return beta * (M.dot(v)) + (1 - beta) * e_n
In [8]:
v = [e_0]
iter_time = 50
v_ = v[0]
for k in range(iter_time):
v_ = random_walk(v_, beta, M, e_0)
v.append(v_)
index = list(range(iter_time+1))
pd.concat(v, axis=1).T.reset_index(drop=True).T
Out[8]:
If we wanted to know what node were most similar to another node, we would have to start the analsis over for that node.
notice that convergence takes time, since there is an initial oscillation.
In [66]:
# Exercise
to measure the extent to which a graph looks like a social network.
If we start with $n$ nodes and add $m$ edges to a graph at random:
We expect the number of triangles to be much greater than the value for a random graph.
It has been demonstrated that the age of a community is related to the density of triangles.
Suppose we have a graph of $n$ nodes and $m \geq n$ edges.
nouns:
heavy hitter: the node whose degree is at least $\sqrt{m}$. note, the number of heavy-hitter nodes is nore more than $2\sqrt{m}$, since otherwise the sum of degree of nodes would be more than $2m$.
heavy-hitter triangle: the triangle all three of whose nodes are heavy hitter.
Assuming the graph is represented by its edges, we preprocess the graph as follows:
Compute the degree of each node. $O(m)$
Create an index on edges, with the pair of nodes at its ends as the key. constructed in $O(m)$. Query the existence of an edge $O(1)$.
Create another index of edges, this one with key equal to a single node. to retrieve the nodes adjacent to given node. $O(\sqrt{m})$
order the nodes: nodes.sort_values(by=['degree', 'id'])
Heavy-Hitter Triangles: Find in all heavy-hitter nodes which is only $O(\sqrt{m})$. time: $O(\sqrt{m}^3) = O(m^{3/2})$
Other Triangles: Consider each edge $(v_1, v_2)$:
if both $v_1$ and $v_2$ are heavy hitters, ignore the dege. (use $v_3$ since the computation is less).
if $v_1 < v_2$, query whether $(v_1.\text{adjacent nodes}), v_2)$ exits. $O(\sqrt{m} * m) = O(m^{3/2})$.
The total time of the algorithm is $O(m^{3/2})$.
It turns out the algorithm described above is, to within an order of magnitude the best possible.
For a complete graph on $n$ nodes, it has $m = \binom{n}{2}$ edges and the number of triangles is $\binom{n}{3}$. Since we cannot enumerate triangles in less time than the number of those triangles $O(n^3) = O((\sqrt{m})^3) = O(m^{3/2})$.
For sparse graphs, we can add to the complete graph a chain of nodes with any length up to $n^2$ to convert.
multiway join technique: $E(X, Y) \bowtie E(X, Z) \bowtie E(Y, Z)$
if we hash nodes to $b$ buckets, then there will be $b^3$ reducers, since $(h(u), h(v), z)$, $(h(u), y, h(v))$ and $(x, h(u), h(v)$ are mapped.
$\to$ The total communication required is thus $3b$ key-value pairs for each of the $m$ tuples of the edge relation $E$, namely $O(mb)$ if we use $b^3$ Reduce tasks.
$\to$ each Reduce task receives $O(mb) / b^3 = O(m / b^2)$ edges.
$\to$ If we use the algorithm of Section 10.7.2, the computation cost of each Reduce is $O((m / b^2)^{3/2})$. Thus, the total computation cost of $b^3$ Reduce is $O((m / b^2)^{3/2} * b^3) = O(m^{3/2})$.
all undirected graphs can be represented by directed graphs.
path: a sequence of nodes in a directed graph. Its length is the number of arcs, instead of nodes, along the path.
The neighborhood of radius $d$ for $v$ is: $\{u : \operatorname{len}(u, v) \leq d\}$, denote this neighborhood by $N(v, d)$.
The neighborhood profile of a node $v$ is the sequence of sizes of its neighborhoods $|N(v, 1)|, |N(v, 2)|, \dotso$.
The diameter $d$ of a directed graph: $\max (\operatorname{len}(u, v)): u \in G, v \in G$.
for each node $v$, we can find the smallest $d$ such that $|N(v, d)| = |N(v, d+1)|$, and call it $d(v)$. $\to$ the $d$ of $G$ is: $\max_{v} d(v)$.
A graph is strongly connected if there is a paht from any node to any other node. If $G$ is strongly connected, the $d = \max_{v} d(v)$.
"six degrees of separation": the diameter of social graph is six. Unfortunately, not all important graphs exhibit such tight connections.
The transitive closure of a graph is: $\{(u, v) : \operatorname{len of Path}(u, v) \geq 0 \}$. denoted $\operatorname{Path}(u, v)$.
reachability: we say node $u$ reaches node $v$ if $\operatorname{Path}(u, v)$.
$\operatorname{Path}(u, v)$ is true if and only if $v$ is in $N(u, \infty) = \cup_{i \geq 0} N(u, i)$.
The two problems - transtive closure and reachability - are related, but there are many examples of graphs where reachability is feasible and transitive closure is not.
transitive closure is actually more readily parallelizable than is reachability.
$\operatorname{Arc}(X, Y) = \{(x, y) \text{ where } x \to y\}$.
SELECT DISTINCT Arc.Y
FROM Reach, Arc
WHERE Arc.X = Reach.X
how many rounds this process requires depends on how far from $v$ is the furthest node that $v$ can reach.
recursive-doubling method
SELECT DISTINCT p1.X, p2.Y
FROM Path p1, Path p2
WHERE p1.Y = p2.X
The above recursive-doubling method does a lot of redundant work, since there may exist many paths between two nodes.
smart transitive closure:
Every path of length greater than 1 can be broken into a head whose length is a power of 2, followed by a tail whose length is no greater than the length of the head.
$Q(X, Y)$ holds all pairs of nodes $(x, y)$ such that the shortest path from $x$ to $y$ is of length exactly $2^i$ after the $i$th round.
Intitally, set both $Q$ and Path to be copies of the relation Arc.
On the $(i + 1)$st round, we do the following:
Compute a new value for $Q$ by joining it with itself:
SELECT DISTINCT q1.X, q2.Y
FROM Q q1, Q q2
WHERE q1.Y = q2.X
Subtract Path from the relation Q computed in step 1.
Join Path with the nw value of Q computed in 2:
SELECT DISTINCT Q.X, Path.Y
FROM Q, Path
WHERE Q.Y = Path.X
Set the new value of Path to be the union of the relation computed in step 3, the new value of Q computed in step 1, and the old value of Path.
collapse an SCC (Stronly connected components) to a single node when computing the transitive closure.
to find most of the SCC's in a graph by some random node selections followed by two breadth-first searches.
Let $G$ be the graph to be reduced, and let $G'$ be $G$ with all the arcs reversed.
Pick a node $v$ from $G$ at random.
Find $N_G(v, \infty)$, the set of nodes reachable from $v$ in $G$.
Find $N_{G'} (v, \infty)$, the set of nodes that $v$ reaches in the graph $G'$ that has the arcs of $G$ reversed.
Construct the SCC S containing $v$, which is $N_G(v, \infty) \cap N_{G'}(v, \infty)$.
Replace SCC S by a single node $s$ in $G$.
In [ ]:
# Exercise 10.8.8