(by Fiona Pigott)
A centrality metric is a measure on a network intended to rank vertices by importance, and different centrality measures reflect different notions of the "importance" of a vertex.
Because the idea of the "importance" of a vertex depends on the problem at hand, it's important to note that different centrality measures differ partly because they are measuring different things, and the selection of a centrality metric should depend on exacty what you are trying to quantify.
This notebook looks long, but half of it is plotting code (sorry). It is intended to illustrate Chapter 7, sections 7.1-7.7 of M. Newman's Networks: An Introduction.
In [ ]:
# handy graph library for python
import igraph
# science
import numpy as np
from collections import defaultdict
# plot things
import tabulate
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
# get some toy graph data so we can demonstrate these properties
network = igraph.Nexus.get("kaptail")['KAPFTI1']
network.summary()
In [ ]:
# visualize the graph. if you can't get Cairo to work, don't worry about it.
# if you are worried about it anyway, google "cairo py2cairo" and sort this mess out for yourself
igraph.plot(network)
In [ ]:
# Using canned methods from igraph won't be very illustrative, so for the most part, I'll use the adjacency matrix
adj_mat = np.array(network.get_adjacency().data)
num_nodes = adj_mat.shape[0]
fig, axes = plt.subplots(1, figsize = (5,5))
axes.set_title("Adjacency Matrix (black = edge)")
axes.matshow(adj_mat, cmap = 'gist_heat_r')
# these are just constants that we'll want to use
# Get some 1s we'll need repeatedly
ones = np.ones((num_nodes, 1))
normed_ones = np.divide(ones, np.linalg.norm(ones))
zeros = np.zeros((num_nodes, 1))
identity = np.eye(num_nodes)
In [ ]:
# total number of nodes in the graph = size of the adjacency matrix
num_nodes = adj_mat.shape[0]
In [ ]:
# calculate the in- and out- degree
outdegree = np.sum(adj_mat,1)
indegree = np.sum(adj_mat,0)
One possible measure of importance is simply vertex degree. Degree is a simple metric, and it's easy to compute because it depends only on the immediate neighborhood of each vertex.
Even something as simple as "degree centrality," however, can have subtleties. For instance, is the most important vertex the one with the most incoming edges or the most outgoing edges? This would depend a lot on exactly what you are trying to measure.
In [ ]:
# graph the vertices ordered by in + out degree
names_and_degrees = {network.vs["name"][i]: {"in": indegree[i], "out": outdegree[i]} for i in range(0,num_nodes)}
sorted_names_and_degrees = sorted(names_and_degrees.items(), key = lambda x: -1*(x[1]["in"] + x[1]["out"]))
sorted_indegree = [x[1]["in"] for x in sorted_names_and_degrees]
sorted_outdegree = [x[1]["out"] for x in sorted_names_and_degrees]
sorted_names = [x[0] for x in sorted_names_and_degrees]
fig, axes = plt.subplots(3,1, figsize = (16,20))
plt.subplots_adjust(bottom = .1)
axes[0].set_title("Degree for each node", size = 16)
axes[0].bar([x + .4 for x in range(0,num_nodes)], sorted_indegree, color = 'darkorange', width = .8)
axes[0].bar([x + .4 for x in range(0,num_nodes)], sorted_outdegree, bottom = sorted_indegree, color = 'navajowhite', width = .8)
axes[0].legend(["in-degree", "out-degree"])
axes[0].set_ylim(0,max(outdegree + indegree) + 1)
axes[0].set_ylabel("degree", size = 16)
axes[0].set_xticks([x + .9 for x in range(0,num_nodes)])
_ = axes[0].set_xticklabels(sorted_names, rotation = 90)
# graph the vertices sorted by the indegree
sorted_by_indegrees = sorted(names_and_degrees.items(), key = lambda x: -1*(x[1]["in"]))
axes[1].bar([x + .4 for x in range(0,num_nodes)], [x[1]["in"] for x in sorted_by_indegrees], color = 'darkorange', width = .8)
axes[1].set_ylim(0,max(outdegree + indegree) + 1)
axes[1].set_ylabel("indegree", size = 16)
axes[1].legend(["in-degree"])
axes[1].set_xticks([x + .9 for x in range(0,num_nodes)])
_ = axes[1].set_xticklabels([x[0] for x in sorted_by_indegrees], rotation = 90)
# graph the vertices ordered by the out degree
sorted_by_outdegrees = sorted(names_and_degrees.items(), key = lambda x: -1*(x[1]["out"]))
axes[2].bar([x + .4 for x in range(0,num_nodes)], [x[1]["out"] for x in sorted_by_outdegrees], color = 'navajowhite', width = .8)
axes[2].set_ylim(0,max(outdegree + indegree) + 1)
axes[2].set_xlabel("vertex", size = 16)
axes[2].set_ylabel("indegree", size = 16)
axes[2].legend(["out-degree"])
axes[2].set_xticks([x + .9 for x in range(0,num_nodes)])
_ = axes[2].set_xticklabels([x[0] for x in sorted_by_outdegrees], rotation = 90)
In [ ]:
# I'm also going to compute total degree, scaled from 0 to 1, so that I can compare with other centrality metrics later
max_degree = float(max([x[1]["in"] + x[1]["out"] for x in sorted_names_and_degrees]))
scaled_degree = {x[0]: (x[1]["in"] + x[1]["out"])/max_degree for x in sorted_names_and_degrees}
sorted_degree_centrality = sorted(scaled_degree.items(), key = lambda x:-x[1])
#max_degree = float(max([x[1]["in"] for x in sorted_names_and_degrees]))
#scaled_degree = {x[0]: (x[1]["in"])/max_degree for x in sorted_names_and_degrees}
#sorted_degree_centrality = sorted(scaled_degree.items(), key = lambda x:-x[1])
Eigenvector centrality measures network importance of a vertex as a sum of the importance of its neighbors. You can think of this as each vertex "voting" for each neighbor (vertex $i$ votes for vertex $j$ if there is a path between $i$ and $j$, i.e., if $A_{ij} = 1$).
Iteratively, we can then solve for eigenvector centrality: $$ x'_{i} = \sum_j A_{ij}x_j$$ On each iteration, estimates of $x'$ are updated by the previous weights. You might recognize $\sum_j A_{ij}x_j$ as the definition of the matrix multiplication $\mathbf{A}\mathbf{x}$.
Essentially we are solving for the fixed point (where $\mathbf{x}$ is fixed, save some constant factor) of $\mathbf{A}\mathbf{x} = \lambda\mathbf{x}$: the equation for the eigenvectors of $\mathbf{A}$. The vector $\mathbf{x}$ is in fact the eigenvector corresponding to the largest eigenvalue of $\mathbf{A}$.
I glossed over the fact that $\mathbf{x}$ is specifically the principal eigenvector of $\mathbf{A}$.
From Networks Ch 7:
Many iterations of the equation $\mathbf{x}' = \mathbf{A}\mathbf{x}$ gives $\mathbf{x}(t) = \mathbf{A}^t\mathbf{x}(0)$. If $x(0)$ (some initial guess for the centralities) can be written as a a linear combination of the eigenvectors ($\mathbf{v_1}$) of $\mathbf{A}$ (this is only guaranteed if $\mathbf{A}$ is full-rank, which is true iff the graph is completely connected), we can rewrite $\mathbf{x}(0)$ as that combination $\mathbf{x}(0) = \sum_{i}c_i\mathbf{v_i}$. Then: $$ \mathbf{\vec{x}}(t) =\mathbf{A}^t \sum_{i}c_i\mathbf{v_i} = \sum_{i}\lambda_i^t c_i\mathbf{v_i} $$ Multipy by 1 in the form of $\frac{\lambda_1^t}{\lambda_1^t}$, where $\lambda_1$ is the principal (largest) eigenvalue: $$ = {\lambda_1}^t \sum_{i} \left(\frac{\lambda_i}{\lambda_1}\right)^t c_i\mathbf{v_i} $$ Then, because $\frac{\lambda_i}{\lambda_1}$ is always less than 1, as the number of iterations $t \rightarrow \infty, \left(\frac{\lambda_i}{\lambda_1}\right)^t \rightarrow 0 \; \forall \; i \neq 1$
$$ = {\lambda_1}^t c_1\mathbf{v_1} $$So $\mathbf{x} \propto \mathbf{v_1}$
In [ ]:
# set some initial guess for the centralities. giving every node the same value is a good start
eigenvector_centrality = normed_ones
# give an initial error value and a tolerance to stop iterating
err = 100
tol = .01
while err > tol:
# calculate x' with Ax
eigenvector_centrality_new_unnormed = np.dot(adj_mat, eigenvector_centrality)
# norm your x values (only the proportions matter, if you don't normalize the values blow up)
eigenvector_centrality_new = np.divide(eigenvector_centrality_new_unnormed,
np.linalg.norm(eigenvector_centrality_new_unnormed))
# calculate the error
err = sum(abs(np.subtract(eigenvector_centrality_new, eigenvector_centrality)))
# set the new centrality vector
eigenvector_centrality = eigenvector_centrality_new
# sort and scale the centrality metric from 0 to 1
sorted_eigenvector_centrality = sorted(zip(network.vs["name"],
[x/max(eigenvector_centrality) for x in eigenvector_centrality]),
key = lambda x: -x[1])
dict_eigenvector_centrality = dict(sorted_eigenvector_centrality)
In [ ]:
# plot things
fig, axes = plt.subplots(2,1, figsize = (16,12))
plt.subplots_adjust(bottom = .1)
axes[0].set_title("Eigenvector Centrality", size = 16)
axes[0].bar([x + .4 for x in range(0,num_nodes)], [x[1] for x in sorted_eigenvector_centrality],
color = 'royalblue', width = .8)
axes[0].set_ylabel("Eigenvector Centrality Values", size = 16)
axes[0].set_xticks([x + .9 for x in range(0,num_nodes)])
_ = axes[0].set_xticklabels([x[0] for x in sorted_eigenvector_centrality], rotation = 90)
node_ordering = sorted_degree_centrality #sorted_eigenvector_centrality
axes[1].plot(range(0,num_nodes), [dict_eigenvector_centrality[x[0]] for x in node_ordering], color = 'royalblue')
axes[1].plot(range(0,num_nodes), [scaled_degree[x[0]] for x in node_ordering], color = 'navajowhite')
axes[1].plot(range(0,num_nodes), [dict_eigenvector_centrality[x[0]] for x in node_ordering], color = 'royalblue', marker = 'o')
axes[1].plot(range(0,num_nodes), [scaled_degree[x[0]] for x in node_ordering], color = 'navajowhite', marker = 'o', alpha = .5)
axes[1].set_ylabel("Eigenvector Centrality vs Degree", size = 16)
axes[1].set_xticks([x for x in range(0,num_nodes)])
axes[1].grid()
axes[1].set_xlim(-0.5,38.5)
axes[1].set_ylim(0,1.05)
axes[1].legend(["eigenvector centrality", "degree centrality"])
_ = axes[1].set_xticklabels([x[0] for x in node_ordering], rotation = 90)
In [ ]:
# matrix formulation. this is only guaranteed to work properly if the matrix is of full rank
# (not true in a disconnected graph)
# get the eigenvalues and vectors
e_vals, e_vecs = np.linalg.eig(adj_mat)
# find the principal eigenvector
largest_eval_at = e_vals.argsort()[-1]
# cetralities proportional to the principal eigenvector, barring some (potentially negative) constant factor
matrix_eigen = sorted(zip(network.vs["name"], e_vecs[:,largest_eval_at]), key = lambda x: -1*abs(x[1]))
print tabulate.tabulate([[x[0], np.real(x[1])] for x in matrix_eigen][0:8])
One potential problem with eigenvector cetrality is that only vertices in the strongly conencted component (vertices from which you can travel to any place on the network, and to which you can travel from anywhere on the network) can have a non-zero eigenvector centrality.
Below is an example of a graph which has no strongly connected components greater than 1 vertex, and so eigenvector centrality breaks down:
In [ ]:
tree = igraph.Graph(directed = True)
tree.add_vertices([0,1,2,3,4,5,6])
tree.add_edges([(0,1),(0,2),(1,3),(1,4),(2,5),(2,6)])
tree.vs["label"] = [0,1,2,3,4,5,6]
igraph.plot(tree, layout = tree.layout("sugiyama"), bbox=(0, 0, 150, 150), vertex_label_size = 1)
In [ ]:
# I'll use the built-in eigenvector centrality here for brevity, but it's the same algorithm
tree.eigenvector_centrality()
We can get meanigful centrality values for the above case if we assign every node at least a little centrality, rather than having these weakly connected or disconnected vertices get a centrality of zero. The iterative centrality equation becomes: $$ x_{i}' = \alpha \sum_j A_{ij}x_j + \beta $$ Where that $\beta$ term is the centrality that every node gets.
This can be rewritted as matrices: $$ \mathbf{x} = \alpha \mathbf{A}\mathbf{x} + \beta \mathbf{1} $$
Which can be solved for $\mathbf{x}$: $\mathbf{x} = (\mathbf{I} - \alpha \mathbf{A})^{-1}\beta\mathbf{1} \;$ ($\beta$ can simply be set to 1, since it's simply a multiplicative constant)
Now, note that $\alpha$ is a free parameter, where $\alpha = 0$ would make the Katz centrality $\beta\mathbf{1}$. The larger $\alpha$ is the greater emphasis is put on the eigenvctor term, but $\alpha$ cannot be arbitraily large. The term $(\mathbf{I} - \alpha \mathbf{A})^{-1}$ diverges when $\text{det}(\mathbf{I} - \alpha \mathbf{A}) = 0$, and it is necessary to chose a value of $\alpha$ less than this value. Below is an illustration of how the determinant depends on $\alpha$.
In [ ]:
alpha_values = np.arange(0.219,.4,.0005)
det_values = [np.linalg.det(np.subtract(adj_mat, 1.0/a * np.eye(adj_mat.shape[0]))) for a in alpha_values]
f, axes = plt.subplots(1,2, figsize = (14,5))
axes[0].plot(alpha_values, [0 for a in alpha_values], 'k')
axes[1].plot(alpha_values, [0 for a in alpha_values], 'k')
axes[0].plot(alpha_values, det_values)
axes[1].plot([0.21988], [0], 'ro')
axes[1].plot(alpha_values, det_values)
xlim0 = axes[0].set_xlim(0.219,.32)
xlim1 = axes[1].set_xlim(0.219,.23)
In [ ]:
alpha_index = np.where(np.array(det_values) > 0)[0][0] - 1
katz_alpha = alpha_values[alpha_index] * .9
print "The chosen alpha value is: ", katz_alpha
Calculate Katz centrality with the chosen $\alpha$ using the iterative formulation:
In [ ]:
# iterative formulation
katz_centrality = zeros
err = 100
tol = .01
while err > tol:
katz_centrality_new = katz_alpha * np.dot(adj_mat, katz_centrality) + ones
err = sum(abs(np.subtract(katz_centrality_new, katz_centrality)))
katz_centrality = katz_centrality_new
sorted_katz_centrality = sorted(zip(network.vs["name"],
[x[0]/max(katz_centrality)[0] for x in katz_centrality]),
key = lambda x: -x[1])
dict_katz_centrality = dict(sorted_katz_centrality)
In [ ]:
fig, axes = plt.subplots(2,1, figsize = (16,12))
plt.subplots_adjust(bottom = .1)
# plot katz centrality
axes[0].set_title("Katz Centrality", size = 16)
axes[0].bar([x + .4 for x in range(0,num_nodes)], [x[1] for x in sorted_katz_centrality],
color = 'limegreen', width = .8)
axes[0].set_ylabel("Katz Centrality Values", size = 16)
axes[0].set_xticks([x + .9 for x in range(0,num_nodes)])
_ = axes[0].set_xticklabels([x[0] for x in sorted_katz_centrality], rotation = 90)
# compare with other measures
fade = .3
node_ordering = sorted_degree_centrality #sorted_katz_centrality
axes[1].plot(range(0,num_nodes), [dict_katz_centrality[x[0]] for x in node_ordering], color = 'limegreen')
axes[1].plot(range(0,num_nodes), [dict_eigenvector_centrality[x[0]] for x in node_ordering], color = 'royalblue', alpha = fade)
axes[1].plot(range(0,num_nodes), [scaled_degree[x[0]] for x in node_ordering], color = 'navajowhite', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_katz_centrality[x[0]] for x in node_ordering], color = 'limegreen', marker = 'o')
axes[1].plot(range(0,num_nodes), [dict_eigenvector_centrality[x[0]] for x in node_ordering], color = 'royalblue', marker = 'o', alpha = fade)
axes[1].plot(range(0,num_nodes), [scaled_degree[x[0]] for x in node_ordering], color = 'navajowhite', marker = 'o', alpha = fade)
axes[1].set_ylabel("Katz vs Eigenvector vs Degree", size = 16)
axes[1].set_xticks([x for x in range(0,num_nodes)])
axes[1].grid()
axes[1].set_xlim(-0.5,38.5)
axes[1].set_ylim(0,1.05)
axes[1].legend(["katz centrality", "eigenvector centrality", "degree centrality"])
_ = axes[1].set_xticklabels([x[0] for x in node_ordering], rotation = 90)
In [ ]:
# in case you aren't convinced, here is the matrix formulation
matrix_form_katz_centrality = np.dot(np.linalg.inv(identity - katz_alpha*adj_mat), ones)
sorted_matrix_form_katz_centrality = sorted(zip(network.vs["name"], [x[0] for x in matrix_form_katz_centrality]),
key = lambda x: -x[1])
print tabulate.tabulate([[x[0],x[1]] for x in sorted_matrix_form_katz_centrality][0:8])
This is how Katz centrality would behave on the directed acyclic graph for which Eigenvector centrality failed:
In [ ]:
# and for the tree graph case
tree_adj_mat = np.array(tree.get_adjacency().data)
katz_centrality_tree = np.dot(np.linalg.inv(np.eye(len(tree.vs)) - katz_alpha*tree_adj_mat),
np.ones(len(tree.vs)))
katz_centrality_tree = sorted(zip(tree.vs["label"], [x for x in katz_centrality_tree]),
key = lambda x: -x[1])
print tabulate.tabulate([[x[0],x[1]] for x in katz_centrality_tree][0:8])
igraph.plot(tree, layout = tree.layout("sugiyama"), bbox=(0, 0, 150, 150), vertex_label_size = 1)
In Katz centrality (similar to eigenvector centrality), if a vertex is very important, it makes the vertices that it points to important. This could be undesireable in the case where very important vertices also point to many other verties, as is the case with Google, which points to many websites, none of which are necessarily particualrly important.
We said that in eigenvector centrality, each vertex "votes" for the importance of each of its neighbors, one vote per neighbor. In the PageRank formulation of centrality, each vertex only gets one "vote," and so if it has many neighbors, each neighbor only gets a fraction of a vote. This is achieved by dividing the centrality vector by the out degree of the vertex.
$$ x_{i}' = \alpha \sum_j A_{ij}\frac{x_j}{k^{out}_j} + \beta $$I won't walk through the matrix manipulation, but the above equation is equivalent to: $$ \mathbf{x} = \mathbf{D}(\mathbf{D} - \alpha\mathbf{A})^{-1}\beta $$ Where $\mathbf{D}$ is the diagonal matrix with elements $D_{ii} = \text{max}(k_i^{out},1)$ (ensuring that $\mathbf{D}$ has no zeros means that it is invertible).
In [ ]:
# we need the out degree with non-zero, so that we don't divide by zero
# it's okay to simply replace zeros with ones, because the matrix multiplication will zero them out again
outdegree_no_zeros = outdegree
outdegree_no_zeros[outdegree_no_zeros == 0] = 1
Degree = np.diag(outdegree_no_zeros)
outdegree_no_zeros = outdegree_no_zeros.reshape(num_nodes,1)
In [ ]:
# iterative formulation
pagerank = ones
err = 100
tol = .001
pagerank_alpha = .85
while err > tol:
pagerank_new = (pagerank_alpha * np.dot(adj_mat, np.divide(pagerank, outdegree_no_zeros))) + ones
err = sum(abs(np.subtract(pagerank_new, pagerank)))
pagerank = pagerank_new
sorted_pagerank = sorted(zip(network.vs["name"], [x/float(max(pagerank)) for x in pagerank]), key = lambda x:-x[1])
dict_pagerank = dict(sorted_pagerank)
In [ ]:
# plotting code, ignore
fig, axes = plt.subplots(2,1, figsize = (16,12))
plt.subplots_adjust(bottom = .1)
# plot katz centrality
axes[0].set_title("PageRank Centrality", size = 16)
axes[0].bar([x + .4 for x in range(0,num_nodes)], [x[1] for x in sorted_pagerank],
color = 'purple', width = .8)
axes[0].set_ylabel("PageRank Centrality Values", size = 16)
axes[0].set_xticks([x + .9 for x in range(0,num_nodes)])
_ = axes[0].set_xticklabels([x[0] for x in sorted_pagerank], rotation = 90)
# compare with other measures
fade = .3
node_ordering = sorted_degree_centrality #sorted_pagerank
axes[1].plot(range(0,num_nodes), [dict_pagerank[x[0]] for x in node_ordering], color = 'purple')
axes[1].plot(range(0,num_nodes), [dict_katz_centrality[x[0]] for x in node_ordering], color = 'limegreen', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_eigenvector_centrality[x[0]] for x in node_ordering], color = 'royalblue', alpha = fade)
axes[1].plot(range(0,num_nodes), [scaled_degree[x[0]] for x in node_ordering], color = 'navajowhite', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_pagerank[x[0]] for x in node_ordering], color = 'purple', marker = 'o')
axes[1].plot(range(0,num_nodes), [dict_katz_centrality[x[0]] for x in node_ordering], color = 'limegreen', marker = 'o', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_eigenvector_centrality[x[0]] for x in node_ordering], color = 'royalblue', marker = 'o', alpha = fade)
axes[1].plot(range(0,num_nodes), [scaled_degree[x[0]] for x in node_ordering], color = 'navajowhite', marker = 'o', alpha = fade)
axes[1].set_ylabel("PageRank vs Katz vs Eigenvector vs Degree", size = 16)
axes[1].set_xticks([x for x in range(0,num_nodes)])
axes[1].grid()
axes[1].set_xlim(-0.5,38.5)
axes[1].set_ylim(0,1.05)
axes[1].legend(["pagerank centrality", "katz centrality", "eigenvector centrality", "degree centrality"])
_ = axes[1].set_xticklabels([x[0] for x in node_ordering], rotation = 90)
In [ ]:
# matrix formulation, in case you didn't believe my hasty non-derivation
D_minus_alpha_a = np.subtract(Degree, pagerank_alpha * adj_mat)
D_minus_alpha_a_inv = np.linalg.inv(D_minus_alpha_a)
pagerank_matrix_method = np.dot(np.dot(Degree, D_minus_alpha_a_inv), ones)
print tabulate.tabulate([[x[0], x[1]] for x in
sorted(zip(network.vs["name"], pagerank_matrix_method), key = lambda x:-x[1])[0:8]])
It is concievable that you might want to distingush vertices who point to important vertices, as well as vertices pointed to by other vertices (vertices that are pointed to corresponds to the previously discussed centrality measures).
In this case, we could distinguish between vertices with a high authority centrality (corresponding to vertices with many important incoming connections and is quite similar to the previously discussed metrics) and vertices with high hub centrality (measuring how many important authority-vertices that a vertex points to).
Call the authority centraliy $\mathbf{x}$ and call the hub centrality $\mathbf{y}$, then solve: $$ \mathbf{x} = \alpha\mathbf{A}\mathbf{y} $$ $$ \mathbf{y} = \beta\mathbf{A}^T\mathbf{x} $$
In [ ]:
normed_ones = np.divide([1 for i in sum(adj_mat)], np.linalg.norm([1 for i in sum(adj_mat)]))
err = 100
tol = .001
hub_score = normed_ones
authority_score = normed_ones
alpha = 1
beta= 1
while err > tol:
# find the new hub and authority scores
authority_score_new_unnormed = alpha * np.dot(adj_mat, hub_score)
hub_score_new_unnormed = beta * np.dot(np.transpose(adj_mat), authority_score)
# norm the scores (we only care about proportional values anyway)
authority_score_new = np.divide(authority_score_new_unnormed, np.linalg.norm(authority_score_new_unnormed))
hub_score_new = np.divide(hub_score_new_unnormed, np.linalg.norm(hub_score_new_unnormed))
# find error and update
err = sum(abs(np.subtract(hub_score_new, hub_score))) + sum(abs(np.subtract(authority_score_new, authority_score)))
authority_score = authority_score_new
hub_score = hub_score_new
dict_authority = dict(zip(network.vs["name"], [x/max(authority_score) for x in authority_score] ))
dict_hub = dict(zip(network.vs["name"], [x/max(hub_score) for x in hub_score] ))
sorted_authority_score = sorted(dict_authority.items(), key = lambda x: -x[1])
In [ ]:
fig, axes = plt.subplots(2,1, figsize = (16,12))
plt.subplots_adjust(bottom = .1)
# plot katz centrality
axes[0].set_title("Hub/Authority Centrality", size = 16)
axes[0].bar([x + .2 for x in range(0,num_nodes)], [x[1] for x in sorted_authority_score], color = 'indianred', width = .4)
axes[0].bar([x + .6 for x in range(0,num_nodes)], [dict_hub[x[0]] for x in sorted_authority_score], color = 'brown', width = .4)
axes[0].set_ylabel("Hub/Authority Centrality Values", size = 16)
axes[0].set_xticks([x + .8 for x in range(0,num_nodes)])
_ = axes[0].set_xticklabels([x[0] for x in sorted_authority_score], rotation = 90)
axes[0].legend(["authority score", "hub score"])
# compare with other measures
fade = .3
node_ordering = sorted_degree_centrality #sorted_pagerank
axes[1].plot(range(0,num_nodes), [dict_authority[x[0]] for x in node_ordering], color = 'indianred')
axes[1].plot(range(0,num_nodes), [dict_hub[x[0]] for x in node_ordering], color = 'brown')
axes[1].plot(range(0,num_nodes), [dict_pagerank[x[0]] for x in node_ordering], color = 'purple', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_katz_centrality[x[0]] for x in node_ordering], color = 'limegreen', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_eigenvector_centrality[x[0]] for x in node_ordering], color = 'royalblue', alpha = fade)
axes[1].plot(range(0,num_nodes), [scaled_degree[x[0]] for x in node_ordering], color = 'navajowhite', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_authority[x[0]] for x in node_ordering], color = 'indianred', marker = 'o')
axes[1].plot(range(0,num_nodes), [dict_hub[x[0]] for x in node_ordering], color = 'brown', marker = 'o')
axes[1].plot(range(0,num_nodes), [dict_pagerank[x[0]] for x in node_ordering], color = 'purple', marker = 'o', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_katz_centrality[x[0]] for x in node_ordering], color = 'limegreen', marker = 'o', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_eigenvector_centrality[x[0]] for x in node_ordering], color = 'royalblue', marker = 'o', alpha = fade)
axes[1].plot(range(0,num_nodes), [scaled_degree[x[0]] for x in node_ordering], color = 'navajowhite', marker = 'o', alpha = fade)
axes[1].set_ylabel("Hub/Authority vs PageRank vs Katz vs Eigenvector vs Degree", size = 12)
axes[1].set_xticks([x for x in range(0,num_nodes)])
axes[1].grid()
axes[1].set_xlim(-0.5,38.5)
axes[1].set_ylim(0,1.05)
axes[1].legend(["authority score", "hub score",
"pagerank centrality", "katz centrality", "eigenvector centrality", "degree centrality"])
_ = axes[1].set_xticklabels([x[0] for x in node_ordering], rotation = 90)
In [ ]:
# Let's look at how hub and authority scores are correlated
auth_hub = [(y[0][0], y[0][1], y[1])
for y in zip(sorted_authority_score,[dict_hub[x[0]] for x in sorted_authority_score])]
plt.figure(figsize = (10,8))
plt.plot([x[1] for x in auth_hub], [x[2] for x in auth_hub], 'o', color = 'indianred')
plt.plot([0,1],[0,1], 'k--')
plt.ylim(-.05,1.05)
plt.xlim(-.05,1.05)
for label, x, y in auth_hub:
plt.annotate(label, (x - .02,y + .011), size = 8)
plt.grid()
plt.title("Authority score vs hub score", size = 16)
plt.ylabel("Authority Score", size = 16)
_ = plt.xlabel("Hub Score", size = 16)
So far, we have formulated cetrality based on the adjacency matrix. A different way of thinking about centrality is to think about the message-passing ability of each node--how far and how quickly can it pass a message across the network? How imporant is it to the flow of information across the network?
This notion of centrality is based on the distance from a vertex to other vertices on the graph.
In [ ]:
# it doesn't make sense to calculate these metrics on components that aren't connected, so I'll get the
# stringly connected component to use in the next few metrics
connected_components = network.clusters(mode = 'STRONG')
giant_component = max(zip([len(c) for c in connected_components], connected_components), key = lambda x:x[0])[1]
connected_subnetwork = network.subgraph(giant_component)
unconnected_nodes = list(set(network.vs["name"]) - set(connected_subnetwork.vs["name"]))
connected_subnetwork.summary()
In [ ]:
num_connected_nodes = len(connected_subnetwork.vs)
adj_mat_connected_subnetwork = np.array(connected_subnetwork.get_adjacency().data)
outdegree_connected_subnetwork = np.sum(adj_mat,1)
indegree_connected_subnetwork = np.sum(adj_mat,0)
Closeness centrality measures the average geodesic (shortest) distance between a vertex and every other vertex on the network.
Because the average distance would be small for the most central node and large for the least central, we take the inverse of the average distance in order to create a centrality score.
$d_{ij}$ denotes the distance between vertices $i$ and $j$, $n$ denotes the total number of vertices in the graph:
$$ C_i = \frac{n}{\sum_j d_{ij}} $$
In [ ]:
closeness_centrality = []
for vertex in connected_subnetwork.vs:
closeness_centrality.append(num_connected_nodes / float(sum(connected_subnetwork.shortest_paths(vertex)[0])))
dict_closeness_centrality = defaultdict(float)
for i in range(0, num_connected_nodes):
dict_closeness_centrality[connected_subnetwork.vs["name"][i]] = closeness_centrality[i]/max(closeness_centrality)
for unconnected_node in unconnected_nodes:
dict_closeness_centrality[unconnected_node] = 0
sorted_closeness = sorted(dict_closeness_centrality.items(), key = lambda x: -x[1])
In [ ]:
fig, axes = plt.subplots(2,1, figsize = (16,12))
plt.subplots_adjust(bottom = .1)
# plot centrality
axes[0].set_title("Closeness Centrality", size = 16)
axes[0].bar([x + .4 for x in range(0,num_nodes)], [x[1] for x in sorted_closeness],
color = 'dimgray', width = .8)
axes[0].set_ylabel("Closeness Centrality Values", size = 16)
axes[0].set_xticks([x + .8 for x in range(0,num_nodes)])
_ = axes[0].set_xticklabels([x[0] for x in sorted_closeness], rotation = 90)
# compare with other measures
fade = .3
node_ordering = sorted_degree_centrality #sorted_pagerank
axes[1].plot(range(0,num_nodes), [dict_closeness_centrality[x[0]] for x in node_ordering], color = 'dimgray')
axes[1].plot(range(0,num_nodes), [dict_authority[x[0]] for x in node_ordering], color = 'indianred', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_hub[x[0]] for x in node_ordering], color = 'brown', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_pagerank[x[0]] for x in node_ordering], color = 'purple', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_katz_centrality[x[0]] for x in node_ordering], color = 'limegreen', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_eigenvector_centrality[x[0]] for x in node_ordering], color = 'royalblue', alpha = fade)
axes[1].plot(range(0,num_nodes), [scaled_degree[x[0]] for x in node_ordering], color = 'navajowhite', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_closeness_centrality[x[0]] for x in node_ordering], color = 'dimgray', marker = 'o')
axes[1].plot(range(0,num_nodes), [dict_authority[x[0]] for x in node_ordering], color = 'indianred', marker = 'o', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_hub[x[0]] for x in node_ordering], color = 'brown', marker = 'o', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_pagerank[x[0]] for x in node_ordering], color = 'purple', marker = 'o', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_katz_centrality[x[0]] for x in node_ordering], color = 'limegreen', marker = 'o', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_eigenvector_centrality[x[0]] for x in node_ordering], color = 'royalblue', marker = 'o', alpha = fade)
axes[1].plot(range(0,num_nodes), [scaled_degree[x[0]] for x in node_ordering], color = 'navajowhite', marker = 'o', alpha = fade)
axes[1].set_ylabel("PageRank vs Katz vs Eigenvector vs Degree", size = 16)
axes[1].set_xticks([x for x in range(0,num_nodes)])
axes[1].grid()
axes[1].set_xlim(-0.5,38.5)
axes[1].set_ylim(0,1.05)
axes[1].legend(["closeness centrality", "authority score", "hub score",
"pagerank centrality", "katz centrality", "eigenvector centrality", "degree centrality"])
_ = axes[1].set_xticklabels([x[0] for x in node_ordering], rotation = 90)
You might notice (by looking at the histogram) that the actual range of values for the closeness centrality score is quite low, and the difference between the most central vertices is quite small. If you remember a previous discussion on small-diameter social graphs, this is because the average distance between any two nodes on a random graph tends to be both reatively constant and quite low.
This lack of range is a problem with the closeness centrality score, as it makes the ranking sensitive to small perturbations in the structure of the network.
Example: London would have a high betweenness centrality in a map of airplane routes from the US to Europe. You often travel through London to get to Europe.
A vertex with high betweennness centrality is one that connects many disparate parts of the network (indeed, one algorithm to find communitites in a network iteratively cuts the network on vertices with a high betweenness centrality).
The most common formulation of betweenness centrality counts the number of times a certain vertex $i$ appears on the shortest path between every pair of verties on the network ($s,t$). In the case where there are multiple shortest paths between $s$ and $t$, we count the fraction of times that $i$ appears on those paths.
Define $n^i_{s,t} = 1$ if $i$ appears on the geodesic (shortest) path between two vertices $s$ and $t$, and define $g_{s,t}$ as the total number of geodesic paths between vertices $s$ and $t$.
$$ x_i = \sum_{s,t}\frac{n^i_{s,t}}{g_{s,t}} $$
In [ ]:
shortest_paths_all_pairs = {}
for vertex in range(0,num_connected_nodes):
for dest in range(0,num_connected_nodes):
shortest_paths_all_pairs[(vertex, dest)] = connected_subnetwork.get_all_shortest_paths(vertex, to = dest)
betweenness_sums = defaultdict(int)
for v in range(0,num_connected_nodes):
for i in range(0,num_connected_nodes):
for j in range(0,num_connected_nodes):
n_v_vector = [int(v in path) for path in shortest_paths_all_pairs[(i,j)]]
betweenness_sums[v] += float(sum(n_v_vector))/float(len(n_v_vector))
betweenness = [x[1] for x in sorted(betweenness_sums.items(), key = lambda x:x[0])]
In [ ]:
print tabulate.tabulate([[x[0], x[1]] for x in
sorted(zip(connected_subnetwork.vs["name"], betweenness), key = lambda x: -x[1])][0:8])
The above formulation of betweenness centrality assumes that information (or people, planes, etc) travels along the most efficient (shortest) path on the network. This is by far the most common formulation, and in gerneral, it's what scientists mean when they say "betweenness centrality." However, for some kinds of networks it's more helpful to assume a different sort of information diffusion.
One alternative (of many) is to instead count the average number of times that some vertex $i$ on a random walk between vertices $s$ and $t$.
In [ ]:
# random walk betweenness
iterations = 50
give_up_after = 2*num_connected_nodes
paths = defaultdict(list)
# random walk between every pair of vertices
for vertex in range(0,num_connected_nodes):
for dest in range(0,num_connected_nodes):
for i in range(0,iterations):
# start at the first vertex
current_vertex = vertex
path = [current_vertex]
# now wander around
steps = 0
while (current_vertex != dest) and (steps < give_up_after):
steps += 1
# choose a random vertex to walk to
options = np.where(adj_mat_connected_subnetwork[vertex] != 0)[0]
# dead end
if options.size == 0:
path = []
break
else:
current_vertex = np.random.choice(options)
path.append(current_vertex)
paths[(vertex,dest)].append(path)
In [ ]:
flow_betweenness_sums = defaultdict(int)
for v in range(0,num_connected_nodes):
for i in range(0,num_connected_nodes):
for j in range(0,num_connected_nodes):
n_v_vector = [int(v) in path for path in paths[(i,j)]]
flow_betweenness_sums[v] += float(sum(n_v_vector))/float(len(n_v_vector))
flow_betweenness = [x[1] for x in sorted(flow_betweenness_sums.items(), key = lambda x:x[0])]
In [ ]:
dict_betweenness = dict(zip(connected_subnetwork.vs["name"], [x/max(betweenness) for x in betweenness]))
dict_flow_betweenness = dict(zip(connected_subnetwork.vs["name"], [x/max(flow_betweenness) for x in flow_betweenness]))
for unconnected_node in unconnected_nodes:
dict_betweenness[unconnected_node] = 0
dict_flow_betweenness[unconnected_node] = 0
sorted_flow_betweenness = sorted(dict_flow_betweenness.items(), key = lambda x: -x[1])
sorted_betweenness = sorted(dict_betweenness.items(), key = lambda x: -x[1])
In [ ]:
print tabulate.tabulate([[x[0], x[1]] for x in
sorted(zip(connected_subnetwork.vs["name"], flow_betweenness), key = lambda x: -x[1])][0:8])
In [ ]:
fig, axes = plt.subplots(2,1, figsize = (16,12))
plt.subplots_adjust(bottom = .1)
# plot centrality
axes[0].set_title("Betweenness Centrality", size = 16)
axes[0].bar([x + .2 for x in range(0,num_nodes)], [x[1] for x in sorted_betweenness], color = 'cyan', width = .4)
axes[0].bar([x + .6 for x in range(0,num_nodes)], [dict_flow_betweenness[x[0]] for x in sorted_betweenness], color = 'darkcyan', width = .4)
axes[0].set_ylabel("Betweenness Centrality Values", size = 16)
axes[0].set_xticks([x + .8 for x in range(0,num_nodes)])
_ = axes[0].set_xticklabels([x[0] for x in sorted_betweenness], rotation = 90)
axes[0].legend(["betweenness by geodesic distances", "betweenness by random walk"])
# compare with other measures
fade = .3
node_ordering = sorted_degree_centrality #sorted_pagerank
axes[1].plot(range(0,num_nodes), [dict_betweenness[x[0]] for x in node_ordering], color = 'cyan')
axes[1].plot(range(0,num_nodes), [dict_flow_betweenness[x[0]] for x in node_ordering], color = 'darkcyan')
axes[1].plot(range(0,num_nodes), [dict_closeness_centrality[x[0]] for x in node_ordering], color = 'dimgray', alpha = fade)
#axes[1].plot(range(0,num_nodes), [dict_authority[x[0]] for x in node_ordering], color = 'indianred', alpha = fade)
#axes[1].plot(range(0,num_nodes), [dict_hub[x[0]] for x in node_ordering], color = 'brown', alpha = fade)
#axes[1].plot(range(0,num_nodes), [dict_pagerank[x[0]] for x in node_ordering], color = 'purple', alpha = fade)
#axes[1].plot(range(0,num_nodes), [dict_katz_centrality[x[0]] for x in node_ordering], color = 'limegreen', alpha = fade)
#axes[1].plot(range(0,num_nodes), [dict_eigenvector_centrality[x[0]] for x in node_ordering], color = 'royalblue', alpha = fade)
#axes[1].plot(range(0,num_nodes), [scaled_degree[x[0]] for x in node_ordering], color = 'navajowhite', alpha = fade)
axes[1].plot(range(0,num_nodes), [dict_betweenness[x[0]] for x in node_ordering], color = 'cyan', marker = 'o')
axes[1].plot(range(0,num_nodes), [dict_flow_betweenness[x[0]] for x in node_ordering], color = 'darkcyan', marker = 'o')
axes[1].plot(range(0,num_nodes), [dict_closeness_centrality[x[0]] for x in node_ordering], color = 'dimgray', marker = 'o', alpha = fade)
#axes[1].plot(range(0,num_nodes), [dict_authority[x[0]] for x in node_ordering], color = 'indianred', marker = 'o', alpha = fade)
#axes[1].plot(range(0,num_nodes), [dict_hub[x[0]] for x in node_ordering], color = 'brown', marker = 'o', alpha = fade)
#axes[1].plot(range(0,num_nodes), [dict_pagerank[x[0]] for x in node_ordering], color = 'purple', marker = 'o', alpha = fade)
#axes[1].plot(range(0,num_nodes), [dict_katz_centrality[x[0]] for x in node_ordering], color = 'limegreen', marker = 'o', alpha = fade)
#axes[1].plot(range(0,num_nodes), [dict_eigenvector_centrality[x[0]] for x in node_ordering], color = 'royalblue', marker = 'o', alpha = fade)
#axes[1].plot(range(0,num_nodes), [scaled_degree[x[0]] for x in node_ordering], color = 'navajowhite', marker = 'o', alpha = fade)
axes[1].set_ylabel("Betweenness vs PageRank vs Katz vs Eigenvector vs Degree", size = 12)
axes[1].set_xticks([x for x in range(0,num_nodes)])
axes[1].grid()
axes[1].set_xlim(-0.5,38.5)
axes[1].set_ylim(0,1.05)
axes[1].legend(["betweenness", "flow betweenness", "closeness centrality"])#, "authority score", "hub score", "pagerank centrality", "katz centrality", "eigenvector centrality", "degree centrality"])
_ = axes[1].set_xticklabels([x[0] for x in node_ordering], rotation = 90)