[Data, the Humanist's New Best Friend](index.ipynb)
*Class 19*

In this class you are expected to learn:

  • Centrality
  • Degree
  • Betweenness
  • Closeness
  • Eigenvector
  • Current flow betweenness
  • Ego networks
*Aliens? Netwoks!*

In [1]:
# We will need this later
%pylab inline
pylab.rcParams['figure.figsize'] = (12.0, 6.0)


Populating the interactive namespace from numpy and matplotlib

More terminology, weee!

The social scientist John Barnes once described graph theory as a "terminological jungle, in which any newcomer may plant a tree". We now turn to some more of the fundamental concepts and definitions surrounding graphs.

  • Paths. We've seen this already, but... A path is simply a sequence of nodes with the property that each consecutive pair in the sequence is connected by an edge.
  • Cycles. A cycle is a path with at least three edges, in which the first and last nodes are the same, but otherwise all nodes are distinct.
  • Connectivity. Given a graph, it is natural to ask whether every node can reach every other node by a path. With this in mind, we say that a graph is connected if for every pair of nodes, there is a path between them.
  • Components. A connected component of a graph (often shortened just to the term component) is a subset of the nodes such that: (i) every node in the subset has a path to every other; and (ii) the subset is not part of some larger set with the property that every node can reach every other.
  • Giant Components. A connected component that contains a significant fraction of all the nodes.
  • Subgraphs. A subgraph is a subset of the nodes of a network, and all of the edges linking these nodes. Any group of nodes can form a subgraph. Ego networks are subgraphs that are centered on a certain node.

Centrality

First, let's take a second look to our hamsterster network data set.


In [2]:
import networkx as nx
hamsterster = nx.read_edgelist("data/petster-friendships-hamster.txt")
nx.draw(hamsterster)


It's hard to tell by looking to the image. How many components do you think the graph has? Let's find out by using NetworkX's nx.connected_components().


In [3]:
hamsterster_components = nx.connected_components(hamsterster)
len(hamsterster_components)


Out[3]:
23

And how big is the biggest, and how small the smallest?


In [4]:
[len(c) for c in hamsterster_components]


Out[4]:
[1788, 7, 7, 5, 5, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

In this case we can say that the network has one giant component with 1788 nodes.

Centrality of a vertex measures its relative importance within the graph. And there are many ways of calculating centrality, each onw with a slightly different meaning.

However, there is little agreement on what centrality actually means. In practice, the first answer is usually "it depends." It depends on the data—what does it mean that two people are connected? Did they just "friend each other" on a social networking site—or do they have a deeper connection? It depends on whether the links mean information exchange, or responsibility (as in the case of organizational networks). It depends on the understanding of power and influence that is desired as an output. In this class, we will explore the four classic centrality metrics.

Degree centrality

The idea of degree centrality is the simplest. The degree centrality of a node is just its degree, that is, the number of neighbors it has. The more connected to other nodes a node is, the higher its degree centrality. It's somehow measuring the "celebrities" of the graph.


In [5]:
degree = nx.degree_centrality(hamsterster)

Once we are calculated degree centrality, we sort the results to see which nodes are more central.


In [6]:
sorted(degree.items(), key=lambda x: x[1],reverse = True)[:10]


Out[6]:
[('237', 0.1464728056004308),
 ('238', 0.11954765751211631),
 ('44', 0.09100700053850296),
 ('168', 0.08292945611200861),
 ('137', 0.07969843834141088),
 ('169', 0.0791599353796446),
 ('45', 0.07646742057081314),
 ('46', 0.07377490576198169),
 ('176', 0.06785137318255251),
 ('177', 0.06677436725901992)]

In [35]:
degree_hist = plt.hist(list(degree.values()), 100)
plt.loglog(degree_hist[1][1:], degree_hist[0])
plt.title("Degree histogram (loglog)")


Out[35]:
<matplotlib.text.Text at 0x7fa7c7403bd0>

It seems like there is an interesting cluster. Let's try to trim the graph and draw it.


In [23]:
# Return a new graph object that contains the network with pendant and
# isolated nodes removed
def trim_degrees(graph, degree=1):
    g = graph.copy()
    d = nx.degree(g)
    for n in g.nodes():
        if d[n] <= degree:
            g.remove_node(n)
    return g

len(trim_degrees(hamsterster))


Out[23]:
1572

In [28]:
len(trim_degrees(hamsterster, degree=25))


Out[28]:
258

In [34]:
nx.draw(trim_degrees(hamsterster, degree=25))


Betweenness centrality

Degree centrality is clearly significant, but high degree centrality does not correspond very well with cropping up often on paths between other nodes in the network. This is better captured by betweenness centrality. Betweenness centrality has to do with how often a node crops up when taking a best path (there may be more than one) between two other nodes. The betweenness centrality of a node is defined as the proportion of best paths between any other pairs of nodes which pass through it. By definition the betweenness centrality of a node depends on the connection properties of every pair of nodes in the graph, except pairs with it, whereas degree centrality depends only on the node's neighbors.

A very closely related notion is the notion of betweenness for edges. Given the set of shortest paths between all node pairs, the betweenness centrality of an edge is simply the proportion of those which pass through it. To get a better feel for the intuition, it often helps to think of a network as having traffic: Travelers start out from each node with their destinations evenly divided among the other nodes, always choosing best paths for their journeys. In this situation, the edge with the highest betweenness will have the most traffic.


In [7]:
betweenness = nx.betweenness_centrality(hamsterster)

In [8]:
sorted(betweenness.items(), key=lambda x: x[1],reverse = True)[:10]


Out[8]:
[('237', 0.07955057589298353),
 ('137', 0.05961357357799319),
 ('169', 0.05332285093462576),
 ('238', 0.03863041164165812),
 ('251', 0.03546679437500915),
 ('44', 0.03277672954551926),
 ('296', 0.027557830872305678),
 ('168', 0.026840912958145585),
 ('23', 0.02657300119877967),
 ('65', 0.0260147153874862)]

Activity

Write a function, `centrality_scatter(cent1, cent1)`, that receives two centrality dictionaries and plot each node label as a point using each of dictionary as one of the axes. Add a linear best-fit trend, axis and title labels.

Closeness centrality

Ability to move information from one side of the network to another (i.e., gossip) is an important step towards establishing a shared perception of the world, whether it has to do with someone's choice of outfits at a party or the formation of a political movement. Therefore, an important node is typically "close" to, and can communicate quickly with, the other nodes in the network. That's the main idea behind closeness centrality.

Closeness centrality is based on distance calculation and tries to quantify the intuitive notion of what position a node occupies, if central or peripheral.


In [9]:
closeness = nx.closeness_centrality(hamsterster)
sorted(closeness.items(), key=lambda x: x[1],reverse = True)[:10]


Out[9]:
[('237', 0.42158339409479156),
 ('137', 0.4173880253671492),
 ('238', 0.40452567972539516),
 ('176', 0.3941413395628363),
 ('177', 0.3941413395628363),
 ('118', 0.39225334500744863),
 ('178', 0.39100469861588333),
 ('168', 0.39002918224374117),
 ('3', 0.38976397654411943),
 ('117', 0.387830100250937)]

Eigenvector centrality

For this measure, an important node is connected to important neighbors. Which is very similar to the idea of the Google's PageRank. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

Eigenvector centrality is often referred to as Bonacich's centrality.


In [10]:
eigenvector = nx.closeness_centrality(hamsterster)
sorted(eigenvector.items(), key=lambda x: x[1],reverse = True)[:10]


Out[10]:
[('237', 0.42158339409479156),
 ('137', 0.4173880253671492),
 ('238', 0.40452567972539516),
 ('176', 0.3941413395628363),
 ('177', 0.3941413395628363),
 ('118', 0.39225334500744863),
 ('178', 0.39100469861588333),
 ('168', 0.39002918224374117),
 ('3', 0.38976397654411943),
 ('117', 0.387830100250937)]

Current flow betweenness centrality

Newman, the author of the measure, calls it "random walk centrality". The current flow centrality of a node can be defined as the probability of passing through it on a random walk starting at some node and ending at some other node (neither equal to initial). Newman argues that betweenness for some social networks should not be computed just as a function of shortest paths, but of all paths, assigning the shortest the greatest weight. This is because messages in such networks may get to where they're going not by the shortest path, but by a random path, such as that of sexual disease through a sexual contact network.

This measure needs connected graphs, so we will first extract the first component of our hamsterster example as a subgraph, and then apply the measure (it can take a while).

*Be patient, my friend*

In [11]:
hamsterster_subgraph = nx.connected_component_subgraphs(hamsterster)[0]
current_flow = nx.current_flow_betweenness_centrality(hamsterster_subgraph)
sorted(current_flow.items(), key=lambda x: x[1],reverse = True)[:10]


Out[11]:
[('237', 0.05922581737437469),
 ('169', 0.05441662536956891),
 ('137', 0.041596720418423654),
 ('238', 0.041310498229222864),
 ('65', 0.03657795826966395),
 ('44', 0.03620653287526644),
 ('251', 0.03571610669668275),
 ('87', 0.034926744563065104),
 ('66', 0.030416985907935196),
 ('23', 0.029590387063215626)]

Activity

Pick other network from KONECT and apply all the centrality measures. Get the set of nodes more central in all the measures for the top 100.

Drawing centrality

One way to represent centrality is by modifying nodes size or color. NetworkX has parameters in its function nx.draw() to assign a list of sizes to node_size, a list of RGB colors to node_color, and the list of nodes to node_list. The sizes must be greater than 1, otherwise you won't see anything.


In [36]:
node_list = eigenvector.keys()
node_size = [100 * v for v in eigenvector.values()]
nx.draw(hamsterster, node_list=node_list, node_size=node_size)


Still doesn't say much. Maybe colouring the nodes? :D

Activity

Plot a network using a centrality measure to modify the colors of the nodes. Scale the colors in such a way that 25% of highest central nodes are blue, 25% next are green, 25% next are red, and the final 25% are white.

Ego Networks

Ego networks are subnetworks or subgraphs that are centered on a certain node. On Facebook and LinkedIn, these are simply described as "your network", but you can only access your own ego networks, and can't do a broader survey. Having a larger dataset allows us to survey and compare ego networks of various people.

We need to think about the idea of what it means to be connected in a social network, and what network distance means. In the most basic case, a link means that "Alice is friends with Bob," a distance of 2 means "Carol is a friend of a friend of Alice," and a distance of 3 means that "Dave is a friend of a friend of a friend of Alice." Intuitively, we understand that while we know quite a bit about our friends, and we know something about some of our friends' friends, we know nearly nothing about our friends' friends' friends. Formally, this concept is known as a Horizon of Observability, a concept loosely borrowed from physics and the notion of the observable universe.

In online networks such as Twitter, with a significantly looser definition of a "friend" and computer-facilitated retention of information, the useful radius of the ego network is large. However, trust and influence still travel over the less-reliable "wetware" interpersonal channels, so it would be a mistake to consider ego networks with a radius over 3.

Extraction of ego networks is quite simple, as NetworkX provides a ready-made function to do the job. In our hamsterster data set, the node 237 is always high in centrality.


In [46]:
ego = nx.ego_graph(hamsterster, '237')
ego


Out[46]:
<networkx.classes.graph.Graph at 0x7fa7c6f6d0d0>

Knowing the size of an ego network is important to understand the reach of the information that a person can transmit (or, conversely, have access to).


In [48]:
len(ego)


Out[48]:
273

Let's expand the raidus up to 2.


In [50]:
len(nx.ego_graph(hamsterster, '237', radius=2))


Out[50]:
1148

Wow! That's quite different. When we expand the radius to 2, friend of a friend, our celeb node 237 is able to reach up to 1,148 nodes.

The other metric is called clustering coefficient. Essentially, it measures the proportion of your friends that are also friends with each other (mutual trust). This metric can be applied to entire networks, but in a large network with widely varying densities and multiple cores, average clustering is difficult to interpret. In ego networks, the interpretation is simple: dense ego networks with a lot of mutual trust have a high clustering coefficient. Star networks with a single broadcast node and passive listeners have a low clustering coefficient.


In [52]:
nx.average_clustering(ego)


Out[52]:
0.3539200721346954

His clustering coefficient shows that he is in a trust network of friends, a network where a revolutionary message can easily spread and be sustained.

For the next class: