In this class you are expected to learn:
In [1]:
# We will need this later
%pylab inline
pylab.rcParams['figure.figsize'] = (12.0, 6.0)
The social scientist John Barnes once described graph theory as a "terminological jungle, in which any newcomer may plant a tree". We now turn to some more of the fundamental concepts and definitions surrounding graphs.
First, let's take a second look to our hamsterster
network data set.
In [2]:
import networkx as nx
hamsterster = nx.read_edgelist("data/petster-friendships-hamster.txt")
nx.draw(hamsterster)
It's hard to tell by looking to the image. How many components do you think the graph has? Let's find out by using NetworkX's nx.connected_components()
.
In [3]:
hamsterster_components = nx.connected_components(hamsterster)
len([c for c in hamsterster_components])
Out[3]:
And how big is the biggest, and how small the smallest? Because nx.connected_components()
returns a generator, we need to save it for later or invoking it every time we need to do something with the connected components.
In [4]:
hamsterster_components = nx.connected_components(hamsterster)
[len(c) for c in hamsterster_components]
Out[4]:
In this case we can say that the network has one giant component with 1788 nodes.
Centrality of a vertex measures its relative importance within the graph. And there are many ways of calculating centrality, each one with a slightly different meaning.
However, there is little agreement on what centrality actually means. In practice, the first answer is usually "it depends." It depends on the data—what does it mean that two people are connected? Did they just "friend each other" on a social networking site—or do they have a deeper connection? It depends on whether the links mean information exchange, or responsibility (as in the case of organizational networks). It depends on the understanding of power and influence that is desired as an output. In this class, we will explore the four classic centrality metrics.
The idea of degree centrality is the simplest. The degree centrality of a node is just its degree, that is, the number of neighbors it has. The more connected to other nodes a node is, the higher its degree centrality. It's somehow measuring the "celebrities" of the graph.
In [5]:
degree = nx.degree_centrality(hamsterster)
Once we are calculated degree centrality, we sort the results to see which nodes are more central, showing the first 10.
In [6]:
sorted(degree.items(), key=lambda x: x[1],reverse = True)[:10]
Out[6]:
In [7]:
degree_hist = plt.hist(list(degree.values()), 100)
plt.loglog(degree_hist[1][1:], degree_hist[0])
plt.title("Degree histogram (loglog)")
Out[7]:
It seems like there is an interesting cluster. Let's try to trim the graph and draw it.
In [8]:
# Return a new graph object that contains the network with pendant and
# isolated nodes removed
def trim_degrees(graph, degree=1):
g = graph.copy()
d = nx.degree(g)
for n in g.nodes():
if d[n] <= degree:
g.remove_node(n)
return g
len(trim_degrees(hamsterster))
Out[8]:
In [9]:
len(trim_degrees(hamsterster, degree=25))
Out[9]:
In [10]:
nx.draw(trim_degrees(hamsterster, degree=25))
Degree centrality is clearly significant, but high degree centrality does not correspond very well with cropping up often on paths between other nodes in the network. This is better captured by betweenness centrality. Betweenness centrality has to do with how often a node crops up when taking a best path (there may be more than one) between two other nodes. The betweenness centrality of a node is defined as the proportion of best paths between any other pairs of nodes which pass through it. By definition the betweenness centrality of a node depends on the connection properties of every pair of nodes in the graph, except pairs with it, whereas degree centrality depends only on the node's neighbors.
A very closely related notion is the notion of betweenness for edges. Given the set of shortest paths between all node pairs, the betweenness centrality of an edge is simply the proportion of those which pass through it. To get a better feel for the intuition, it often helps to think of a network as having traffic: Travelers start out from each node with their destinations evenly divided among the other nodes, always choosing best paths for their journeys. In this situation, the edge with the highest betweenness will have the most traffic. Therefore, a good path calculator should avoid those nodes with high betweenness centrality while keeping a good total length of the path chosen. Not an easy problem.
In [11]:
betweenness = nx.betweenness_centrality(hamsterster)
In [12]:
sorted(betweenness.items(), key=lambda x: x[1], reverse=True)[:10]
Out[12]:
Activity
Write a function, `centrality_scatter(cent1, cent2)`, that receives two centrality dictionaries and plot each node label as a point using each of dictionary as one of the axes. Add a linear best-fit trend, axis and title labels.
Ability to move information from one side of the network to another (i.e., gossip) is an important step towards establishing a shared perception of the world, whether it has to do with someone's choice of outfits at a party or the formation of a political movement. Therefore, an important node is typically "close" to, and can communicate quickly with, the other nodes in the network. That's the main idea behind closeness centrality.
Closeness centrality is based on distance calculation and tries to quantify the intuitive notion of what position a node occupies, if central or peripheral.
In [13]:
closeness = nx.closeness_centrality(hamsterster)
sorted(closeness.items(), key=lambda x: x[1], reverse=True)[:10]
Out[13]:
For this measure, an important node is connected to important neighbors. Which is very similar to the idea of the Google's PageRank. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
Eigenvector centrality is often referred to as Bonacich's centrality.
In [14]:
eigenvector = nx.closeness_centrality(hamsterster)
sorted(eigenvector.items(), key=lambda x: x[1],reverse = True)[:10]
Out[14]:
Newman, the author of the measure, calls it "random walk centrality". The current flow centrality of a node can be defined as the probability of passing through it on a random walk starting at some node and ending at some other node (neither equal to initial). Newman argues that betweenness for some social networks should not be computed just as a function of shortest paths, but of all paths, assigning the shortest the greatest weight. This is because messages in such networks may get to where they're going not by the shortest path, but by a random path, such as that of sexual disease through a sexual contact network.
This measure needs connected graphs, so we will first extract the first component of our hamsterster
example as a subgraph, and then apply the measure (it can take a while).
In [15]:
# next() returns the next element of an iterator, so the first time it will be the first element
hamsterster_subgraph = next(nx.connected_component_subgraphs(hamsterster))
current_flow = nx.current_flow_betweenness_centrality(hamsterster_subgraph)
sorted(current_flow.items(), key=lambda x: x[1], reverse=True)[:10]
Out[15]:
Activity
Pick another network from KONECT and apply all the centrality measures. Get the set of nodes more central in all the measures for the top 10.
One way to represent centrality is by modifying nodes size or color. NetworkX has parameters in its function nx.draw()
to assign a list of sizes to node_size
, a list of RGB colors to node_color
, and the list of nodes to node_list
. The sizes must be greater than 1, otherwise you won't see anything.
In [16]:
node_list = eigenvector.keys()
node_size = [100 * v for v in eigenvector.values()]
nx.draw(hamsterster, node_list=node_list, node_size=node_size)
Still doesn't say much. Maybe colouring the nodes? :D
Activity
Plot a network using a centrality measure to modify the colors of the nodes (google it!). Scale the colors in such a way that 25% of highest central nodes are blue, 25% next are green, 25% next are red, and the final 25% are white.
Ego networks are subnetworks or subgraphs that are centered on a certain node. On Facebook and LinkedIn, these are simply described as "your network", but you can only access your own ego network, and can't do a broader survey. Having a larger dataset allows us to survey and compare ego networks of various people.
We need to think about the idea of what it means to be connected in a social network, and what network distance means. In the most basic case, a link means that "Alice is a friend of Bob," a distance of 2 means "Carol is a friend of a friend of Alice," and a distance of 3 means that "Dave is a friend of a friend of a friend of Alice." Intuitively, we understand that while we know quite a bit about our friends, and we know something about some of our friends' friends, we know nearly nothing about our friends' friends' friends. Formally, this concept is known as a Horizon of Observability, a concept loosely borrowed from physics and the notion of the observable universe.
In online networks such as Twitter, with a significantly looser definition of a "friend" and computer-facilitated retention of information, the useful radius of the ego network is large. However, trust and influence still travel over the less-reliable "wetware" interpersonal channels, so it would be a mistake to consider ego networks with a radius over 3.
Extraction of ego networks is quite simple, as NetworkX provides a ready-made function to do the job. In our hamsterster
data set, the node 237
is always high in centrality.
In [17]:
ego = nx.ego_graph(hamsterster, '237')
ego
Out[17]:
Knowing the size of an ego network is important to understand the reach of the information that a person can transmit (or, conversely, have access to).
In [18]:
len(ego)
Out[18]:
Let's expand the raidus up to 2.
In [19]:
len(nx.ego_graph(hamsterster, '237', radius=2))
Out[19]:
Wow! That's quite different. When we expand the radius to 2, friend of a friend, our celeb node 237 is able to reach up to 1,148 nodes.
The other metric is called clustering coefficient. Essentially, it measures the proportion of your friends that are also friends with each other (mutual trust). This metric can be applied to entire networks, but in a large network with widely varying densities and multiple cores, average clustering is difficult to interpret. In ego networks, the interpretation is simple: dense ego networks with a lot of mutual trust have a high clustering coefficient. Star networks with a single broadcast node and passive listeners have a low clustering coefficient.
In [20]:
nx.average_clustering(ego)
Out[20]:
His clustering coefficient shows that he is in a trust network of friends where approximately 1 pair of friends out of 3 is connected. It's a network where a revolutionary message could be spread and sustained, although the speed of the messaging could be improbed by craeting new edges.