[Data, the Humanist's New Best Friend](index.ipynb)
*Class 19*

In this class you are expected to learn:

  • Social Network Analysis
  • Small Worlds
  • Random vs Scale Free
  • Modularity and Community Structure
*What if Kevin Bacon never worked with Kevin Bacon?*

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import networkx as nx
plt.rcParams['figure.figsize'] = (12.0, 6.0)

Social Network Analysis

The study of the properties of social networks has been an active area of research in sociology, anthropology, and social psychology for many decades. But due to some seminal work in the 90s, coupled with the explosion of the world wide web and the fresh interest generated by social media, the field has entered a new phase of rapid growth. It has also become clear that understanding social networks involves understanding network properties of relevance in a large number of disciplines. One of the welcome consequences is that there are large amounts of freely available data. A good place to start is Jure Lekovec's data page known as the Stanford Large Network Dataset (SNAP) collection which has links to a great variety of networks, ranging from Facebook style social networks to citation networks to Twitter networks to open communities like Live Journal. For smaller scale practice networks, try Mark Newman's data page. Mark Newman is a physicist at the University of Michigan and Santa Fe Institute, who has done a lot of important work on networks,

The study of social networks is part of the larger discipline of the study of systems, large entities with many interacting parts that can function cooperatively (note the word can) to get things done that none of the parts could get done alone. Systems include

  • The human body
  • Language (both computer and human)
  • Organizations such as the USSR, a boy scout troop, and an automobile company.
  • The network of power lines across the U.S. and Canada
  • The internet
  • Political blogs on the Internet
  • The protein interactions in a living cell
  • A food web
  • Communities of people

But what about the network part? Why study systems as networks? Systems that have parts that enter into recurring relations may be pictured as networks. The system of the human body has a subsystem called the central nervous system that has parts called neurons connected to other neurons in complex patterns; the way in which one neuron is linked to another, however, is relatively uniform. Internet sites are linked to other internet sites by hyperlinks; companies have hierarchical structure in which the relation "is a superior of" recurs over and over; power plants, power substation, and transformers are linked together by high voltage transmission lines; various functional units and organs of the human body are linked together by various kinds of pathways (circulatory, neural, and lymphatic). Species are linked to each other by food dependencies ecologists call a food web. The proteins in a living cell are linked to each other through a network of interactions.

Small Worlds

The first set of metrics that we have seen is usually called "centrality". There are different types of importance in a network. Who are the celebrities? Who the gossipmongers? Who are the community bridges? Or the éminence grise? All these questions have centrality measures at its core: degree centrality, closeness centrality, betweenness centrality, and eigenvector centrality, respectively.

Centrality allows us to identify key players and communities. And when that happens, these networks are usually called small world networks. These kind of networks keep a local neighborhood structure but allow a small number of ties to reach far away. The prevailing argument in the community is that small world network-like shapes dominate the social network landscape. These networks consist of dense communities or neighborhoods that are loosely connected by boundary spanners, and these spanners are identified by the centrality measures. Specifically, a small-world network is defined to be a network where the typical distance $L$ between two randomly chosen nodes (the number of steps required) grows proportionally to the logarithm of the number of nodes $N$ in the network, that is:

$$ L \propto log N $$

In the context of a social network, this results in the small world phenomenon of strangers being linked by a mutual acquaintance.

In NetworkX is as easy as calling the function average_shortest_path_length(). However, average_shortest_path_length() expects for a connected graph, so we first extract the main component of the Hamsterster graph.


In [2]:
hamsterster = nx.read_edgelist("data/petster-friendships-hamster.txt")
hamsterster_component = next(nx.connected_component_subgraphs(hamsterster))
nx.average_shortest_path_length(hamsterster_component)


Out[2]:
3.452640184078649

And the size of our main component is calculated the usual way.


In [3]:
len(hamsterster_component)


Out[3]:
1788
*Isn't it, Scar?*

But without any evolution in time, it is really hard to see if the proportion remains true. Fortunately, the average shortest path length, $L$, is just one of the two independent structural features of a small-world. The other is called the clustering coefficient.

Watts and Strogatz defined what they called small worlds as follows:

A network with a low average shortest path length (APL) and a high clustering coefficient (CC) is called a small world.

The clustering coefficient of a node is the ratio of the number of links connecting a node’s neighbors to each other to the maximum possible number of such links. The clustering coefficient of a graph is the average clustering coefficient of all the nodes in the graph. In the ordered graph, clustering is quite high; the neighbors a node n is connected to are quite likly to be connected to other nodes n is connected to. In a random graph, the clustering coefficient is dramatically reduced, because those predictable links have all been destroyed.

In a social network way, the clustering coefficient measures the "all-my-friends-know-each-other" property (sometimes summarized as the friends of my friends are my friends). The higher the clustering coefficient of a person the greater the proportion of his friends who all know each other. Thus, our intuitions suggest that social graphs will have high CCs. This property is utterly unlike random graphs. By the same token, it seems, human social networks should have high APLs. If our friends more or less know all the same people we do, then the world is just a collection of cliques, and paths from one clique to another will be hard to come by. Thus, APLs should be high. In fact, our intuitions are right about CCs, but they are completely wrong about APL: human social networks actually have low APLs, and this is like random graphs. Summing it up:

\begin{array}[t]{lll} & \mbox{APL} & \mbox{CC}\\ \hline \mbox{Human} & \mbox{low} & \mbox{high} \\ \mbox{Random} & \mbox{low} & \mbox{low} \end{array}

Thus, human social networks are small worlds.

With this simple definition, W&S succeeded not only in isolating some key properties of social graphs, but in identifying key structural properties that characterized a large class of real world networks of great interest.

W&S demonstrate the applicability of their definition of social networks with three very different examples:

  • Collaboration network of Hollywood actors
  • The power grid of the Western United States
  • The neural network of the worm Caenorhabditis elegans

Each of these has a small world graph and each is a proxy for an important class of networks. As an example, let's take the KONECT Movies network dataset, actors.txt, and calculate both APL and CC.


In [4]:
actors = nx.read_edgelist("data/actors.txt")

In [5]:
actors_component = next(nx.connected_component_subgraphs(actors))
actors_component.size()


Out[5]:
281384

In [ ]:
# This will take forever, graphs are hard-core mathematics! Be patient
nx.average_shortest_path_length(actors_component)

In [6]:
nx.average_clustering(actors_component)


Out[6]:
0.00037324806955060845

Random vs. Scale-free

We have seen already degree distributions. The reason to care about degree distributions is that we can compare them with known statistical distributions to learn things about the structural properties of the graphs. It's one of a number of global properties of the graph that says something about how its functioning can be orderly.

We turn next to one of the key tools in studying the emergence of order in a complex system: studying random systems. In particular, random graphs.

In a series of ground-breaking papers published in 1959-1968, Paul Erdos and Alfred Renyi investigated the properties of random graphs and laid the foundations of new area of graph theory. Our chief interest in discussing some of the properties of random graphs here is to compare those properties with those of social networks. Comparing orderly systems to random systems gives us some insight into what properties are key to the emergence of order.

NetworkX includes several random graph generators. The next code generates a random graph following the procedure described by Erdos and Renyi:

  1. Choose a probability p for edge creation. The value of p will determine how densely connected the graph is.
  2. Start with a set of nodes, but with no connections between them.
  3. Pick two nodes at random and add an edge between them with probability p.

In [7]:
er = nx.erdos_renyi_graph(100, 0.75)
nx.draw_spectral(er)


We use a force-spring layout algorithm to draw the graph in the previous cell, which pulls linked nodes toward the center. As time goes on, components begin to merge, because an edge happens to be drawn between two of their nodes. Finally, nearly all the nodes have been merged into one giant component.

Since every link added increases the degree of two nodes, the average degree of a node rises faster than the number of links. A somewhat surprising result derived by Erdos & Renyi is that after the average degree of a node exceeds one, the growth of the giant component must accelerate very rapidly, leading in very few steps to the creation of a single giant component. This sudden global transformation at very precise parameter value is called a phase transition and the value of the parameter (average degree) is called a critical point. Phase transitions are studied by physicists concerned with phenomena like the transition of a material from a liquid to a solid state at its freezing point. So random graphs play a key role in the mathematics of critical phenomena, which is called percolation theory. This is one example of the strong conceptual connections between the study of large networks and statistical physics.

A less surprising derived result has to do with the degree distribution of a random graph. The degree distribution of a random graph is a normal distribution, rather, the discrete counterpart of a bell curve, a binomial distribution. This tells us that real world networks are not generated by the kind of process that generates a random graph, as they usually don't follow normal distributions by degree.

Activity

Plot the histogram for the degree distribution of the main component of the actors network.

Modularity and Community Structure

In 1977, anthropologist Wayne Zachary published a study of a small karate club. He represented the findings of his ethnographic study as a graph, defined as follows:

A line is drawn between two points when the two individuals being represented consistently interacted in contexts outside those of karate classes, workouts, and club meetings.

NetworkX includes this dataset. Let's take a look at ti by using a circular layout algorimth for positioning the nodes.


In [8]:
karate = nx.karate_club_graph()
nx.draw_circular(karate)


Using the circular layout it seems like there is no hidden structure. However, if we apply a Fruchterman-Reingold layout, .draw_spring() in NetworkX, results are different.


In [9]:
karate_positions = nx.fruchterman_reingold_layout(karate)
nx.draw(karate, scale=1.0, pos=karate_positions)


Graph layout algorithms uses no information about the nodes. That is, the two communities that start to be apparent now are discovered by the layout algorithm because of the structure of their mutual links.

Let's see what happens if we add coloring according to the club attribute in the nodes' data.


In [10]:
faction1_col, faction2_col = 'coral', 'lightblue'
# Now using the faction membership abnnotations from Zachary's data, change the known faction2 nodes to faction2_color.
node_color = [faction1_col] * len(karate.nodes())
node_dict = dict(karate.nodes(data=True))
for n in karate.nodes():
    if node_dict[n]['club'] == 'Officer':
        node_color[n] = faction2_col

new_labels = dict((x, x + 1) for x in karate.nodes())
nx.draw_networkx_labels(karate, karate_positions, new_labels, font_size=10, font_color='black')
nx.draw_networkx_nodes(karate, karate_positions, {0: 0, 33: 33}, node_color=['grey', 'grey'], node_size=500)
# Now draw all the nodes, including leaders, using faction color scheme.
nx.draw_networkx_nodes(karate, karate_positions, new_labels, node_color=node_color)
nx.draw_networkx_edges(karate, karate_positions)


Out[10]:
<matplotlib.collections.LineCollection at 0x7ffb4ed8feb8>

When we add the coloring, we see that the two communities correspond very closely to the factions in the graph. There are two key nodes, node 1 (actually the karate club instructor) and node 34, (actually the club president), which are central to different groups within the club. As can be seen, the two subgroups Zachary found (which he called factions) correspond closely to the grouping the force-spring algorithm finds, based on the interconnectedness of the nodes.

The layout algorithm used is the Fruchterman-Reingold algorithm, which is one of a family of spring-directed algorithms. The idea of such algorithms is to assume two competing forces, a repulsive force driving all nodes apart and an attractive force keeping the nodes linked by edges together. You can think of the edges as strong springs. Such algorithms are good at discovering symmetries.

We would like to see whether it is possible to define the two factions purely on the basis of structural properties of the graph.

Centrality measures can tell us what nodes might be at the centers of communities, but they don't directly tell us what those communities are. What we need beyond that is an analytical method for splitting a network into communities (a divisive algorithm) — or conversely, a method for collecting individual nodes into larger and larger communities (an agglomerative algorithm). We will sketch one of the many algorithms that exist as an example, choosing an elegant divisive algorithm known as the Girvan-Newman method that depends on computing edge-betweenness values. The method relies on the idea that edges of high edge-betweenness tend to be community-spanning.

  1. Find the edge of highest betweenness — or multiple edges of highest betweenness, if there is a tie — and remove these edges from the graph. This may cause the graph to separate into multiple components. If so, these are the largest communities in the partitioning of the graph.
  2. With these crucial edges missing, the betweenness levels of all edges may have changed, so recalculate all betweennesses, and again remove the edge or edges of highest betweenness. This may break some of the existing components into smaller components; if so, these are communities nested within the larger communities found thus far.
  3. Proceed in this way as long as edges remain in graph, in each step recalculating all betweennesses and removing the edge or edges of highest betweenness.

This very elegant algorithm thus finds the very largest communities first, and then successively smaller communities nested within those. As simple as it is to describe in words, it is quite computationally costly because it requires finding the best path between all pairs of remaining nodes at each step.

In Python and using NetworkX looks like this.


In [11]:
def girvan_newman (G):

    if len(G.nodes()) == 1:
        return [G.nodes()]

    def find_best_edge(G0):
        """
        Networkx implementation of edge_betweenness
        returns a dictionary. Make this into a list,
        sort it and return the edge with highest betweenness.
        """
        eb = nx.edge_betweenness_centrality(G0)
        return sorted(eb.items(), key=lambda x: x[1], reverse=True)[0][0]

    components = list(nx.connected_component_subgraphs(G))

    while len(components) == 1:
        G.remove_edge(*find_best_edge(G))
        components = list(nx.connected_component_subgraphs(G))

    result = [c.nodes() for c in components]

    for c in components:
        result.extend(girvan_newman(c))

    return result

Let's try it with the karate network.


In [12]:
community1, community2 = girvan_newman(karate)[:2]
community1, community2


Out[12]:
([0, 1, 3, 4, 5, 6, 7, 10, 11, 12, 13, 16, 17, 19, 21],
 [32, 33, 2, 8, 9, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])

This corresponds very closely to Zachary’s data-driven split, but not exactly. Edges 3 and 9, the boundary edges that seem to participate in the most connections between both groups have been placed in the group centered round 34 instead of the group centered round 1.

The notion of local bridge, or community-spanning edge, or high betweenness edge, corresponds very closely to an important concept in sociology known as a weak tie. The point is that members of Ego's (as in ego-network subgraph) own densely knit group are less likely to know something Ego doesn't know, or to have contacts Ego doesn’t already have.

Activity

Write the code to extract the different communities in the actors network and the number of nodes in each one. Try also using Louvain's method (introduced in [assignment 3](assignment3.ipynb)), and compare the results.

For the next class