Network Analysis


Introduction

Networks are mathematical or graphical representations of patterns of relationships between entities. These relationships are defined by some measure of "closeness" between individuals, and can exist in an abstract or actual space (for example, whether you are related to someone versus how far away you live from each other). Networks have been used to model everything from airplane traffic to supply chains, and even amorphous materials like window glass, cells, and proteins. They can also be used to model relationships among people. Social networks are patterns of relationships among people or organizations that affect and are affected by actions of individuals within the network. Network analysis captures the effect of the complete pattern of connections among individuals in a group to help us perform structural analysis of outcomes of interest for individuals and the group as a whole.

Networks can be represented as graphs, where a graph is made up of nodes connected by ties. The flexibility of network analysis means that the first step toward analysis is to clearly define what constitutes a node and what constitutes a tie in your network. There are several type of graphs: connected, unconnected, directional, and many more (see glossary for a list of terms).

This tutorial is based on Chapter 8 of Big Data and Social Science.

Glossary of Terms

  • A node is an individual entity within a graph.

  • A tie is a link between nodes. Ties can be undirected, meaning they represent a symmetrical relationship, or directed, meaning they represent an asymmetrical relationship (one that doesn't necessarily go both ways).

  • A directed tie is known as an arc. An undirected tie is known as an edge. tth Barack Obama, then he is also Facebook friends with me.

  • A cutpoint is a node that cannot be removed without disconnecting the network.

  • A bridge is a tie that cannot be removed without disconnecting the network.

  • Two nodes are said to be reachable when they are connected by an unbroken chain of relationships through other nodes.

  • Network density is the number of actual connections in a network divided by the number of potential connections in that network.

  • Average distance is the average path length between nodes in a graph. It is a measure of how many nodes it takes to transmit information across the network. This metric is only valid for fully connected graphs.

  • Centrality is the degree to which a given node influences the entire network.

Table of Contents

  1. Loading the Data
  2. Representations of Networks
    1. Adjacency Matrix
    2. List of Edges
    3. Graphs
  3. Network Measures
    1. Summary Statistics
    2. Degree Distribution
    3. Components and Reachability
    4. Path Length
  4. Centrality Metrics
    1. Degree Centrality
    2. Closeness Centrality
    3. Betweenness Centrality
  5. Cliques
  6. Community Detection
  7. Exercises
  8. Resources

In [ ]:
%pylab inline
from __future__ import print_function
import sys
import community
import networkx as nx
import seaborn as sns
import pandas as pd
from sqlalchemy import create_engine

Creating a Network

The first step in creating a network is defining the question or questions we want to explore using the network. This then allows us to define what a node and tie will be.

Loading the Data

In this tutorial we will explore graphical representations of this network, degree metrics, centrality metrics, how to calculate the shortest path between nodes, and community detection. We will be using the NetworkX Python Library developed at Los Alamos National Laboratory (LANL).

First we have to load the data from the database. Note we did the hard work of creating the network in SQL and now doing our more complex analysis in Python.


In [ ]:
engine = create_engine("")
df_network = pd.read_sql('',
                        engine)
df_network.head()

In [ ]:
network = list(zip(df_network.ssn_l, df_network.ssn_r))

In [ ]:
G = nx.Graph()
G.add_edges_from(network)

Representations of Networks

Adjacency Matrix

One way to represent networks is an adjacency matrix, a binary (all entries either 0 or 1) square matrix. Each row represents the connections between one node and the other nodes in the network. For instance, the first row represents the first node. Each entry in a row corresponding to a node represents possible connections to the other nodes as indicated by 1 (connected) or 0 (not connected).


In [ ]:
plt.figure(figsize=(30,30))
plt.spy(nx.adjacency_matrix(G))

List of Edges

Graphs can also be represented as edge lists, where you list the connections between nodes exhaustively. If we know the graph is undirected, we only need to list each relationship one time. For example, we say that 1 is connected to 32, but it would be redundant to also say that 32 is connected to 1. Representing a network as an edge list is typically preferable to an adjacency matrix in the case of a sparse matrix -- where most of the entries of the matrix are 0 due to taking much less space to store. An edge list is typically how a network is stored in a database.


In [ ]:
network[:10]

Graphs

Networks can also be displayed as graphs, which is probably the most intuitive way to visualize them. The top visualization below emphasizes the nodes, or individuals, how close they are to one another, and the groups that emerge.

The visualization below emphasizes the edges, or the connections themselves. Note: this network is too large to visualize

nx.draw(G)

Due to the large number of nodes this visualization is not helpful. Given that we can't derive much information from this particular visualization we need to turn to other network measures.

Network Measures

It is useful to know the size (in terms of nodes and ties) of the network, both to have an idea of the size and connectivity of the network, and because most of the measures you will use to describe the network will need to be standardized by the number of nodes or the number of potential connections.

One of the most important things to understand about larger networks is the pattern of indirect connections among nodes, because it is these chains of indirect connections that make the network function as a whole, and make networks a useful level of analysis. Much of the power of networks is due to indirect ties that create reachability. Two nodes can reach each other if they are connected by an unbroken chain of relationships, often called indirect ties.

Structural differences between node positions, the presence and characteristics of smaller "communities" within larger networks, and properties of the structure of the whole group can be quantified using different network measures.

Summary Statistics


In [ ]:
# Print out some summary statistics on the network
print( nx.info(G) )

We see that there are 568892 ties (relationships) and 13716 nodes (individuals).

The average degree of the network is the average number of edges connected to each node.

We see that the average degree of this network is 83., meaning that the average individual in the network is connected to 83 individuals.


In [ ]:
# Print out the average density of the netwo
print(nx.density(G))

The average density is calculated as the $$\text{average density} = \frac{\text{actual ties}}{\text{possible number of ties}} $$

where the possible number of ties for an undirected graph (if every node had a tie to every other node) is $\frac{n(n-1)}{2}$.

If every node were connected to every other node, the average density would be 1. If there were no ties between any of the nodes, the average density would be 0. The average density of this network is 0.0006, which indicates it is not a very dense network. In this example, we can interpret this to mean that individuals are mostly in small groups, and the groups don't overlap very much.

Now that we have looked at some summary statistics as a whole we are going to drill down to the individual actors in our network.

Degree Distribution (Who has the most relationships?)

We can cast this question as a network analysis problem by asking which node has the most ties.


In [ ]:
dict_degree = G.degree()
df_degree = pd.DataFrame.from_dict(dict_degree, orient='index')
df_degree.columns=['degree']
df_degree.index.name = 'node_id'

In [ ]:
sns.set_style("whitegrid")
plt.figure(figsize=(22, 12))
sns.set_context("poster", font_scale=1.00, rc={"lines.linewidth": 1.00,"lines.markersize":8})
df_degree.sort_values(by='degree', ascending=False)[:10].plot(kind='barh')

In [ ]:
df_degree.sort_values(by='degree', ascending=False)[:10]

In [ ]:
G.neighbors()

Components and Reachability

Two nodes are said to be reachable when they are connected by an unbroken chain of relationships through other nodes. Networks in which more of the possible connections (direct and indirect) among nodes are realized are denser and more cohesive than networks in which fewer of these connections are realized.

The reachability of individuals in a network is determined by membership in components, which are subsets of the larger network in which every member of the group is indirectly connected to every other. Imagining the standard node and line drawing of a graph, a component is a portion of the network where you can trace a path between every pair of nodes without ever lifting your pen.

Many larger networks consist of a single dominant component including anywhere from 50% to 90% of the individuals, and a few smaller components that are not connected. In this case, is common to perform analysis on only the main connected component of the graph, because there is not a convenient way to mathematically represent how "far away" unconnected nodes are. In our karate class example, our graph is connected, meaning that you can reach any individual from any other individual by moving along the edges of the graph, so we don't need to worry about that problem.

Path Length

A shortest path between two nodes is a path from one node to the other, not repeating any nodes. One way to think of a shortest path between two individuals is how many people it would take to broker an introduction between them (think six degrees of Kevin Bacon).

Most pairs will have several "shortest paths" between them; the longest shortest path is called the geodesic.


In [ ]:
# Calculate the length of the shortest path between 12 and 15
ls_path = nx.shortest_path(G,)
print('The path length from {} to {} is {}.'.format(
                                            '',
                                            '',
                                            len(ls_path)))
print('path length: ', ls_path)

In this case there is no path between the two nodes.


In [ ]:
# Calculate the length of the shortest path between 12 and 15
ls_path = nx.shortest_path(G, '',
                           '')
print('The path length from {} to {} is {}.'.format(
                                            '',
                                            '',
                                            len(ls_path)))
print('path length: ', ls_path)

The average shortest path length describes how quickly information or goods can disburse through the network.

The average shortest length $l$ is defined as $$ l = \frac{1}{n(n-1)} \sum_{i \ne j}d(v_{i},v_{j}) $$ where $n$ is the number of nodes in the graph and $d(v_{i},v_{j})$ is the shortest path length between nodes $i$ and $j$.


In [ ]:
print(nx.average_shortest_path_length(G))

In this case, we cannot calculate the average shortest path, since our network is not fully connected (the network has islands within it that are cut off from the rest of the network). Since there is no way to calculate the distance between two nodes that can't be reached from one another, there is no way to calculate the average shortest distance across all pairs.

Centrality Metrics

Centrality metrics measure how important, or "central," a node is to the network. These can indicate what individual has the most social contacts, who is closest to people, or the person where information most transfers through. There are many centrality metrics -- degree centrality, betweenness centrality, closeness centrality, eigenvalue centrality, percolation centrality, PageRank -- all capturing different aspects of a node's contribution to a network.

Centrality measures are the most commonly used means to explore network effects at the level of certain individual participants. Typically, these metrics identify and describe a few important nodes, but don't tell us much about the rest of the nodes in the network. This is akin to Google's search results: the first few matches are the most relevant, but if you go a few pages in to the search results, you might as well have been searching for something else entirely.

Degree Centrality (Who has the most relationships?)

The most basic and intuitive measure of centrality, degree centrality, simply counts the number of ties that each node has. Degree centrality represents a clear measure of the prominence or visibility of a node. The degree centrality $C_{D}(x)$ of a node $x$ is

$$C_{D}(x) = \frac{deg(x)}{n-1}$$

where $deg(x)$ is the number of connections that node $x$ has, and $n-1$ is a normalization factor for the total amount of possible connections.

If a node has no connections to any other nodes, its degree centrality will be 0. If it is directly connected to every other node, its degree centrality will be 1.


In [ ]:
dict_degree_centrality = nx.degree_centrality(G)
df_degree_centrality = pd.DataFrame.from_dict(dict_degree_centrality, orient='index')
df_degree_centrality.columns=['degree_centrality']
df_degree_centrality.index.name = 'node_id'

In [ ]:
df_degree_centrality.sort_values(by='degree_centrality',
                                 ascending=False)[:10].plot(kind='barh')

As we can see, this is simply a recasting of the degree distribution.

Closeness Centrality (Who has the shortest of shortest paths going between them?)

Closeness centrality is based on the idea that networks position some individuals closer to or farther away from other individuals, and that shorter paths between actors increase the likelihood of communication, and consequently the ability to coordinate complicated activities. The closeness centrality $C_C(x)$ of a node $x$ is calculated as:

$$C_C(x) = \frac{n-1}{\sum_{y}d(x,y)} $$

where $d(x,y)$ is the length of the geodesic between nodes $x$ and $y$.


In [ ]:
dict_closeness_centrality = {}

In [ ]:
for ssn_hash in zip(*network[:25])[0]:
    dict_closeness_centrality[ssn_hash] = nx.closeness_centrality(G,u=ssn_hash)

In [ ]:
df_closeness_centrality = pd.DataFrame.from_dict(dict_closeness_centrality,
                                                 orient='index')
df_closeness_centrality.columns=['closeness_centrality']
df_closeness_centrality.index.name = 'node_id'

In [ ]:
df_closeness_centrality.sort_values(by='closeness_centrality',
                                   ascending=False)[:10].plot(kind='barh')

Betweenness Centrality (Who has the most shortest paths between them?)

Where closeness assumes that communication and information flow increase with proximity, betweenness centrality captures "brokerage," or the idea that a node that is positioned "in between" many other pairs of nodes gains some individual advantage. To calculate betweenness, we must assume that when people search for new information through networks, they are capable of identifying the shortest path (so that we know that the path between two nodes actually includes the "in between" node); additionally, we must assume that when multiple shortest paths exist, each path is equally likely to be chosen.

The betweenness centrality $C_B(x)$ of a node $x$ is given by

$$ C_B{x} = \sum_{s,t} \frac{\sigma_{st}(x)}{\sigma_{st}}$$

where $\sigma_{st}$ is the number of shortest paths from node $s$ to node $t$ and $\sigma_{st}(x)$ is the number of shortest paths $\sigma_{st}$ passing through node $x$. Intuitively, for each node, we look at how many of the shortest paths between every other pair of nodes includes that node.


In [ ]:
dict_betweenness_centrality = nx.betweenness_centrality(G, k=50)
df_betweenness_centrality = pd.DataFrame.from_dict(dict_betweenness_centrality,
                                                 orient='index')
df_betweenness_centrality.columns=['betweeness_centrality']
df_betweenness_centrality.index.name = 'node_id'

In [ ]:
df_betweenness_centrality.sort_values(by='betweeness_centrality',
                                   ascending=False)[:10].plot(kind='barh')

Given the small values for betweenness centrality, it appears that there is no large single broker in this network.

Cliques

A clique is a maximally connected sub-network, or a group of individuals who are all connected to one another.

In our case, this would be a group of individuals that are all connected to each other: all released in the same month, to the same zipcode, with the same gang affiliation. We might expect to see a lot of cliques in this network, because we defined the relationships within our network based on these groupings.


In [ ]:
cliques = list(nx.find_cliques(G))

In [ ]:
import functools

In [ ]:
#summary stats of cliques
num_cliques = len(cliques)
ls_len_cliqs = [len(cliq) for cliq in cliques ]
max_clique_size = max(ls_len_cliqs)
avg_clique_size = np.mean(ls_len_cliqs)
max_cliques = [c for c in cliques if len(c) == max_clique_size]
max_clique_sets = [set(c) for c in max_cliques]
people_in_max_cliques = list(functools.reduce(lambda x,y: x.intersection(y), max_clique_sets))

In [ ]:
print(num_cliques)
print(max_clique_size)
print(avg_clique_size)

There are 2231 cliques in the network. The maximum clique size is 689 people and the average clique size is 7.60, ~8 people.

Let's see what the maximum cliques look like


In [ ]:
max_cliques

In [ ]:
Graph_max_clique1 = G.subgraph(max_cliques[0])

In [ ]:
nx.draw(Graph_max_clique1, with_labels=False)

In [ ]:
df_network[ df_network['ssn_l'].isin(max_cliques[0]) & df_network['ssn_r'].isin(max_cliques[0])]

Community Detection (This may take some time)

In community detection, we try to find sub-networks, or communities, of densely populated connections. Community detection is similar to clustering, in that strong communities will display an abundance of intra-community connections and few inter-communityk connections.

The technical implementation of the algorithm can be found here.


In [ ]:
dict_clusters = community.best_partition(G)
clusters = [dict_clusters.get(node) for node in G.nodes()]
plt.axis("off")
#nx.draw_networkx(G,
#                 cmap = plt.get_cmap("terrain"),
#                 node_color = clusters, 
#                 node_size = 600,
#                 with_labels = True,
#                 fontsize=200)

In [ ]:
dict_clusters

Resources