Social Network Analysis

Written by Jin Cheong & Luke Chang


In [ ]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
try:
    import networkx as nx
except:
    # Install NetworkX
    !pip install networkx

Primer to Network Analysis

A network is made up of two main components: nodes and edges.

The nodes are the individuals, words, or entities that compose a network, and the edges are the links that connect one node to another.

Here is an example of a node and an edge.

Now we can try drawing our own network using the NetworkX package. Let's start off by initializing a Network, and call it G.


In [ ]:
# Initialize Graph object
G = nx.Graph()

Now we can add a node using the .add_node('nodename') method


In [ ]:
G = nx.Graph()
G.add_node('Jin')
G.add_node('Luke')
G.add_node('Eshin')
plt.figure(figsize=(5,3))
nx.draw(G,with_labels=True,node_size=5000,font_size=20,alpha=.5)

Notice there are no connections between the two nodes because we haven't added any edges yet.

To add edges between nodes, for example node1 and node 2, we can use the .add_edge('node1','node2') method on our graph.


In [ ]:
G = nx.Graph()
G.add_edge('Jin','Luke',weight=1)
G.add_edge('Jin','Eshin',weight=1)
G.add_edge('Luke','Eshin',weight=1)
plt.figure(figsize=(5,3))
pos = nx.spring_layout(G) # One way of specifiying a layout for nodes.
nx.draw(G,pos,with_labels=True,node_size=5000,font_size=20,alpha=.5)

Now you can see that we have an edge connecting Jin and Luke.

There are several different types of graphs.

  1. weighted vs unweighted graphs
    • a graph is said to be unweighted if every connection is either 1 or 0
    • it is weighted if the edges take on other values indicating the strength of the relationship.
  1. directed vs undirected graphs
    • a graphs is undirected if the edges are symmetrical
    • it is directed if the edges are asymmetrical, meaning that each edge can indicate the direction of the relationship

For simplicity, today we will only cover unweighted and undirected graphs.

There are multiple ways to indicate nodes and edges in a graph.
As illustrated above, we can manually add each node and edge. This approach is fine for small graphs, but won't scale well. It is also possible to add nodes is to use a python dictionary. The key is the node and the values indicate what that key(node) connects to.


In [ ]:
d = {'Jin':['Luke','Eshin','Antonia','Andy','Sophie','Rob','Charlie','Vanessa'], 
     'Luke':['Eshin','Antonia','Seth','Andy','Sophie','Rob','Charlie','Vanessa'],
     'Antonia':['Luke','Jin','Eshin','Seth','Andy'],
     'Eshin':['Heidi','Antonia','Jin','Sam','Andy'],
     'Sophie':['Rob','Vanessa'],
     'Rob':['Sophie','Vanessa']}
G = nx.Graph(d)
plt.figure(figsize=(15,8))
np.random.seed(2) # Just to keep things same
pos = pos = nx.fruchterman_reingold_layout(G) # Another way of specifiying a layout for nodes.
nx.draw(G,with_labels=True,node_size=1500,font_size=20,alpha=.3,width=2)

What should we look for in a network ?

Micromeasures

Centrality measures track the importance of a node derived from its position in the network.

There are four main groups of centrality depending on the type of statistics used:

  1. degree - how connected a node is
  2. betweenness - how important a node is in connecting other nodes
  3. closeness - how easily a node can reach other nodes
  4. neighbors' characteristics - how important, central, or influential a node's neighbors are.

<img src="Figures/centrality.png",width=300>

Reference: Social and Economic Networks by Matthew O. Jackson. Princeton University Press.
Picture: Wikipedia article
Other examples: http://www.orgnet.com/sna.html

Degree Centrality

The degree centrality measures the number of edges of a given node.

In other words, it measures how connected a node is.


In [ ]:
def print_network(val_map,title=None):
    print('\x1b[1;31m'+title+'\x1b[0m')
    for k, v in val_map.iteritems():
        print k.ljust(15), str(round(v,3)).ljust(30)

def plot_network(val_map, title=None):
    values = [val_map.get(node, 0.25) for node in G.nodes()]
    plt.figure(figsize=(12,5))
    np.random.seed(2)
#     pos = nx.spring_layout(G)
    pos = nx.fruchterman_reingold_layout(G,dim=2)
    ec = nx.draw_networkx_edges(G,pos,alpha=.2)
    nc = nx.draw_networkx_nodes(G,pos,node_size=1000,with_labels=True,alpha=.5,cmap=plt.get_cmap('jet'),node_color=values)
    nx.draw_networkx_labels(G,pos,font_size=18)
    # nx.draw(G,pos,node_size=400,with_labels=True,alpha=.5,cmap=plt.get_cmap('jet'),node_color=values)
    plt.colorbar(nc)
    plt.axis('off')
    plt.suptitle(title,fontsize=18)
    plt.show()

# Get degree centrality values
d = nx.degree_centrality(G)
title = 'Degree Centrality Map'
# print centrality values
print_network(d,title)
# plot graph with values
plot_network(d,title)

Luke has the highest number of connections and therefore has the highest degree centrality. Jin is next, and then Eshin.

Betweenness Centrality

Think of Between Centrality as a bottleneck or a broker of two separate networks.

The measure is the fraction of the total number of shortest paths a node lie on between two other nodes.

Here Eshin has the highest betweeness centrality because he connects the most different people.


In [ ]:
d = nx.betweenness_centrality(G)
title = "Betweenness Centrality"
print_network(d,title)
plot_network(d,title)

Closeness Centrality

The closeness centrality measures how close a given node is to any other node.
This is measured by the average length of the shortest path between a node and all other nodes.
Thus you have higher closeness centrality if you can get to everyone faster than others.


In [ ]:
d = nx.closeness_centrality(G)
title = "Closeness Centrality"
print_network(d,title)
plot_network(d,title)

Eigenvector Centrality

The underlying assumption of the Eigenvector centrality is that a node's importance is determined by how important its neighbors are.

For instance, having more influential friends (betweeness centrality) can be more important than just having more number of friends (degree centrality).

Now we can observe that Luke is back to being more important than Eshin (who had higher betweeness centrality), because Luke is friends with Eshin who also has high centrality.


In [ ]:
d = nx.eigenvector_centrality(G)
title = "Eigenvector Centrality"
print_network(d,title)
plot_network(d,title)

Macromeasures

Macromeasures are useful to understand the network as a whole or to compare networks.

We will cover three fundamental measures of the network structure.

  1. Degree distribution
  2. Average shortest path
  3. Clustering Coefficients

Degree Distribution

One fundamental characteristic is the distribution of degrees, number of edges for each node, which we can easily plot.
From this distribution we can discern attributes such as whether everyone is connected to everybody are there many lonely nodes and such.


In [ ]:
degree_sequence=sorted([d for n,d in G.degree().iteritems()], reverse=True) # degree sequence

G2=nx.complete_graph(5)
degree_sequence2=sorted([d for n,d in G2.degree().iteritems()], reverse=True) # degree sequence

fig, [(ax1,ax2),(ax3,ax4)] = plt.subplots(2,2,figsize=(15,10))
ax1.hist(degree_sequence,bins=range(0,7),rwidth=.8)
ax1.set_title("Degree Histogram for Graph 1",fontsize=16)
ax1.set_ylabel("Count",fontsize=14)
ax1.set_xlabel("Degree",fontsize=14)
ax1.set_xticks([d+0.4 for d in degree_sequence])
ax1.set_xticklabels([d for d in degree_sequence])
ax1.set_ylim((0,6))

nx.draw(G,pos=nx.circular_layout(G),ax=ax2)
ax2.set_title('Network 1: Graph of our class',fontsize=16)

ax3.hist(degree_sequence2,bins=range(0,7),rwidth=.8)
ax3.set_title("Degree Histogram for Complete Graph",fontsize=16)
ax3.set_ylabel("Count",fontsize=14)
ax3.set_xlabel("Degree",fontsize=14)
ax3.set_xticks([d+0.4 for d in degree_sequence2])
ax3.set_xticklabels([d for d in degree_sequence2])
ax3.set_ylim((0,6))

nx.draw(G2,pos=nx.circular_layout(G2),ax=ax4)
ax4.set_title('Network 2: Graph of a Complete Graph',fontsize=16)

plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
plt.show()

Average Shortest Path

The average shortest path length is the average of calculating the shortest number of edges it takes for every pair of nodes in the network.

This can be a mesure of how efficient/fast the network is in the distribution of information, rumors, disease, etc.


In [ ]:
print 'Network 1 average shortest path: ' + str(round(nx.average_shortest_path_length(G),2))
print 'Network 2 average shortest path: ' + str(nx.average_shortest_path_length(G2))

Clustering coefficient.

Another way to look at how tight-knit a network is to look at how clustered it is.
To estimate clustering, we start from cliques, which can be considered as a subnetwork composed of three nodes forming a triangle.

We can first calculate for each node, how many cliques(triangles) it is a member of.
Then we calculate transitivity, which is simply the number of triangles in the network divided by the number of all possible traingles.


In [ ]:
d = nx.triangles(G)
print_network(d,'Number of cliques(triangles)')
print 
print 'Transitivity : ' + str(nx.transitivity(G))

plot_network(d,'Transitivity')

A similar approach is to calculate clustering coefficient for each node and for the graph. Clustering is the measure of how likely it is if A has edges to B and C, that B and C related?

For instance, Charlie form a triangle with Luke and Jin, but Sam and Heidi don't form a single triangle.

We can calculate this for each node and then get an average value overall to get the average clustering coefficient.


In [ ]:
d = nx.clustering(G)
print_network(d,'Clustering Coefficients per node')
print 
print 'Average clustering coefficient : ' + str(nx.average_clustering(G))

In [ ]:

Exercise

Your assignment this week is to construct your own social network. It could be based on a sports team, your close friends, or family. It could also reflect more abstract aspects of relationships such as social support, gossip, problem solver, etc.

Calculate two different metrics of the network to provide insight into how different members play different roles.


In [ ]: