Networks are measurable representations of patterns of relationships connecting entities in an abstract or actual space. Networks have been used to model airplane traffic from airports, supply chains, amorphous materials like window glass, cells, and proteins. They can also be used to model relationships among people. Social networks are patterns of relationships among people or organizations that affect and are affected by actions of individuals within the network. Network analysis captures the effect of the complete pattern of conections among individuals in a group to help us perform structural analysis of outcomes of interest for individuals and the group as a whole.

Networks can be represented as **graphs**, where a graph is made up of **nodes** connected by **ties**. The flexibility of network analysis means that the first step toward analysis is to clearly define what constitutes a node and what constitutes a tie in your network. There are several type of graphs: connected, unconnected, directional, and many more (see glossary for a list of terms).

This tutorial is based on Chapter 8 of Big Data and Social Science.

A

**node**is an individual entity within a graph.A

**tie**is a link between nodes. Ties can be**undirected**, meaning they represent a symmetrical relationship, or**directed**, meaning they represent an asymmetrical relationship (one that doesn't necessarily go both ways).A directed tie is known as an

**arc**. An undirected tie is known as an**edge**.The

**degree**of a node is the number of ties connected to that node.A

**directed graph**has directed ties, which depict an asymmetrical relationship. For instance, I may follow Barack Obama on Instagram, but that does not necessarily mean that he follows me.An

**undirected graph**has ties that depict a symmetrical or reciprocal relationship. For instance, if I am Facebook friends with Barack Obama, then he is also Facebook friends with me.A

**cutpoint**is a*node*that cannot be removed without disconnecting the network.A

**bridge**is a*tie*that cannot be removed without disconnecting the network.Two nodes are said to be

**reachable**when they are connected by an unbroken chain of relationships through other nodes.**Network density**is the number of*actual*connections in a network divided by the number of*potential*connections in that network.**Average distance**is the average path length between nodes in a graph. It is a measure of how many nodes it takes to transmit information across the network.**Centrality**is the degree to which a network revolves around a given node.

```
In [1]:
```%pylab inline
from __future__ import print_function
import sys
import community
import networkx
import seaborn
import pandas
import numpy

```
```

In this tutorial we are going to examine the friendship network of 34 students in a karate class. A political rivalry arose dividing the class into two factions, eventually leading to the fissure of the club into two separate karate classes.

The club held periodic meetings to vote on policy decisions. When one of the faction heads, individuals 1 and 34, called a meeting, they would communicate the information only to members in their own faction, in order to ensure that they would hold the majority of members present at the meeting. Meeting times were not publicly announced, but passed through the social network from friend to friend.

In this tutorial we will explore graphical representations of this network, degree metrics, centrality metrics, how to calculate the shortest path between nodes, and community detection. We will be using the NetworkX Python Library developed at Los Alamos National Laboratory (LANL). The nodes represent individuals and the ties represent friendships. The data is stored in gml format.

```
In [2]:
```Graph_Karate = networkx.read_gml('data/karate.gml', label='id')
Graph_Karate.name = 'SocialCircles_KarateClass'

One way to represent networks is an **adjacency matrix**, a binary (all entries either 0 or 1) square matrix. Each row represents the connections between one node and the other nodes in the network. For instance, the first row represents the first node. Each entry in a row corresponding to a node represents possible connections to the other nodes as indicated by 1 (connected) or 0 (not connected).

```
In [3]:
```adj_matrix = networkx.adjacency_matrix(Graph_Karate).todense()
numpy.set_printoptions(threshold=numpy.nan)
print(adj_matrix)

```
```

Note that in the matrix above, all the diagonal entries are 0. This means that we do not consider a node "connected to itself."

Is this graph directed? It's hard to tell this - or much else about the network's structure - by looking at this sea of 0s and 1s. Below we demonstrate plotting the adjacency matrix as a heat map, which shows symmetry across the diagonal. This shows that, in the karate class, there are no one-way friendships: if someone is your friend, you are also their friend.

```
In [4]:
```# Plot the adjacency matrix as a heat map
seaborn.heatmap(adj_matrix,
cbar=False,
cmap = plt.get_cmap('Greys'),
yticklabels=Graph_Karate.nodes(),
xticklabels=Graph_Karate.nodes(),
fmt='d')

```
Out[4]:
```

Graphs can also be represented as **edge lists**, where you list the connections between nodes exhaustively. If we know the graph is undirected, we only need to list each relationship one time. For example, we say that 1 is connected to 32, but it would be redundant to also say that 32 is connected to 1. Representing a network as an edge list is typically preferable to an adjacency matrix in the case of a sparse matrix -- where most of the entries of the matrix are 0. An edge list is typically how a network is stored in a database.

```
In [5]:
```# Writing the karate class graph as a list of edges
networkx.write_edgelist(Graph_Karate,
sys.stdout)

```
```

```
In [6]:
```# Displaying the karate class as a graph
spring_pos = networkx.spring_layout(Graph_Karate)
plt.axis("off")
networkx.draw_networkx(Graph_Karate,
pos=spring_pos,
with_labels = True,
node_size=650,
label='Friendship Network')

```
```

The visualization below emphasizes the edges, or the connections themselves.

```
In [7]:
```# Displaying the class as a circle
networkx.draw_circular(Graph_Karate,
with_labels=True,
node_size=600)

```
```

It is useful to know the size (in terms of nodes and ties) of the network, both to have an idea of the size and connectivity of the network, and because most of the measures you will use to describe the network will need to be standardized by the number of nodes or the number of potential connections.

The most important thing to understand about larger networks is the pattern of indirect connections among nodes, because it is these chains of indirect connections that make the network function as a whole, and make networks a
useful level of analysis. Much of the power of networks is due to indirect ties that create **reachability.** Two nodes can reach each other if they are connected by an unbroken chain of relationships, often called **indirect ties**.

Structural differences between node positions, the presence and characteristics of smaller "communities" within larger networks, and properties of the structure of the whole group can be quantified using different **network measures.**

```
In [8]:
```# Print out some summary statistics on the network
print( networkx.info(Graph_Karate) )

```
```

We see that there are 34 edges (students) and 78 edges (friendship).

The **average degree** of the network is the average number of edges connected to each node. This is not the same as 78/34, because each "friendship" includes two nodes, but is only counted once in the total number of edges.

We see that the average degree of this network is 4.5882, meaning that the average individual in the network has about 4.6 friends.

```
In [9]:
```# Print out the average density of the network
print(networkx.density(Graph_Karate))

```
```

The average density is calculated as the $$\text{average density} = \frac{\text{actual ties}}{\text{possible number of ties}} $$

where the possible number of ties for an undirected graph (if every node had a tie to every other node) is $\frac{n(n-1)}{2}$.

If every node were connected to every other node, the average density would be 1. If there were no ties between any of the nodes, the average density would be 0. The average density of this network is 0.14, which indicates it is not a very dense network. In this example, we can interpret this to mean that individuals are mostly friends only with others that belong to the same faction - there is not much interaction or overlap between the two factions.

```
In [10]:
```dict_degree = Graph_Karate.degree()
df_degree = pandas.DataFrame.from_dict(dict_degree, orient='index')
df_degree.columns=['degree']
df_degree.index.name = 'node_id'

```
In [11]:
```seaborn.set_style("whitegrid")
plt.figure(figsize=(22, 12))
seaborn.set_context("poster", font_scale=1.00, rc={"lines.linewidth": 1.00,"lines.markersize":8})
df_degree.plot(kind='barh')

```
Out[11]:
```

**most connected**, meaning they have the most friendships out of the group. 12 is the only individual that has only one connection. There are a few individuals who have 10 friends, and the rest in the class have around 2 to 5 friendships. It makes sense that 1 and 34 have the most social connections, because they are the leaders of the rival factions.

Two nodes are said to be **reachable** when they are connected by an unbroken chain of relationships through other nodes. Networks in which more of the possible connections (direct and indirect) among nodes are realized are denser and more cohesive than networks in which fewer of these connections are realized.

The reachability of individuals in a netowrk is determined by membership in **components**, which are subsets of the
larger network in which every member of the group is indirectly connected to every other. Imagining the standard node and line drawing of a graph, a component is a portion of the network where you can trace a path between every pair of nodes without ever lifting your pen.

Many larger networks consist of a single dominant component including anywhere from 50% to 90% of the individuals, and a few smaller components that are not connected. In this case, is common to perform analysis on only the main connected component of the graph, because there is not a convenient way to mathematically represent how "far away" unconnected nodes are. In our karate class example, our graph is connected, meaning that you can reach any individual from any other individual by moving along the edges of the graph, so we don't need to worry about that problem.

A **shortest path** between two nodes is a path from one node to the other, not repeating any nodes. One way to think of a shortest path between two individuals is how many people it would take to broker an introduction between them (think six degrees of Kevin Bacon).

Most pairs will have several shortest paths between them; the *longest shortest path* is called the **geodesic**.

```
In [12]:
```# Calculate the length of the shortest path between 12 and 15
ls_path = networkx.shortest_path(Graph_Karate, 12,15)
print('The path length from {} to {} is {}.'.format(
12,15,len(ls_path)))
print('path length: ', ls_path)

```
```

The **average shortest path length** describes how quickly information or goods can disburse through the network.

The average shortest length $l$ is defined as $$ l = \frac{1}{n(n-1)} \sum_{i \ne j}d(v_{i},v_{j}) $$ where $n$ is the number of nodes in the graph and $d(v_{i},v_{j})$ is the shortest path length between nodes $i$ and $j$.

```
In [13]:
```print(networkx.average_shortest_path_length(Graph_Karate))

```
```

Centrality metrics measure how important, or "central," a node is to the network. These can indicate what individual has the most social contacts, who is closest to people, or the person where information most transfers through. There are many **centrality metrics** -- degree centrality, betweenness centrality, closeness centrality, eigenvalue centrality, percolation centrality, PageRank -- all capturing different aspects of a node's contribution to a network.

Centrality measures are the most commonly used means to explore network effects at the level of certain individual participants. Typically, these metrics identify and describe a few import nodes, but don't tell us much about the rest of the nodes in the network. This is akin to Google's search results: the first few matches are the most relevant, but if you go a few pages in to the search results, you might as well have been searching for something else entirely.

The most basic and intuitive measure of centrality, **degree centrality**, simply counts the number of ties that each node has. Degree centrality represents a clear measure of the prominence or visibility of a node. The degree centrality $C_{D}(x)$ of a node $x$ is

where $deg(x)$ is the number of connections that node $x$ has, and $n-1$ is a normalization factor for the total amount of possible connections.

In the case of the karate class, the node with the highest degree centrality is the person with the most friends.

```
In [14]:
```dict_degree_centrality = networkx.degree_centrality(Graph_Karate)
df_degree_centrality = pandas.DataFrame.from_dict(dict_degree_centrality, orient='index')
df_degree_centrality.columns=['degree_centrality']
df_degree_centrality.index.name = 'node_id'

```
In [15]:
```df_degree_centrality.plot(kind='barh')

```
Out[15]:
```

**Closeness centrality** is based on the idea that networks position some individuals closer to or farther away
from other individuals, and that shorter paths between actors increase the likelihood of communication and
consequently the abiltiy to coordinate complicated activities. The closeness centrality $C_C(x)$ of a node $x$ is calculated as:

where $d(x,y)$ is the length of the geodesic between nodes $x$ and $y$.

```
In [16]:
```dict_closeness_centrality = networkx.closeness_centrality(Graph_Karate)
df_closeness_centrality = pandas.DataFrame.from_dict(dict_closeness_centrality,
orient='index')
df_closeness_centrality.columns=['closeness_centrality']
df_closeness_centrality.index.name = 'node_id'

```
In [17]:
```df_closeness_centrality.sort_values(by='closeness_centrality',
ascending=False).plot(kind='barh')

```
Out[17]:
```

Where closeness assumes that communication and information flow increase with proximity, **betweenness centrality**
captures the idea of "brokerage," or that there is an individual benefit which a node derives from being positioned
"in between" many other pairs of nodes. To calculate betweenness, we must assume that when people search for new
information through networks, they are capable of identifying the shortest path; additionally, we must assume
that when multiple shortest paths exist, each path is equally likely to be chosen.

The betweenness centrality $C_B(x)$ of a node $x$ is given by

$$ C_B{x} = \sum_{s,t} \frac{\sigma_{st}(x)}{\sigma_{st}}$$where $\sigma_{st}$ is the number of shortest paths from node $s$ to node $t$ and $\sigma_{st}(x)$ is the number of shortest paths $\sigma_{st}$ passing through node $x$.

```
In [18]:
```dict_betweenness_centrality = networkx.betweenness_centrality(Graph_Karate)
df_betweenness_centrality = pandas.DataFrame.from_dict(dict_betweenness_centrality,
orient='index')
df_betweenness_centrality.columns=['betweeness_centrality']
df_betweenness_centrality.index.name = 'node_id'

```
In [19]:
```df_betweenness_centrality.sort_values(by='betweeness_centrality',
ascending=False).plot(kind='barh')

```
Out[19]:
```

```
In [20]:
```cliques = list(networkx.find_cliques(Graph_Karate))

```
In [21]:
```import functools

```
In [22]:
```#summary stats of cliques
num_cliques = len(cliques)
ls_len_cliqs = [len(cliq) for cliq in cliques ]
max_clique_size = max(ls_len_cliqs)
avg_clique_size = np.mean(ls_len_cliqs)
max_cliques = [c for c in cliques if len(c) == max_clique_size]
max_clique_sets = [set(c) for c in max_cliques]
people_in_max_cliques = list(functools.reduce(lambda x,y: x.intersection(y), max_clique_sets))

```
In [23]:
```print(num_cliques)
print(max_clique_size)
print(avg_clique_size)

```
```

*36* cliques in the network. The maximum clique size is *5* people and the average clique size is *2.86*, ~3 people.

Let's see what the maximum cliques look like

```
In [24]:
``````
max_cliques
```

```
Out[24]:
```

```
In [25]:
```Graph_max_clique1 = Graph_Karate.subgraph(max_cliques[0])
Graph_max_clique2 = Graph_Karate.subgraph(max_cliques[1])

```
In [26]:
```networkx.draw(Graph_max_clique1, with_labels=True)

```
```

```
In [27]:
```networkx.draw(Graph_max_clique2, with_labels=True)

```
```

In **community detection**, we try to find sub-networks, or communities, of densely populated connections. Community detection is similiar to clustering, in that strong communities will display an abundance of intra-community connections and few inter-community connections.

In the case of the karate class, we know beforehand that there are two communities which kept mostly to themselves
and didn't cross-pollinate much. Assuming we *didn't* know that, let's see if we can identify these two communities from looking at the data.

The technical implementation of the algorithm can be found here.

1's Faction: [1,2,3,4,5,6,7,8,11,12,13,14,17,18,20,22]

34's Faction: [9,10,15,16,19,21,23,24,25,26,27,28,29,30,31,32,33,34]

```
In [28]:
```dict_clusters = community.best_partition(Graph_Karate,resolution=1.0)
clusters = [dict_clusters.get(node) for node in Graph_Karate.nodes()]
plt.axis("off")
networkx.draw_networkx(Graph_Karate,
pos = spring_pos,
cmap = plt.get_cmap("terrain"),
node_color = clusters,
node_size = 600,
with_labels = True,
fontsize=200)

```
```

Here we've encountered a common problem in clustering: it can be hard to decide on *how many* clusters there are,
as well as the clusters themselves. Often deciding on the number of clusters is done in an ad hoc way.

We've identified four different sub-clusters within the network, but it's not that evident from the graphic that the green cluster or white cluster deserve to be their own separate entity. Here it seems reasonable to combine the green and navy blue clusters into one faction, and the tan and white clusters into another faction.

Let's combine the four sub-clusters into two sub-clusters and see how we do.

```
In [29]:
```make_new_clust = lambda x: 0 if x < 2 else 1
dict_new_clust = {key: make_new_clust(dict_clusters[key]) for key in dict_clusters}

```
In [30]:
```ls_new_clust = [dict_new_clust.get(node) for node in Graph_Karate.nodes()]
plt.axis("off")
networkx.draw_networkx(Graph_Karate,
pos = spring_pos,
cmap=plt.get_cmap("terrain"),
node_color = ls_new_clust,
node_size = 600,
with_labels = True,
fontsize=200)

```
```

- International Network for Social Network Analysis is a large, interdisciplinary association dedicated to network analysis.
- Pajek is a freeware package for network analysis and visualization.
- Gephi is another freeware package that supports large-scale network visualization.
- Network Workbench is a freeware package that supports extensive analysis and visualization of networks.
- NetworkX is the Python package used in this tutorial to analyze and visualize networks.
- iGraph is a network analysis package with implementations in R, Python, and C libraries.
- A Fast and Dirty Intro to NetworkX (and D3)