Networks are mathematical or graphical representations of patterns of relationships between entities. These relationships are defined by some measure of "closeness" between individuals, and can exist in an abstract or actual space (for example, whether you are related to someone versus how far away you live from each other). Networks have been used to model everything from airplane traffic to supply chains, and even amorphous materials like window glass, cells, and proteins. They can also be used to model relationships among people. Social networks are patterns of relationships among people or organizations that affect and are affected by actions of individuals within the network. Network analysis captures the effect of the complete pattern of connections among individuals in a group to help us perform structural analysis of outcomes of interest for individuals and the group as a whole.

Networks can be represented as **graphs**, where a graph is made up of **nodes** connected by **ties**. The flexibility of network analysis means that the first step toward analysis is to clearly define what constitutes a node and what constitutes a tie in your network. There are several type of graphs: connected, unconnected, directional, and many more (see glossary for a list of terms).

This tutorial is based on Chapter 8 of Big Data and Social Science.

A

**node**is an individual entity within a graph.A

**tie**is a link between nodes. Ties can be**undirected**, meaning they represent a symmetrical relationship, or**directed**, meaning they represent an asymmetrical relationship (one that doesn't necessarily go both ways).A directed tie is known as an

**arc**. An undirected tie is known as an**edge**. tth Barack Obama, then he is also Facebook friends with me.A

**cutpoint**is a*node*that cannot be removed without disconnecting the network.A

**bridge**is a*tie*that cannot be removed without disconnecting the network.Two nodes are said to be

**reachable**when they are connected by an unbroken chain of relationships through other nodes.**Network density**is the number of*actual*connections in a network divided by the number of*potential*connections in that network.**Average distance**is the average path length between nodes in a graph. It is a measure of how many nodes it takes to transmit information across the network. This metric is only valid for fully connected graphs.**Centrality**is the degree to which a given node influences the entire network.

```
In [ ]:
```%pylab inline
from __future__ import print_function
import sys
import community
import networkx as nx
import seaborn as sns
import pandas as pd
from sqlalchemy import create_engine

In this tutorial we will explore graphical representations of this network, degree metrics, centrality metrics, how to calculate the shortest path between nodes, and community detection. We will be using the NetworkX Python Library developed at Los Alamos National Laboratory (LANL).

First we have to load the data from the database. *Note we did the hard work of creating the network in SQL and now doing our more complex analysis in Python.*

```
In [ ]:
```engine = create_engine("")
df_network = pd.read_sql('',
engine)
df_network.head()

```
In [ ]:
```network = list(zip(df_network.ssn_l, df_network.ssn_r))

```
In [ ]:
```G = nx.Graph()
G.add_edges_from(network)

One way to represent networks is an **adjacency matrix**, a binary (all entries either 0 or 1) square matrix. Each row represents the connections between one node and the other nodes in the network. For instance, the first row represents the first node. Each entry in a row corresponding to a node represents possible connections to the other nodes as indicated by 1 (connected) or 0 (not connected).

```
In [ ]:
```plt.figure(figsize=(30,30))
plt.spy(nx.adjacency_matrix(G))

Graphs can also be represented as **edge lists**, where you list the connections between nodes exhaustively. If we know the graph is undirected, we only need to list each relationship one time. For example, we say that 1 is connected to 32, but it would be redundant to also say that 32 is connected to 1. Representing a network as an edge list is typically preferable to an adjacency matrix in the case of a sparse matrix -- where most of the entries of the matrix are 0 due to taking much less space to store. An edge list is typically how a network is stored in a database.

```
In [ ]:
```network[:10]

*Note: this network is too large to visualize*

nx.draw(G)

It is useful to know the size (in terms of nodes and ties) of the network, both to have an idea of the size and connectivity of the network, and because most of the measures you will use to describe the network will need to be standardized by the number of nodes or the number of potential connections.

One of the most important things to understand about larger networks is the pattern of indirect connections among nodes, because it is these chains of indirect connections that make the network function as a whole, and make networks a
useful level of analysis. Much of the power of networks is due to indirect ties that create **reachability.** Two nodes can reach each other if they are connected by an unbroken chain of relationships, often called **indirect ties**.

Structural differences between node positions, the presence and characteristics of smaller "communities" within larger networks, and properties of the structure of the whole group can be quantified using different **network measures.**

```
In [ ]:
```# Print out some summary statistics on the network
print( nx.info(G) )

We see that there are 568892 ties (relationships) and 13716 nodes (individuals).

The **average degree** of the network is the average number of edges connected to each node.

We see that the average degree of this network is 83., meaning that the average individual in the network is connected to 83 individuals.

```
In [ ]:
```# Print out the average density of the netwo
print(nx.density(G))

The average density is calculated as the $$\text{average density} = \frac{\text{actual ties}}{\text{possible number of ties}} $$

where the possible number of ties for an undirected graph (if every node had a tie to every other node) is $\frac{n(n-1)}{2}$.

If every node were connected to every other node, the average density would be 1. If there were no ties between any of the nodes, the average density would be 0. The average density of this network is 0.0006, which indicates it is not a very dense network. In this example, we can interpret this to mean that individuals are mostly in small groups, and the groups don't overlap very much.

```
In [ ]:
```dict_degree = G.degree()
df_degree = pd.DataFrame.from_dict(dict_degree, orient='index')
df_degree.columns=['degree']
df_degree.index.name = 'node_id'

```
In [ ]:
```sns.set_style("whitegrid")
plt.figure(figsize=(22, 12))
sns.set_context("poster", font_scale=1.00, rc={"lines.linewidth": 1.00,"lines.markersize":8})
df_degree.sort_values(by='degree', ascending=False)[:10].plot(kind='barh')

```
In [ ]:
```df_degree.sort_values(by='degree', ascending=False)[:10]

```
In [ ]:
```G.neighbors()

Two nodes are said to be **reachable** when they are connected by an unbroken chain of relationships through other nodes. Networks in which more of the possible connections (direct and indirect) among nodes are realized are denser and more cohesive than networks in which fewer of these connections are realized.

The reachability of individuals in a network is determined by membership in **components**, which are subsets of the
larger network in which every member of the group is indirectly connected to every other. Imagining the standard node and line drawing of a graph, a component is a portion of the network where you can trace a path between every pair of nodes without ever lifting your pen.

Many larger networks consist of a single dominant component including anywhere from 50% to 90% of the individuals, and a few smaller components that are not connected. In this case, is common to perform analysis on only the main connected component of the graph, because there is not a convenient way to mathematically represent how "far away" unconnected nodes are. In our karate class example, our graph is connected, meaning that you can reach any individual from any other individual by moving along the edges of the graph, so we don't need to worry about that problem.

A **shortest path** between two nodes is a path from one node to the other, not repeating any nodes. One way to think of a shortest path between two individuals is how many people it would take to broker an introduction between them (think six degrees of Kevin Bacon).

Most pairs will have several "shortest paths" between them; the *longest shortest path* is called the **geodesic**.

```
In [ ]:
```# Calculate the length of the shortest path between 12 and 15
ls_path = nx.shortest_path(G,)
print('The path length from {} to {} is {}.'.format(
'',
'',
len(ls_path)))
print('path length: ', ls_path)

In this case there is no path between the two nodes.

```
In [ ]:
```# Calculate the length of the shortest path between 12 and 15
ls_path = nx.shortest_path(G, '',
'')
print('The path length from {} to {} is {}.'.format(
'',
'',
len(ls_path)))
print('path length: ', ls_path)

The **average shortest path length** describes how quickly information or goods can disburse through the network.

The average shortest length $l$ is defined as $$ l = \frac{1}{n(n-1)} \sum_{i \ne j}d(v_{i},v_{j}) $$ where $n$ is the number of nodes in the graph and $d(v_{i},v_{j})$ is the shortest path length between nodes $i$ and $j$.

```
In [ ]:
```print(nx.average_shortest_path_length(G))

Centrality metrics measure how important, or "central," a node is to the network. These can indicate what individual has the most social contacts, who is closest to people, or the person where information most transfers through. There are many **centrality metrics** -- degree centrality, betweenness centrality, closeness centrality, eigenvalue centrality, percolation centrality, PageRank -- all capturing different aspects of a node's contribution to a network.

Centrality measures are the most commonly used means to explore network effects at the level of certain individual participants. Typically, these metrics identify and describe a few important nodes, but don't tell us much about the rest of the nodes in the network. This is akin to Google's search results: the first few matches are the most relevant, but if you go a few pages in to the search results, you might as well have been searching for something else entirely.

The most basic and intuitive measure of centrality, **degree centrality**, simply counts the number of ties that each node has. Degree centrality represents a clear measure of the prominence or visibility of a node. The degree centrality $C_{D}(x)$ of a node $x$ is

where $deg(x)$ is the number of connections that node $x$ has, and $n-1$ is a normalization factor for the total amount of possible connections.

If a node has no connections to any other nodes, its degree centrality will be 0. If it is directly connected to every other node, its degree centrality will be 1.

```
In [ ]:
```dict_degree_centrality = nx.degree_centrality(G)
df_degree_centrality = pd.DataFrame.from_dict(dict_degree_centrality, orient='index')
df_degree_centrality.columns=['degree_centrality']
df_degree_centrality.index.name = 'node_id'

```
In [ ]:
```df_degree_centrality.sort_values(by='degree_centrality',
ascending=False)[:10].plot(kind='barh')

As we can see, this is simply a recasting of the degree distribution.

**Closeness centrality** is based on the idea that networks position some individuals closer to or farther away
from other individuals, and that shorter paths between actors increase the likelihood of communication, and
consequently the ability to coordinate complicated activities. The closeness centrality $C_C(x)$ of a node $x$ is calculated as:

where $d(x,y)$ is the length of the geodesic between nodes $x$ and $y$.

```
In [ ]:
```dict_closeness_centrality = {}

```
In [ ]:
```for ssn_hash in zip(*network[:25])[0]:
dict_closeness_centrality[ssn_hash] = nx.closeness_centrality(G,u=ssn_hash)

```
In [ ]:
```df_closeness_centrality = pd.DataFrame.from_dict(dict_closeness_centrality,
orient='index')
df_closeness_centrality.columns=['closeness_centrality']
df_closeness_centrality.index.name = 'node_id'

```
In [ ]:
```df_closeness_centrality.sort_values(by='closeness_centrality',
ascending=False)[:10].plot(kind='barh')

Where closeness assumes that communication and information flow increase with proximity, **betweenness centrality**
captures "brokerage," or the idea that a node that is positioned "in between" many other pairs of nodes gains some individual advantage. To calculate betweenness, we must assume that when people search for new
information through networks, they are capable of identifying the shortest path (so that we know that the path between two nodes actually includes the "in between" node); additionally, we must assume
that when multiple shortest paths exist, each path is equally likely to be chosen.

The betweenness centrality $C_B(x)$ of a node $x$ is given by

$$ C_B{x} = \sum_{s,t} \frac{\sigma_{st}(x)}{\sigma_{st}}$$where $\sigma_{st}$ is the number of shortest paths from node $s$ to node $t$ and $\sigma_{st}(x)$ is the number of shortest paths $\sigma_{st}$ passing through node $x$. Intuitively, for each node, we look at how many of the shortest paths between every other pair of nodes includes that node.

```
In [ ]:
```dict_betweenness_centrality = nx.betweenness_centrality(G, k=50)
df_betweenness_centrality = pd.DataFrame.from_dict(dict_betweenness_centrality,
orient='index')
df_betweenness_centrality.columns=['betweeness_centrality']
df_betweenness_centrality.index.name = 'node_id'

```
In [ ]:
```df_betweenness_centrality.sort_values(by='betweeness_centrality',
ascending=False)[:10].plot(kind='barh')

A clique is a maximally connected sub-network, or a group of individuals who are all connected to one another.

In our case, this would be a group of individuals that are all connected to each other: all released in the same month, to the same zipcode, with the same gang affiliation. We might expect to see a lot of cliques in this network, because we defined the relationships within our network based on these groupings.

```
In [ ]:
```cliques = list(nx.find_cliques(G))

```
In [ ]:
```import functools

```
In [ ]:
```#summary stats of cliques
num_cliques = len(cliques)
ls_len_cliqs = [len(cliq) for cliq in cliques ]
max_clique_size = max(ls_len_cliqs)
avg_clique_size = np.mean(ls_len_cliqs)
max_cliques = [c for c in cliques if len(c) == max_clique_size]
max_clique_sets = [set(c) for c in max_cliques]
people_in_max_cliques = list(functools.reduce(lambda x,y: x.intersection(y), max_clique_sets))

```
In [ ]:
```print(num_cliques)
print(max_clique_size)
print(avg_clique_size)

There are *2231* cliques in the network. The maximum clique size is *689* people and the average clique size is *7.60*, ~8 people.

Let's see what the maximum cliques look like

```
In [ ]:
``````
max_cliques
```

```
In [ ]:
```Graph_max_clique1 = G.subgraph(max_cliques[0])

```
In [ ]:
```nx.draw(Graph_max_clique1, with_labels=False)

```
In [ ]:
```df_network[ df_network['ssn_l'].isin(max_cliques[0]) & df_network['ssn_r'].isin(max_cliques[0])]

In **community detection**, we try to find sub-networks, or communities, of densely populated connections. Community detection is similar to clustering, in that strong communities will display an abundance of intra-community connections and few inter-communityk connections.

The technical implementation of the algorithm can be found here.

```
In [ ]:
```dict_clusters = community.best_partition(G)
clusters = [dict_clusters.get(node) for node in G.nodes()]
plt.axis("off")
#nx.draw_networkx(G,
# cmap = plt.get_cmap("terrain"),
# node_color = clusters,
# node_size = 600,
# with_labels = True,
# fontsize=200)

```
In [ ]:
``````
dict_clusters
```

- International Network for Social Network Analysis is a large, interdisciplinary association dedicated to network analysis.
- Pajek is a freeware package for network analysis and visualization.
- Gephi is another freeware package that supports large-scale network visualization.
- Network Workbench is a freeware package that supports extensive analysis and visualization of networks.
- NetworkX is the Python package used in this tutorial to analyze and visualize networks.
- iGraph is a network analysis package with implementations in R, Python, and C libraries.
- A Fast and Dirty Intro to NetworkX (and D3)