[Data, the Humanist's New Best Friend](index.ipynb)
*Class 17*

In this class you are expected to learn:

  • Networks
  • Vertices and edges
  • Graph Theory
  • NetworkX
  • Network Analysis
*You must master layouts*

Networks

In case you don't know it, we are all connected in same way or the other. Every time you find a connection or relationship between two or more entities, like friendship among people, bank transactions, or postal mail, chances for that being analyzed as a network are high. Facebook, Twitter, tumblr, Instagram are all social networks, thus networks. In recent years, vast amounts of network data are being generated and collected in different fields:

  • Sociology: web pages, mobile phones, social networks.
  • Technology: Internet routers, vehicular flows, power grids.
  • Academia: co-authorship, mentions.
  • Literature: play networks, publication networks.

But how can we analyze these networks? And most importantly, why would we want to do such a thing?

To answer the first question, we need to know that networks are what mathematicians call graphs. And as usual in the world of Mathematics, there is a huge field of study for that, Graph Theory. So almost all the tools we need are already there, at least in theory.

About the latter, studying and analyzing networks can give us insights about data that are hard to find by other means. Network problems can involve finding an optimal way of doing something, like network flow, shortest path problem, transport problem, location problem, or routing problem. Other times we focus on finding the most important guy in a network, and under what criterias, or the most famous, or the least popular. All of these are examples of how analyzing a problem in terms of networks can give us valuable information that is almost invisible in other ways.


In [1]:
from IPython.display import IFrame
IFrame("http://zoom.it/A1wx#full", width=800, height=378)


Out[1]:
*Source: [Towards a digital geography of Hispanic Baroque art](http://www.cultureplex.ca/publications/towards-a-digital-geography-of-hispanic-baroque-art/)*

Activity

Try to find at least two more examples of networks that are not listed before.


In [2]:
# We will need this later
%pylab inline
pylab.rcParams['figure.figsize'] = (12.0, 6.0)


Populating the interactive namespace from numpy and matplotlib

Seven Bridges of Königsberg

The origins take us back in time to the Künigsberg of the 18th century. Königsberg was a city in Prussia that time. The river Pregel flowed through the town, creating two islands. The city and the islands were connected by seven bridges as shown. The inhabitants of the city were moved by the question, if it was possible to take a walk through the town by visiting each area of the town and crossing each bridge only once? Every bridge must have been crossed completely, i.e. it is not allowed to walk halfway onto a bridge and then turn around and later cross the other half from the other side. The walk need not start and end at the same spot.

Leonhard Euler solved the problem in 1735 by proving that it is not possible. He found out that the choice of a route inside each land area is irrelevant and that the only thing which mattered is the order (or the sequence) in which the bridges are crossed. He had formulated an abstraction of the problem, eliminating unnecessary facts and focussing on the land areas and the bridges connecting them. This way, he created the foundations of Graph Theory. If we see a "land area" as a vertex and each bridge as an edge, we have "reduced" the problem to a graph.

*Source: [Seven Bridges of Königsberg](http://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg)*

Vertices and edges

At this point I guess that is clear what we mean by vertex or node, and by edge or relationship. Before entering more complicated, but still easy to get, concepts, we need to clarify something about graphs and networks. There is no difference between vertex and node, and between edge or relationship. How you use it's just a matter of your background: in Mathematics they use graph. vertex and edge; and Sociology, network, node and relationship. So whatever is better for you. We will be using both.

*Don't get crazy, please*

Graph Theory

A graph, in Mathematics and Computer Science consists of vertices, also known as nodes. Nodes may or may not be connected with one another. The connecting line between two nodes is called an edge or a relationship. If the edges between the nodes are undirected, the graph is called an undirected graph. If an edge is directed from one vertex to another, a graph is called a directed graph. An directed edge is called an arc or a directed relationship.

Though graphs may look very theoretical, many practical problems can be represented by graphs. They are often used to model problems or situations in physics, biology, psychology and above all in computer science. In computer science, graphs are used to represent networks of communication, data organization, computational devices, the flow of computation, etc. In the latter case, they are used to represent the data organisation, like the file system of an operating system, or communication networks. The link structure of websites can be seen as a graph as well, i.e. a directed graph, because a link is a directed relationship.

Graphs

Python has no built-in data type or class for graphs, but it is easy to implement them in Python. Dictionaries are the ideal for representing graphs in Python.


In [3]:
graph = { "a" : ["c"],
          "b" : ["c", "e"],
          "c" : ["a", "b", "d", "e"],
          "d" : ["c"],
          "e" : ["c", "b"],
          "f" : []
        }
graph


Out[3]:
{'d': ['c'],
 'e': ['c', 'b'],
 'f': [],
 'a': ['c'],
 'b': ['c', 'e'],
 'c': ['a', 'b', 'd', 'e']}

The keys of the dictionary above are the nodes of our graph. The corresponding values are lists with the vertices, which are connected by an edge. There is no simpler and more elegant way to represent a graph.

An edge can be seen as a 2-tuple with nodes as elements, i.e. ("a", "b"). Let's now create a function to generate the list of all edges, given a graph.


In [4]:
def generate_edges(graph):
    edges = []
    for vertex in graph:
        for neighbour in graph[vertex]:
            edges.append((vertex, neighbour))
    return edges

print(generate_edges(graph))


[('d', 'c'), ('e', 'c'), ('e', 'b'), ('a', 'c'), ('b', 'c'), ('b', 'e'), ('c', 'a'), ('c', 'b'), ('c', 'd'), ('c', 'e')]

As we can see, there is no edge containing the node "f". "f" is an isolated vertex in our graph. We might need as well a function that calculates the isolated vertices of a given graph.


In [5]:
def find_isolated_vertex(graph):
    """ returns a list of isolated vertices. """
    isolated = []
    for vertex in graph:
        if not graph[vertex]:
            isolated += vertex
    return isolated

find_isolated_vertex(graph)


Out[5]:
['f']

Furthermore, two vertices are adjacent when they are both incident to a common edge.

Paths in graphs

Another common concept in graphs is the path, that can be informally defined as the sequence of vertices you need to pass by to connect from one vertex to another.

For example, in our graph example, ['a', 'c', 'e'] is a path in our graph, as well as ['a', 'c', 'e', 'b']. Ideally, paths shouldn't pass twice by the same vertex.

Activity

Think about writing a function to calculate the shortest path between two vertices. How should it work?

The distance between two vertices in a graph is the length of the shortest path between these vertices. No backtracks, detours, or loops are allowed for the calculation of a distance. To find the diameter of a graph, first find the shortest path between each pair of vertices. The greatest length of any of these paths is the diameter of the graph.

Degree

The degree of a vertex is the number of edges connecting it. If a network is directed, meaning that edges point in one direction from one node to another node, then nodes have two different degrees, the in-degree, which is the number of incoming edges, and the out-degree, which is the number of outgoing edges. The maximum and minimum degree of a graph are the maximum and minimum degree of its vertices.

In our graph, the maximum degree is 4 at vertex c and the minimum degree is 0, i.e the isolated vertex f.

Activity

Write two functions, `max_degree(graph)` and `min_degree(graph)`, that take a graph and return the maximum and minimum degree of the graph.

There is a property that says that the sum of degrees of all the vertices is equal to the number of edges multiplied by 2.

If the degree of a node in a network is the number of edges the node has to other nodes, we can now define a degree distribution as the fraction of nodes in the network with a certain degree.

Density

If we denot $E$ as the total number of edges, and $V$ as the total numbel of vertices, for undirected simple graphs, the graph density is defined as:

$$\frac{2E}{V(V - 1)}$$

This gives an idea on how sparse or dense is a graph. A dense graph is a graph in which the number of edges is close to the maximal number of possible edges. A graph with only a few edges, is called a sparse graph. The difference between those two terms is not very sharp.

The maximal density is 1, if a graph is complete, that means all nodes are connected to all the other nodes. The minimal density on the other hand is 0, for graphs with only isolated nodes.

Activity

Write a functions, `density(graph)`, that returns the density of a graph. For example, `density(graph)` should return `0.6666666666`.

NetworkX

We have seen some basic concepts of graphs. And now we have a clue on how difficult some of the algorithms can be. Fortunately for us, there exists the amazing NetworkX, a "Python package for the creation, manipulation and study of the structure, dynamics and functions of complex networks."

For us, NetworkX is a tool to study the structure and dynamics of social, biological, and infrastructure networks. And it's ease to learn, teach and use for rapid development in a collaborative, multidisciplinary environment.

NetworkX defines no custom node (it calls vertices nodes) objects or edge objects:

  • Node-centric view of network.
  • Nodes can be any hashable object, while edges are tuples with optional edge data (stored in dictionary).
  • Any Python object is allowed as edge data and it is assigned and stored in a Python dictionary (default empty).

In [6]:
import networkx as nx  # As nx as convention

Let's see an example.


In [7]:
g = nx.Graph()
g.add_edge('A','B');
g.add_edge('B','C');
g.add_edge('C','A');
nx.draw(g)


That simple, that's how you create a graph. We create an instance of the class nx.Graph and then start adding edges. There are others classes for other types of graphs, like nx.DiGraph for directed graphs.

But the methods and functions (in other words, the API) of NetworkX allow us for much more. We can also add nodes first, and nodes can be whatever we need: a number, a string, a file, etc.


In [8]:
g = nx.Graph()
g.add_node(1)

Or directly add a list of nodes at once.


In [9]:
g.add_nodes_from([2 ,3])

To remove a node we need to pass the node.


In [10]:
g.remove_node(2)

In [11]:
nx.draw(g)


And almost same thing for the edges.


In [12]:
g.add_edge(1,2)
g.add_edges_from([(1 ,2) ,(1 ,3)])
g.remove_edge(1,2)
nx.draw(g)


In some cases we might have the information of an edge in a tuple, e, but instead of passing the two values like g.add_edge(e[0], e[1]), we can do g.add_edge(*e), which has the same effect. In Python, that's called unpacking, and works for list, tuples, and even dictionaries.


In [13]:
e = (2, 3)
g.add_edge(*e)  # unpack edge tuple

Some useful methods for accessing nodes and edges are shown below.


In [14]:
g.number_of_nodes()  # also g.order()


Out[14]:
3

In [15]:
g.number_of_edges()  # also g.size()


Out[15]:
2

In [16]:
g.nodes()


Out[16]:
[1, 2, 3]

In [17]:
g.edges()


Out[17]:
[(1, 3), (2, 3)]

In [18]:
g.edges(data=True)  # To return the properties associated with the nodes as dictionaries


Out[18]:
[(1, 3, {}), (2, 3, {})]

In [19]:
g.neighbors(1)  # Get the neighbors nodes of node 1


Out[19]:
[3]

In [20]:
g.degree(1)


Out[20]:
1

Any NetworkX graph behaves like a Python dictionary with nodes as primary keys. We can add any property to the node by just typing the name and the value.


In [21]:
g.add_node(1, time='5pm')
g.node[1]['time']


Out[21]:
'5pm'

In [22]:
g.node[1]  # Python dictionary


Out[22]:
{'time': '5pm'}

The special edge attribute weight should always be numeric and holds values used by algorithms requiring weighted edges. We will see that this is important when calculating shortest paths.


In [23]:
g.add_edge(1, 2, weight=4.0 )
g[1][2]['weight'] = 5.0  # edge already added
g[1][2]


Out[23]:
{'weight': 5.0}

Many applications require iteration over nodes or over edges: simple and easy in NetworkX. The next code prints the degree for each node in g.


In [24]:
g.add_edge(1,2)
for node in g.nodes():
    print(node, g.degree(node))


1 2
2 2
3 2

In [25]:
g.add_edge(1, 3, weight=2.5)
g[1][2]['weight'] = 1.5
for n1, n2, attr in g.edges(data=True):
    if 'weight' in attr:
        print(n1, n2, attr['weight'])
    else:
        print(n1, n2)


1 2 1.5
1 3 2.5
2 3

When we have a directed graph, we can also use that information to filter out, for a specific node, nodes that are pointing to it, and nodes that are being pointed by it.


In [26]:
dg = nx.DiGraph()
dg.add_weighted_edges_from([(1, 4, 0.5), (3, 1, 0.75)])  # Third element in the tuples is weight

In [27]:
dg.successors(1)


Out[27]:
[4]

In [28]:
dg.predecessors(1)


Out[28]:
[3]

Some algorithms work only for directed graphs and others are not well defined for directed graphs. If you want to treat a directed graph as undirected for some measurement you should probably convert it using Graph.to_undirected().

Network Analysis

In order to do some real network analysis, we first need some real network. KONECT (the Koblenz Network Collection) is a project to collect large network datasets of all types in order to perform research in network science and related fields, collected by the Institute of Web Science and Technologies at the University of Koblenz–Landau.

From that collection, we will use Hamsterster friendships network dataset, an undirected and unweighted network that contains friendships between users of the website hamsterster.com.

NetworkX is able to handle a variety of graph formats, both for reading and writing:

  • edge lists
  • adjacency lists
  • GML
  • GEXF
  • Python pickle
  • GraphML
  • Pajek
  • LEDA
  • YAML

Our Hamsterster dataset is defined by an edge list, so the file is a plain text file with one edge per line. For your commodity, the file is uncompressed and ready in data/petster-friendships-hamster.txt


In [29]:
!head -n 5 data/petster-friendships-hamster.txt


99 98
999 550
999 42
999 25
998 996

Let's load it into NetworkX!


In [30]:
hamsterster = nx.read_edgelist("data/petster-friendships-hamster.txt")
hamsterster


Out[30]:
<networkx.classes.graph.Graph at 0x7f850ee950d0>

After you go over thousands nodes, there is no point in drawing the graph, as you can see below.


In [31]:
nx.draw(hamsterster)


We need better ways to shape the graph, or layouts, and fine tune the parameters. But not today :-P

On the other hand, we can extract information from the graph without seeing it. Basic graph properties include number of nodes, of edges and average degree.


In [32]:
hamsterster_n, hamsterster_k = hamsterster.order(), hamsterster.size()
hamsterster_avg_deg = hamsterster_k / hamsterster_n
print("Nodes: ", hamsterster_n)
print("Edges: ", hamsterster_k)
print("Average degree: ", hamsterster_avg_deg)


Nodes:  1858
Edges:  12534
Average degree:  6.745963401506997

We are starting to know the graph. Let's take a look to the degrees. A good way of doing that, is by computing a degree distributions of the graph and plot them. If the graph were directed, we would need to generate two distributions, one for the in-degree and othre for the out-degree.

The key for a degree distribution is to obtain the degree sequence, which is just an list contained all the degrees but sorted.


In [33]:
degree_sequence = sorted(hamsterster.degree().values() ,reverse=True)  # degree sequence
plt.plot(degree_sequence, 'b-', marker='o')
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.title('Hamsterster friendships network')


Out[33]:
<matplotlib.text.Text at 0x7f850d326f10>

Finally, as we have learnt already, it might be useful to represent distributions using base 10 for the axes, and remember, we do that with matplotlib.loglog() function.


In [36]:
plt.loglog(degree_sequence, 'b-', marker='o')
plt.xlabel('Degree (log)')
plt.ylabel('Number of nodes (log)')
plt.title('Hamsterster friendships network')


Out[36]:
<matplotlib.text.Text at 0x7f850c7afd10>

Activity

Using another small graph from KONECT, download the dataset, load it into NetworkX, run these very basic analysis, and plot the degree distribution. If you choose a directed graph, remember to plot in-degree and out-degree.

For the next class