In this assignment you will get to work with graphs using the graph-tool python module.
Learning goals are:
For further background on graph-tool
refer to the graph tutorial slides and the tutorial notebook. There you can also find instructions on how to use the py36 conda
environment in the Big Data Lab, which has all required dependencies installed.
Submission: Fill your answers into this notebook and submit on CourSys.
In [187]:
import matplotlib as mpl
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display, Markdown
%matplotlib inline
import graph_tool.all as gt
print("graph-tool version: {}".format(gt.__version__.split(' ')[0]))
In [188]:
g = gt.collection.data['power']
display(Markdown(gt.collection.descriptions['power']))
In this graph an edge represents a power supply line. A node is either a generator, a transformator, or a substation.
Task 1a: Create a drawing of this graph that emphasizes nodes that have more than 10 incident power supply lines. Set the size of all other nodes to 0, but retain visibility of the power lines.
In [189]:
# TODO
Out[189]:
Task 1b: Identify one of the centrality measures that can be used to indicate powerlines that act as a bridge between different parts of the network. Use this to emphasize structurally important nodes and powerlines.
In [195]:
# TODO
Out[195]:
In [206]:
X_knows = {
'Mary': ['Peter', 'Albert', 'DavidF', 'Peter'],
'Judy': ['Bob', 'Alan'],
'Peter': ['Mary', 'DavidF', 'Jon'],
'DavidF': ['Albert', 'Joseph', 'Peter', 'Mary'],
'Jon': ['Peter', 'Joseph', 'DavidE'],
'DavidE': ['Jon', 'Joseph', 'Albert'],
'Joseph': ['DavidE', 'Jon', 'DavidF'],
'Bob': ['Judy', 'Alan'],
'Alan': ['Bob', 'Mary', 'Judy'],
'Albert': ['DavidF', 'Mary', 'DavidE'],
}
Task: Create an undirected graph based on the information above, remove parallel edges, and draw it using a layout that resembles the tidy example given in the lecture.
In [224]:
# TODO
Out[224]:
For the following question let us work with a real social graph of facebook friendship connections. Please download facebook_combined.txt
from SNAP, the Stanford Large Network Dataset Collection and create a Graph
object with graph-tool
. The dataset contains the ego networks of 10 facebook users, i.e. the friends of each of these users and the connections among those friends.
Goal of this questions is to use centrality measures to determine influencers among the users, not including the ego nodes themselves.
Task 3a: Load the dataset and create a drawing of the graph.
In [227]:
# TODO
In [229]:
# TODO
Out[229]:
Task 3b: Calculate and show a histogram of pairwise distances among users, i.e. on the shortest paths of friendship connections among any pair of users. Use a log-scale to show the frequencies in the histogram. What is the diameter of this graph?
Hint: Calculating and drawing the histograms amounts to two lines of code using gt
and plt
modules.
In [230]:
# TODO
Task 3c: Determine influencers, i.e. people withing the ego network that are well connected among friends.
There are several steps to this analysis. First, remove the ego nodes. You can do this in an approximate way, simply by calculating a measure of influence of a user and removing the highest scoring nodes, assuming that these naturally are the ego nodes whose friends this network consists of.
Use PageRank
as a measure of influence of a node.
Step 1 - Create a GraphView and drawing what only retains nodes with a pagerank
$< 0.002$.
In [ ]:
# TODO
In [234]:
# TODO
Out[234]:
Step 2 - Only retain the largest connected component of this graph, i.e. create another GraphView
of only this largest component.
In [237]:
# TODO
Step 3 - Recalculate pagerank for each node, choose a threshold (e.g. 0.0005) above which only a small part of the users are selected. Create a drawing that emphasizes these users that play central roles among their friends.
In [244]:
# TODO
In [245]:
# TODO
Out[245]:
Consider a graph $G$ consisting of $N$ nodes and $M$ relations (edges) having weights. Forcedirected graph layouts (FDL) try to create a (2D) embedding of $G$ so that strongly related nodes are placed close to each other. Now, consider a $K$-dimensional dataset $D$ of $N$ data points. Dimensionality reduction (DR) methods try to create a (2D) embedding of $D$ so that strongly related data points (in terms of their $K$-dimensional distances) are placed close to each other. Given the similarities of the two types of techniques, how could we use a FDL method to create a DR embedding for a given $K$-dimensional dataset $D$? And how could we conversely use a DR method to create a graph layout for a given graph $G$?
Hints: To answer the question, first elaborate on the notation of both the graph $G$ and dataset $D$. Next, define, in terms of the specific aspects of these datasets, what would be the corresponding graph $G$ (for a $K$-dimensional dataset $D$) so that we could use a FDL on $G$ to create a 2D embedding of $D$; and conversely, what would be the corresponding dataset $D$ (for a graph $G$), so we could use a DR method to create a 2D embedding of $G$.
[A. Telea, Data Visualization - Principles and Practice, Chapter 11]