Visualising and Analysing Networks

By now you should have a decent understanding of how bookworm assembles a list of character relationships and assesses their strength.
The real point of this project, though, is to give the user a tactile, intuitive view of the network of characters and how they interact. This notebook should cover the methods I've used to achieve that.

Let's start by importing all the usual stuff and loading in the Harry Potter network:



In [5]:

    
from bookworm import *



In [6]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12,9)

import pandas as pd
import numpy as np



In [7]:

    
book = load_book('data/raw/hp_philosophers_stone.txt')
characters = extract_character_names(book)
sequences = get_sentence_sequences(book)

df = find_connections(sequences, characters)
cooccurence = calculate_cooccurence(df)

Visualisation with NetworkX

NetworkX is a very nice python library which is built to handle graphs and networks. We can load our data into a NetworkX Graph object by building up a table of character interactions as follows:



In [8]:

    
import networkx as nx
interaction_df = get_interaction_df(cooccurence, threshold=2)
interaction_df.sample(5)









    Out[8]:







  
    
      
      source
      target
      value
    
  
  
    
      49
      ('Gryffindor ',)
      ('Slytherin ',)
      3
    
    
      91
      ('Hooch ',)
      ('Madam ',)
      7
    
    
      43
      ('George ',)
      ('Weasley ',)
      6
    
    
      106
      ('Nimbus ',)
      ('Thousand ',)
      7
    
    
      89
      ('Hermione ',)
      ('Snape ',)
      4

get_interaction_df() is defined in bookworm/build_network.py, and works by searching through the provided cooccurence matrix for interactions with strength above a specified threshold.

We can load that interaction dataframe into a NetworkX Graph using the super simple from_pandas_dataframe() function:



In [9]:

    
G = nx.from_pandas_dataframe(interaction_df,
                             source='source',
                             target='target')

And, just as easily, visualise it with draw_spring(), where spring is a reference to the idea that edges in the network are treated like physical springs, with elasticity/compressability related to the weights of the connections:



In [10]:

    
nx.draw_spring(G, with_labels=True)









    



/home/harrisonpim/anaconda3/lib/python3.5/site-packages/networkx/drawing/nx_pylab.py:126: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
    Future behavior will be consistent with the long-time default:
    plot commands add elements without first clearing the
    Axes and/or Figure.
  b = plt.ishold()
/home/harrisonpim/anaconda3/lib/python3.5/site-packages/networkx/drawing/nx_pylab.py:138: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
    Future behavior will be consistent with the long-time default:
    plot commands add elements without first clearing the
    Axes and/or Figure.
  plt.hold(b)
/home/harrisonpim/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:917: UserWarning: axes.hold is deprecated. Please remove it from your matplotlibrc and/or style files.
  warnings.warn(self.msg_depr_set % key)
/home/harrisonpim/anaconda3/lib/python3.5/site-packages/matplotlib/rcsetup.py:152: UserWarning: axes.hold is deprecated, will be removed in 3.0
  warnings.warn("axes.hold is deprecated, will be removed in 3.0")

Very nice... ish. There's more that could be done to clean up the visualisation and make it pretty, but it's fine for now.

One of the nicest things about NetworkX is all of its builtin network analysis functionality. For example, we can use pagerank or hits to give us the most 'important' or 'central' nodes in the network. These algorithms were originally developed to analyse linked networks of websites, but they can just as easily be applied to stations in transport networks, streets in cities, similar products in ecommerce systems, friends in social circles, or connected characters in books.



In [11]:

    
pd.Series(nx.pagerank(G)).sort_values(ascending=False)[:5]









    Out[11]:





('Harry ',)        0.127126
('Professor ',)    0.032636
('Snape ',)        0.031137
('Hermione ',)     0.026902
('Malfoy ',)       0.024660
dtype: float64



In [12]:

    
a, b = nx.hits(G)
pd.Series(a).sort_values(ascending=False)[:5]









    Out[12]:





('Harry ',)        0.101015
('Snape ',)        0.048103
('Professor ',)    0.044831
('Ron ',)          0.036281
('Hermione ',)     0.035260
dtype: float64

We can ask NetworkX for cliques in the graph, which are especially relevant to social networks like this. enumerate_all_cliques() gives us a massive list of all the cliques it finds - we'll just return the last one because it's most illustrative of what a clique is in this context...



In [13]:

    
list(nx.enumerate_all_cliques(G))[-1]









    Out[13]:





["('Petunia ',)",
 "('Vernon ',)",
 "('Harry ',)",
 "('Uncle ',)",
 "('Aunt ',)",
 "('Dudley ',)"]

It's isolated the people who appear in the book at Number 4, Privet Drive. Fun!

We can do stuff like illustrate the communicability of one character with another - We would expect that characters which don't spend much time together in the book would have a harder time communicating with one another than those who spend a lot of time together, illustrated by a smaller communicability value:



In [14]:

    
comms = nx.communicability(G)

print(comms["('Vernon ',)"]["('Dumbledore ',)"])
print(comms["('Harry ',)"]["('Hermione ',)"])









    



49.34499406953878
382.85523290242776

Similarly, we can use NetworkX's implementation of classic pathfinding algoritms like Dijkstra's algorithm and A* to return paths between characters. For example, if Hedwig was interested in getting to know Nicolas Flamel, and wanted to do so with as few new introductions as possible along the way, these are the shoulders she would need to tap on for introductions:



In [15]:

    
nx.dijkstra_path(G, 
                 source="('Hedwig ',)", 
                 target="('Flamel ',)")









    Out[15]:





["('Hedwig ',)",
 "('Harry ',)",
 "('Dumbledore ',)",
 "('Nicolas ',)",
 "('Flamel ',)"]

Pathfinding is clearly an application that is more suited to transport networks etc, but it's still interesting to see it applied here...
There's an anecdote which gets passed around about a young South Korean computer scientist in academia who wanted to rise to the top of his field as quickly as possible. By developing a network of the academics in his field and their people they had published with, he was able to quickly work out which authors were most influential, and the path of introductions and cooautorship that he would need to take from his own, weak position in the network to publishing papers with the most influencial academics and becoming a central node himself. I have no idea whether the anecdote is true or not, but it's a nice story, and illustrative of where and why this stuff might be useful to think about. Applying it to owls and alchemists is fun, but it can be useful in the real world too...

All of this stuff dates back to the 1730s and the origins of graph theory, with Euler and the Seven Bridges of Konigsberg. It's a subject worth reading about if you haven't already - it's fascinating, and the world opens up to you in entirely new ways when you develop some intuition around when and where networks appear in nature and how they can be analysed. Clever applications of graph theory are absolutely key to the success of companies like Google, Facebook, and Amazon.

More dynamic visualisations with d3.js

The thing above is fast and fun, and allows us to run a load of interesting algorithms over the network, but it all feels very static... The point of this project is to visualise these networks in an way which gives the user an intuitive sense of the relationships between characters.
We can get closer to that intuitive, touchy-feely sense of the network by putting together a force directed graph with d3.js, like the one by Mike Bostock (the creatory of d3) shown here. Bostock is visualising the boring old Les Mis dataset - we're going to feed d3 our freshly made Harry Potter one.

First we need to set up the data structure which the d3 script requires.



In [35]:

    
nodes = [{"id": str(id), "group": 1} for id in set(interaction_df['source'])]
links = interaction_df.to_dict(orient='records')
d3_dict = {'nodes': nodes, 'links': links}

We can write that dictionary out to a .json file in the project's d3 directory using the json package:



In [36]:

    
import json

with open('bookworm/d3/bookworm.json', 'w') as fp:
    json.dump(d3_dict, fp)

Jupyter notebooks allow us to run commands in other languages, so we'll use bash to do a few things from here on. For example, we can list the files in the d3 directory:



In [37]:

    
%%bash
ls bookworm/d3/









    



bookworm.json
index.html

or print out one of those files:



In [38]:

    
%%bash
cat bookworm/d3/index.html









    



<!DOCTYPE html>
<meta charset="utf-8">
<style>

.links line {
  stroke: #999;
  stroke-opacity: 0.6;
}

.nodes circle {
  stroke: #fff;
  stroke-width: 1.5px;
}

</style>
<svg width="1000" height="1000"></svg>
<script src="https://d3js.org/d3.v4.min.js"></script>
<script>

var svg = d3.select("svg"),
    width = +svg.attr("width"),
    height = +svg.attr("height");

var color = d3.scaleOrdinal(d3.schemeCategory20);

var simulation = d3.forceSimulation()
    .force("link", d3.forceLink().id(function(d) { return d.id; }))
    .force("charge", d3.forceManyBody())
    .force("center", d3.forceCenter(width / 2, height / 2));

d3.json("bookworm.json", function(error, graph) {
  if (error) throw error;

  var link = svg.append("g")
      .attr("class", "links")
    .selectAll("line")
    .data(graph.links)
    .enter().append("line")
      .attr("stroke-width", function(d) { return Math.sqrt(d.value); });

  var node = svg.append("g")
      .attr("class", "nodes")
    .selectAll("circle")
    .data(graph.nodes)
    .enter().append("circle")
      .attr("r", 5)
      .attr("fill", function(d) { return color(d.group); })
      .call(d3.drag()
          .on("start", dragstarted)
          .on("drag", dragged)
          .on("end", dragended));

  node.append("title")
      .text(function(d) { return d.id; });

  simulation
      .nodes(graph.nodes)
      .on("tick", ticked);

  simulation.force("link")
      .links(graph.links);

  function ticked() {
    link
        .attr("x1", function(d) { return d.source.x; })
        .attr("y1", function(d) { return d.source.y; })
        .attr("x2", function(d) { return d.target.x; })
        .attr("y2", function(d) { return d.target.y; });

    node
        .attr("cx", function(d) { return d.x; })
        .attr("cy", function(d) { return d.y; });
  }
});

function dragstarted(d) {
  if (!d3.event.active) simulation.alphaTarget(0.3).restart();
  d.fx = d.x;
  d.fy = d.y;
}

function dragged(d) {
  d.fx = d3.event.x;
  d.fy = d3.event.y;
}

function dragended(d) {
  if (!d3.event.active) simulation.alphaTarget(0);
  d.fx = null;
  d.fy = null;
}

</script>

The next cell can be used to set up a locally hosted version of that d3.js script.

It's a super-simple, two-line bash script which uses python's builtin http.server module to run the javascript visualisation code in the browser on your machine.

We dumped our graph data into a file called 'bookworm.json' in one of the cells above - that file can now processed by 'index.html' (printed above), which displays the data using the d3.js javascript library.



In [39]:

    
%%bash
cd bookworm/d3/ 
python -m http.server









    



Process is interrupted.

When you've run the cell, open a new tab and go to the following address

localhost:8000

You should see a pretty graph representation of our network bouncing around. Hover over a node to see which character it corresponds to. Click and drag nodes to play around with it (This is super fun to do with your hands if you're running this on a touchscreen laptop. Playing with two hands also works!).

Note:

When you're finished playing, remember to navigate back to the two-line %%bash cell above and push the STOP button to kill the local server. You won't be able to run any more code in this notebook until you do.

In the next notebook, we'll start considering the effect of time in novels and ways of representing temporal networks
< 02 - Character Building | Home | 04 - Time and Chronology >



In [ ]:

	source	target	value
49	('Gryffindor ',)	('Slytherin ',)	3
91	('Hooch ',)	('Madam ',)	7
43	('George ',)	('Weasley ',)	6
106	('Nimbus ',)	('Thousand ',)	7
89	('Hermione ',)	('Snape ',)	4