In [ ]:
%%capture
!rm -rf shakespeare_data/plays_xml
!unzip -P rtad shakespeare_data/plays.zip -d shakespeare_data/plays_xml
In [ ]:
%%capture
import numpy as np
import networkx as nx
from lxml import etree
import itertools
from datascience import *
import matplotlib.pyplot as plt
%matplotlib inline
Mark Algee-Hewitt looks at thousands of plays across centuries. But as we've learned so far, to do this we first have to figure out how to calculate the metrics we're interested in for a single text. Let's take a look at a single play. Luckily, there are databases that exists that have already annotated a lot of plays in a markup language called XML. Especially well researched corpora have extensive metadata. We'll look at the Shakespeare corpus with data obtained from https://www.playshakespeare.com/ .
We'll start by looking at Othello.
In [ ]:
with open("shakespeare_data/plays_xml/othello_ps_v3.xml") as f:
othello_xml = etree.fromstring(f.read().encode())
If we're trying to build a network we need two things: 1) nodes and 2) edges. For Algee-Hewitt, and for us today, that means we need to know the characters in Othello, and with whom they communicate. We'd also like to know how often that specific interaction occurs.
We can get all elements of the XML tree by iterating over all the nodes:
In [ ]:
all_elements = list(othello_xml.iter())
all_elements
That's a lot of information! Let's grab out all of the speakers. All the speaker elements will have a text attribute that has their actual name, or abbreviation of their name.
In [ ]:
[e.text for e in all_elements if e.tag == "speaker"]
To get a unique list we'll use set:
In [ ]:
set([e.text for e in all_elements if e.tag == "speaker"])
Great start! In Network Analysis there are two fundamental principles. A node is an entity, it can have relationships with other entities. In literature, this is often a character, but it could be a Twitter user, organization, geographic location, or even words!
We may be interested in a node's properties. If it's a character, we may want to know how often they speak, age, etc. We can add this to the network as further layers.
The second concept is an edge. An edge connects nodes. We're foremost interested in the volume of connections between nodes. For literature, this would be the number of times two characters interact.
As we learned from Moretti and our readings for today, this is a very difficult task for most texts. Where does on character's speech end and another's begin? Luckily, in plays this is slightly easier to identify (though still not perfectly clear).
For Shakespeare, we'll settle for them being present in the same scene. If they're in the same scene together, we'll increase our measure of their interaction.
Thus for each character we want to know how many lines the speak in the entire play, along with which scenes they appear in. We can then collate this wil the other characters.
The get_cast_dict function below will parse the XML data and extract this information.
In [ ]:
cast_dict = {}
for c in set([e.text for e in all_elements if e.tag == "speaker"]):
cast_dict[c] = {"num_lines": 0,
"scenes": []}
cast_dict
In [ ]:
# extract all scene elements from the xml
scenes = [e for e in all_elements if e.tag == "scene"]
scenes
In [ ]:
elements = [e.find("acttitle").text for e in all_elements if e.tag == "act"]
In [ ]:
def get_cast_dict(all_elements):
'''
returns a dictionary with the total number of lines and scenes a character appears in
'''
cast_dict = {}
# first get a unique set of all characters appearing in the play
for c in set([e.text for e in all_elements if e.tag == "speaker"]):
cast_dict[c] = {"num_lines": 0,
"scenes": []}
# extract all scene elements from the xml
scenes = [e for e in all_elements if e.tag == "scene"]
acts = [e for e in all_elements if e.tag == "act"]
# acts = [e.find("acttitle").text for e in all_elements if e.tag == "act"]
for a in acts:
# get title of acts
act_title = a.find("acttitle").text
# get scene elements
scenes = [e for e in a if e.tag == "scene"]
# iterate through each scene
for sc in scenes:
# grab all the speeches in the scene
speeches = [s for s in sc.getchildren() if s.tag == "speech"]
# iterate through speeches
for s in speeches:
# increment number of lines for the speaker
cast_dict[s.find("speaker").text]["num_lines"] += len(s.findall("line"))
# find all the speaker for each speech
speakers = [s.find("speaker").text for s in speeches]
# add the title of the scene for each speaker appearing in the scene
for s in set(speakers):
cast_dict[s]["scenes"].append(act_title + " " + sc.find("scenetitle").text)
# reassign scenes to only a unique set
for c in cast_dict.keys():
cast_dict[c]["scenes"] = list(set(cast_dict[c]["scenes"]))
return cast_dict
In [ ]:
cast_dict = get_cast_dict(all_elements)
cast_dict
That's all we need to make a basic network and do some analysis! We have all the character names and the scenes in which they appear. We can collate some of this information to find out in which scenes certain characters appear together. This will happen in our make_graph function.
The NetworkX Python library will parse this dictionary for us to make a graph object. Let's write a function:
In [ ]:
def make_graph(c_dict):
'''
This function accepts a dictionary with number of lines and scenes to create a
NetworkX graph object
'''
# setup graph object
G = nx.Graph()
# add nodes with attributes of number of lines and scenes
for c in c_dict.keys():
if c_dict[c]["num_lines"] > 0:
G.add_node(
c,
number_of_lines=c_dict[c]["num_lines"],
scenes=c_dict[c]["scenes"]
)
# make edges by iterating over all combinations of nodes
for (node1, data1), (node2, data2) in itertools.combinations(G.nodes(data=True), 2):
# count scenes together by getting union of their sets
scenes_together = len(set(data1['scenes']) & set(data2['scenes']))
if scenes_together:
# add more weight for more scenes together
G.add_edge(node1, node2, weight=scenes_together)
return G
In [ ]:
G = make_graph(cast_dict)
We can graph this using matplotlib:
In [ ]:
# nodes should be sized by number of lines
node_size = [data['number_of_lines'] for __, data in G.nodes(data=True)]
node_color = 'blue'
plt.figure(figsize=(13,8)) # make the figure size a little larger
plt.axis('off') # remove the axis, which isn't meaningful in this case
plt.title("Othello's Social Network", fontsize=20)
# The 'k' argument determines how spaced out the nodes will be from
# one another on the graph.
pos = nx.spring_layout(G, k=0.5)
nx.draw_networkx(
G,
pos=pos,
node_size=node_size,
node_color=node_color,
edge_color='gray', # change edge color
alpha=0.3, # make nodes more transparent to make labels clearer
font_size=14,
)
Our graph, G, is a powerful object. We can calculate many of the standard network analysis statistics. There are various measures of centrality, many of which were referenced in the reading.
In [ ]:
network_tab = Table()
network_tab.append_column(label="Characters", values=[c for c in sorted(cast_dict.keys())])
network_tab.show()
Wikipedia defines "degree centrality":
Historically first and conceptually simplest is degree centrality, which is defined as the number of links incident upon a node (i.e., the number of ties that a node has).
In [ ]:
dc = [x[1] for x in sorted(nx.degree_centrality(G).items(), key=lambda x: x[0])]
network_tab.append_column(label="Degree Centrality", values=dc)
network_tab.show()
Wikipedia defines "betweeness centrality":
Betweenness is a centrality measure of a vertex within a graph (there is also edge betweenness, which is not discussed here). Betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path between two other nodes.
In [ ]:
bc = [x[1] for x in sorted(nx.betweenness_centrality(G).items(), key=lambda x: x[0])]
network_tab.append_column(label="Betweenness Centrality", values=bc)
network_tab.show()
Wikipedia defines "eigenvector centrality":
Eigenvector centrality (also called eigencentrality) is a measure of the influence of a node in a network. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.
$x_v = \frac{1}{\lambda} \sum_{t \in M(v)}x_t = \frac{1}{\lambda} \sum_{t \in G} a_{v,t}x_t$
In [ ]:
ec = [x[1] for x in sorted(nx.eigenvector_centrality(G).items(), key=lambda x: x[0])]
network_tab.append_column(label="Eigenvector Centrality", values=ec)
network_tab.show()
In [ ]:
matplotlib isn't always the most beautiful option. A popular way of visualizing networks is by using Javascript's D3 library. Luckily, networkx allows us to export the network information to JSON:
In [ ]:
from networkx.readwrite import json_graph
import json
d3_data = json_graph.node_link_data(G)
d3_data
We can then add this to a D3 template:
In [ ]:
import re
with open('network.html', 'r') as f:
net_html = f.read()
pattern = re.compile(r'(<script type="application/json" id="net">)(\s*.*)')
net_html = net_html.replace(re.findall(pattern, net_html)[-1][-1].strip(), json.dumps(d3_data).strip())
with open('network.html', 'w') as f:
f.write(net_html)
We'll then IFrame in the HTML file
In [ ]:
from IPython.display import IFrame
IFrame('network.html', width=700, height=900)
Algee-Hewitt was calculating the gini coefficient of the eigenvector centralities. He essentially wanted to know whether importance in a network was evenly distributed, or concentrated in the hands of a few. The lower the gini coefficient, the more equal the distribution, the closer to 1, the closer one gets to complete inequality. I've found a function online that will calculate the gini coefficient for you!
In [ ]:
def gini(array):
"""Calculate the Gini coefficient of a numpy array."""
# https://github.com/oliviaguest/gini
array = np.sort(array) # values must be sorted
index = np.arange(1, array.shape[0] + 1) # index per array element
n = array.shape[0] # number of array elements
return ((np.sum((2 * index - n - 1) * array)) / (n * np.sum(array))) #Gini coefficient
Just to demonstrate, let's make a very unequal array:
In [ ]:
np.concatenate((np.zeros(99), np.ones(1)))
The gini coefficient should be close to 1:
In [ ]:
gini(np.concatenate((np.zeros(99), np.ones(1))))
What if we have half zeroes and half ones?
In [ ]:
gini(np.concatenate((np.zeros(50), np.ones(50))))
All ones?
In [ ]:
gini(np.ones(50))
Now we can use the gini function on Othello to see how evenly distributed centrality is:
In [ ]:
import numpy as np
gini(network_tab['Eigenvector Centrality'])
Great, but that's not terribly interesting itself, we want to see how it relates to other plays. We'll do that for homework.
First, let's write a function to calculate Algee-Hewitt's second measure. He takes the percentage of characters in the top quartile of eigenvector centralities. You'll want to use the np.percentile method!
In [ ]:
def percentage_top_quartile(character_table):
# YOUR CODE HERE
return percentage
In [ ]:
percentage_top_quartile(network_tab['Eigenvector Centrality'])
In [ ]:
!ls shakespeare_data/plays_xml/
Write some code to loop through at least 5 of these plays and print the most central character in each play according to eigenvector centrality:
In [ ]:
Now use the gini function to calculate the gini coefficient of the eigenvector centralities for each of the 5 plays and create a bar chart. Do the same for the percentage in the top quartile. What do these results mean?
In [ ]:
Much of this code is adapted from http://www.adampalay.com/blog/2015/04/17/shakespeare-social-networks/ .