David Robinson posted a great article Analyzing networks of characters in 'Love Actually' on 25th December 2015, which uses R to analyse the connections between the characters in the film Love Actually.
This notebook and the associated python code attempts to reproduce his analysis using tools from the Python ecosystem. The code in this notebook is copied from love_actually.py as of 29th December 2015 (ideally I'd keep them in sync but no promises)
To start we need to import some useful packages, shown below. Some of these needed to be installed using the Anaconda distribution or pip if conda failed:
In [1]:
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from scipy.cluster.hierarchy import dendrogram, linkage
import ggplot as gg
import networkx as nx
In [2]:
%matplotlib inline
First we need to define a couple of functions to read the 'Love Actually' script into a list of lines and the cast into a Pandas DataFrame. The data_dir variable gets the location of the directory containing the input files from an environment variable so set this environment variable to the location appropriate for you if you're following along.
In [3]:
data_dir = os.path.join(os.getenv('MDA_DATA_DIR', '/home/mattmcd/Work/Data'), 'LoveActually')
def read_script():
"""Read the Love Actually script from text file into list of lines
The script is first Google hit for 'Love Actually script' as a doc
file. Use catdoc or Libre Office to save to text format.
"""
with open(os.path.join(data_dir, 'love_actually.txt'), 'r') as f:
lines = [line.strip() for line in f]
return lines
def read_actors():
"""Read the mapping from character to actor using the varianceexplained data file
Used curl -O http://varianceexplained.org/files/love_actually_cast.csv to get a local copy
"""
return pd.read_csv(os.path.join(data_dir, 'love_actually_cast.csv'))
The cell below reproduces the logic in the first cell of the original article. It doesn't feel quite as nice to me as the dplyr syntax but is not too bad.
In [4]:
def parse_script(raw):
df = pd.DataFrame(raw, columns=['raw'])
df = df.query('raw != ""')
df = df[~df.raw.str.contains("(song)")]
lines = (df.
assign(is_scene=lambda d: d.raw.str.contains(" Scene ")).
assign(scene=lambda d: d.is_scene.cumsum()).
query('not is_scene'))
speakers = lines.raw.str.extract('(?P<speaker>[^:]*):(?P<dialogue>.*)')
lines = (pd.concat([lines, speakers], axis=1).
dropna().
assign(line=lambda d: np.cumsum(~d.speaker.isnull())))
lines.drop(['raw', 'is_scene'], axis=1, inplace=True)
return lines
In [5]:
def read_all():
lines = parse_script(read_script())
cast = read_actors()
combined = lines.merge(cast).sort('line').assign(
character=lambda d: d.speaker + ' (' + d.actor + ')').reindex()
# Decode bytes to unicode
combined['character'] = map(lambda s: s.decode('utf-8'), combined['character'])
return combined
In [6]:
# Read in script and cast into a dataframe
lines = read_all()
# Print the first few rows
lines.head(10)
Out[6]:
Constructing the n_character x n_scene matrix showing how many lines each character has in each scene is quite easy using pandas groupby method to create a hierarchical index, followed by the unstack method to convert the second level of the index into columns.
In [7]:
def get_scene_speaker_matrix(lines):
by_speaker_scene = lines.groupby(['character', 'scene'])['line'].count()
speaker_scene_matrix = by_speaker_scene.unstack().fillna(0)
return by_speaker_scene, speaker_scene_matrix
In [8]:
# Group by speaker and scene and construct the speaker-scene matrix
by_speaker_scene, speaker_scene_matrix = get_scene_speaker_matrix(lines)
Now we get to the analysis itself. First we perform a hierarchical clustering of the data using the same data normalization and clustering method as the original article. The leaf order in the dendrogram is returned for use in later steps of the analysis, as it has similar characters close to each other.
In [9]:
def plot_dendrogram(mat, normalize=True):
# Cluster and plot dendrogram. Return order after clustering.
if normalize:
# Normalize by number of lines
mat = mat.div(mat.sum(axis=1), axis=0)
Z = linkage(mat, method='complete', metric='cityblock')
labels = mat.index
f = plt.figure()
ax = f.add_subplot(111)
R = dendrogram(Z, leaf_rotation=90, leaf_font_size=8,
labels=labels, ax=ax, color_threshold=-1)
f.tight_layout()
ordering = R['ivl']
return ordering
In [10]:
# Hierarchical cluster and return order of leaves
ordering = plot_dendrogram(speaker_scene_matrix)
In [11]:
print(ordering)
Plotting the timeline of character versus scene is very similar to the R version since we make use of the Python ggplot port from yhat. It seems like there are a few differences between this package and the R ggplot2 library.
In particular:
(note that these points may just be limitations of my understanding of ggplot in Python)
In [12]:
def get_scenes_with_multiple_characters(by_speaker_scene):
# Filter speaker scene dataframe to remove scenes with only one speaker
# n_scene x 1 Series with index 'scene'
filt = by_speaker_scene.count('scene') > 1
# n_scene x n_character Index
scene_index = by_speaker_scene.index.get_level_values('scene')
# n_scene x n_character boolean vector
ind = filt[scene_index].values
return by_speaker_scene[ind]
In [13]:
def order_scenes(scenes, ordering=None):
# Order scenes by e.g. leaf order after hierarchical clustering
scenes = scenes.reset_index()
scenes['scene'] = scenes['scene'].astype('category')
scenes['character'] = scenes['character'].astype('category', categories=ordering)
scenes['character_code'] = scenes['character'].cat.codes
return scenes
In [14]:
# Order the scenes by cluster leaves order
scenes = order_scenes(get_scenes_with_multiple_characters(by_speaker_scene), ordering)
In [15]:
def plot_timeline(scenes):
# Plot character vs scene timelime
# NB: due to limitations in Python ggplot we need to plot with scene on y-axis
# in order to label x-ticks by character.
# scale_x_continuous and scale_y_continuous behave slightly differently.
print (gg.ggplot(gg.aes(y='scene', x='character_code'), data=scenes) +
gg.geom_point() + gg.labs(x='Character', y='Scene') +
gg.scale_x_continuous(
labels=scenes['character'].cat.categories.values.tolist(),
breaks=range(len(scenes['character'].cat.categories))) +
gg.theme(axis_text_x=gg.element_text(angle=30, hjust=1, size=10)))
In [16]:
# Plot a timeline of characters vs scene
plot_timeline(scenes);
In [17]:
def get_cooccurrence_matrix(speaker_scene_matrix, ordering=None):
# Co-occurrence matrix for the characters, ignoring last scene where all are present
scene_ind = speaker_scene_matrix.astype(bool).sum() < 10
if ordering:
mat = speaker_scene_matrix.loc[ordering, scene_ind]
else:
mat = speaker_scene_matrix.loc[:, scene_ind]
return mat.dot(mat.T)
In [18]:
cooccur_mat = get_cooccurrence_matrix(speaker_scene_matrix, ordering)
The heatmap below is not as nice as the default R heatmap as it is missing the dendrograms on each axis and also the character names, so could be extended e.g. following Hierarchical Clustering Heatmaps in Python.
It otherwise shows a similar result to the original article in that ordering using the dendrogram leaf order has resulted in a co-occurrence matrix predominantly of block diagonal form.
In [19]:
def plot_heatmap(cooccur_mat):
# Plot co-ccurrence matrix as heatmap
plt.figure()
plt.pcolor(cooccur_mat)
In [20]:
# Plot heatmap of co-occurrence matrix
plot_heatmap(cooccur_mat)
The network plot gives similar results to the original article. This could be extended, for example by adding weights to the graph edges.
In [21]:
def plot_network(cooccur_mat):
# Plot co-occurence matrix as network diagram
G = nx.Graph(cooccur_mat.values)
pos = nx.graphviz_layout(G) # NB: needs pydot installed
plt.figure()
nx.draw_networkx_nodes(G, pos, node_size=700, node_color='c')
nx.draw_networkx_edges(G, pos)
nx.draw_networkx_labels(
G, pos,
labels={i: s for (i, s) in enumerate(cooccur_mat.index.values)},
font_size=10)
plt.axis('off')
plt.show()
In [22]:
# Plot network graph of co-occurrence matrix
plot_network(cooccur_mat)
This notebook attempted to reproduce using Python the R analysis and visualization of the character network in 'Love Actually'. Overall this was a useful exercise in learning similarities and differences between the tools, as well as becoming more familiar with the Pandas syntax. The output currently doesn't look quite as nice as the R original but is a good starting point for future tinkering.
I am Matt McDonnell and am currently working as a Data Scientist for a Cambridge (UK) based startup. Previously my main language has been MATLAB used in quantitative analyst and quantitative developer roles for a London based asset management firm, as well as in a Technical Consultant role for MathWorks. In my current role the main languages used are R, Python and SQL, with Clojure being the main language used outside the Data Science team.