Note: This notebook requires GraphLab Create 1.1 or higher
Welcome to the GraphLab Graph Analytics toolkit. In this notebook we'll use the toolkit to explore American films released between 2004 and 2013 and answer the question of whether Kevin Bacon is really the best actor to put at the center of the Kevin Bacon game.
Before we start playing with the data, we need to import some key libraries: graphlab of course, and IPython display utilities. We also tell IPython notebook and GraphLab Canvas to produce plots directly in this notebook.
In [1]:
import graphlab
from IPython.display import display
from IPython.display import Image
graphlab.canvas.set_target('ipynb')
We'll load data on performances in American movies for the last ten years, pulled from the Freebase API's film/film and film/performance topics. The Freebase data is crowd-sourced, so it's a bit messy, but it is freely available under the Creative Commons license.
Our curated data live in an Amazon S3 bucket, from which we could create an SFrame directly, but for this demo we'll first download the CSV file and save it locally. Please note that running this notebook on your machine will download the 8MB csv file to your working directory.
In [2]:
import urllib
url = 'https://static.turi.com/datasets/americanMovies/freebase_performances.csv'
urllib.urlretrieve(url, filename='freebase_performances.csv') # downloads an 8MB file to the working directory
Out[2]:
There are a few data preprocessing steps to do, for which we'll use an SFrame (for more on SFrames, see the Introduction to SFrames notebook). First, we drop a superfluous column which is created because of a missing column header. Next, we drop actor names that are equal to an empty list, which is obviously an error. Finally, we add a column with 0.5 in each row, which will come in handy later for computing graph distances between actors.
After the data is clean, let's take a peek, with GraphLab Canvas.
In [3]:
data = graphlab.SFrame.read_csv('remote://freebase_performances.csv',
column_type_hints={'year': int})
data = data[data['actor_name'] != '[]']
data['weight'] = .5
data.show()
Now we construct the graph. For the first half of our analysis we will use a bipartite graph, where actors and movies are vertices and film performances are the edges that connect actors to movies.
In [4]:
actors = data['actor_name'].unique()
films = data['film_name'].unique()
g = graphlab.SGraph()
g = g.add_edges(data, src_field='actor_name', dst_field='film_name')
g = g.add_edges(data, src_field='film_name', dst_field='actor_name')
print "Movie graph summary:\n", g.summary(), "\n"
By using the get_vertices() and get_edges() methods we can verify the data was entered correctly. Note the SGraph uses directed edges, so we enter them in both directions to get an undirected graph and the correct output from the toolkits.
In [5]:
print "Actor vertex sample:"
g.get_vertices(ids=actors).tail(5)
Out[5]:
In [6]:
print "Film vertex sample:"
g.get_vertices(ids=films).head(5)
Out[6]:
In [7]:
print "Sample edges (performances):"
g.get_edges().head(5)
Out[7]:
This graph can (and should) be saved so we can come back to it later without the data cleaning.
In [8]:
# g.save('sample_graph')
# new_g = graphlab.load_graph(filename='sample_graph')
This graph is too large to visualize, so we'll pull a small subgraph to see what the data look like.
In [9]:
selection = ['The Great Gatsby', 'The Wolf of Wall Street']
subgraph = graphlab.SGraph()
subgraph = subgraph.add_edges(g.get_edges(dst_ids=selection),
src_field='__src_id', dst_field='__dst_id')
subgraph.show(highlight=selection)
Can you guess which node is in the middle?
In [10]:
subgraph.show(vlabel='id', highlight=selection)
First, let's find the number of connected components. We'll do this first on our bipartite graph with both films and actors, but in the second half of this notebook we'll do it again with only actors.
In [11]:
cc = graphlab.connected_components.create(g, verbose=False)
cc_out = cc['component_id']
print "Connected components summary:\n", cc.summary()
There are over 2,000 components. With the 'component_size' field we can see that there is really only one very large component.
In [12]:
cc_size = cc['component_size'].sort('Count', ascending=False)
cc_size
Out[12]:
Let's pull one of the smaller connected components to see if there's anything interesting. The cc_out object is an SFrame, which acts a lot like a pandas DataFrame, but is on disk.
In [13]:
tgt = cc_size['component_id'][1]
tgt_names = cc_out[cc_out['component_id'] == tgt]['__id']
subgraph = graphlab.SGraph()
subgraph = subgraph.add_edges(g.get_edges(src_ids=tgt_names),
src_field='__src_id', dst_field='__dst_id')
film_selector = subgraph.get_vertices(ids=films)['__id']
subgraph.show(vlabel='id', highlight=film_selector)
This component corresponds to a handful of Japanese anime series. To help ourselves out later, we'll also pull out the names of the actors in the giant connected component. Here we use the SFrame.filter_by method because we're looking for matches against a set of names.
In [14]:
big_label = cc_size['component_id'][0]
big_names = cc_out[cc_out['component_id'] == big_label]
mainstream_actors = big_names.filter_by(actors, column_name='__id')['__id']
OK, let's play the Kevin Bacon game. First, let's see what movies he's been in over the last decade...
In [15]:
bacon_films = g.get_edges(src_ids=['Kevin Bacon'])
subgraph = graphlab.SGraph()
subgraph = subgraph.add_edges(bacon_films, src_field='__src_id',
dst_field='__dst_id')
subgraph.show(vlabel='id', elabel='character', highlight=['Kevin Bacon'])
... and with whom Kevin Bacon has co-starred.
In [16]:
subgraph = graphlab.SGraph()
for f in bacon_films['__dst_id']:
subgraph = subgraph.add_edges(g.get_edges(src_ids=[f], dst_ids=None),
src_field='__src_id', dst_field='__dst_id')
subgraph.show(highlight=list(bacon_films['__dst_id']), vlabel='__id', vlabel_hover=True)
We can find the shortest path distance between every other vertex and Mr. Bacon. Keep in mind that this is a bipartite graph, so movies will have half distances from our target and actors will have whole number distances from the target.
In [17]:
sp = graphlab.shortest_path.create(g, source_vid='Kevin Bacon', weight_field='weight', verbose=False)
sp_graph = sp['graph']
The computation is very quick. And we can now get the distance from Kevin Bacon to any other actor or movie. Kevin has a distance of 0, of course, and the films he's been in lately are all at distance 0.5. His co-stars in those movies are at distance 1, so they have Kevin Bacon number of 1.
Querying this for another actor is very fast because the result is already computed. Robert De Niro, for example, has a Kevin Bacon number (for the last decade) of 2. There are many paths of length two between these two actors, and 'get_path' will plot one of them.
If we go back to the connected component of Japanese films and pick one of the actors, Yuzuru Fujimoto, we see he has an infinite Kevin Bacon number over the last decade, which fits the earlier result that Yuzuru is in a different component. If we try to find the path between these two actors, we get an astronomically high number, indicating there is no path.
In [18]:
path = [x[0] for x in sp.get_path('Robert De Niro', show=True, highlight=list(films))]
query = sp_graph.get_vertices(ids='Yuzuru Fujimoto')
query.head()
Out[18]:
Who else do we want to check?? Because we're not using name disambiguation, we should check first to make sure a suggestion is in the list of actors.
In [19]:
target = 'Lydia Fox'
target in actors
path = [x[0] for x in sp.get_path(target, show=True, highlight=list(films))]
To visualize the distribution of Kevin Bacon distances, we narrow the result down to only mainstream actors (i.e. those in the same connected component as Mr. Bacon). The most common Kevin Bacon number is 3. Note there are still some half distances in this result; this is because we're working with messy data. The messiness comes from three sources:
We use the Canvas quantile histogram tool to visualize the distribution of graph distances from Kevin Bacon to all other nodes in the graph. The quantile histogram is a very fast visualization that stacks quantiles of an SArray into a pre-set number of buckets.
In [20]:
bacon_sf = sp_graph.get_vertices(ids=mainstream_actors)
bacon_sf['distance'].show()
Where there is some approximation in the quantiles shown in the quantile histogram, it's clear that the modal distance from Kevin Bacon is three hops, with 99% of the nodes falling between 1.5 and 5 hops.
Why is Kevin Bacon the target of the Kevin Bacon game? Why not Robert De Niro or Dennis Hopper, both of whom have been in a lot of movies. In fact, let's figure out just how central Kevin Bacon is, and who might be a better center for the Kevin Bacon game. First, let's find out who's been in the most movies. We'll start with some good guesses about prolific actors.
First, let's find out the number of movies for each actor. To do this we use the SGraph.triple_apply method, which loops over edges in the graph. For each edge, the method applies a user-specified function that changes some attributes of the vertices incident to the edge. In this case, there are three steps, encapsulated in the get_degree function.
In [21]:
def count_in_degree(src, edge, dst):
dst['in_degree'] += 1
return (src, edge, dst)
def get_degree(g):
new_g = graphlab.SGraph(g.vertices, g.edges)
new_g.vertices['in_degree'] = 0
return new_g.triple_apply(count_in_degree, ['in_degree']).get_vertices()
degree = get_degree(g)
In [22]:
comparisons = ['Kevin Bacon', 'Robert De Niro', 'Dennis Hopper', 'Samuel L. Jackson']
degree.filter_by(comparisons, '__id').sort('in_degree', ascending=False)
Out[22]:
At least for the last decade, Samuel L. Jackson seems to be a better candidate than Kevin Bacon. But we're left with several questions: were our candidate guesses good? who are the top actors overall? how does Kevin Bacon stack up? To answer these we still need to separate the actors from the films, which we can do by filtering with the actors list.
In [23]:
actor_degree = degree.filter_by(actors, '__id')
actor_degree['in_degree'].show()
The quantile histogram from Canvas illustrates clearly that Kevin Bacon's network degree of 21 is in the far end of the degree distribution's tail. So Kevin Bacon---with 21 movies in the last decade---is indeed prolific, but he's far from the top. Who is at the top??
In [24]:
actor_degree.topk('in_degree')
Out[24]:
In [25]:
print "** Danny Trejo**"
display(Image(url='https://static.turi.com/datasets/americanMovies/Danny-Trejo.jpg'))
print "Source: Glenn Francis http://www.PacificProDigital.com"
Danny Trejo is the winner!! But number of movies is a crude measure of centrality. John Cena, for example, is a professional wrestler, whose 70 movies are almost entirely WWE performances. Let's also find the mean shortest path distance for our selected comparisons, which is the average shortest path distance from each source actor to all other actors.
In [26]:
# Add top-degree actors to the comparison list
comparisons += ['Danny Trejo', 'Frank Welker', 'John Cena']
In [27]:
# # Make a container for the centrality statistics
mean_dists = {}
# # Get statistics for Kevin Bacon - use the already computed KB shortest paths
mean_dists['Kevin Bacon'] = bacon_sf['distance'].mean()
## Get statistics for the other comparison actors
for person in comparisons[1:]:
# get single-source shortest paths
sp2 = graphlab.shortest_path.create(g, source_vid=person,
weight_field='weight',
verbose=False)
sp2_graph = sp2.get('graph')
sp2_out = sp2_graph.get_vertices(ids=mainstream_actors)
# Compute some statistics about the distribution of distances
mean_dists[person] = sp2_out['distance'].mean()
In [28]:
mean_dists
Out[28]:
Once again, Danny Trejo comes out ahead, with the smallest mean distance to all other network nodes. Let's take a look at the whole distribution of shortest path distances for him.
In [29]:
sp2 = graphlab.shortest_path.create(g, source_vid='Danny Trejo',
weight_field='weight',
verbose=False)
sp2_graph = sp2.get('graph')
sp2_out = sp2_graph.get_vertices(ids=mainstream_actors)
sp2_out['distance'].show()
The shape of the distribution is similar to same plot for Kevin Bacon, but there are far more nodes reachable in only two hops for Danny Trejo. All in all, Danny Trejo seems to be a much better candidate for the center of the Kevin Bacon game.
However, there are many other measures of centrality or importance in a graph. Some of these require converting our bipartite graph into an actor network where edges join two actors who have been in a movie together.
To compute other interesting statistics about the network of actors, it is useful to eliminate the film vertices, so that actor vertices share an edge of the two actors co-starred in a movie. If A is the adjacency matrix for the bipartite graph where actors are rows and movies are columns, the social network (with weighted edges) is computed by AA^T. Because this operation is memory intensive, we first pull out the subset of movies from 2013.
In [30]:
## Pull out the data for 2013 films
year_data = data[data['year'] == 2013]
year_actors = graphlab.SFrame({'actor': year_data['actor_name'].unique()})
year_actors = year_actors.add_row_number()
year_films = year_data['film_name'].unique()
Next we construct the A matrix and multiply it by itself. This is a computational bottleneck that takes some time and is the reason we subset the data down to just 2013 release year movies. To do this computation we'll use a pandas Data Frame and a numpy array.
In [31]:
import numpy as np
import pandas as pd
A = pd.DataFrame(np.zeros((len(year_actors), len(year_films)), dtype=np.int),
columns=year_films, index=year_actors['actor'])
for row in year_data:
A[row['film_name']][row['actor_name']] = 1
In [32]:
A = A.values
adjacency = np.triu(np.dot(A, A.T))
Finally, we construct a new graph which is the actor network, where two actors are connected if they've been in a movie together. The edge attribute 'count' is the number of movies shared by two actors in 2013 - with a handful of exceptions, this is 1 for all actor pairs (in the 2013 data).
In [33]:
year_actors
Out[33]:
In [34]:
edge_idx = np.nonzero(adjacency)
graphlab.SFrame({'idx_source': adjacency[edge_idx]})
Out[34]:
In [35]:
edge_idx = np.nonzero(adjacency)
graphlab.SFrame({'idx_source': adjacency[edge_idx]})
Out[35]:
In [36]:
graphlab.deps.HAS_NUMPY = True
In [37]:
sf_edge = graphlab.SFrame({'idx_source': edge_idx[0],
'idx_dest': edge_idx[1],
'weight': adjacency[edge_idx]})
In [38]:
sf_edge = sf_edge.join(year_actors, on={'idx_source': 'id'}, how='left')
sf_edge.rename({'actor': 'actor1'})
sf_edge = sf_edge.join(year_actors, on={'idx_dest': 'id'}, how='left')
sf_edge.rename({'actor': 'actor2'})
sf_edge.remove_column('idx_dest')
sf_edge.remove_column('idx_source')
net = graphlab.SGraph()
net = net.add_edges(sf_edge, src_field='actor1', dst_field='actor2')
net = net.add_edges(sf_edge, src_field='actor2', dst_field='actor1')
In [39]:
print "Sample actor edges:"
net.edges
Out[39]:
We can also find the connected components in the actor network.
In [40]:
cc = graphlab.connected_components.create(net)
cc_out = cc.get('component_id')
print "Connected component summary:"
cc.summary()
Again, there are is one dominant component with "mainstream" actors. As with the bipartite graph, let's isolate and explore some of the smaller components.
In [41]:
cc_size = cc['component_size'].sort('Count', ascending=False)
cc_size
Out[41]:
To keep things simple going forward, we'll work with only the big connected component.
In [42]:
big_label = cc_size['component_id'][0]
big_names = cc_out[cc_out['component_id'] == big_label]
mainstream_actors = big_names.filter_by(actors, column_name='__id')['__id']
mainstream_edges = net.get_edges(src_ids=mainstream_actors)
net = graphlab.SGraph()
net = net.add_edges(mainstream_edges, src_field='__src_id', dst_field='__dst_id')
net.summary()
Out[42]:
Kevin Bacon appears to have been eclipsed by many actors as the center of the Kevin Bacon game over the past decade, but we used very crude graph metrics to determine this. In this smaller network of 2013 actors we can compute more sophisticated things to measure centrality. We'll start with vertex degree once again, but now an actor's degree is the number of actors who share a movie with him (rather than the number of movies done by the actor as in the bipartite graph).
We'll use an SFrame---initialized with the actor degree---to store the results.
In [43]:
centrality = get_degree(net)
Triangles in a graph are complete subgraphs with only three vertices. The number of triangles to which an actor belongs is a measure of the connectivity of his or her social network.
In [44]:
tc = graphlab.triangle_counting.create(net)
print "Triangle count summary:\n", tc.summary()
centrality = centrality.join(tc['triangle_count'], on='__id', how='left')
Pagerank is a popular method for calculating the "importance" of nodes in a graph.
In [45]:
pr = graphlab.pagerank.create(net, verbose=False)
print "Pagerank summary:\n", pr.summary()
centrality = centrality.join(pr['pagerank'], on='__id', how='left')
centrality.sort('pagerank', ascending=False)
Out[45]:
James Franco crushes the competition with pagerank. If we plot the histogram of pagerank quantiles, we can see just how far ahead Mr. Franco really is.
In [46]:
centrality['pagerank'].show()
With 99% of the pagerank scores below 2.225, James Franco's pagerank of 4.78 is almost literally off the chart!
Finally, another very common way to measure the centrality of a vertex is the mean distance from the vertex to all other nodes. This is a relatively expensive thing to compute, so we'll find it just for the top of our leader board.
In [47]:
centrality = centrality.sort('pagerank', ascending=False)
mean_dists = [int(1e12)] * centrality.num_rows()
for i in range(10):
a = centrality[i]['__id']
sp = graphlab.shortest_path.create(net, source_vid=a, verbose=False)
sp_out = sp['distance']
mean_dists[i] = sp_out['distance'].mean()
centrality['mean_dist'] = mean_dists
centrality.sort('mean_dist')
Out[47]:
We have a three way tie, between Anthony Mackie, James Franco , and Paul Rudd.
In [48]:
display(Image(url='https://static.turi.com/datasets/americanMovies/Anthony-Mackie.jpg'))
print "Source: David Shankbone http://flickr.com/photos/27865228@N06/4543207958"
display(Image(url='https://static.turi.com/datasets/americanMovies/James-Franco.jpg'))
print "Source: Vanessa Lua http://www.flickr.com/photos/vanessalua/6286991960/"
display(Image(url='https://static.turi.com/datasets/americanMovies/Paul-Rudd.jpg'))
print "Source: Eva Rinaldi http://www.flickr.com/photos/evarinaldiphotography/11024133765/"
TL; DR: Danny Trejo is the real Kevin Bacon of the last decade, while Paul Rudd, Anthony Mackie and James Franco took over the most central spot for 2013.
(Looking for more details about the modules and functions? Check out the API docs.)