Six Degrees of Kevin Bacon

GraphLab Graph Analytics Toolkit - Exploring the graph of American Films

Note: This notebook requires GraphLab Create 1.1 or higher

Welcome to the GraphLab Graph Analytics toolkit. In this notebook we'll use the toolkit to explore American films released between 2004 and 2013 and answer the question of whether Kevin Bacon is really the best actor to put at the center of the Kevin Bacon game.

Set up and exploratory data analysis

Before we start playing with the data, we need to import some key libraries: graphlab of course, and IPython display utilities. We also tell IPython notebook and GraphLab Canvas to produce plots directly in this notebook.


In [1]:
import graphlab

from IPython.display import display
from IPython.display import Image
graphlab.canvas.set_target('ipynb')

We'll load data on performances in American movies for the last ten years, pulled from the Freebase API's film/film and film/performance topics. The Freebase data is crowd-sourced, so it's a bit messy, but it is freely available under the Creative Commons license.

Our curated data live in an Amazon S3 bucket, from which we could create an SFrame directly, but for this demo we'll first download the CSV file and save it locally. Please note that running this notebook on your machine will download the 8MB csv file to your working directory.


In [2]:
import urllib
url = 'https://static.turi.com/datasets/americanMovies/freebase_performances.csv'
urllib.urlretrieve(url, filename='freebase_performances.csv')  # downloads an 8MB file to the working directory


Out[2]:
('freebase_performances.csv', <httplib.HTTPMessage instance at 0x10e63b200>)

There are a few data preprocessing steps to do, for which we'll use an SFrame (for more on SFrames, see the Introduction to SFrames notebook). First, we drop a superfluous column which is created because of a missing column header. Next, we drop actor names that are equal to an empty list, which is obviously an error. Finally, we add a column with 0.5 in each row, which will come in handy later for computing graph distances between actors.

After the data is clean, let's take a peek, with GraphLab Canvas.


In [3]:
data = graphlab.SFrame.read_csv('remote://freebase_performances.csv',
                                column_type_hints={'year': int})
data = data[data['actor_name'] != '[]']
data['weight'] = .5
data.show()


[INFO] This commercial license of GraphLab Create is assigned to engr@turi.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-41273 - Server binary: /Users/zach/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1439501248.log
[INFO] GraphLab Server Version: 1.5.2
PROGRESS: Finished parsing file /Users/zach/turi.com/src/learn/gallery/notebooks/freebase_performances.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.283318 secs.
PROGRESS: Finished parsing file /Users/zach/turi.com/src/learn/gallery/notebooks/freebase_performances.csv
PROGRESS: Parsing completed. Parsed 156468 lines in 0.281008 secs.

Now we construct the graph. For the first half of our analysis we will use a bipartite graph, where actors and movies are vertices and film performances are the edges that connect actors to movies.

  • First we get SArrays of the unique actor and film names. This will help us identify which vertices in our graph belong to each of these two classes.
  • The SGraph is directed, so to create an undirected graph we add edges in each direction.
  • The SGraph constructor automatically creates vertices based on the source and destination fields in the edge constructor.

In [4]:
actors = data['actor_name'].unique()
films = data['film_name'].unique()

g = graphlab.SGraph()
g = g.add_edges(data, src_field='actor_name', dst_field='film_name')
g = g.add_edges(data, src_field='film_name', dst_field='actor_name')

print "Movie graph summary:\n", g.summary(), "\n"


Movie graph summary:
{'num_edges': 312732, 'num_vertices': 97723} 

By using the get_vertices() and get_edges() methods we can verify the data was entered correctly. Note the SGraph uses directed edges, so we enter them in both directions to get an undirected graph and the correct output from the toolkits.


In [5]:
print "Actor vertex sample:"
g.get_vertices(ids=actors).tail(5)


Actor vertex sample:
Out[5]:
__id
Zorba Paster
Zsa Zsa Gershick
Zsuzsanna Szabados
Zuhair Haddad
mc chris
[5 rows x 1 columns]

In [6]:
print "Film vertex sample:"
g.get_vertices(ids=films).head(5)


Film vertex sample:
Out[6]:
__id
$100 and a T-Shirt: A
Documentary About Zines ...
'77
(__)
10 Minute Solution:
Kickbox Bootcamp ...
10 Rules for Sleeping
Around ...
[5 rows x 1 columns]


In [7]:
print "Sample edges (performances):"
g.get_edges().head(5)


Sample edges (performances):
Out[7]:
__src_id __dst_id id character year weight
Tom Arnold The Kid &amp; I 94 Bill Williams 2005 0.5
David Arden Engel The Albino Code 211 Captain Fascist 2006 0.5
Alyssa Chamberlain The Albino Code 220 English Girl 2006 0.5
Ezra Stevens The Albino Code 222 Driver 2006 0.5
Eric Mill The Albino Code 223 Hooligan 2 2006 0.5
[5 rows x 6 columns]

This graph can (and should) be saved so we can come back to it later without the data cleaning.


In [8]:
# g.save('sample_graph')
# new_g = graphlab.load_graph(filename='sample_graph')

This graph is too large to visualize, so we'll pull a small subgraph to see what the data look like.


In [9]:
selection = ['The Great Gatsby', 'The Wolf of Wall Street']

subgraph = graphlab.SGraph()
subgraph = subgraph.add_edges(g.get_edges(dst_ids=selection),
                              src_field='__src_id', dst_field='__dst_id')
subgraph.show(highlight=selection)


Can you guess which node is in the middle?


In [10]:
subgraph.show(vlabel='id', highlight=selection)


Connected Components

First, let's find the number of connected components. We'll do this first on our bipartite graph with both films and actors, but in the second half of this notebook we'll do it again with only actors.


In [11]:
cc = graphlab.connected_components.create(g, verbose=False)
cc_out = cc['component_id']
print "Connected components summary:\n", cc.summary()


Connected components summary:
Class                                   : ConnectedComponentsModel

Graph
-----
num_edges                               : 312732
num_vertices                            : 97723

Results
-------
graph                                   : SGraph. See m['graph']
component size                          : SFrame. See m['component_size']
number of connected components          : 2352
vertex component id                     : SFrame. See m['componentid']

Metrics
-------
training time (secs)                    : 1.3695

Queryable Fields
----------------
graph                                   : A new SGraph with the color id as a vertex property
component_id                            : An SFrame with each vertex's component id
component_size                          : An SFrame with the size of each component
training_time                           : Total training time of the model

None

There are over 2,000 components. With the 'component_size' field we can see that there is really only one very large component.


In [12]:
cc_size = cc['component_size'].sort('Count', ascending=False)
cc_size


Out[12]:
component_id Count
21644 87233
4101 48
12905 46
72315 44
73764 39
72062 34
16055 33
11485 32
21271 25
60080 24
[2352 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Let's pull one of the smaller connected components to see if there's anything interesting. The cc_out object is an SFrame, which acts a lot like a pandas DataFrame, but is on disk.


In [13]:
tgt = cc_size['component_id'][1]
tgt_names = cc_out[cc_out['component_id'] == tgt]['__id']

subgraph = graphlab.SGraph()
subgraph = subgraph.add_edges(g.get_edges(src_ids=tgt_names),
                              src_field='__src_id', dst_field='__dst_id')

film_selector = subgraph.get_vertices(ids=films)['__id']
subgraph.show(vlabel='id', highlight=film_selector)


This component corresponds to a handful of Japanese anime series. To help ourselves out later, we'll also pull out the names of the actors in the giant connected component. Here we use the SFrame.filter_by method because we're looking for matches against a set of names.


In [14]:
big_label = cc_size['component_id'][0]
big_names = cc_out[cc_out['component_id'] == big_label]
mainstream_actors = big_names.filter_by(actors, column_name='__id')['__id']

The Kevin Bacon game

OK, let's play the Kevin Bacon game. First, let's see what movies he's been in over the last decade...


In [15]:
bacon_films = g.get_edges(src_ids=['Kevin Bacon'])

subgraph = graphlab.SGraph()
subgraph = subgraph.add_edges(bacon_films, src_field='__src_id',
                              dst_field='__dst_id')
subgraph.show(vlabel='id', elabel='character', highlight=['Kevin Bacon'])


... and with whom Kevin Bacon has co-starred.


In [16]:
subgraph = graphlab.SGraph()

for f in bacon_films['__dst_id']:
    subgraph = subgraph.add_edges(g.get_edges(src_ids=[f], dst_ids=None),
                                  src_field='__src_id', dst_field='__dst_id')
    
subgraph.show(highlight=list(bacon_films['__dst_id']), vlabel='__id', vlabel_hover=True)


We can find the shortest path distance between every other vertex and Mr. Bacon. Keep in mind that this is a bipartite graph, so movies will have half distances from our target and actors will have whole number distances from the target.


In [17]:
sp = graphlab.shortest_path.create(g, source_vid='Kevin Bacon', weight_field='weight', verbose=False)
sp_graph = sp['graph']

The computation is very quick. And we can now get the distance from Kevin Bacon to any other actor or movie. Kevin has a distance of 0, of course, and the films he's been in lately are all at distance 0.5. His co-stars in those movies are at distance 1, so they have Kevin Bacon number of 1.

Querying this for another actor is very fast because the result is already computed. Robert De Niro, for example, has a Kevin Bacon number (for the last decade) of 2. There are many paths of length two between these two actors, and 'get_path' will plot one of them.

If we go back to the connected component of Japanese films and pick one of the actors, Yuzuru Fujimoto, we see he has an infinite Kevin Bacon number over the last decade, which fits the earlier result that Yuzuru is in a different component. If we try to find the path between these two actors, we get an astronomically high number, indicating there is no path.


In [18]:
path = [x[0] for x in sp.get_path('Robert De Niro', show=True, highlight=list(films))]

query = sp_graph.get_vertices(ids='Yuzuru Fujimoto')
query.head()


None
Out[18]:
__id distance
Yuzuru Fujimoto 1e+30
[1 rows x 2 columns]

Who else do we want to check?? Because we're not using name disambiguation, we should check first to make sure a suggestion is in the list of actors.


In [19]:
target = 'Lydia Fox'
target in actors

path = [x[0] for x in sp.get_path(target, show=True, highlight=list(films))]


None

To visualize the distribution of Kevin Bacon distances, we narrow the result down to only mainstream actors (i.e. those in the same connected component as Mr. Bacon). The most common Kevin Bacon number is 3. Note there are still some half distances in this result; this is because we're working with messy data. The messiness comes from three sources:

  1. Some names are both actors and films (e.g. Secretariat)
  2. Freebase is crowd-sourced data with plenty of mistakes
  3. For the demo we're using proper names as vertex IDs, so when there is an overlapping name, we have problems. In production we would use a vertex ID known to be unique.

We use the Canvas quantile histogram tool to visualize the distribution of graph distances from Kevin Bacon to all other nodes in the graph. The quantile histogram is a very fast visualization that stacks quantiles of an SArray into a pre-set number of buckets.


In [20]:
bacon_sf = sp_graph.get_vertices(ids=mainstream_actors)
bacon_sf['distance'].show()


Where there is some approximation in the quantiles shown in the quantile histogram, it's clear that the modal distance from Kevin Bacon is three hops, with 99% of the nodes falling between 1.5 and 5 hops.

Why Kevin Bacon?

Why is Kevin Bacon the target of the Kevin Bacon game? Why not Robert De Niro or Dennis Hopper, both of whom have been in a lot of movies. In fact, let's figure out just how central Kevin Bacon is, and who might be a better center for the Kevin Bacon game. First, let's find out who's been in the most movies. We'll start with some good guesses about prolific actors.

First, let's find out the number of movies for each actor. To do this we use the SGraph.triple_apply method, which loops over edges in the graph. For each edge, the method applies a user-specified function that changes some attributes of the vertices incident to the edge. In this case, there are three steps, encapsulated in the get_degree function.

  1. make a copy of the graph.
  2. make a new vertex attribute in_degree to hold the result (i.e. the number of movies for each actor and the number of actors for each film).
  3. run the triple_apply function, which applies the count_in_degree function to each edge of the graph. In this case we add 1 to the in-degree of each edge's destination node. Because we added edges in both directions, this is sufficient to compute the degree.

In [21]:
def count_in_degree(src, edge, dst):
    dst['in_degree'] += 1
    return (src, edge, dst)

def get_degree(g):
    new_g = graphlab.SGraph(g.vertices, g.edges)
    new_g.vertices['in_degree'] = 0
    return new_g.triple_apply(count_in_degree, ['in_degree']).get_vertices()

degree = get_degree(g)

In [22]:
comparisons = ['Kevin Bacon', 'Robert De Niro', 'Dennis Hopper', 'Samuel L. Jackson']
degree.filter_by(comparisons, '__id').sort('in_degree', ascending=False)


Out[22]:
__id in_degree
Samuel L. Jackson 42
Robert De Niro 29
Kevin Bacon 21
Dennis Hopper 21
[4 rows x 2 columns]

At least for the last decade, Samuel L. Jackson seems to be a better candidate than Kevin Bacon. But we're left with several questions: were our candidate guesses good? who are the top actors overall? how does Kevin Bacon stack up? To answer these we still need to separate the actors from the films, which we can do by filtering with the actors list.


In [23]:
actor_degree = degree.filter_by(actors, '__id')
actor_degree['in_degree'].show()


The quantile histogram from Canvas illustrates clearly that Kevin Bacon's network degree of 21 is in the far end of the degree distribution's tail. So Kevin Bacon---with 21 movies in the last decade---is indeed prolific, but he's far from the top. Who is at the top??


In [24]:
actor_degree.topk('in_degree')


Out[24]:
__id in_degree
Danny Trejo 81
Frank Welker 78
John Cena 70
Michael Madsen 63
Jeff Bennett 59
Keith David 56
Eric Roberts 56
Tom Kenny 54
Vampiro 53
Grey DeLisle 52
[10 rows x 2 columns]

In [25]:
print "** Danny Trejo**"
display(Image(url='https://static.turi.com/datasets/americanMovies/Danny-Trejo.jpg'))
print "Source: Glenn Francis http://www.PacificProDigital.com"


** Danny Trejo**
Source: Glenn Francis http://www.PacificProDigital.com

Danny Trejo is the winner!! But number of movies is a crude measure of centrality. John Cena, for example, is a professional wrestler, whose 70 movies are almost entirely WWE performances. Let's also find the mean shortest path distance for our selected comparisons, which is the average shortest path distance from each source actor to all other actors.


In [26]:
# Add top-degree actors to the comparison list
comparisons += ['Danny Trejo', 'Frank Welker', 'John Cena']

In [27]:
# # Make a container for the centrality statistics
mean_dists = {}

# # Get statistics for Kevin Bacon - use the already computed KB shortest paths
mean_dists['Kevin Bacon'] = bacon_sf['distance'].mean()


## Get statistics for the other comparison actors
for person in comparisons[1:]:

    # get single-source shortest paths
    sp2 = graphlab.shortest_path.create(g, source_vid=person,
                                        weight_field='weight',
                                        verbose=False)
    sp2_graph = sp2.get('graph')
    sp2_out = sp2_graph.get_vertices(ids=mainstream_actors)

    # Compute some statistics about the distribution of distances
    mean_dists[person] = sp2_out['distance'].mean()

In [28]:
mean_dists


Out[28]:
{'Danny Trejo': 2.6244607155980866,
 'Dennis Hopper': 2.967030499339929,
 'Frank Welker': 2.909965023068448,
 'John Cena': 3.1741837581828287,
 'Kevin Bacon': 2.888808742871924,
 'Robert De Niro': 2.828014208527843,
 'Samuel L. Jackson': 2.7586455625569934}

Once again, Danny Trejo comes out ahead, with the smallest mean distance to all other network nodes. Let's take a look at the whole distribution of shortest path distances for him.


In [29]:
sp2 = graphlab.shortest_path.create(g, source_vid='Danny Trejo',
                                    weight_field='weight',
                                    verbose=False)
sp2_graph = sp2.get('graph')
sp2_out = sp2_graph.get_vertices(ids=mainstream_actors)
sp2_out['distance'].show()


The shape of the distribution is similar to same plot for Kevin Bacon, but there are far more nodes reachable in only two hops for Danny Trejo. All in all, Danny Trejo seems to be a much better candidate for the center of the Kevin Bacon game.

However, there are many other measures of centrality or importance in a graph. Some of these require converting our bipartite graph into an actor network where edges join two actors who have been in a movie together.

Computing the actor network

To compute other interesting statistics about the network of actors, it is useful to eliminate the film vertices, so that actor vertices share an edge of the two actors co-starred in a movie. If A is the adjacency matrix for the bipartite graph where actors are rows and movies are columns, the social network (with weighted edges) is computed by AA^T. Because this operation is memory intensive, we first pull out the subset of movies from 2013.


In [30]:
## Pull out the data for 2013 films
year_data = data[data['year'] == 2013]

year_actors = graphlab.SFrame({'actor': year_data['actor_name'].unique()})
year_actors = year_actors.add_row_number()

year_films = year_data['film_name'].unique()

Next we construct the A matrix and multiply it by itself. This is a computational bottleneck that takes some time and is the reason we subset the data down to just 2013 release year movies. To do this computation we'll use a pandas Data Frame and a numpy array.


In [31]:
import numpy as np
import pandas as pd

A = pd.DataFrame(np.zeros((len(year_actors), len(year_films)), dtype=np.int),
                 columns=year_films, index=year_actors['actor'])

for row in year_data:
    A[row['film_name']][row['actor_name']] = 1

In [32]:
A = A.values
adjacency = np.triu(np.dot(A, A.T))

Finally, we construct a new graph which is the actor network, where two actors are connected if they've been in a movie together. The edge attribute 'count' is the number of movies shared by two actors in 2013 - with a handful of exceptions, this is 1 for all actor pairs (in the 2013 data).


In [33]:
year_actors


Out[33]:
id actor
0 George J. Vezina
1 Lange Siegstad
2 Paula Malcomson
3 David Mattia
4 June Diane Raphael
5 Melonie Diaz
6 Roger Edwards
7 David Kim
8 James Ransone
9 Chris J. Knight
[6427 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [34]:
edge_idx = np.nonzero(adjacency)
graphlab.SFrame({'idx_source': adjacency[edge_idx]})


Out[34]:
idx_source
1
1
1
1
1
1
1
1
1
1
[90898 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [35]:
edge_idx = np.nonzero(adjacency)
graphlab.SFrame({'idx_source': adjacency[edge_idx]})


Out[35]:
idx_source
1
1
1
1
1
1
1
1
1
1
[90898 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [36]:
graphlab.deps.HAS_NUMPY = True

In [37]:
sf_edge = graphlab.SFrame({'idx_source': edge_idx[0], 
                           'idx_dest': edge_idx[1],
                           'weight': adjacency[edge_idx]})

In [38]:
sf_edge = sf_edge.join(year_actors, on={'idx_source': 'id'}, how='left')
sf_edge.rename({'actor': 'actor1'})

sf_edge = sf_edge.join(year_actors, on={'idx_dest': 'id'}, how='left')
sf_edge.rename({'actor': 'actor2'})

sf_edge.remove_column('idx_dest')
sf_edge.remove_column('idx_source')

net = graphlab.SGraph()
net = net.add_edges(sf_edge, src_field='actor1', dst_field='actor2')
net = net.add_edges(sf_edge, src_field='actor2', dst_field='actor1')

In [39]:
print "Sample actor edges:"
net.edges


Sample actor edges:
Out[39]:
__src_id __dst_id weight
David Delagarza David Delagarza 1
David Delagarza Navarro Hall 1
Clark Gregg Clark Gregg 4
Clark Gregg Richard Dreyfuss 1
Clark Gregg Dylan Minnette 1
Clark Gregg Griffin Gluck 1
Marc Andrew Smith Marc Andrew Smith 1
Scott Haze Scott Haze 2
David Belle David Belle 1
David Belle Domenick Lombardozzi 1
[181796 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Connected components (again)

We can also find the connected components in the actor network.


In [40]:
cc = graphlab.connected_components.create(net)
cc_out = cc.get('component_id')

print "Connected component summary:"
cc.summary()


PROGRESS: +-----------------------------+
PROGRESS: | Number of components merged |
PROGRESS: +-----------------------------+
PROGRESS: | 6302                        |
PROGRESS: | 0                           |
PROGRESS: +-----------------------------+
Connected component summary:
Class                                   : ConnectedComponentsModel

Graph
-----
num_edges                               : 181796
num_vertices                            : 6427

Results
-------
graph                                   : SGraph. See m['graph']
component size                          : SFrame. See m['component_size']
number of connected components          : 180
vertex component id                     : SFrame. See m['componentid']

Metrics
-------
training time (secs)                    : 0.0792

Queryable Fields
----------------
graph                                   : A new SGraph with the color id as a vertex property
component_id                            : An SFrame with each vertex's component id
component_size                          : An SFrame with the size of each component
training_time                           : Total training time of the model

Again, there are is one dominant component with "mainstream" actors. As with the bipartite graph, let's isolate and explore some of the smaller components.


In [41]:
cc_size = cc['component_size'].sort('Count', ascending=False)
cc_size


Out[41]:
component_id Count
663 5305
726 40
363 24
491 22
1186 20
439 20
161 19
1005 17
948 16
1203 15
[180 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

To keep things simple going forward, we'll work with only the big connected component.


In [42]:
big_label = cc_size['component_id'][0]
big_names = cc_out[cc_out['component_id'] == big_label]
mainstream_actors = big_names.filter_by(actors, column_name='__id')['__id']

mainstream_edges = net.get_edges(src_ids=mainstream_actors)
net = graphlab.SGraph()
net = net.add_edges(mainstream_edges, src_field='__src_id', dst_field='__dst_id')

net.summary()


Out[42]:
{'num_edges': 167952, 'num_vertices': 5305}

Back to Kevin Bacon

Kevin Bacon appears to have been eclipsed by many actors as the center of the Kevin Bacon game over the past decade, but we used very crude graph metrics to determine this. In this smaller network of 2013 actors we can compute more sophisticated things to measure centrality. We'll start with vertex degree once again, but now an actor's degree is the number of actors who share a movie with him (rather than the number of movies done by the actor as in the bipartite graph).

We'll use an SFrame---initialized with the actor degree---to store the results.


In [43]:
centrality = get_degree(net)

Triangles in a graph are complete subgraphs with only three vertices. The number of triangles to which an actor belongs is a measure of the connectivity of his or her social network.


In [44]:
tc = graphlab.triangle_counting.create(net)
print "Triangle count summary:\n", tc.summary()

centrality = centrality.join(tc['triangle_count'], on='__id', how='left')


PROGRESS: Initializing vertex ids.
PROGRESS: Removing duplicate (bidirectional) edges.
PROGRESS: Counting triangles...
PROGRESS: Finished in 0.630832 secs.
PROGRESS: Total triangles in the graph : 1367881
Triangle count summary:
Class                                   : TriangleCountingModel

Graph
-----
num_edges                               : 167952
num_vertices                            : 5305

Results
-------
graph                                   : SGraph. See m['graph']
total number of triangles               : 1367881
vertex triangle count                   : SFrame. See m['triangle_count']

Metrics
-------
training time (secs)                    : 0.6391

Queryable Fields
----------------
graph                                   : A new SGraph with the triangle count as a vertex property.
num_triangles                           : Total number of triangles in the graph.
triangle_count                          : An SFrame with the triangle count for each vertex.
training_time                           : Total training time of the model

None

Pagerank is a popular method for calculating the "importance" of nodes in a graph.


In [45]:
pr = graphlab.pagerank.create(net, verbose=False)
print "Pagerank summary:\n", pr.summary()

centrality = centrality.join(pr['pagerank'], on='__id', how='left')
centrality.sort('pagerank', ascending=False)


Pagerank summary:
Class                                   : PagerankModel

Graph
-----
num_edges                               : 167952
num_vertices                            : 5305

Results
-------
graph                                   : SGraph. See m['graph']
change in last iteration (L1 norm)      : 1.5828
vertex pagerank                         : SFrame. See m['pagerank']

Settings
--------
maximun number of iterations            : 20
convergence threshold (L1 norm)         : 0.01
probablity of random jumps to any node in the graph: 0.15

Metrics
-------
training time (secs)                    : 0.4068
number of iterations                    : 20

Queryable Fields
----------------
training_time                           : Total training time of the model
graph                                   : A new SGraph with the pagerank as a vertex property
delta                                   : Change in pagerank for the last iteration in L1 norm
reset_probability                       : The probablity of randomly jumps to any node in the graph
pagerank                                : An SFrame with each vertex's pagerank
num_iterations                          : Number of iterations
threshold                               : The convergence threshold in L1 norm
max_iterations                          : The maximun number of iterations to run

None
Out[45]:
__id in_degree triangle_count pagerank delta
James Franco 151 2982 4.77719241601 6.77217050331e-05
Rob Corddry 105 1127 4.11442446464 0.000878799198492
Channing Tatum 134 2949 3.78775951804 0.000374050088186
Paul Rudd 184 6235 3.77233182078 0.00157863734389
Andrew R. Kaplan 205 6979 3.62503016161 0.00271149544128
Anthony Mackie 124 2378 3.59796557524 9.26603227924e-05
John Cusack 103 1955 3.37184310389 3.23909995932e-05
Jeffrey Wright 85 963 3.27329650664 0.000817767631977
Amy Seimetz 68 611 3.26483531164 0.000862322961037
Melissa Leo 109 2526 3.11387894674 0.00028271047215
[5305 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

James Franco crushes the competition with pagerank. If we plot the histogram of pagerank quantiles, we can see just how far ahead Mr. Franco really is.


In [46]:
centrality['pagerank'].show()


With 99% of the pagerank scores below 2.225, James Franco's pagerank of 4.78 is almost literally off the chart!

Finally, another very common way to measure the centrality of a vertex is the mean distance from the vertex to all other nodes. This is a relatively expensive thing to compute, so we'll find it just for the top of our leader board.


In [47]:
centrality = centrality.sort('pagerank', ascending=False)

mean_dists = [int(1e12)] * centrality.num_rows()

for i in range(10):
    a = centrality[i]['__id']
    sp = graphlab.shortest_path.create(net, source_vid=a, verbose=False)
    sp_out = sp['distance']
    mean_dists[i] = sp_out['distance'].mean()

centrality['mean_dist'] = mean_dists
centrality.sort('mean_dist')


Out[47]:
__id in_degree triangle_count pagerank delta mean_dist
Anthony Mackie 124 2378 3.59796557524 9.26603227924e-05 3.11008482564
James Franco 151 2982 4.77719241601 6.77217050331e-05 3.12082940622
Paul Rudd 184 6235 3.77233182078 0.00157863734389 3.13779453346
Channing Tatum 134 2949 3.78775951804 0.000374050088186 3.20263901979
Rob Corddry 105 1127 4.11442446464 0.000878799198492 3.27389255419
Jeffrey Wright 85 963 3.27329650664 0.000817767631977 3.27464655985
Andrew R. Kaplan 205 6979 3.62503016161 0.00271149544128 3.38812441093
John Cusack 103 1955 3.37184310389 3.23909995932e-05 3.44882186616
Melissa Leo 109 2526 3.11387894674 0.00028271047215 3.53195098963
Amy Seimetz 68 611 3.26483531164 0.000862322961037 3.89198868992
[5305 rows x 6 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

We have a three way tie, between Anthony Mackie, James Franco , and Paul Rudd.


In [48]:
display(Image(url='https://static.turi.com/datasets/americanMovies/Anthony-Mackie.jpg'))
print "Source: David Shankbone http://flickr.com/photos/27865228@N06/4543207958"
display(Image(url='https://static.turi.com/datasets/americanMovies/James-Franco.jpg'))
print "Source: Vanessa Lua http://www.flickr.com/photos/vanessalua/6286991960/"
display(Image(url='https://static.turi.com/datasets/americanMovies/Paul-Rudd.jpg'))
print "Source: Eva Rinaldi http://www.flickr.com/photos/evarinaldiphotography/11024133765/"


Source: David Shankbone http://flickr.com/photos/27865228@N06/4543207958
Source: Vanessa Lua http://www.flickr.com/photos/vanessalua/6286991960/
Source: Eva Rinaldi http://www.flickr.com/photos/evarinaldiphotography/11024133765/

TL; DR: Danny Trejo is the real Kevin Bacon of the last decade, while Paul Rudd, Anthony Mackie and James Franco took over the most central spot for 2013.

(Looking for more details about the modules and functions? Check out the API docs.)