Six Degrees of Kevin Bacon

GraphLab Graph Analytics Toolkit - Exploring the graph of American Films

Note: This notebook requires GraphLab Create 1.1 or higher

Welcome to the GraphLab Graph Analytics toolkit. In this notebook we'll use the toolkit to explore American films released between 2004 and 2013 and answer the question of whether Kevin Bacon is really the best actor to put at the center of the Kevin Bacon game.

Set up and exploratory data analysis

Before we start playing with the data, we need to import some key libraries: graphlab of course, and IPython display utilities. We also tell IPython notebook and GraphLab Canvas to produce plots directly in this notebook.



In [1]:

    
import graphlab

from IPython.display import display
from IPython.display import Image
graphlab.canvas.set_target('ipynb')

We'll load data on performances in American movies for the last ten years, pulled from the Freebase API's film/film and film/performance topics. The Freebase data is crowd-sourced, so it's a bit messy, but it is freely available under the Creative Commons license.

Our curated data live in an Amazon S3 bucket, from which we could create an SFrame directly, but for this demo we'll first download the CSV file and save it locally. Please note that running this notebook on your machine will download the 8MB csv file to your working directory.



In [2]:

    
import urllib
url = 'https://static.turi.com/datasets/americanMovies/freebase_performances.csv'
urllib.urlretrieve(url, filename='freebase_performances.csv')  # downloads an 8MB file to the working directory









    Out[2]:





('freebase_performances.csv', <httplib.HTTPMessage instance at 0x10e63b200>)

There are a few data preprocessing steps to do, for which we'll use an SFrame (for more on SFrames, see the Introduction to SFrames notebook). First, we drop a superfluous column which is created because of a missing column header. Next, we drop actor names that are equal to an empty list, which is obviously an error. Finally, we add a column with 0.5 in each row, which will come in handy later for computing graph distances between actors.

After the data is clean, let's take a peek, with GraphLab Canvas.



In [3]:

    
data = graphlab.SFrame.read_csv('remote://freebase_performances.csv',
                                column_type_hints={'year': int})
data = data[data['actor_name'] != '[]']
data['weight'] = .5
data.show()









    



[INFO] This commercial license of GraphLab Create is assigned to engr@turi.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-41273 - Server binary: /Users/zach/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1439501248.log
[INFO] GraphLab Server Version: 1.5.2






    




PROGRESS: Finished parsing file /Users/zach/turi.com/src/learn/gallery/notebooks/freebase_performances.csv






    




PROGRESS: Parsing completed. Parsed 100 lines in 0.283318 secs.






    




PROGRESS: Finished parsing file /Users/zach/turi.com/src/learn/gallery/notebooks/freebase_performances.csv






    




PROGRESS: Parsing completed. Parsed 156468 lines in 0.281008 secs.

Now we construct the graph. For the first half of our analysis we will use a bipartite graph, where actors and movies are vertices and film performances are the edges that connect actors to movies.

First we get SArrays of the unique actor and film names. This will help us identify which vertices in our graph belong to each of these two classes.
The SGraph is directed, so to create an undirected graph we add edges in each direction.
The SGraph constructor automatically creates vertices based on the source and destination fields in the edge constructor.



In [4]:

    
actors = data['actor_name'].unique()
films = data['film_name'].unique()

g = graphlab.SGraph()
g = g.add_edges(data, src_field='actor_name', dst_field='film_name')
g = g.add_edges(data, src_field='film_name', dst_field='actor_name')

print "Movie graph summary:\n", g.summary(), "\n"









    



Movie graph summary:
{'num_edges': 312732, 'num_vertices': 97723}

By using the get_vertices() and get_edges() methods we can verify the data was entered correctly. Note the SGraph uses directed edges, so we enter them in both directions to get an undirected graph and the correct output from the toolkits.



In [5]:

    
print "Actor vertex sample:"
g.get_vertices(ids=actors).tail(5)









    



Actor vertex sample:






    Out[5]:





    
        __id
    
    
        Zorba Paster
    
    
        Zsa Zsa Gershick
    
    
        Zsuzsanna Szabados
    
    
        Zuhair Haddad
    
    
        mc chris
    

[5 rows x 1 columns]



In [6]:

    
print "Film vertex sample:"
g.get_vertices(ids=films).head(5)









    



Film vertex sample:






    Out[6]:





    
        __id
    
    
        $100 and a T-Shirt: A
Documentary About Zines ...
    
    
        '77
    
    
        (__)
    
    
        10 Minute Solution:
Kickbox Bootcamp ...
    
    
        10 Rules for Sleeping
Around ...
    

[5 rows x 1 columns]



In [7]:

    
print "Sample edges (performances):"
g.get_edges().head(5)









    



Sample edges (performances):






    Out[7]:





    
        __src_id
        __dst_id
        id
        character
        year
        weight
    
    
        Tom Arnold
        The Kid &amp; I
        94
        Bill Williams
        2005
        0.5
    
    
        David Arden Engel
        The Albino Code
        211
        Captain Fascist
        2006
        0.5
    
    
        Alyssa Chamberlain
        The Albino Code
        220
        English Girl
        2006
        0.5
    
    
        Ezra Stevens
        The Albino Code
        222
        Driver
        2006
        0.5
    
    
        Eric Mill
        The Albino Code
        223
        Hooligan 2
        2006
        0.5
    

[5 rows x 6 columns]

This graph can (and should) be saved so we can come back to it later without the data cleaning.



In [8]:

    
# g.save('sample_graph')
# new_g = graphlab.load_graph(filename='sample_graph')

This graph is too large to visualize, so we'll pull a small subgraph to see what the data look like.



In [9]:

    
selection = ['The Great Gatsby', 'The Wolf of Wall Street']

subgraph = graphlab.SGraph()
subgraph = subgraph.add_edges(g.get_edges(dst_ids=selection),
                              src_field='__src_id', dst_field='__dst_id')
subgraph.show(highlight=selection)

Can you guess which node is in the middle?



In [10]:

    
subgraph.show(vlabel='id', highlight=selection)

Connected Components

First, let's find the number of connected components. We'll do this first on our bipartite graph with both films and actors, but in the second half of this notebook we'll do it again with only actors.



In [11]:

    
cc = graphlab.connected_components.create(g, verbose=False)
cc_out = cc['component_id']
print "Connected components summary:\n", cc.summary()









    



Connected components summary:
Class                                   : ConnectedComponentsModel

Graph
-----
num_edges                               : 312732
num_vertices                            : 97723

Results
-------
graph                                   : SGraph. See m['graph']
component size                          : SFrame. See m['component_size']
number of connected components          : 2352
vertex component id                     : SFrame. See m['componentid']

Metrics
-------
training time (secs)                    : 1.3695

Queryable Fields
----------------
graph                                   : A new SGraph with the color id as a vertex property
component_id                            : An SFrame with each vertex's component id
component_size                          : An SFrame with the size of each component
training_time                           : Total training time of the model

None

There are over 2,000 components. With the 'component_size' field we can see that there is really only one very large component.



In [12]:

    
cc_size = cc['component_size'].sort('Count', ascending=False)
cc_size









    Out[12]:





    
        component_id
        Count
    
    
        21644
        87233
    
    
        4101
        48
    
    
        12905
        46
    
    
        72315
        44
    
    
        73764
        39
    
    
        72062
        34
    
    
        16055
        33
    
    
        11485
        32
    
    
        21271
        25
    
    
        60080
        24
    

[2352 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Let's pull one of the smaller connected components to see if there's anything interesting. The cc_out object is an SFrame, which acts a lot like a pandas DataFrame, but is on disk.



In [13]:

    
tgt = cc_size['component_id'][1]
tgt_names = cc_out[cc_out['component_id'] == tgt]['__id']

subgraph = graphlab.SGraph()
subgraph = subgraph.add_edges(g.get_edges(src_ids=tgt_names),
                              src_field='__src_id', dst_field='__dst_id')

film_selector = subgraph.get_vertices(ids=films)['__id']
subgraph.show(vlabel='id', highlight=film_selector)

This component corresponds to a handful of Japanese anime series. To help ourselves out later, we'll also pull out the names of the actors in the giant connected component. Here we use the SFrame.filter_by method because we're looking for matches against a set of names.



In [14]:

    
big_label = cc_size['component_id'][0]
big_names = cc_out[cc_out['component_id'] == big_label]
mainstream_actors = big_names.filter_by(actors, column_name='__id')['__id']

The Kevin Bacon game

OK, let's play the Kevin Bacon game. First, let's see what movies he's been in over the last decade...



In [15]:

    
bacon_films = g.get_edges(src_ids=['Kevin Bacon'])

subgraph = graphlab.SGraph()
subgraph = subgraph.add_edges(bacon_films, src_field='__src_id',
                              dst_field='__dst_id')
subgraph.show(vlabel='id', elabel='character', highlight=['Kevin Bacon'])

... and with whom Kevin Bacon has co-starred.



In [16]:

    
subgraph = graphlab.SGraph()

for f in bacon_films['__dst_id']:
    subgraph = subgraph.add_edges(g.get_edges(src_ids=[f], dst_ids=None),
                                  src_field='__src_id', dst_field='__dst_id')
    
subgraph.show(highlight=list(bacon_films['__dst_id']), vlabel='__id', vlabel_hover=True)

We can find the shortest path distance between every other vertex and Mr. Bacon. Keep in mind that this is a bipartite graph, so movies will have half distances from our target and actors will have whole number distances from the target.



In [17]:

    
sp = graphlab.shortest_path.create(g, source_vid='Kevin Bacon', weight_field='weight', verbose=False)
sp_graph = sp['graph']

The computation is very quick. And we can now get the distance from Kevin Bacon to any other actor or movie. Kevin has a distance of 0, of course, and the films he's been in lately are all at distance 0.5. His co-stars in those movies are at distance 1, so they have Kevin Bacon number of 1.

Querying this for another actor is very fast because the result is already computed. Robert De Niro, for example, has a Kevin Bacon number (for the last decade) of 2. There are many paths of length two between these two actors, and 'get_path' will plot one of them.

If we go back to the connected component of Japanese films and pick one of the actors, Yuzuru Fujimoto, we see he has an infinite Kevin Bacon number over the last decade, which fits the earlier result that Yuzuru is in a different component. If we try to find the path between these two actors, we get an astronomically high number, indicating there is no path.



In [18]:

    
path = [x[0] for x in sp.get_path('Robert De Niro', show=True, highlight=list(films))]

query = sp_graph.get_vertices(ids='Yuzuru Fujimoto')
query.head()









    














    





None






    Out[18]:





    
        __id
        distance
    
    
        Yuzuru Fujimoto
        1e+30
    

[1 rows x 2 columns]

Who else do we want to check?? Because we're not using name disambiguation, we should check first to make sure a suggestion is in the list of actors.



In [19]:

    
target = 'Lydia Fox'
target in actors

path = [x[0] for x in sp.get_path(target, show=True, highlight=list(films))]

To visualize the distribution of Kevin Bacon distances, we narrow the result down to only mainstream actors (i.e. those in the same connected component as Mr. Bacon). The most common Kevin Bacon number is 3. Note there are still some half distances in this result; this is because we're working with messy data. The messiness comes from three sources:

Some names are both actors and films (e.g. Secretariat)
Freebase is crowd-sourced data with plenty of mistakes
For the demo we're using proper names as vertex IDs, so when there is an overlapping name, we have problems. In production we would use a vertex ID known to be unique.

We use the Canvas quantile histogram tool to visualize the distribution of graph distances from Kevin Bacon to all other nodes in the graph. The quantile histogram is a very fast visualization that stacks quantiles of an SArray into a pre-set number of buckets.



In [20]:

    
bacon_sf = sp_graph.get_vertices(ids=mainstream_actors)
bacon_sf['distance'].show()

Where there is some approximation in the quantiles shown in the quantile histogram, it's clear that the modal distance from Kevin Bacon is three hops, with 99% of the nodes falling between 1.5 and 5 hops.

Why Kevin Bacon?

Why is Kevin Bacon the target of the Kevin Bacon game? Why not Robert De Niro or Dennis Hopper, both of whom have been in a lot of movies. In fact, let's figure out just how central Kevin Bacon is, and who might be a better center for the Kevin Bacon game. First, let's find out who's been in the most movies. We'll start with some good guesses about prolific actors.

First, let's find out the number of movies for each actor. To do this we use the SGraph.triple_apply method, which loops over edges in the graph. For each edge, the method applies a user-specified function that changes some attributes of the vertices incident to the edge. In this case, there are three steps, encapsulated in the get_degree function.

make a copy of the graph.
make a new vertex attribute in_degree to hold the result (i.e. the number of movies for each actor and the number of actors for each film).
run the triple_apply function, which applies the count_in_degree function to each edge of the graph. In this case we add 1 to the in-degree of each edge's destination node. Because we added edges in both directions, this is sufficient to compute the degree.



In [21]:

    
def count_in_degree(src, edge, dst):
    dst['in_degree'] += 1
    return (src, edge, dst)

def get_degree(g):
    new_g = graphlab.SGraph(g.vertices, g.edges)
    new_g.vertices['in_degree'] = 0
    return new_g.triple_apply(count_in_degree, ['in_degree']).get_vertices()

degree = get_degree(g)



In [22]:

    
comparisons = ['Kevin Bacon', 'Robert De Niro', 'Dennis Hopper', 'Samuel L. Jackson']
degree.filter_by(comparisons, '__id').sort('in_degree', ascending=False)









    Out[22]:





    
        __id
        in_degree
    
    
        Samuel L. Jackson
        42
    
    
        Robert De Niro
        29
    
    
        Kevin Bacon
        21
    
    
        Dennis Hopper
        21
    

[4 rows x 2 columns]

At least for the last decade, Samuel L. Jackson seems to be a better candidate than Kevin Bacon. But we're left with several questions: were our candidate guesses good? who are the top actors overall? how does Kevin Bacon stack up? To answer these we still need to separate the actors from the films, which we can do by filtering with the actors list.



In [23]:

    
actor_degree = degree.filter_by(actors, '__id')
actor_degree['in_degree'].show()

The quantile histogram from Canvas illustrates clearly that Kevin Bacon's network degree of 21 is in the far end of the degree distribution's tail. So Kevin Bacon---with 21 movies in the last decade---is indeed prolific, but he's far from the top. Who is at the top??



In [24]:

    
actor_degree.topk('in_degree')









    Out[24]:





    
        __id
        in_degree
    
    
        Danny Trejo
        81
    
    
        Frank Welker
        78
    
    
        John Cena
        70
    
    
        Michael Madsen
        63
    
    
        Jeff Bennett
        59
    
    
        Keith David
        56
    
    
        Eric Roberts
        56
    
    
        Tom Kenny
        54
    
    
        Vampiro
        53
    
    
        Grey DeLisle
        52
    

[10 rows x 2 columns]



In [25]:

    
print "** Danny Trejo**"
display(Image(url='https://static.turi.com/datasets/americanMovies/Danny-Trejo.jpg'))
print "Source: Glenn Francis http://www.PacificProDigital.com"









    



** Danny Trejo**






    











    



Source: Glenn Francis http://www.PacificProDigital.com

Danny Trejo is the winner!! But number of movies is a crude measure of centrality. John Cena, for example, is a professional wrestler, whose 70 movies are almost entirely WWE performances. Let's also find the mean shortest path distance for our selected comparisons, which is the average shortest path distance from each source actor to all other actors.



In [26]:

    
# Add top-degree actors to the comparison list
comparisons += ['Danny Trejo', 'Frank Welker', 'John Cena']



In [27]:

    
# # Make a container for the centrality statistics
mean_dists = {}

# # Get statistics for Kevin Bacon - use the already computed KB shortest paths
mean_dists['Kevin Bacon'] = bacon_sf['distance'].mean()


## Get statistics for the other comparison actors
for person in comparisons[1:]:

    # get single-source shortest paths
    sp2 = graphlab.shortest_path.create(g, source_vid=person,
                                        weight_field='weight',
                                        verbose=False)
    sp2_graph = sp2.get('graph')
    sp2_out = sp2_graph.get_vertices(ids=mainstream_actors)

    # Compute some statistics about the distribution of distances
    mean_dists[person] = sp2_out['distance'].mean()



In [28]:

    
mean_dists









    Out[28]:





{'Danny Trejo': 2.6244607155980866,
 'Dennis Hopper': 2.967030499339929,
 'Frank Welker': 2.909965023068448,
 'John Cena': 3.1741837581828287,
 'Kevin Bacon': 2.888808742871924,
 'Robert De Niro': 2.828014208527843,
 'Samuel L. Jackson': 2.7586455625569934}

Once again, Danny Trejo comes out ahead, with the smallest mean distance to all other network nodes. Let's take a look at the whole distribution of shortest path distances for him.



In [29]:

    
sp2 = graphlab.shortest_path.create(g, source_vid='Danny Trejo',
                                    weight_field='weight',
                                    verbose=False)
sp2_graph = sp2.get('graph')
sp2_out = sp2_graph.get_vertices(ids=mainstream_actors)
sp2_out['distance'].show()

The shape of the distribution is similar to same plot for Kevin Bacon, but there are far more nodes reachable in only two hops for Danny Trejo. All in all, Danny Trejo seems to be a much better candidate for the center of the Kevin Bacon game.

However, there are many other measures of centrality or importance in a graph. Some of these require converting our bipartite graph into an actor network where edges join two actors who have been in a movie together.

Computing the actor network

To compute other interesting statistics about the network of actors, it is useful to eliminate the film vertices, so that actor vertices share an edge of the two actors co-starred in a movie. If A is the adjacency matrix for the bipartite graph where actors are rows and movies are columns, the social network (with weighted edges) is computed by AA^T. Because this operation is memory intensive, we first pull out the subset of movies from 2013.



In [30]:

    
## Pull out the data for 2013 films
year_data = data[data['year'] == 2013]

year_actors = graphlab.SFrame({'actor': year_data['actor_name'].unique()})
year_actors = year_actors.add_row_number()

year_films = year_data['film_name'].unique()

Next we construct the A matrix and multiply it by itself. This is a computational bottleneck that takes some time and is the reason we subset the data down to just 2013 release year movies. To do this computation we'll use a pandas Data Frame and a numpy array.



In [31]:

    
import numpy as np
import pandas as pd

A = pd.DataFrame(np.zeros((len(year_actors), len(year_films)), dtype=np.int),
                 columns=year_films, index=year_actors['actor'])

for row in year_data:
    A[row['film_name']][row['actor_name']] = 1



In [32]:

    
A = A.values
adjacency = np.triu(np.dot(A, A.T))

Finally, we construct a new graph which is the actor network, where two actors are connected if they've been in a movie together. The edge attribute 'count' is the number of movies shared by two actors in 2013 - with a handful of exceptions, this is 1 for all actor pairs (in the 2013 data).



In [33]:

    
year_actors









    Out[33]:





    
        id
        actor
    
    
        0
        George J. Vezina
    
    
        1
        Lange Siegstad
    
    
        2
        Paula Malcomson
    
    
        3
        David Mattia
    
    
        4
        June Diane Raphael
    
    
        5
        Melonie Diaz
    
    
        6
        Roger Edwards
    
    
        7
        David Kim
    
    
        8
        James Ransone
    
    
        9
        Chris J. Knight
    

[6427 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [34]:

    
edge_idx = np.nonzero(adjacency)
graphlab.SFrame({'idx_source': adjacency[edge_idx]})









    Out[34]:





    
        idx_source
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    

[90898 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [35]:

    
edge_idx = np.nonzero(adjacency)
graphlab.SFrame({'idx_source': adjacency[edge_idx]})









    Out[35]:





    
        idx_source
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    
    
        1
    

[90898 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [36]:

    
graphlab.deps.HAS_NUMPY = True



In [37]:

    
sf_edge = graphlab.SFrame({'idx_source': edge_idx[0], 
                           'idx_dest': edge_idx[1],
                           'weight': adjacency[edge_idx]})



In [38]:

    
sf_edge = sf_edge.join(year_actors, on={'idx_source': 'id'}, how='left')
sf_edge.rename({'actor': 'actor1'})

sf_edge = sf_edge.join(year_actors, on={'idx_dest': 'id'}, how='left')
sf_edge.rename({'actor': 'actor2'})

sf_edge.remove_column('idx_dest')
sf_edge.remove_column('idx_source')

net = graphlab.SGraph()
net = net.add_edges(sf_edge, src_field='actor1', dst_field='actor2')
net = net.add_edges(sf_edge, src_field='actor2', dst_field='actor1')



In [39]:

    
print "Sample actor edges:"
net.edges









    



Sample actor edges:






    Out[39]:





    
        __src_id
        __dst_id
        weight
    
    
        David Delagarza
        David Delagarza
        1
    
    
        David Delagarza
        Navarro Hall
        1
    
    
        Clark Gregg
        Clark Gregg
        4
    
    
        Clark Gregg
        Richard Dreyfuss
        1
    
    
        Clark Gregg
        Dylan Minnette
        1
    
    
        Clark Gregg
        Griffin Gluck
        1
    
    
        Marc Andrew Smith
        Marc Andrew Smith
        1
    
    
        Scott Haze
        Scott Haze
        2
    
    
        David Belle
        David Belle
        1
    
    
        David Belle
        Domenick Lombardozzi
        1
    

[181796 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Connected components (again)

We can also find the connected components in the actor network.



In [40]:

    
cc = graphlab.connected_components.create(net)
cc_out = cc.get('component_id')

print "Connected component summary:"
cc.summary()









    




PROGRESS: +-----------------------------+






    




PROGRESS: | Number of components merged |






    




PROGRESS: +-----------------------------+






    




PROGRESS: | 6302                        |






    




PROGRESS: | 0                           |






    




PROGRESS: +-----------------------------+






    



Connected component summary:
Class                                   : ConnectedComponentsModel

Graph
-----
num_edges                               : 181796
num_vertices                            : 6427

Results
-------
graph                                   : SGraph. See m['graph']
component size                          : SFrame. See m['component_size']
number of connected components          : 180
vertex component id                     : SFrame. See m['componentid']

Metrics
-------
training time (secs)                    : 0.0792

Queryable Fields
----------------
graph                                   : A new SGraph with the color id as a vertex property
component_id                            : An SFrame with each vertex's component id
component_size                          : An SFrame with the size of each component
training_time                           : Total training time of the model

Again, there are is one dominant component with "mainstream" actors. As with the bipartite graph, let's isolate and explore some of the smaller components.



In [41]:

    
cc_size = cc['component_size'].sort('Count', ascending=False)
cc_size









    Out[41]:





    
        component_id
        Count
    
    
        663
        5305
    
    
        726
        40
    
    
        363
        24
    
    
        491
        22
    
    
        1186
        20
    
    
        439
        20
    
    
        161
        19
    
    
        1005
        17
    
    
        948
        16
    
    
        1203
        15
    

[180 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

To keep things simple going forward, we'll work with only the big connected component.



In [42]:

    
big_label = cc_size['component_id'][0]
big_names = cc_out[cc_out['component_id'] == big_label]
mainstream_actors = big_names.filter_by(actors, column_name='__id')['__id']

mainstream_edges = net.get_edges(src_ids=mainstream_actors)
net = graphlab.SGraph()
net = net.add_edges(mainstream_edges, src_field='__src_id', dst_field='__dst_id')

net.summary()









    Out[42]:





{'num_edges': 167952, 'num_vertices': 5305}

Back to Kevin Bacon

Kevin Bacon appears to have been eclipsed by many actors as the center of the Kevin Bacon game over the past decade, but we used very crude graph metrics to determine this. In this smaller network of 2013 actors we can compute more sophisticated things to measure centrality. We'll start with vertex degree once again, but now an actor's degree is the number of actors who share a movie with him (rather than the number of movies done by the actor as in the bipartite graph).

We'll use an SFrame---initialized with the actor degree---to store the results.



In [43]:

    
centrality = get_degree(net)

Triangles in a graph are complete subgraphs with only three vertices. The number of triangles to which an actor belongs is a measure of the connectivity of his or her social network.



In [44]:

    
tc = graphlab.triangle_counting.create(net)
print "Triangle count summary:\n", tc.summary()

centrality = centrality.join(tc['triangle_count'], on='__id', how='left')









    




PROGRESS: Initializing vertex ids.






    




PROGRESS: Removing duplicate (bidirectional) edges.






    




PROGRESS: Counting triangles...






    




PROGRESS: Finished in 0.630832 secs.






    




PROGRESS: Total triangles in the graph : 1367881






    



Triangle count summary:
Class                                   : TriangleCountingModel

Graph
-----
num_edges                               : 167952
num_vertices                            : 5305

Results
-------
graph                                   : SGraph. See m['graph']
total number of triangles               : 1367881
vertex triangle count                   : SFrame. See m['triangle_count']

Metrics
-------
training time (secs)                    : 0.6391

Queryable Fields
----------------
graph                                   : A new SGraph with the triangle count as a vertex property.
num_triangles                           : Total number of triangles in the graph.
triangle_count                          : An SFrame with the triangle count for each vertex.
training_time                           : Total training time of the model

None

Pagerank is a popular method for calculating the "importance" of nodes in a graph.



In [45]:

    
pr = graphlab.pagerank.create(net, verbose=False)
print "Pagerank summary:\n", pr.summary()

centrality = centrality.join(pr['pagerank'], on='__id', how='left')
centrality.sort('pagerank', ascending=False)









    



Pagerank summary:
Class                                   : PagerankModel

Graph
-----
num_edges                               : 167952
num_vertices                            : 5305

Results
-------
graph                                   : SGraph. See m['graph']
change in last iteration (L1 norm)      : 1.5828
vertex pagerank                         : SFrame. See m['pagerank']

Settings
--------
maximun number of iterations            : 20
convergence threshold (L1 norm)         : 0.01
probablity of random jumps to any node in the graph: 0.15

Metrics
-------
training time (secs)                    : 0.4068
number of iterations                    : 20

Queryable Fields
----------------
training_time                           : Total training time of the model
graph                                   : A new SGraph with the pagerank as a vertex property
delta                                   : Change in pagerank for the last iteration in L1 norm
reset_probability                       : The probablity of randomly jumps to any node in the graph
pagerank                                : An SFrame with each vertex's pagerank
num_iterations                          : Number of iterations
threshold                               : The convergence threshold in L1 norm
max_iterations                          : The maximun number of iterations to run

None






    Out[45]:





    
        __id
        in_degree
        triangle_count
        pagerank
        delta
    
    
        James Franco
        151
        2982
        4.77719241601
        6.77217050331e-05
    
    
        Rob Corddry
        105
        1127
        4.11442446464
        0.000878799198492
    
    
        Channing Tatum
        134
        2949
        3.78775951804
        0.000374050088186
    
    
        Paul Rudd
        184
        6235
        3.77233182078
        0.00157863734389
    
    
        Andrew R. Kaplan
        205
        6979
        3.62503016161
        0.00271149544128
    
    
        Anthony Mackie
        124
        2378
        3.59796557524
        9.26603227924e-05
    
    
        John Cusack
        103
        1955
        3.37184310389
        3.23909995932e-05
    
    
        Jeffrey Wright
        85
        963
        3.27329650664
        0.000817767631977
    
    
        Amy Seimetz
        68
        611
        3.26483531164
        0.000862322961037
    
    
        Melissa Leo
        109
        2526
        3.11387894674
        0.00028271047215
    

[5305 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

James Franco crushes the competition with pagerank. If we plot the histogram of pagerank quantiles, we can see just how far ahead Mr. Franco really is.



In [46]:

    
centrality['pagerank'].show()

With 99% of the pagerank scores below 2.225, James Franco's pagerank of 4.78 is almost literally off the chart!

Finally, another very common way to measure the centrality of a vertex is the mean distance from the vertex to all other nodes. This is a relatively expensive thing to compute, so we'll find it just for the top of our leader board.



In [47]:

    
centrality = centrality.sort('pagerank', ascending=False)

mean_dists = [int(1e12)] * centrality.num_rows()

for i in range(10):
    a = centrality[i]['__id']
    sp = graphlab.shortest_path.create(net, source_vid=a, verbose=False)
    sp_out = sp['distance']
    mean_dists[i] = sp_out['distance'].mean()

centrality['mean_dist'] = mean_dists
centrality.sort('mean_dist')









    Out[47]:





    
        __id
        in_degree
        triangle_count
        pagerank
        delta
        mean_dist
    
    
        Anthony Mackie
        124
        2378
        3.59796557524
        9.26603227924e-05
        3.11008482564
    
    
        James Franco
        151
        2982
        4.77719241601
        6.77217050331e-05
        3.12082940622
    
    
        Paul Rudd
        184
        6235
        3.77233182078
        0.00157863734389
        3.13779453346
    
    
        Channing Tatum
        134
        2949
        3.78775951804
        0.000374050088186
        3.20263901979
    
    
        Rob Corddry
        105
        1127
        4.11442446464
        0.000878799198492
        3.27389255419
    
    
        Jeffrey Wright
        85
        963
        3.27329650664
        0.000817767631977
        3.27464655985
    
    
        Andrew R. Kaplan
        205
        6979
        3.62503016161
        0.00271149544128
        3.38812441093
    
    
        John Cusack
        103
        1955
        3.37184310389
        3.23909995932e-05
        3.44882186616
    
    
        Melissa Leo
        109
        2526
        3.11387894674
        0.00028271047215
        3.53195098963
    
    
        Amy Seimetz
        68
        611
        3.26483531164
        0.000862322961037
        3.89198868992
    

[5305 rows x 6 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

We have a three way tie, between Anthony Mackie, James Franco , and Paul Rudd.



In [48]:

    
display(Image(url='https://static.turi.com/datasets/americanMovies/Anthony-Mackie.jpg'))
print "Source: David Shankbone http://flickr.com/photos/27865228@N06/4543207958"
display(Image(url='https://static.turi.com/datasets/americanMovies/James-Franco.jpg'))
print "Source: Vanessa Lua http://www.flickr.com/photos/vanessalua/6286991960/"
display(Image(url='https://static.turi.com/datasets/americanMovies/Paul-Rudd.jpg'))
print "Source: Eva Rinaldi http://www.flickr.com/photos/evarinaldiphotography/11024133765/"









    











    



Source: David Shankbone http://flickr.com/photos/27865228@N06/4543207958






    











    



Source: Vanessa Lua http://www.flickr.com/photos/vanessalua/6286991960/






    











    



Source: Eva Rinaldi http://www.flickr.com/photos/evarinaldiphotography/11024133765/

TL; DR: Danny Trejo is the real Kevin Bacon of the last decade, while Paul Rudd, Anthony Mackie and James Franco took over the most central spot for 2013.

(Looking for more details about the modules and functions? Check out the API docs.)

__src_id	__dst_id	id	character	year	weight
Tom Arnold	The Kid & I	94	Bill Williams	2005	0.5
David Arden Engel	The Albino Code	211	Captain Fascist	2006	0.5
Alyssa Chamberlain	The Albino Code	220	English Girl	2006	0.5
Ezra Stevens	The Albino Code	222	Driver	2006	0.5
Eric Mill	The Albino Code	223	Hooligan 2	2006	0.5

component_id	Count
21644	87233
4101	48
12905	46
72315	44
73764	39
72062	34
16055	33
11485	32
21271	25
60080	24

__id	in_degree
Samuel L. Jackson	42
Robert De Niro	29
Kevin Bacon	21
Dennis Hopper	21

__id	in_degree
Danny Trejo	81
Frank Welker	78
John Cena	70
Michael Madsen	63
Jeff Bennett	59
Keith David	56
Eric Roberts	56
Tom Kenny	54
Vampiro	53
Grey DeLisle	52

id	actor
0	George J. Vezina
1	Lange Siegstad
2	Paula Malcomson
3	David Mattia
4	June Diane Raphael
5	Melonie Diaz
6	Roger Edwards
7	David Kim
8	James Ransone
9	Chris J. Knight

__src_id	__dst_id	weight
David Delagarza	David Delagarza	1
David Delagarza	Navarro Hall	1
Clark Gregg	Clark Gregg	4
Clark Gregg	Richard Dreyfuss	1
Clark Gregg	Dylan Minnette	1
Clark Gregg	Griffin Gluck	1
Marc Andrew Smith	Marc Andrew Smith	1
Scott Haze	Scott Haze	2
David Belle	David Belle	1
David Belle	Domenick Lombardozzi	1

__id	in_degree	triangle_count	pagerank	delta
James Franco	151	2982	4.77719241601	6.77217050331e-05
Rob Corddry	105	1127	4.11442446464	0.000878799198492
Channing Tatum	134	2949	3.78775951804	0.000374050088186
Paul Rudd	184	6235	3.77233182078	0.00157863734389
Andrew R. Kaplan	205	6979	3.62503016161	0.00271149544128
Anthony Mackie	124	2378	3.59796557524	9.26603227924e-05
John Cusack	103	1955	3.37184310389	3.23909995932e-05
Jeffrey Wright	85	963	3.27329650664	0.000817767631977
Amy Seimetz	68	611	3.26483531164	0.000862322961037
Melissa Leo	109	2526	3.11387894674	0.00028271047215