Getting Started with GraphLab Create

First a note about IPython Notebook

Most of our tutorials are written as IPython notebooks. This allows you to download and run the tutorials on your own machine, either as notebooks (.ipynb) or Python files (.py). To run the notebooks you'll need to install IPython and IPython Notebook; for installation details, visit www.ipython.org. A couple of the notebooks depend on matplotlib for custom plots; this library can be installed with the terminal command 'pip install matplotlib'.

Overview

In this tutorial, you'll get a good flavor of some of the fundamental tasks that GraphLab Create is built for.

You will learn how to:

  • load data into SFrames
  • create a Graph data structure from these frames
  • write simple graph queries
  • apply a machine learning model from the Graph Analytics Toolkit

We also have many other toolkits to explore from including recommender systems, data matching, graph analytics and more. Explore these and the rest of Graphlab Create in our User Guide.

...oh yeah, you'll also learn that some of us at Dato have a thing for Bond...yes...James Bond...


In [3]:
import graphlab as gl
gl.canvas.set_target('ipynb') # use IPython Notebook output for GraphLab Canvas

Load data into an SFrame

GraphLab Create uses two scalable data structures:

  • the SFrame, a tabular structure ideal for data munging & feature building
  • the Graph, a structure ideal for sparse data

In [2]:
vertices = gl.SFrame.read_csv('http://s3.amazonaws.com/dato-datasets/bond/bond_vertices.csv')
edges = gl.SFrame.read_csv('http://s3.amazonaws.com/dato-datasets/bond/bond_edges.csv')


PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/bond/bond_vertices.csv to /var/tmp/graphlab-piotrteterwak/84908/25be5354-9362-461c-987e-8f459f80350c.csv
PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/bond/bond_vertices.csv
PROGRESS: Parsing completed. Parsed 10 lines in 0.054139 secs.
PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/bond/bond_vertices.csv
PROGRESS: Parsing completed. Parsed 10 lines in 0.01183 secs.
PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/bond/bond_edges.csv to /var/tmp/graphlab-piotrteterwak/84908/d41d23b0-38bf-4605-9bc5-b93ab1ceed78.csv
PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/bond/bond_edges.csv
PROGRESS: Parsing completed. Parsed 20 lines in 0.011795 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
------------------------------------------------------
PROGRESS: Read 20 lines. Lines per second: 1352.17
PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/bond/bond_edges.csv
PROGRESS: Parsing completed. Parsed 20 lines in 0.015271 secs.
Inferred types from first line of file as 
column_type_hints=[str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

In [3]:
# SFrame has a number of methods to explore and transform your data
vertices.show()



In [4]:
# this shows the summary of the edges SFrame
edges.show()


Create a graph object


In [5]:
g = gl.SGraph()

Add vertices and edges to this graph


In [6]:
# add some vertices in a dataflow-ish way
g = g.add_vertices(vertices=vertices, vid_field='name')

In [7]:
# more dataflow
g = g.add_edges(edges=edges, src_field='src', dst_field='dst')

Do some basic graph querying


In [8]:
# Show all the vertices
g.get_vertices()


Out[8]:
__id gender license_to_kill villian
Inga Bergstorm F 0 0
Moneypenny F 1 0
Henry Gupta M 0 1
Wai Lin F 1 0
M M 1 0
Paris Carver F 0 1
James Bond M 1 0
Q M 1 0
Elliot Carver M 0 1
Gotz Otto M 0 1
[10 rows x 4 columns]

In [9]:
# Show all the edges
g.get_edges()


Out[9]:
__src_id __dst_id relation
Inga Bergstorm James Bond friend
Moneypenny M managed_by
Moneypenny Q colleague
Henry Gupta Elliot Carver killed_by
Q Moneypenny colleague
M Moneypenny worksfor
James Bond Inga Bergstorm friend
Wai Lin James Bond friend
M James Bond worksfor
James Bond M managed_by
[20 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [10]:
# Get all the "friend" edges
g.get_edges(fields={'relation': 'friend'})


Out[10]:
__src_id __dst_id relation
Inga Bergstorm James Bond friend
James Bond Inga Bergstorm friend
Wai Lin James Bond friend
James Bond Wai Lin friend
[4 rows x 3 columns]

Apply the pagerank algorithm to our graph


In [11]:
pr = gl.pagerank.create(g)


PROGRESS: Counting out degree
PROGRESS: Done counting out degree
PROGRESS: +-----------+-----------------------+
PROGRESS: | Iteration | L1 change in pagerank |
PROGRESS: +-----------+-----------------------+
PROGRESS: | 1         | 6.65833               |
PROGRESS: | 2         | 4.65611               |
PROGRESS: | 3         | 3.46298               |
PROGRESS: | 4         | 2.55686               |
PROGRESS: | 5         | 1.95422               |
PROGRESS: | 6         | 1.42139               |
PROGRESS: | 7         | 1.10464               |
PROGRESS: | 8         | 0.806704              |
PROGRESS: | 9         | 0.631771              |
PROGRESS: | 10        | 0.465388              |
PROGRESS: | 11        | 0.364898              |
PROGRESS: | 12        | 0.271257              |
PROGRESS: | 13        | 0.212255              |
PROGRESS: | 14        | 0.159062              |
PROGRESS: | 15        | 0.124071              |
PROGRESS: | 16        | 0.0935911             |
PROGRESS: | 17        | 0.0727674             |
PROGRESS: | 18        | 0.0551714             |
PROGRESS: | 19        | 0.0427744             |
PROGRESS: | 20        | 0.0325555             |
PROGRESS: +-----------+-----------------------+

In [12]:
pr.get('pagerank').topk(column_name='pagerank')


Out[12]:
__id pagerank delta
James Bond 2.52743578524 0.0132914517076
M 1.87718696576 0.00666194771763
Moneypenny 1.18363921275 0.00143637385736
Q 1.18363921275 0.00143637385736
Inga Bergstorm 0.869872717136 0.00477951418076
Wai Lin 0.869872717136 0.00477951418076
Elliot Carver 0.634064732205 0.000113553313724
Paris Carver 0.284762885673 1.89255522873e-05
Henry Gupta 0.284762885673 1.89255522873e-05
Gotz Otto 0.284762885673 1.89255522873e-05
[10 rows x 3 columns]

We see, not unexpectedly, that James Bond is a very important person, and that bad guys aren't that popular...

(Looking for more details about the modules and functions? Check out the API docs.)