Getting Started with GraphLab Create

First a note about IPython Notebook

Most of our tutorials are written as IPython notebooks. This allows you to download and run the tutorials on your own machine, either as notebooks (.ipynb) or Python files (.py). To run the notebooks you'll need to install IPython and IPython Notebook; for installation details, visit www.ipython.org. A couple of the notebooks depend on matplotlib for custom plots; this library can be installed with the terminal command 'pip install matplotlib'.

Overview

In this tutorial, you'll get a good flavor of some of the fundamental tasks that GraphLab Create is built for.

You will learn how to:

load data into SFrames
create a Graph data structure from these frames
write simple graph queries
apply a machine learning model from the Graph Analytics Toolkit

We also have many other toolkits to explore from including recommender systems, data matching, graph analytics and more. Explore these and the rest of Graphlab Create in our User Guide.

...oh yeah, you'll also learn that some of us at Dato have a thing for Bond...yes...James Bond...



In [3]:

    
import graphlab as gl
gl.canvas.set_target('ipynb') # use IPython Notebook output for GraphLab Canvas

Load data into an SFrame

GraphLab Create uses two scalable data structures:

the SFrame, a tabular structure ideal for data munging & feature building
the Graph, a structure ideal for sparse data



In [2]:

    
vertices = gl.SFrame.read_csv('http://s3.amazonaws.com/dato-datasets/bond/bond_vertices.csv')
edges = gl.SFrame.read_csv('http://s3.amazonaws.com/dato-datasets/bond/bond_edges.csv')









    




PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/bond/bond_vertices.csv to /var/tmp/graphlab-piotrteterwak/84908/25be5354-9362-461c-987e-8f459f80350c.csv






    




PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/bond/bond_vertices.csv






    




PROGRESS: Parsing completed. Parsed 10 lines in 0.054139 secs.






    




PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/bond/bond_vertices.csv






    




PROGRESS: Parsing completed. Parsed 10 lines in 0.01183 secs.






    




PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/bond/bond_edges.csv to /var/tmp/graphlab-piotrteterwak/84908/d41d23b0-38bf-4605-9bc5-b93ab1ceed78.csv






    




PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/bond/bond_edges.csv






    




PROGRESS: Parsing completed. Parsed 20 lines in 0.011795 secs.






    



------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
------------------------------------------------------





    




PROGRESS: Read 20 lines. Lines per second: 1352.17






    




PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/bond/bond_edges.csv






    




PROGRESS: Parsing completed. Parsed 20 lines in 0.015271 secs.






    



Inferred types from first line of file as 
column_type_hints=[str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------



In [3]:

    
# SFrame has a number of methods to explore and transform your data
vertices.show()



In [4]:

    
# this shows the summary of the edges SFrame
edges.show()

Create a graph object



In [5]:

    
g = gl.SGraph()

Add vertices and edges to this graph



In [6]:

    
# add some vertices in a dataflow-ish way
g = g.add_vertices(vertices=vertices, vid_field='name')



In [7]:

    
# more dataflow
g = g.add_edges(edges=edges, src_field='src', dst_field='dst')

Do some basic graph querying



In [8]:

    
# Show all the vertices
g.get_vertices()









    Out[8]:





    
        __id
        gender
        license_to_kill
        villian
    
    
        Inga Bergstorm
        F
        0
        0
    
    
        Moneypenny
        F
        1
        0
    
    
        Henry Gupta
        M
        0
        1
    
    
        Wai Lin
        F
        1
        0
    
    
        M
        M
        1
        0
    
    
        Paris Carver
        F
        0
        1
    
    
        James Bond
        M
        1
        0
    
    
        Q
        M
        1
        0
    
    
        Elliot Carver
        M
        0
        1
    
    
        Gotz Otto
        M
        0
        1
    

[10 rows x 4 columns]



In [9]:

    
# Show all the edges
g.get_edges()









    Out[9]:





    
        __src_id
        __dst_id
        relation
    
    
        Inga Bergstorm
        James Bond
        friend
    
    
        Moneypenny
        M
        managed_by
    
    
        Moneypenny
        Q
        colleague
    
    
        Henry Gupta
        Elliot Carver
        killed_by
    
    
        Q
        Moneypenny
        colleague
    
    
        M
        Moneypenny
        worksfor
    
    
        James Bond
        Inga Bergstorm
        friend
    
    
        Wai Lin
        James Bond
        friend
    
    
        M
        James Bond
        worksfor
    
    
        James Bond
        M
        managed_by
    

[20 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [10]:

    
# Get all the "friend" edges
g.get_edges(fields={'relation': 'friend'})









    Out[10]:





    
        __src_id
        __dst_id
        relation
    
    
        Inga Bergstorm
        James Bond
        friend
    
    
        James Bond
        Inga Bergstorm
        friend
    
    
        Wai Lin
        James Bond
        friend
    
    
        James Bond
        Wai Lin
        friend
    

[4 rows x 3 columns]

Apply the pagerank algorithm to our graph



In [11]:

    
pr = gl.pagerank.create(g)









    




PROGRESS: Counting out degree






    




PROGRESS: Done counting out degree






    




PROGRESS: +-----------+-----------------------+






    




PROGRESS: | Iteration | L1 change in pagerank |






    




PROGRESS: +-----------+-----------------------+






    




PROGRESS: | 1         | 6.65833               |






    




PROGRESS: | 2         | 4.65611               |






    




PROGRESS: | 3         | 3.46298               |






    




PROGRESS: | 4         | 2.55686               |






    




PROGRESS: | 5         | 1.95422               |






    




PROGRESS: | 6         | 1.42139               |






    




PROGRESS: | 7         | 1.10464               |






    




PROGRESS: | 8         | 0.806704              |






    




PROGRESS: | 9         | 0.631771              |






    




PROGRESS: | 10        | 0.465388              |






    




PROGRESS: | 11        | 0.364898              |






    




PROGRESS: | 12        | 0.271257              |






    




PROGRESS: | 13        | 0.212255              |






    




PROGRESS: | 14        | 0.159062              |






    




PROGRESS: | 15        | 0.124071              |






    




PROGRESS: | 16        | 0.0935911             |






    




PROGRESS: | 17        | 0.0727674             |






    




PROGRESS: | 18        | 0.0551714             |






    




PROGRESS: | 19        | 0.0427744             |






    




PROGRESS: | 20        | 0.0325555             |






    




PROGRESS: +-----------+-----------------------+



In [12]:

    
pr.get('pagerank').topk(column_name='pagerank')









    Out[12]:





    
        __id
        pagerank
        delta
    
    
        James Bond
        2.52743578524
        0.0132914517076
    
    
        M
        1.87718696576
        0.00666194771763
    
    
        Moneypenny
        1.18363921275
        0.00143637385736
    
    
        Q
        1.18363921275
        0.00143637385736
    
    
        Inga Bergstorm
        0.869872717136
        0.00477951418076
    
    
        Wai Lin
        0.869872717136
        0.00477951418076
    
    
        Elliot Carver
        0.634064732205
        0.000113553313724
    
    
        Paris Carver
        0.284762885673
        1.89255522873e-05
    
    
        Henry Gupta
        0.284762885673
        1.89255522873e-05
    
    
        Gotz Otto
        0.284762885673
        1.89255522873e-05
    

[10 rows x 3 columns]

We see, not unexpectedly, that James Bond is a very important person, and that bad guys aren't that popular...

(Looking for more details about the modules and functions? Check out the API docs.)

__id	gender	license_to_kill	villian
Inga Bergstorm	F	0	0
Moneypenny	F	1	0
Henry Gupta	M	0	1
Wai Lin	F	1	0
M	M	1	0
Paris Carver	F	0	1
James Bond	M	1	0
Q	M	1	0
Elliot Carver	M	0	1
Gotz Otto	M	0	1

__src_id	__dst_id	relation
Inga Bergstorm	James Bond	friend
Moneypenny	M	managed_by
Moneypenny	Q	colleague
Henry Gupta	Elliot Carver	killed_by
Q	Moneypenny	colleague
M	Moneypenny	worksfor
James Bond	Inga Bergstorm	friend
Wai Lin	James Bond	friend
M	James Bond	worksfor
James Bond	M	managed_by

__id	pagerank	delta
James Bond	2.52743578524	0.0132914517076
M	1.87718696576	0.00666194771763
Moneypenny	1.18363921275	0.00143637385736
Q	1.18363921275	0.00143637385736
Inga Bergstorm	0.869872717136	0.00477951418076
Wai Lin	0.869872717136	0.00477951418076
Elliot Carver	0.634064732205	0.000113553313724
Paris Carver	0.284762885673	1.89255522873e-05
Henry Gupta	0.284762885673	1.89255522873e-05
Gotz Otto	0.284762885673	1.89255522873e-05