Install GraphLab Create

We always start with this line before using any part of GraphLab Create


In [1]:
import graphlab

Load a tabular data set


In [2]:
sf = graphlab.SFrame('../data/books/book-data.csv')


2016-03-27 21:51:23,435 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.5 started. Logging: /tmp/graphlab_server_1459140681.log
This commercial license of GraphLab Create is assigned to engr@turi.com.
Finished parsing file /Users/srikris/workspace/tutorials/strata-sj-2016/data/books/book-data.csv
Parsing completed. Parsed 100 lines in 0.086887 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/srikris/workspace/tutorials/strata-sj-2016/data/books/book-data.csv
Parsing completed. Parsed 18776 lines in 0.064677 secs.

SFrame basics


In [3]:
sf # We can view first few lines of table


Out[3]:
book author year publisher
The Law of Love Laura Esquivel 1997 Three Rivers Press (CA)
Undercurrents Frances Fyfield 2001 Viking Penguin Inc
Swept Away Cay David 1992 Meteor Publishing
Corporation ...
Interesting Women:
Stories ...
Andrea Lee 2002 Random House
Sweet Revenge Kate Clemens 2004 Kensington Publishing
Corporation ...
Somewhere, Someday Josephine Cox 2001 Trafalgar Square
Living by the Word:
Selected Writings ...
Alice Walker 1989 Harvest Books
Hinds' Feet on High
Places ...
Hannah Hurnard 1993 Tyndale House Publishers
By Possession Madeline Hunter 2000 Bantam Books
The Love of a Cowboy Anna Jeffrey 2003 Onyx Books
[18776 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [4]:
sf.tail()  # view end of the table


Out[4]:
book author year publisher
Silken Bondage Nan Ryan 1991 Dell Publishing Company
Das Magische Messer / The
Magic Knife ...
Philip Pullman 2002 Distribooks
A Walk Across America Peter Jenkins 1979 Harpercollins
Princess Charming Elizabeth Thornton 2001 Bantam Books
Roger's Version John Updike 1996 Ballantine Books
Sheep in a Jeep Nancy E. Shaw 1988 Houghton Mifflin
Babyhood Paul Reiser 1997 Random House Audio
The Lighthouse Keeper James Michael Pratt 2000 St Martins Pr
Spandau Phoenix Greg Iles 1993 Penguin USA
Glory in Death J. D. Robb 2004 Berkley Publishing Group
[10 rows x 4 columns]

GraphLab Canvas


In [5]:
# .show() visualizes any data structure in GraphLab Create
sf.show()


Canvas is accessible via web browser at the URL: http://localhost:52612/index.html
Opening Canvas in default web browser.

In [6]:
# If you want Canvas visualization to show up on this notebook, 
# rather than popping up a new window, add this line:
graphlab.canvas.set_target('ipynb')

In [7]:
sf['year'].show(view='Categorical')


Inspect columns of dataset


In [8]:
sf['author']


Out[8]:
dtype: str
Rows: 18776
['Laura Esquivel', 'Frances Fyfield', 'Cay David', 'Andrea Lee', 'Kate Clemens', 'Josephine Cox', 'Alice Walker', 'Hannah Hurnard', 'Madeline Hunter', 'Anna Jeffrey', 'Jack Prelutsky', 'China Mieville', 'Randy Shilts', 'P. C. Doherty', 'Darlene Marie Wilkinson', 'Robert N. Munsch', 'Sean Stewart', 'Anne McCaffrey', 'Iris Johansen', 'Carl Sagan', 'Christopher Pike', 'Lorene Cary', 'Timothy Findley', 'Louise Erdrich', 'Michele Albert', 'Sherryl Woods', 'Oliver W. Sacks', "Des O'Connor", 'Suzanne Schlosberg', 'Frank Herbert', 'Lorraine Heath', 'Stephanie Coontz', 'Robin Cook', 'Rosa Montero', 'Robert Barnard', 'Echo Heron', 'Peter Gethers', 'Charles Dickens', 'C. E. Crimmins', 'Richard Preston', 'Catherine Coulter', "Lesley O'Mara", 'Ellen Gilchrist', 'Danielle Steel', 'Jack Heffron', 'Lilian Jackson Braun', 'Anthony Horowitz', 'Sue Townsend', 'James Ellroy', 'Stephen King', 'Lori Foster', 'Robert K. Tanenbaum', 'Margaret Truman', 'Chitra Banerjee Divakaruni', 'Stephen King', 'Leo Buscaglia', 'Stephen Jay Gould', 'Roderick MacLeish', 'Danielle Steel', 'John M. Ford', 'Ray Bradbury', 'Jonathan Kellerman', 'Eric Carle', 'Andrew M. Greeley', 'Philip Yancey', 'Daniel Pennac', 'David Baldacci', 'Patricia Highsmith', 'Philip Shelby', 'Mickey Pearlman', 'Linda Fairstein', 'Stephanie Laurens', 'Shannon Holmes', 'Katharine Kerr', 'Carolyn See', 'Beverly Lewis', 'Octavia E. Butler', 'Jean Stone', 'Erma Bombeck', 'David Klass', 'Erma Bombeck', 'Dave Barry', 'Lee Child', 'C. J. Cherryh', 'Andy Riley', 'Isaac Asimov', 'Barbara Bretton', 'Spider Robinson', 'Laura Joh Rowland', 'Cassie Edwards', 'Louis Bayard', 'Oscar Hijuelos', 'John Grisham', 'Jennifer Crusie', 'Tom Magliozzi', 'John De Graaf', 'Jan Burke', 'Dean R. Koontz', 'Barbara Kingsolver', 'Paul Reiser', ... ]

In [9]:
sf['year']


Out[9]:
dtype: int
Rows: 18776
[1997, 2001, 1992, 2002, 2004, 2001, 1989, 1993, 2000, 2003, 1990, 2000, 1987, 2000, 2002, 1986, 1996, 1992, 1999, 1997, 1998, 1996, 1995, 1998, 2002, 2004, 2000, 2002, 1996, 1981, 1999, 1992, 1997, 1998, 1984, 1998, 1996, 1991, 2000, 1999, 1993, 1990, 0, 1984, 2002, 1992, 2002, 1993, 1991, 2001, 2001, 1996, 1997, 2000, 1986, 1983, 1982, 2002, 1991, 1993, 1992, 1987, 1993, 1986, 2001, 1995, 2002, 2001, 1999, 1994, 2002, 2000, 2001, 1993, 1999, 2001, 1999, 2003, 1993, 2002, 1992, 1993, 2003, 2000, 2003, 1991, 1996, 2002, 2001, 1996, 2003, 1992, 2000, 2001, 2001, 2002, 2002, 2000, 1990, 1994, ... ]

Some simple columnar operations


In [10]:
sf['year'].mean()


Out[10]:
1945.7877077119708

In [11]:
sf['year'].max()


Out[11]:
2030

Create new columns in our SFrame


In [12]:
sf


Out[12]:
book author year publisher
The Law of Love Laura Esquivel 1997 Three Rivers Press (CA)
Undercurrents Frances Fyfield 2001 Viking Penguin Inc
Swept Away Cay David 1992 Meteor Publishing
Corporation ...
Interesting Women:
Stories ...
Andrea Lee 2002 Random House
Sweet Revenge Kate Clemens 2004 Kensington Publishing
Corporation ...
Somewhere, Someday Josephine Cox 2001 Trafalgar Square
Living by the Word:
Selected Writings ...
Alice Walker 1989 Harvest Books
Hinds' Feet on High
Places ...
Hannah Hurnard 1993 Tyndale House Publishers
By Possession Madeline Hunter 2000 Bantam Books
The Love of a Cowboy Anna Jeffrey 2003 Onyx Books
[18776 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [13]:
sf['full_name'] = sf['book'] + ' ' + sf['author'] + ' ' + sf['publisher']

In [14]:
sf


Out[14]:
book author year publisher full_name
The Law of Love Laura Esquivel 1997 Three Rivers Press (CA) The Law of Love Laura
Esquivel Three Rivers ...
Undercurrents Frances Fyfield 2001 Viking Penguin Inc Undercurrents Frances
Fyfield Viking Penguin ...
Swept Away Cay David 1992 Meteor Publishing
Corporation ...
Swept Away Cay David
Meteor Publishing ...
Interesting Women:
Stories ...
Andrea Lee 2002 Random House Interesting Women:
Stories Andrea Lee Ra ...
Sweet Revenge Kate Clemens 2004 Kensington Publishing
Corporation ...
Sweet Revenge Kate
Clemens Kensington ...
Somewhere, Someday Josephine Cox 2001 Trafalgar Square Somewhere, Someday
Josephine Cox Trafalgar ...
Living by the Word:
Selected Writings ...
Alice Walker 1989 Harvest Books Living by the Word:
Selected Writings ...
Hinds' Feet on High
Places ...
Hannah Hurnard 1993 Tyndale House Publishers Hinds' Feet on High
Places Hannah Hurnard ...
By Possession Madeline Hunter 2000 Bantam Books By Possession Madeline
Hunter Bantam Books ...
The Love of a Cowboy Anna Jeffrey 2003 Onyx Books The Love of a Cowboy Anna
Jeffrey Onyx Books ...
[18776 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [16]:
sf['year'] + 2


Out[16]:
dtype: int
Rows: 18776
[1999, 2003, 1994, 2004, 2006, 2003, 1991, 1995, 2002, 2005, 1992, 2002, 1989, 2002, 2004, 1988, 1998, 1994, 2001, 1999, 2000, 1998, 1997, 2000, 2004, 2006, 2002, 2004, 1998, 1983, 2001, 1994, 1999, 2000, 1986, 2000, 1998, 1993, 2002, 2001, 1995, 1992, 2, 1986, 2004, 1994, 2004, 1995, 1993, 2003, 2003, 1998, 1999, 2002, 1988, 1985, 1984, 2004, 1993, 1995, 1994, 1989, 1995, 1988, 2003, 1997, 2004, 2003, 2001, 1996, 2004, 2002, 2003, 1995, 2001, 2003, 2001, 2005, 1995, 2004, 1994, 1995, 2005, 2002, 2005, 1993, 1998, 2004, 2003, 1998, 2005, 1994, 2002, 2003, 2003, 2004, 2004, 2002, 1992, 1996, ... ]

Use the apply function to do a advance transformation of our data


In [17]:
sf['year'].show()



In [18]:
def transform_year(year):
    if (year > 1945) and (year < 2016):
        return year
    else:
        return None

In [22]:
print transform_year(1960)


1960

In [21]:
print transform_year(2022)


None

In [24]:
print transform_year(1900)


None

In [27]:
sf['year'].apply(transform_year)


Out[27]:
dtype: int
Rows: 18776
[1997, 2001, 1992, 2002, 2004, 2001, 1989, 1993, 2000, 2003, 1990, 2000, 1987, 2000, 2002, 1986, 1996, 1992, 1999, 1997, 1998, 1996, 1995, 1998, 2002, 2004, 2000, 2002, 1996, 1981, 1999, 1992, 1997, 1998, 1984, 1998, 1996, 1991, 2000, 1999, 1993, 1990, None, 1984, 2002, 1992, 2002, 1993, 1991, 2001, 2001, 1996, 1997, 2000, 1986, 1983, 1982, 2002, 1991, 1993, 1992, 1987, 1993, 1986, 2001, 1995, 2002, 2001, 1999, 1994, 2002, 2000, 2001, 1993, 1999, 2001, 1999, 2003, 1993, 2002, 1992, 1993, 2003, 2000, 2003, 1991, 1996, 2002, 2001, 1996, 2003, 1992, 2000, 2001, 2001, 2002, 2002, 2000, 1990, 1994, ... ]

In [28]:
sf['year'] = sf['year'].apply(transform_year)

In [30]:
sf['year'].show()



In [ ]: