Reading data from Impala

GraphLab create supports loading data from many standard data formats (CSV, Avro, JSON) and data stores such as S3 and HDFS. We also have an ODBC connector, which works seamlessly for reading data directly from Cloudera's Impala.

Before trying this on your own computer, you'll need to make sure that you have the Cloudera ODBC driver installed.

Let's take a look at how simple it is to stream results from Impala queries directly into our scalable data structure, the SFrame.



In [4]:

    
import graphlab as gl



In [3]:

    
# configure your ODBC connection
db = gl.connect_odbc("DRIVER=/opt/cloudera/impalaodbc/lib/universal/" \
                     "libclouderaimpalaodbc.dylib;HOST=10.10.2.15;PORT=21050")









    



[INFO] Start server at: ipc:///tmp/graphlab_server-29804 - Server binary: /Users/rlvoyer/Envs/glc_pypi_1.3/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1423871383.log
[INFO] GraphLab Server Version: 1.3.0

Cloudera Impala uses SQL as its query language. We can run a standard SQL DESCRIBE query to get a sense for what the data looks like.



In [5]:

    
# run a DESCRIBE query against the Amazon product titles table
gl.SFrame.from_odbc(db, "DESCRIBE titles")









    Out[5]:





    
        name
        type
        comment
    
    
        idx
        bigint
        
    
    
        product_id
        string
        
    
    
        num_reviews
        int
        
    
    
        price
        string
        
    
    
        simple_category
        string
        
    
    
        title
        string
        
    
    
        category_list_0
        string
        
    
    
        category_list_1
        string
        
    
    
        category_list_2
        string
        
    
    
        category_list_3
        string
        
    
    
        ...
        ...
        ...
    

[15 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Cool! Now let's stream some data into an SFrame.



In [14]:

    
# run a simple SELECT to get titles for all products with more than 100 reviews
titles_sf = gl.SFrame.from_odbc(db, "SELECT title, num_reviews, simple_category FROM titles WHERE num_reviews > 25")
titles_sf









    Out[14]:





    
        title
        num_reviews
        simple_category
    
    
        reality
        166
        Music
    
    
        keeping heart on pine
ridg ...
        26
        Books
    
    
        eric meyer on css:
mastering the languag ...
        68
        Books
    
    
        pierrot le fou (1969)
        52
        Movies & TV
    
    
        the life of john wesley
hardin as written by ...
        27
        Books
    
    
        snakes on a train
(unrated director's ...
        26
        Movies & TV
    
    
        t2 : infiltra
        35
        Books
    
    
        drop dead fred [region 2]
(1991) ...
        161
        Movies & TV
    
    
        loser goes first: my
thirty-something year ...
        32
        Books
    
    
        irresistible (banning
sisters trilogy) ...
        29
        Books
    
    
        ...
        ...
        ...
    

[71639 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

We can use GraphLab Canvas to visualize the data.



In [15]:

    
titles_sf.show()









    



Canvas is accessible via web browser at the URL: http://localhost:63103/index.html

And now that we have our data in an SFrame, we're ready to start training predictive models, and deploying them to production!

name	type	comment
idx	bigint
product_id	string
num_reviews	int
price	string
simple_category	string
title	string
category_list_0	string
category_list_1	string
category_list_2	string
category_list_3	string
...	...	...

title	num_reviews	simple_category
reality	166	Music
keeping heart on pine ridg ...	26	Books
eric meyer on css: mastering the languag ...	68	Books
pierrot le fou (1969)	52	Movies & TV
the life of john wesley hardin as written by ...	27	Books
snakes on a train (unrated director's ...	26	Movies & TV
t2 : infiltra	35	Books
drop dead fred [region 2] (1991) ...	161	Movies & TV
loser goes first: my thirty-something year ...	32	Books
irresistible (banning sisters trilogy) ...	29	Books
...	...	...