Reading data from Impala

GraphLab create supports loading data from many standard data formats (CSV, Avro, JSON) and data stores such as S3 and HDFS. We also have an ODBC connector, which works seamlessly for reading data directly from Cloudera's Impala.

Before trying this on your own computer, you'll need to make sure that you have the Cloudera ODBC driver installed.

Let's take a look at how simple it is to stream results from Impala queries directly into our scalable data structure, the SFrame.


In [4]:
import graphlab as gl

In [3]:
# configure your ODBC connection
db = gl.connect_odbc("DRIVER=/opt/cloudera/impalaodbc/lib/universal/" \
                     "libclouderaimpalaodbc.dylib;HOST=10.10.2.15;PORT=21050")


[INFO] Start server at: ipc:///tmp/graphlab_server-29804 - Server binary: /Users/rlvoyer/Envs/glc_pypi_1.3/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1423871383.log
[INFO] GraphLab Server Version: 1.3.0

Cloudera Impala uses SQL as its query language. We can run a standard SQL DESCRIBE query to get a sense for what the data looks like.


In [5]:
# run a DESCRIBE query against the Amazon product titles table
gl.SFrame.from_odbc(db, "DESCRIBE titles")


Out[5]:
name type comment
idx bigint
product_id string
num_reviews int
price string
simple_category string
title string
category_list_0 string
category_list_1 string
category_list_2 string
category_list_3 string
... ... ...
[15 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Cool! Now let's stream some data into an SFrame.


In [14]:
# run a simple SELECT to get titles for all products with more than 100 reviews
titles_sf = gl.SFrame.from_odbc(db, "SELECT title, num_reviews, simple_category FROM titles WHERE num_reviews > 25")
titles_sf


Out[14]:
title num_reviews simple_category
reality 166 Music
keeping heart on pine
ridg ...
26 Books
eric meyer on css:
mastering the languag ...
68 Books
pierrot le fou (1969) 52 Movies & TV
the life of john wesley
hardin as written by ...
27 Books
snakes on a train
(unrated director's ...
26 Movies & TV
t2 : infiltra 35 Books
drop dead fred [region 2]
(1991) ...
161 Movies & TV
loser goes first: my
thirty-something year ...
32 Books
irresistible (banning
sisters trilogy) ...
29 Books
... ... ...
[71639 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

We can use GraphLab Canvas to visualize the data.


In [15]:
titles_sf.show()


Canvas is accessible via web browser at the URL: http://localhost:63103/index.html

And now that we have our data in an SFrame, we're ready to start training predictive models, and deploying them to production!