GraphLab create supports loading data from many standard data formats (CSV, Avro, JSON) and data stores such as S3 and HDFS. We also have an ODBC connector, which works seamlessly for reading data directly from Cloudera's Impala.
Before trying this on your own computer, you'll need to make sure that you have the Cloudera ODBC driver installed.
Let's take a look at how simple it is to stream results from Impala queries directly into our scalable data structure, the SFrame.
In [4]:
import graphlab as gl
In [3]:
# configure your ODBC connection
db = gl.connect_odbc("DRIVER=/opt/cloudera/impalaodbc/lib/universal/" \
"libclouderaimpalaodbc.dylib;HOST=10.10.2.15;PORT=21050")
Cloudera Impala uses SQL as its query language. We can run a standard SQL DESCRIBE query to get a sense for what the data looks like.
In [5]:
# run a DESCRIBE query against the Amazon product titles table
gl.SFrame.from_odbc(db, "DESCRIBE titles")
Out[5]:
Cool! Now let's stream some data into an SFrame.
In [14]:
# run a simple SELECT to get titles for all products with more than 100 reviews
titles_sf = gl.SFrame.from_odbc(db, "SELECT title, num_reviews, simple_category FROM titles WHERE num_reviews > 25")
titles_sf
Out[14]:
We can use GraphLab Canvas to visualize the data.
In [15]:
titles_sf.show()
And now that we have our data in an SFrame, we're ready to start training predictive models, and deploying them to production!