<img src="images/continuum_analytics_logo.png" alt="Continuum Logo", align="right", width="30%">,

Introduction to Blaze

In this 45 minute tutorial we'll learn how to use Blaze to discover, migrate, and query data living in other databases. Generally this tutorial will have the following format

  1. into - Move data to database
  2. blaze - Query data in database
  3. remote - What if data and database is on an HDFS-backed cluster?

Goal: Accessible, Interactive, Analytic Queries

NumPy and Pandas provide accessible, interactive, analytic queries; this is valuable.


In [1]:
import pandas as pd
df = pd.read_csv('iris.csv')
df.head()


Out[1]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

In [5]:
df.groupby(df.species).petal_length.mean()  # Average petal length per species


Out[5]:
species
Iris-setosa        1.462
Iris-versicolor    4.260
Iris-virginica     5.552
Name: petal_length, dtype: float64

If your data fits on your computer then this is probably the way to go, and you can stop reading right now.

From now on, we're going to assume one of the following:

  1. You have an inconvenient amount of data
  2. That data should live someplace other than your computer

Databases and Python

When in-memory arrays/dataframes cease to be an option, we turn to databases. These live outside of the Python process and so might be less convenient. The open source Python ecosystem includes libraries to interact with these databases and with foreign data in general.

Examples:

Today we're going to use some of these indirectly with into and Blaze. We'll try to point out these libraries as we automate them so that, if you'd like, you can use them independently.

Blaze and into

The Blaze and into projects give a consistent interface over many of the libraries above. They strive to trivialize common tasks.

  • into moves data from place to place and from format to format
  • blaze queries data in databases

We're going to start with into, learning how to migrate data between formats, between computers, and into databases. We'll then use Blaze to perform analytic queries on that data.

Teaser

We'll eventually do things like this

>>> from into import into
>>> into('hive://hostname/default::iris', 'iris.csv')  # Move local data onto HDFS and register with Hive

>>> from blaze import Data, by
>>> db = Data('hive://hostname/default')
>>> by(db.species, avg=db.petal_length.mean())
...

Not a Magic Bullet

Blaze and into make easy things trivial but are not a replacement for intimate knowledge of your database or for your problem. These projects are intended to enable non-expert users to do every-day tasks.