Blaze - Symbolic Data Analysis

We demonstrate Blaze's symbolic nature by comparing its use to Pandas.


In [1]:
import blaze as bz
import pandas as pd

In [2]:
bz.__version__


Out[2]:
'0.6.5'

On The Surface

Pandas and Blaze feel similar


In [3]:
df = pd.read_csv('/home/mrocklin/workspace/blaze/examples/data/iris.csv')
df.head(10)


Out[3]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa

In [4]:
t = bz.Table('/home/mrocklin/workspace/blaze/examples/data/iris.csv')
t.head(10)


Out[4]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa

In [5]:
t.species.distinct()


Out[5]:
species
0 Iris-setosa
1 Iris-versicolor
2 Iris-virginica

Both perform analytic queries

For example, lets compute the minimum sepal length per species


In [6]:
df.groupby('species').sepal_length.min()


Out[6]:
species
Iris-setosa        4.3
Iris-versicolor    4.9
Iris-virginica     4.9
Name: sepal_length, dtype: float64

In [7]:
bz.by(t.species, 
      smallest=t.sepal_length.min())


Out[7]:
species smallest
0 Iris-virginica 4.9
1 Iris-setosa 4.3
2 Iris-versicolor 4.9

But internally they're quite different

As an example, lets see how they handle interaction with SQL databases

SQL Interaction


In [8]:
# Open up SQLAlchemy engine
import sqlalchemy
engine = sqlalchemy.create_engine('sqlite:////home/mrocklin/workspace/blaze/examples/data/iris.db')

# Pull data from SQLite to Pandas
df = pd.read_sql('SELECT * FROM iris', engine)
df.head(10)


Out[8]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa

In [9]:
# Point Blaze to SQLite table
t = bz.Table('sqlite:////home/mrocklin/workspace/blaze/examples/data/iris.db::iris')
t.head(10)


Out[9]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa

And again our analytic query


In [10]:
df.groupby('species').sepal_length.min()


Out[10]:
species
Iris-setosa        4.3
Iris-versicolor    4.9
Iris-virginica     4.9
Name: sepal_length, dtype: float64

In [11]:
bz.by(t.species, 
      smallest=t.sepal_length.min())


Out[11]:
species smallest
0 Iris-setosa 4.3
1 Iris-versicolor 4.9
2 Iris-virginica 4.9

Externally things looks the same

Internally very different things happened

Pandas pulled data from the SQLite database into local memory, then used pandas algorithms on that data.

  1. Pandas $\leftarrow_\textrm{data}$ SQLite
  2. Pandas did the work

Blaze generated SQL and passed that back to the database

  1. Blaze $\leftarrow_\textrm{metadata}$ SQLite
  2. Blaze $\rightarrow_\textrm{SQL}$ SQLite
  3. SQLite did the work

Lets see the generated SQL

Blaze speaks SQLAlchemy


In [12]:
expr = bz.by(t.species, 
             smallest=t.sepal_length.min())

result = bz.compute(expr, {t: t.data.table})
result


Out[12]:
<sqlalchemy.sql.selectable.Select at 0x7f1c030aa3d0; Select object>

SQLAlchemy in turn speaks SQLite


In [13]:
print result


SELECT iris.species, min(iris.sepal_length) AS smallest 
FROM iris GROUP BY iris.species

A more complex example in SQLAlchemy


In [14]:
print bz.compute(t[t.sepal_length > 5].species.count_values(), 
                 {t: t.data.table})


SELECT iris.species, count(iris.species) AS count 
FROM iris 
WHERE iris.sepal_length > :sepal_length_1 GROUP BY iris.species ORDER BY count DESC

What users actually experience


In [15]:
t[t.sepal_length > 5].species.count_values()


Out[15]:
species count
0 Iris-virginica 49
1 Iris-versicolor 47
2 Iris-setosa 22

Main Points

  1. Blaze feels like NumPy/Pandas
  2. Today, Blaze doesn't perform computation
  3. Instead, Blaze manages metadata and APIs