Blaze - Symbolic Data Analysis

We demonstrate Blaze's symbolic nature by comparing its use to Pandas.



In [1]:

    
import blaze as bz
import pandas as pd



In [2]:

    
bz.__version__









    Out[2]:





'0.6.5'

On The Surface

Pandas and Blaze feel similar



In [3]:

    
df = pd.read_csv('/home/mrocklin/workspace/blaze/examples/data/iris.csv')
df.head(10)









    Out[3]:






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      species
    
  
  
    
      0
       5.1
       3.5
       1.4
       0.2
       Iris-setosa
    
    
      1
       4.9
       3.0
       1.4
       0.2
       Iris-setosa
    
    
      2
       4.7
       3.2
       1.3
       0.2
       Iris-setosa
    
    
      3
       4.6
       3.1
       1.5
       0.2
       Iris-setosa
    
    
      4
       5.0
       3.6
       1.4
       0.2
       Iris-setosa
    
    
      5
       5.4
       3.9
       1.7
       0.4
       Iris-setosa
    
    
      6
       4.6
       3.4
       1.4
       0.3
       Iris-setosa
    
    
      7
       5.0
       3.4
       1.5
       0.2
       Iris-setosa
    
    
      8
       4.4
       2.9
       1.4
       0.2
       Iris-setosa
    
    
      9
       4.9
       3.1
       1.5
       0.1
       Iris-setosa



In [4]:

    
t = bz.Table('/home/mrocklin/workspace/blaze/examples/data/iris.csv')
t.head(10)









    Out[4]:





  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      species
    
  
  
    
      0
       5.1
       3.5
       1.4
       0.2
       Iris-setosa
    
    
      1
       4.9
       3.0
       1.4
       0.2
       Iris-setosa
    
    
      2
       4.7
       3.2
       1.3
       0.2
       Iris-setosa
    
    
      3
       4.6
       3.1
       1.5
       0.2
       Iris-setosa
    
    
      4
       5.0
       3.6
       1.4
       0.2
       Iris-setosa
    
    
      5
       5.4
       3.9
       1.7
       0.4
       Iris-setosa
    
    
      6
       4.6
       3.4
       1.4
       0.3
       Iris-setosa
    
    
      7
       5.0
       3.4
       1.5
       0.2
       Iris-setosa
    
    
      8
       4.4
       2.9
       1.4
       0.2
       Iris-setosa
    
    
      9
       4.9
       3.1
       1.5
       0.1
       Iris-setosa



In [5]:

    
t.species.distinct()









    Out[5]:





  
    
      
      species
    
  
  
    
      0
           Iris-setosa
    
    
      1
       Iris-versicolor
    
    
      2
        Iris-virginica

Both perform analytic queries

For example, lets compute the minimum sepal length per species



In [6]:

    
df.groupby('species').sepal_length.min()









    Out[6]:





species
Iris-setosa        4.3
Iris-versicolor    4.9
Iris-virginica     4.9
Name: sepal_length, dtype: float64



In [7]:

    
bz.by(t.species, 
      smallest=t.sepal_length.min())









    Out[7]:





  
    
      
      species
      smallest
    
  
  
    
      0
        Iris-virginica
       4.9
    
    
      1
           Iris-setosa
       4.3
    
    
      2
       Iris-versicolor
       4.9

But internally they're quite different

As an example, lets see how they handle interaction with SQL databases

SQL Interaction



In [8]:

    
# Open up SQLAlchemy engine
import sqlalchemy
engine = sqlalchemy.create_engine('sqlite:////home/mrocklin/workspace/blaze/examples/data/iris.db')

# Pull data from SQLite to Pandas
df = pd.read_sql('SELECT * FROM iris', engine)
df.head(10)









    Out[8]:






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      species
    
  
  
    
      0
       5.1
       3.5
       1.4
       0.2
       Iris-setosa
    
    
      1
       4.9
       3.0
       1.4
       0.2
       Iris-setosa
    
    
      2
       4.7
       3.2
       1.3
       0.2
       Iris-setosa
    
    
      3
       4.6
       3.1
       1.5
       0.2
       Iris-setosa
    
    
      4
       5.0
       3.6
       1.4
       0.2
       Iris-setosa
    
    
      5
       5.4
       3.9
       1.7
       0.4
       Iris-setosa
    
    
      6
       4.6
       3.4
       1.4
       0.3
       Iris-setosa
    
    
      7
       5.0
       3.4
       1.5
       0.2
       Iris-setosa
    
    
      8
       4.4
       2.9
       1.4
       0.2
       Iris-setosa
    
    
      9
       4.9
       3.1
       1.5
       0.1
       Iris-setosa



In [9]:

    
# Point Blaze to SQLite table
t = bz.Table('sqlite:////home/mrocklin/workspace/blaze/examples/data/iris.db::iris')
t.head(10)









    Out[9]:





  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      species
    
  
  
    
      0 
       5.1
       3.5
       1.4
       0.2
       Iris-setosa
    
    
      1 
       4.9
       3.0
       1.4
       0.2
       Iris-setosa
    
    
      2 
       4.7
       3.2
       1.3
       0.2
       Iris-setosa
    
    
      3 
       4.6
       3.1
       1.5
       0.2
       Iris-setosa
    
    
      4 
       5.0
       3.6
       1.4
       0.2
       Iris-setosa
    
    
      5 
       5.4
       3.9
       1.7
       0.4
       Iris-setosa
    
    
      6 
       4.6
       3.4
       1.4
       0.3
       Iris-setosa
    
    
      7 
       5.0
       3.4
       1.5
       0.2
       Iris-setosa
    
    
      8 
       4.4
       2.9
       1.4
       0.2
       Iris-setosa
    
    
      9 
       4.9
       3.1
       1.5
       0.1
       Iris-setosa
    
    
      10
       5.4
       3.7
       1.5
       0.2
       Iris-setosa

And again our analytic query



In [10]:

    
df.groupby('species').sepal_length.min()









    Out[10]:





species
Iris-setosa        4.3
Iris-versicolor    4.9
Iris-virginica     4.9
Name: sepal_length, dtype: float64



In [11]:

    
bz.by(t.species, 
      smallest=t.sepal_length.min())









    Out[11]:





  
    
      
      species
      smallest
    
  
  
    
      0
           Iris-setosa
       4.3
    
    
      1
       Iris-versicolor
       4.9
    
    
      2
        Iris-virginica
       4.9

Externally things looks the same

Internally very different things happened

Pandas pulled data from the SQLite database into local memory, then used pandas algorithms on that data.

Pandas $\leftarrow_\textrm{data}$ SQLite
Pandas did the work

Blaze generated SQL and passed that back to the database

Blaze $\leftarrow_\textrm{metadata}$ SQLite
Blaze $\rightarrow_\textrm{SQL}$ SQLite
SQLite did the work

Lets see the generated SQL

Blaze speaks SQLAlchemy



In [12]:

    
expr = bz.by(t.species, 
             smallest=t.sepal_length.min())

result = bz.compute(expr, {t: t.data.table})
result









    Out[12]:





<sqlalchemy.sql.selectable.Select at 0x7f1c030aa3d0; Select object>

SQLAlchemy in turn speaks SQLite



In [13]:

    
print result









    



SELECT iris.species, min(iris.sepal_length) AS smallest 
FROM iris GROUP BY iris.species

A more complex example in SQLAlchemy



In [14]:

    
print bz.compute(t[t.sepal_length > 5].species.count_values(), 
                 {t: t.data.table})









    



SELECT iris.species, count(iris.species) AS count 
FROM iris 
WHERE iris.sepal_length > :sepal_length_1 GROUP BY iris.species ORDER BY count DESC

What users actually experience



In [15]:

    
t[t.sepal_length > 5].species.count_values()









    Out[15]:





  
    
      
      species
      count
    
  
  
    
      0
        Iris-virginica
       49
    
    
      1
       Iris-versicolor
       47
    
    
      2
           Iris-setosa
       22

Main Points

Blaze feels like NumPy/Pandas
Today, Blaze doesn't perform computation
Instead, Blaze manages metadata and APIs

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
5	5.4	3.9	1.7	0.4	Iris-setosa
6	4.6	3.4	1.4	0.3	Iris-setosa
7	5.0	3.4	1.5	0.2	Iris-setosa
8	4.4	2.9	1.4	0.2	Iris-setosa
9	4.9	3.1	1.5	0.1	Iris-setosa