Blaze Expressions

Blaze expressions convey intent from the user.

Blaze compute functions interpret from expressions to backend.

The interface in this notebook is not intended for interactive use. You may find interactive expressions (like Data) more useful for data exploration.


In [1]:
from blaze import symbol, compute, join

In [2]:
bank = symbol('bank', '''1000 * {id: int, 
                                 name: string, 
                                 balance: int,
                                 lastseen: datetime}''')
bank  # no data to see here


/home/mrocklin/Software/anaconda/lib/python2.7/site-packages/IPython/core/formatters.py:239: FormatterWarning: Exception in text/html formatter: Expression does not contain data resources
  FormatterWarning,
Out[2]:
bank

In [3]:
deadbeats = bank[bank.balance < 0][['name', 'lastseen']]
deadbeats


Out[3]:
bank[bank.balance < 0][['name', 'lastseen']]

In [4]:
deadbeats.dshape


Out[4]:
dshape("var * {name: string, lastseen: datetime}")

Compute recipes

Blaze interprets expressions against backends by consulting a repository of small recipes.

We look at some simple recipes for Python, Pandas, and Spark


In [5]:
L = [[1, 'Alice',   100],
     [2, 'Bob',    -200],
     [3, 'Charlie', 300],
     [4, 'Dennis',  400],
     [5, 'Edith',  -500]]

from pandas import DataFrame

df = DataFrame([[1, 'Alice',   100],                         
                [2, 'Bob',    -200],
                [3, 'Charlie', 300],
                [4, 'Denis',   400],
                [5, 'Edith',  -500]], columns=['id', 'name', 'balance'])

import pyspark

sc = pyspark.SparkContext('local', 'blaze-app')
rdd = sc.parallelize(L)

bank = symbol('bank', '''1000 * {id: int, 
                                 name: string, 
                                 balance: int}''')

deadbeats = bank[bank.balance < 0].name

In [6]:
compute(deadbeats, L)


Out[6]:
<itertools.chain at 0x7f1c47d14dd0>

In [7]:
compute(deadbeats, df)


Out[7]:
1      Bob
4    Edith
Name: name, dtype: object

In [8]:
compute(deadbeats, rdd)


Out[8]:
PythonRDD[1] at RDD at PythonRDD.scala:43

How Blaze handles numeric evaluation

or, how to stay sane while trying to engage the entire Python ecosystem


In [9]:
from blaze.compute.core import compute_up

In [10]:
compute_up.source(bank.head(), df)


File: /home/mrocklin/workspace/blaze/blaze/compute/pandas.py

@dispatch(Head, (Series, DataFrame))
def compute_up(t, df, **kwargs):
    return df.head(t.n)


In [11]:
compute_up.source(bank.head(), L)


File: /home/mrocklin/workspace/blaze/blaze/compute/python.py

@dispatch(Head, Sequence)
def compute_up(t, seq, **kwargs):
    if t.n < 100:
        return tuple(take(t.n, seq))
    else:
        return take(t.n, seq)


In [12]:
compute_up.source(bank.head(), rdd)


File: /home/mrocklin/workspace/blaze/blaze/compute/spark.py

@dispatch(Head, RDD)
def compute_up(t, rdd, **kwargs):
    return rdd.take(t.n)

N-Dimensional example


In [13]:
x = symbol('x', '1000 * 1000 * {measurement: float32, timestamp: datetime}')
x


Out[13]:
x

In [14]:
expr = x[:500].measurement.sum(axis=1)
expr


Out[14]:
sum(x[:500].measurement, axis=(1,))

In [15]:
expr.dshape


Out[15]:
dshape("500 * float32")