<img src="images/continuum_analytics_logo.png" alt="Continuum Logo", align="right", width="30%">,

Blaze

Blaze translates a subset of numpy/pandas syntax into database queries. It hides away the database.

On simple datasets, like CSV files, Blaze acts like Pandas with slightly different syntax. In this case Blaze is just using Pandas.

Pandas example


In [ ]:
import pandas as pd

df = pd.read_csv('iris.csv')
df.head(5)

In [ ]:
df.species.drop_duplicates()

Blaze example


In [ ]:
import blaze as bz

d = bz.Data('iris.csv')
d.head(5)

In [ ]:
d.species.distinct()

Foreign Data

Blaze does different things under-the-hood on different kinds of data

  • CSV files: Pandas DataFrames (or iterators of DataFrames)
  • SQL tables: SQLAlchemy.
  • Mongo collections: PyMongo
  • ...

SQL

We'll play with SQL a lot during this tutorial. Blaze translates your query to SQLAlchemy. SQLAlchemy then translates to the SQL dialect of your database, your database then executes that query intelligently.

  • Blaze $\rightarrow$ SQLAlchemy $\rightarrow$ SQL $\rightarrow$ Database computation

This translation process lets analysts interact with a familiar interface while leveraging a potentially powerful database.

To keep things local we'll use SQLite, but this works with any database with a SQLAlchemy dialect. Examples in this section use the iris dataset. Exercises use the Lahman Baseball statistics database, year 2013.

If you have not downloaded this dataset you could do so here - https://github.com/jknecht/baseball-archive-sqlite/raw/master/lahman2013.sqlite.

PostgreSQL

Alternatively we have set up a PostgreSQL instance on EC2 with this same data.

postgresql://postgres:postgres@ec2-54-159-160-163.compute-1.amazonaws.com

If you want to use the Postgres instance you will have to install psycopg2. If you are on Mac/Linux you can probably just

conda install psycopg2

If you are on Windows you should download an installer here http://www.stickpeople.com/projects/python/win-psycopg/

Examples

Lets dive into Blaze Syntax. For simple queries it looks and feels similar to Pandas


In [ ]:
db = bz.Data('sqlite:///my.db')
db.iris

In [ ]:
db.iris.species.distinct()

In [ ]:
db.iris[db.iris.species == 'Iris-versicolor'][['species', 'sepal_length']]

Work happens on the database

If we were using pandas we would read the table into pandas, then use pandas' fast in-memory algorithms for computation. Here we translate your query into SQL and then send that query to the database to do the work.

  • Pandas $\leftarrow_\textrm{data}$ SQL, then Pandas computes
  • Blaze $\rightarrow_\textrm{query}$ SQL, then database computes

If we want to dive into the internal API we can inspect the query that Blaze transmits.


In [ ]:
# Inspect SQL query
query = db.iris[db.iris.species == 'Iris-versicolor'][['species', 'sepal_length']]
print bz.compute(query)

In [ ]:
query = bz.by(db.iris.species, longest=db.iris.petal_length.max(),
                              shortest=db.iris.petal_length.min())
print bz.compute(query)

Exercises

Now we load the Lahman baseball database and perform similar queries


In [ ]:
# db = bz.Data('postgresql://postgres:postgres@ec2-54-159-160-163.compute-1.amazonaws.com')  # Use Postgres if you don't have the sqlite file
db = bz.Data('sqlite:///lahman2013.sqlite')
db.dshape

In [ ]:
# View the Salaries table

In [ ]:
# What are the distinct teamIDs in the Salaries table?

In [ ]:
# What is the minimum and maximum yearID in the Sarlaries table?

In [ ]:
# For the Oakland Athletics (teamID OAK), pick out the playerID, salary, and yearID columns

In [ ]:
# Sort that result by salary.  
# Use the ascending=False keyword argument to the sort function to find the highest paid players

Example: Split-apply-combine

In Pandas we perform computations on a per-group basis with the groupby operator. In Blaze our syntax is slightly different, using instead the by function.


In [ ]:
import pandas as pd
iris = pd.read_csv('iris.csv')
iris.groupby('species').petal_length.min()

In [ ]:
iris = bz.Data('sqlite:///my.db::iris')
bz.by(iris.species, largest=iris.petal_length.max(), 
                    smallest=iris.petal_length.min())

Exercise: Split-apply-combine


In [ ]:
# Find the average and maximum salary by team
# If you like, also find the ratio between the highest and lowest paid players, sort by that ratio

In [ ]:
# Track the average and maximum salary over time

Store Results

By default Blaze only shows us the first ten lines of a result. This provides a more interactive feel and stops us from accidentally crushing our system. Sometimes we do want to compute all of the results and store them someplace.

Blaze expressions are valid sources for into. So we can store our results in any format.


In [ ]:
iris = bz.Data('sqlite:///my.db::iris')
query = bz.by(iris.species, largest=iris.petal_length.max(),   # A lazily evaluated result
                            smallest=iris.petal_length.min())  

bz.into(list, query)  # A concrete result

In [ ]:
bz.into('iris-min-max.json', result)

In [ ]:
!head iris-min-max.json

Exercise: Storage

The solution to the first split-apply-combine problem is below. Store that result in a list, a CSV file, and in a new SQL table in our database (use a uri like sqlite://... to specify the SQL table.)


In [ ]:
result = bz.by(db.Salaries.teamID, avg=db.Salaries.salary.mean(), 
                                   max=db.Salaries.salary.max(), 
                                   ratio=db.Salaries.salary.max() / db.Salaries.salary.min()
                ).sort('ratio', ascending=False)

In [ ]:


In [ ]:


In [ ]: