<img src="images/continuum_analytics_logo.png" alt="Continuum Logo", align="right", width="30%">,
Blaze translates a subset of numpy/pandas syntax into database queries. It hides away the database.
On simple datasets, like CSV files, Blaze acts like Pandas with slightly different syntax. In this case Blaze is just using Pandas.
In [ ]:
import pandas as pd
df = pd.read_csv('iris.csv')
df.head(5)
In [ ]:
df.species.drop_duplicates()
In [ ]:
import blaze as bz
d = bz.Data('iris.csv')
d.head(5)
In [ ]:
d.species.distinct()
Blaze does different things under-the-hood on different kinds of data
We'll play with SQL a lot during this tutorial. Blaze translates your query to SQLAlchemy. SQLAlchemy then translates to the SQL dialect of your database, your database then executes that query intelligently.
This translation process lets analysts interact with a familiar interface while leveraging a potentially powerful database.
To keep things local we'll use SQLite, but this works with any database with a SQLAlchemy dialect. Examples in this section use the iris dataset. Exercises use the Lahman Baseball statistics database, year 2013.
If you have not downloaded this dataset you could do so here - https://github.com/jknecht/baseball-archive-sqlite/raw/master/lahman2013.sqlite.
Alternatively we have set up a PostgreSQL instance on EC2 with this same data.
postgresql://postgres:postgres@ec2-54-159-160-163.compute-1.amazonaws.com
If you want to use the Postgres instance you will have to install psycopg2
. If you are on Mac/Linux you can probably just
conda install psycopg2
If you are on Windows you should download an installer here http://www.stickpeople.com/projects/python/win-psycopg/
In [ ]:
db = bz.Data('sqlite:///my.db')
db.iris
In [ ]:
db.iris.species.distinct()
In [ ]:
db.iris[db.iris.species == 'Iris-versicolor'][['species', 'sepal_length']]
If we were using pandas we would read the table into pandas, then use pandas' fast in-memory algorithms for computation. Here we translate your query into SQL and then send that query to the database to do the work.
If we want to dive into the internal API we can inspect the query that Blaze transmits.
In [ ]:
# Inspect SQL query
query = db.iris[db.iris.species == 'Iris-versicolor'][['species', 'sepal_length']]
print bz.compute(query)
In [ ]:
query = bz.by(db.iris.species, longest=db.iris.petal_length.max(),
shortest=db.iris.petal_length.min())
print bz.compute(query)
In [ ]:
# db = bz.Data('postgresql://postgres:postgres@ec2-54-159-160-163.compute-1.amazonaws.com') # Use Postgres if you don't have the sqlite file
db = bz.Data('sqlite:///lahman2013.sqlite')
db.dshape
In [ ]:
# View the Salaries table
In [ ]:
# What are the distinct teamIDs in the Salaries table?
In [ ]:
# What is the minimum and maximum yearID in the Sarlaries table?
In [ ]:
# For the Oakland Athletics (teamID OAK), pick out the playerID, salary, and yearID columns
In [ ]:
# Sort that result by salary.
# Use the ascending=False keyword argument to the sort function to find the highest paid players
In [ ]:
import pandas as pd
iris = pd.read_csv('iris.csv')
iris.groupby('species').petal_length.min()
In [ ]:
iris = bz.Data('sqlite:///my.db::iris')
bz.by(iris.species, largest=iris.petal_length.max(),
smallest=iris.petal_length.min())
In [ ]:
# Find the average and maximum salary by team
# If you like, also find the ratio between the highest and lowest paid players, sort by that ratio
In [ ]:
# Track the average and maximum salary over time
By default Blaze only shows us the first ten lines of a result. This provides a more interactive feel and stops us from accidentally crushing our system. Sometimes we do want to compute all of the results and store them someplace.
Blaze expressions are valid sources for into
. So we can store our results in any format.
In [ ]:
iris = bz.Data('sqlite:///my.db::iris')
query = bz.by(iris.species, largest=iris.petal_length.max(), # A lazily evaluated result
smallest=iris.petal_length.min())
bz.into(list, query) # A concrete result
In [ ]:
bz.into('iris-min-max.json', result)
In [ ]:
!head iris-min-max.json
In [ ]:
result = bz.by(db.Salaries.teamID, avg=db.Salaries.salary.mean(),
max=db.Salaries.salary.max(),
ratio=db.Salaries.salary.max() / db.Salaries.salary.min()
).sort('ratio', ascending=False)
In [ ]:
In [ ]:
In [ ]: