<img src="images/continuum_analytics_logo.png" alt="Continuum Logo", align="right", width="30%">,

Introduction to Blaze

In this tutorial we'll learn how to use Blaze to discover, migrate, and query data living in other databases. Generally this tutorial will have the following format

odo - Move data to database
blaze - Query data in database

Install

This tutorial uses many different libraries that are all available with the Anaconda Distribution. Once you have Anaconda install, please run these commands from a terminal:

$ conda install -y blaze
$ conda install -y bokeh
$ conda install -y odo

nbviewer: http://nbviewer.ipython.org/github/ContinuumIO/pydata-apps/blob/master/Section-1_blaze.ipynb

github: https://github.com/ContinuumIO/pydata-apps

Goal: Accessible, Interactive, Analytic Queries

NumPy and Pandas provide accessible, interactive, analytic queries; this is valuable.



In [1]:

    
import pandas as pd
df = pd.read_csv('iris.csv')
df.head()









    Out[1]:






  
    
      
      Unnamed: 0
      SepalLength
      SepalWidth
      PetalLength
      PetalWidth
      Species
    
  
  
    
      0
      1
      5.1
      3.5
      1.4
      0.2
      setosa
    
    
      1
      2
      4.9
      3.0
      1.4
      0.2
      setosa
    
    
      2
      3
      4.7
      3.2
      1.3
      0.2
      setosa
    
    
      3
      4
      4.6
      3.1
      1.5
      0.2
      setosa
    
    
      4
      5
      5.0
      3.6
      1.4
      0.2
      setosa



In [2]:

    
df.groupby(df.Species).PetalLength.mean()  # Average petal length per species









    Out[2]:





Species
setosa        1.462
versicolor    4.260
virginica     5.552
Name: PetalLength, dtype: float64

But as data grows and systems become more complex, moving data and querying data become more difficult. Python already has excellent tools for data that fits in memory, but we want to hook up to data that is inconvenient.

From now on, we're going to assume one of the following:

You have an inconvenient amount of data
That data should live someplace other than your computer

Databases and Python

When in-memory arrays/dataframes cease to be an option, we turn to databases. These live outside of the Python process and so might be less convenient. The open source Python ecosystem includes libraries to interact with these databases and with foreign data in general.

Examples:

SQL - sqlalchemy
- Hive/Cassandra - pyhive
- Impala - impyla
- RedShift - redshift-sqlalchemy
- ...
MongoDB - pymongo
HBase - happybase
Spark - pyspark
SSH - paramiko
HDFS - pywebhdfs
Amazon S3 - boto

Today we're going to use some of these indirectly with odo (was into) and Blaze. We'll try to point out these libraries as we automate them so that, if you'd like, you can use them independently.

<img src="images/continuum_analytics_logo.png" alt="Continuum Logo", align="right", width="30%">,

`odo` (formerly `into`)

Odo migrates data between formats and locations.

Before we can use a database we need to move data into it. The odo project provides a single consistent interface to move data between formats and between locations.

We'll start with local data and eventually move out to remote data.

odo docs

Examples

Odo moves data into a target from a source

>>> odo(source, target)

The target and source can be either a Python object or a string URI. The following are all valid calls to into

>>> odo('iris.csv', pd.DataFrame)  # Load CSV file into new DataFrame
>>> odo(my_df, 'iris.json')        # Write DataFrame into JSON file
>>> odo('iris.csv', 'iris.json')   # Migrate data from CSV to JSON

Exercise

Use odo to load the iris.csv file into a Python list, a np.ndarray, and a pd.DataFrame



In [3]:

    
from odo import odo
import numpy as np
import pandas as pd



In [4]:

    
odo("iris.csv", pd.DataFrame)









    Out[4]:






  
    
      
      Unnamed: 0
      SepalLength
      SepalWidth
      PetalLength
      PetalWidth
      Species
    
  
  
    
      0
      1
      5.1
      3.5
      1.4
      0.2
      setosa
    
    
      1
      2
      4.9
      3.0
      1.4
      0.2
      setosa
    
    
      2
      3
      4.7
      3.2
      1.3
      0.2
      setosa
    
    
      3
      4
      4.6
      3.1
      1.5
      0.2
      setosa
    
    
      4
      5
      5.0
      3.6
      1.4
      0.2
      setosa
    
    
      5
      6
      5.4
      3.9
      1.7
      0.4
      setosa
    
    
      6
      7
      4.6
      3.4
      1.4
      0.3
      setosa
    
    
      7
      8
      5.0
      3.4
      1.5
      0.2
      setosa
    
    
      8
      9
      4.4
      2.9
      1.4
      0.2
      setosa
    
    
      9
      10
      4.9
      3.1
      1.5
      0.1
      setosa
    
    
      10
      11
      5.4
      3.7
      1.5
      0.2
      setosa
    
    
      11
      12
      4.8
      3.4
      1.6
      0.2
      setosa
    
    
      12
      13
      4.8
      3.0
      1.4
      0.1
      setosa
    
    
      13
      14
      4.3
      3.0
      1.1
      0.1
      setosa
    
    
      14
      15
      5.8
      4.0
      1.2
      0.2
      setosa
    
    
      15
      16
      5.7
      4.4
      1.5
      0.4
      setosa
    
    
      16
      17
      5.4
      3.9
      1.3
      0.4
      setosa
    
    
      17
      18
      5.1
      3.5
      1.4
      0.3
      setosa
    
    
      18
      19
      5.7
      3.8
      1.7
      0.3
      setosa
    
    
      19
      20
      5.1
      3.8
      1.5
      0.3
      setosa
    
    
      20
      21
      5.4
      3.4
      1.7
      0.2
      setosa
    
    
      21
      22
      5.1
      3.7
      1.5
      0.4
      setosa
    
    
      22
      23
      4.6
      3.6
      1.0
      0.2
      setosa
    
    
      23
      24
      5.1
      3.3
      1.7
      0.5
      setosa
    
    
      24
      25
      4.8
      3.4
      1.9
      0.2
      setosa
    
    
      25
      26
      5.0
      3.0
      1.6
      0.2
      setosa
    
    
      26
      27
      5.0
      3.4
      1.6
      0.4
      setosa
    
    
      27
      28
      5.2
      3.5
      1.5
      0.2
      setosa
    
    
      28
      29
      5.2
      3.4
      1.4
      0.2
      setosa
    
    
      29
      30
      4.7
      3.2
      1.6
      0.2
      setosa
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      120
      121
      6.9
      3.2
      5.7
      2.3
      virginica
    
    
      121
      122
      5.6
      2.8
      4.9
      2.0
      virginica
    
    
      122
      123
      7.7
      2.8
      6.7
      2.0
      virginica
    
    
      123
      124
      6.3
      2.7
      4.9
      1.8
      virginica
    
    
      124
      125
      6.7
      3.3
      5.7
      2.1
      virginica
    
    
      125
      126
      7.2
      3.2
      6.0
      1.8
      virginica
    
    
      126
      127
      6.2
      2.8
      4.8
      1.8
      virginica
    
    
      127
      128
      6.1
      3.0
      4.9
      1.8
      virginica
    
    
      128
      129
      6.4
      2.8
      5.6
      2.1
      virginica
    
    
      129
      130
      7.2
      3.0
      5.8
      1.6
      virginica
    
    
      130
      131
      7.4
      2.8
      6.1
      1.9
      virginica
    
    
      131
      132
      7.9
      3.8
      6.4
      2.0
      virginica
    
    
      132
      133
      6.4
      2.8
      5.6
      2.2
      virginica
    
    
      133
      134
      6.3
      2.8
      5.1
      1.5
      virginica
    
    
      134
      135
      6.1
      2.6
      5.6
      1.4
      virginica
    
    
      135
      136
      7.7
      3.0
      6.1
      2.3
      virginica
    
    
      136
      137
      6.3
      3.4
      5.6
      2.4
      virginica
    
    
      137
      138
      6.4
      3.1
      5.5
      1.8
      virginica
    
    
      138
      139
      6.0
      3.0
      4.8
      1.8
      virginica
    
    
      139
      140
      6.9
      3.1
      5.4
      2.1
      virginica
    
    
      140
      141
      6.7
      3.1
      5.6
      2.4
      virginica
    
    
      141
      142
      6.9
      3.1
      5.1
      2.3
      virginica
    
    
      142
      143
      5.8
      2.7
      5.1
      1.9
      virginica
    
    
      143
      144
      6.8
      3.2
      5.9
      2.3
      virginica
    
    
      144
      145
      6.7
      3.3
      5.7
      2.5
      virginica
    
    
      145
      146
      6.7
      3.0
      5.2
      2.3
      virginica
    
    
      146
      147
      6.3
      2.5
      5.0
      1.9
      virginica
    
    
      147
      148
      6.5
      3.0
      5.2
      2.0
      virginica
    
    
      148
      149
      6.2
      3.4
      5.4
      2.3
      virginica
    
    
      149
      150
      5.9
      3.0
      5.1
      1.8
      virginica
    
  

150 rows × 6 columns

URI Strings

Odo refers to foreign data either with a Python object like a sqlalchemy.Table object for a SQL table, or with a string URI, like postgresql://hostname::tablename.

URI's often take on the following form

protocol://path-to-resource::path-within-resource

Where path-to-resource might point to a file, a database hostname, etc. while path-within-resource might refer to a datapath or table name. Note the two main separators

:// separates the protocol on the left (sqlite, mongodb, ssh, hdfs, hive, ...)
:: separates the path within the database on the right (e.g. tablename)

odo docs on uri strings

Examples

Here are some example URIs

myfile.json
myfiles.*.csv'
postgresql://hostname::tablename
mongodb://hostname/db::collection
ssh://user@host:/path/to/myfile.csv
hdfs://user@host:/path/to/*.csv

Exercise

Migrate your CSV file into a table named iris in a new SQLite database at sqlite:///my.db. Remember to use the :: separator and to separate your database name from your table name.

odo docs on SQL



In [5]:

    
odo("iris.csv", "sqlite:///my.db::iris")









    Out[5]:





Table('iris', MetaData(bind=Engine(sqlite:///my.db)), Column('Unnamed: 0', BIGINT(), table=<iris>, nullable=False), Column('SepalLength', FLOAT(), table=<iris>), Column('SepalWidth', FLOAT(), table=<iris>), Column('PetalLength', FLOAT(), table=<iris>), Column('PetalWidth', FLOAT(), table=<iris>), Column('Species', TEXT(), table=<iris>), schema=None)

What kind of object did you get receive as output? Call type on your result.



In [6]:

    
type(_)









    Out[6]:





sqlalchemy.sql.schema.Table

How it works

Odo is a network of fast pairwise conversions between pairs of formats. We when we migrate between two formats we traverse a path of pairwise conversions.

We visualize that network below:

Each node represents a data format. Each directed edge represents a function to transform data between two formats. A single call to into may traverse multiple edges and multiple intermediate formats. Red nodes support larger-than-memory data.

A single call to into may traverse several intermediate formats calling on several conversion functions. For example, we when migrate a CSV file to a Mongo database we might take the following route:

Load in to a DataFrame (pandas.read_csv)
Convert to np.recarray (DataFrame.to_records)
Then to a Python Iterator (np.ndarray.tolist)
Finally to Mongo (pymongo.Collection.insert)

Alternatively we could write a special function that uses MongoDB's native CSV loader and shortcut this entire process with a direct edge CSV -> Mongo.

These functions are chosen because they are fast, often far faster than converting through a central serialization format.

This picture is actually from an older version of odo, when the graph was still small enough to visualize pleasantly. See odo docs for a more updated version.

Remote Data

We can interact with remote data in three locations

On Amazon's S3 (this will be quick)
On a remote machine via ssh
On the Hadoop File System (HDFS)

For most of this we'll wait until we've seen Blaze, briefly we'll use S3.

S3

For now, we quickly grab a file from Amazon's S3.

This example depends on boto to interact with S3.

conda install boto

odo docs on aws



In [8]:

    
odo('s3://nyqpug/tips.csv', pd.DataFrame)









    Out[8]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
    
  
  
    
      0
      16.99
      1.01
      Female
      No
      Sun
      Dinner
      2
    
    
      1
      10.34
      1.66
      Male
      No
      Sun
      Dinner
      3
    
    
      2
      21.01
      3.50
      Male
      No
      Sun
      Dinner
      3
    
    
      3
      23.68
      3.31
      Male
      No
      Sun
      Dinner
      2
    
    
      4
      24.59
      3.61
      Female
      No
      Sun
      Dinner
      4
    
    
      5
      25.29
      4.71
      Male
      No
      Sun
      Dinner
      4
    
    
      6
      8.77
      2.00
      Male
      No
      Sun
      Dinner
      2
    
    
      7
      26.88
      3.12
      Male
      No
      Sun
      Dinner
      4
    
    
      8
      15.04
      1.96
      Male
      No
      Sun
      Dinner
      2
    
    
      9
      14.78
      3.23
      Male
      No
      Sun
      Dinner
      2
    
    
      10
      10.27
      1.71
      Male
      No
      Sun
      Dinner
      2
    
    
      11
      35.26
      5.00
      Female
      No
      Sun
      Dinner
      4
    
    
      12
      15.42
      1.57
      Male
      No
      Sun
      Dinner
      2
    
    
      13
      18.43
      3.00
      Male
      No
      Sun
      Dinner
      4
    
    
      14
      14.83
      3.02
      Female
      No
      Sun
      Dinner
      2
    
    
      15
      21.58
      3.92
      Male
      No
      Sun
      Dinner
      2
    
    
      16
      10.33
      1.67
      Female
      No
      Sun
      Dinner
      3
    
    
      17
      16.29
      3.71
      Male
      No
      Sun
      Dinner
      3
    
    
      18
      16.97
      3.50
      Female
      No
      Sun
      Dinner
      3
    
    
      19
      20.65
      3.35
      Male
      No
      Sat
      Dinner
      3
    
    
      20
      17.92
      4.08
      Male
      No
      Sat
      Dinner
      2
    
    
      21
      20.29
      2.75
      Female
      No
      Sat
      Dinner
      2
    
    
      22
      15.77
      2.23
      Female
      No
      Sat
      Dinner
      2
    
    
      23
      39.42
      7.58
      Male
      No
      Sat
      Dinner
      4
    
    
      24
      19.82
      3.18
      Male
      No
      Sat
      Dinner
      2
    
    
      25
      17.81
      2.34
      Male
      No
      Sat
      Dinner
      4
    
    
      26
      13.37
      2.00
      Male
      No
      Sat
      Dinner
      2
    
    
      27
      12.69
      2.00
      Male
      No
      Sat
      Dinner
      2
    
    
      28
      21.70
      4.30
      Male
      No
      Sat
      Dinner
      2
    
    
      29
      19.65
      3.00
      Female
      No
      Sat
      Dinner
      2
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      214
      28.17
      6.50
      Female
      Yes
      Sat
      Dinner
      3
    
    
      215
      12.90
      1.10
      Female
      Yes
      Sat
      Dinner
      2
    
    
      216
      28.15
      3.00
      Male
      Yes
      Sat
      Dinner
      5
    
    
      217
      11.59
      1.50
      Male
      Yes
      Sat
      Dinner
      2
    
    
      218
      7.74
      1.44
      Male
      Yes
      Sat
      Dinner
      2
    
    
      219
      30.14
      3.09
      Female
      Yes
      Sat
      Dinner
      4
    
    
      220
      12.16
      2.20
      Male
      Yes
      Fri
      Lunch
      2
    
    
      221
      13.42
      3.48
      Female
      Yes
      Fri
      Lunch
      2
    
    
      222
      8.58
      1.92
      Male
      Yes
      Fri
      Lunch
      1
    
    
      223
      15.98
      3.00
      Female
      No
      Fri
      Lunch
      3
    
    
      224
      13.42
      1.58
      Male
      Yes
      Fri
      Lunch
      2
    
    
      225
      16.27
      2.50
      Female
      Yes
      Fri
      Lunch
      2
    
    
      226
      10.09
      2.00
      Female
      Yes
      Fri
      Lunch
      2
    
    
      227
      20.45
      3.00
      Male
      No
      Sat
      Dinner
      4
    
    
      228
      13.28
      2.72
      Male
      No
      Sat
      Dinner
      2
    
    
      229
      22.12
      2.88
      Female
      Yes
      Sat
      Dinner
      2
    
    
      230
      24.01
      2.00
      Male
      Yes
      Sat
      Dinner
      4
    
    
      231
      15.69
      3.00
      Male
      Yes
      Sat
      Dinner
      3
    
    
      232
      11.61
      3.39
      Male
      No
      Sat
      Dinner
      2
    
    
      233
      10.77
      1.47
      Male
      No
      Sat
      Dinner
      2
    
    
      234
      15.53
      3.00
      Male
      Yes
      Sat
      Dinner
      2
    
    
      235
      10.07
      1.25
      Male
      No
      Sat
      Dinner
      2
    
    
      236
      12.60
      1.00
      Male
      Yes
      Sat
      Dinner
      2
    
    
      237
      32.83
      1.17
      Male
      Yes
      Sat
      Dinner
      2
    
    
      238
      35.83
      4.67
      Female
      No
      Sat
      Dinner
      3
    
    
      239
      29.03
      5.92
      Male
      No
      Sat
      Dinner
      3
    
    
      240
      27.18
      2.00
      Female
      Yes
      Sat
      Dinner
      2
    
    
      241
      22.67
      2.00
      Male
      Yes
      Sat
      Dinner
      2
    
    
      242
      17.82
      1.75
      Male
      No
      Sat
      Dinner
      2
    
    
      243
      18.78
      3.00
      Female
      No
      Thur
      Dinner
      2
    
  

244 rows × 7 columns

<img src="images/continuum_analytics_logo.png" alt="Continuum Logo", align="right", width="30%">,

Blaze

Blaze translates a subset of numpy/pandas syntax into database queries. It hides away the database.

On simple datasets, like CSV files, Blaze acts like Pandas with slightly different syntax. In this case Blaze is just using Pandas.

Pandas example



In [9]:

    
import pandas as pd

df = pd.read_csv('iris.csv')
df.head(5)









    Out[9]:






  
    
      
      Unnamed: 0
      SepalLength
      SepalWidth
      PetalLength
      PetalWidth
      Species
    
  
  
    
      0
      1
      5.1
      3.5
      1.4
      0.2
      setosa
    
    
      1
      2
      4.9
      3.0
      1.4
      0.2
      setosa
    
    
      2
      3
      4.7
      3.2
      1.3
      0.2
      setosa
    
    
      3
      4
      4.6
      3.1
      1.5
      0.2
      setosa
    
    
      4
      5
      5.0
      3.6
      1.4
      0.2
      setosa



In [10]:

    
df.Species.unique()









    Out[10]:





array(['setosa', 'versicolor', 'virginica'], dtype=object)



In [11]:

    
df.Species.drop_duplicates()









    Out[11]:





0          setosa
50     versicolor
100     virginica
Name: Species, dtype: object

Blaze example



In [12]:

    
import blaze as bz

d = bz.Data('iris.csv')
d.head(5)









    Out[12]:





  
    
      
      Unnamed: 0
      SepalLength
      SepalWidth
      PetalLength
      PetalWidth
      Species
    
  
  
    
      0
      1
      5.1
      3.5
      1.4
      0.2
      setosa
    
    
      1
      2
      4.9
      3.0
      1.4
      0.2
      setosa
    
    
      2
      3
      4.7
      3.2
      1.3
      0.2
      setosa
    
    
      3
      4
      4.6
      3.1
      1.5
      0.2
      setosa
    
    
      4
      5
      5.0
      3.6
      1.4
      0.2
      setosa



In [13]:

    
d.Species.distinct()









    Out[13]:





  
    
      
      Species
    
  
  
    
      0
      setosa
    
    
      1
      versicolor
    
    
      2
      virginica

Foreign Data

Blaze does different things under-the-hood on different kinds of data

CSV files: Pandas DataFrames (or iterators of DataFrames)
SQL tables: SQLAlchemy.
Mongo collections: PyMongo
...

SQL

We'll play with SQL a lot during this tutorial. Blaze translates your query to SQLAlchemy. SQLAlchemy then translates to the SQL dialect of your database, your database then executes that query intelligently.

Blaze $\rightarrow$ SQLAlchemy $\rightarrow$ SQL $\rightarrow$ Database computation

This translation process lets analysts interact with a familiar interface while leveraging a potentially powerful database.

To keep things local we'll use SQLite, but this works with any database with a SQLAlchemy dialect. Examples in this section use the iris dataset. Exercises use the Lahman Baseball statistics database, year 2013.

If you have not downloaded this dataset you could do so here - https://github.com/jknecht/baseball-archive-sqlite/raw/master/lahman2013.sqlite.



In [14]:

    
!ls









    



06-simple-apps                  Section3-2_bokeh.ipynb
07-final-app                    images
LICENSE                         iris.csv
Section-1_blaze-solutions.ipynb lahman2013.sqlite
Section3-1_blaze.ipynb          my.db

Examples

Lets dive into Blaze Syntax. For simple queries it looks and feels similar to Pandas



In [15]:

    
db = bz.Data('sqlite:///my.db')
#db.iris
#db.iris.head()



In [16]:

    
db.iris.Species.distinct()









    Out[16]:





  
    
      
      Species
    
  
  
    
      0
      Species
    
    
      1
      setosa
    
    
      2
      versicolor
    
    
      3
      virginica



In [17]:

    
db.iris[db.iris.Species == 'versicolor'][['Species', 'SepalLength']]









    Out[17]:





  
    
      
      Species
      SepalLength
    
  
  
    
      0
      versicolor
      7.0
    
    
      1
      versicolor
      6.4
    
    
      2
      versicolor
      6.9
    
    
      3
      versicolor
      5.5
    
    
      4
      versicolor
      6.5
    
    
      5
      versicolor
      5.7
    
    
      6
      versicolor
      6.3
    
    
      7
      versicolor
      4.9
    
    
      8
      versicolor
      6.6
    
    
      9
      versicolor
      5.2
    
    
      10
      versicolor
      5.0

Work happens on the database

If we were using pandas we would read the table into pandas, then use pandas' fast in-memory algorithms for computation. Here we translate your query into SQL and then send that query to the database to do the work.

Pandas $\leftarrow_\textrm{data}$ SQL, then Pandas computes
Blaze $\rightarrow_\textrm{query}$ SQL, then database computes

If we want to dive into the internal API we can inspect the query that Blaze transmits.



In [18]:

    
# Inspect SQL query
query = db.iris[db.iris.Species == 'versicolor'][['Species', 'SepalLength']]
print bz.compute(query)









    



SELECT iris."Species", iris."SepalLength" 
FROM iris 
WHERE iris."Species" = ?



In [19]:

    
query = bz.by(db.iris.Species, longest=db.iris.PetalLength.max(),
                              shortest=db.iris.PetalLength.min())
print bz.compute(query)









    



SELECT iris."Species", max(iris."PetalLength") AS longest, min(iris."PetalLength") AS shortest 
FROM iris GROUP BY iris."Species"



In [20]:

    
odo(query, list)









    Out[20]:





[(u'Species', u'PetalLength', u'PetalLength'),
 (u'setosa', 1.9, 1.0),
 (u'versicolor', 5.1, 3.0),
 (u'virginica', 6.9, 4.5)]

Exercises

Now we load the Lahman baseball database and perform similar queries



In [21]:

    
# db = bz.Data('postgresql://postgres:postgres@ec2-54-159-160-163.compute-1.amazonaws.com')  # Use Postgres if you don't have the sqlite file
db = bz.Data('sqlite:///lahman2013.sqlite')
db.dshape









    Out[21]:





dshape("""{
  AllstarFull: var * {
    playerID: ?string,
    yearID: ?int32,
    gameNum: ?int32,
    gameID: ?string,
    teamID: ?string,
    lgID: ?string,
    GP: ?int32,
    startingPos: ?int32
    },
  Appearances: var * {
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    playerID: ?string,
    G_all: ?int32,
    GS: ?int32,
    G_batting: ?int32,
    G_defense: ?int32,
    G_p: ?int32,
    G_c: ?int32,
    G_1b: ?int32,
    G_2b: ?int32,
    G_3b: ?int32,
    G_ss: ?int32,
    G_lf: ?int32,
    G_cf: ?int32,
    G_rf: ?int32,
    G_of: ?int32,
    G_dh: ?int32,
    G_ph: ?int32,
    G_pr: ?int32
    },
  AwardsManagers: var * {
    playerID: ?string,
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    tie: ?string,
    notes: ?string
    },
  AwardsPlayers: var * {
    playerID: ?string,
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    tie: ?string,
    notes: ?string
    },
  AwardsShareManagers: var * {
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    playerID: ?string,
    pointsWon: ?int32,
    pointsMax: ?int32,
    votesFirst: ?int32
    },
  AwardsSharePlayers: var * {
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    playerID: ?string,
    pointsWon: ?float64,
    pointsMax: ?int32,
    votesFirst: ?float64
    },
  Batting: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    G: ?int32,
    G_batting: ?int32,
    AB: ?int32,
    R: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    RBI: ?int32,
    SB: ?int32,
    CS: ?int32,
    BB: ?int32,
    SO: ?int32,
    IBB: ?int32,
    HBP: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32,
    G_old: ?int32
    },
  BattingPost: var * {
    yearID: ?int32,
    round: ?string,
    playerID: ?string,
    teamID: ?string,
    lgID: ?string,
    G: ?int32,
    AB: ?int32,
    R: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    RBI: ?int32,
    SB: ?int32,
    CS: ?int32,
    BB: ?int32,
    SO: ?int32,
    IBB: ?int32,
    HBP: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  Fielding: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    POS: ?string,
    G: ?int32,
    GS: ?int32,
    InnOuts: ?int32,
    PO: ?int32,
    A: ?int32,
    E: ?int32,
    DP: ?int32,
    PB: ?int32,
    WP: ?int32,
    SB: ?int32,
    CS: ?int32,
    ZR: ?float64
    },
  FieldingOF: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    Glf: ?int32,
    Gcf: ?int32,
    Grf: ?int32
    },
  FieldingPost: var * {
    playerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    round: ?string,
    POS: ?string,
    G: ?int32,
    GS: ?int32,
    InnOuts: ?int32,
    PO: ?int32,
    A: ?int32,
    E: ?int32,
    DP: ?int32,
    TP: ?int32,
    PB: ?int32,
    SB: ?int32,
    CS: ?int32
    },
  HallOfFame: var * {
    playerID: ?string,
    yearid: ?int32,
    votedBy: ?string,
    ballots: ?int32,
    needed: ?int32,
    votes: ?int32,
    inducted: ?string,
    category: ?string,
    needed_note: ?string
    },
  Managers: var * {
    playerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    inseason: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32,
    rank: ?int32,
    plyrMgr: ?string
    },
  ManagersHalf: var * {
    playerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    inseason: ?int32,
    half: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32,
    rank: ?int32
    },
  Master: var * {
    playerID: ?string,
    birthYear: ?int32,
    birthMonth: ?int32,
    birthDay: ?int32,
    birthCountry: ?string,
    birthState: ?string,
    birthCity: ?string,
    deathYear: ?int32,
    deathMonth: ?int32,
    deathDay: ?int32,
    deathCountry: ?string,
    deathState: ?string,
    deathCity: ?string,
    nameFirst: ?string,
    nameLast: ?string,
    nameGiven: ?string,
    weight: ?int32,
    height: ?float64,
    bats: ?string,
    throws: ?string,
    debut: ?float64,
    finalGame: ?float64,
    retroID: ?string,
    bbrefID: ?string
    },
  Pitching: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    W: ?int32,
    L: ?int32,
    G: ?int32,
    GS: ?int32,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    H: ?int32,
    ER: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    BAOpp: ?float64,
    ERA: ?float64,
    IBB: ?int32,
    WP: ?int32,
    HBP: ?int32,
    BK: ?int32,
    BFP: ?int32,
    GF: ?int32,
    R: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  PitchingPost: var * {
    playerID: ?string,
    yearID: ?int32,
    round: ?string,
    teamID: ?string,
    lgID: ?string,
    W: ?int32,
    L: ?int32,
    G: ?int32,
    GS: ?int32,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    H: ?int32,
    ER: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    BAOpp: ?float64,
    ERA: ?float64,
    IBB: ?int32,
    WP: ?int32,
    HBP: ?int32,
    BK: ?int32,
    BFP: ?int32,
    GF: ?int32,
    R: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  Salaries: var * {
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    playerID: ?string,
    salary: ?float64
    },
  Schools: var * {
    schoolID: ?string,
    schoolName: ?string,
    schoolCity: ?string,
    schoolState: ?string,
    schoolNick: ?string
    },
  SchoolsPlayers: var * {
    playerID: ?string,
    schoolID: ?string,
    yearMin: ?int32,
    yearMax: ?int32
    },
  SeriesPost: var * {
    yearID: ?int32,
    round: ?string,
    teamIDwinner: ?string,
    lgIDwinner: ?string,
    teamIDloser: ?string,
    lgIDloser: ?string,
    wins: ?int32,
    losses: ?int32,
    ties: ?int32
    },
  Teams: var * {
    yearID: ?int32,
    lgID: ?string,
    teamID: ?string,
    franchID: ?string,
    divID: ?string,
    Rank: ?int32,
    G: ?int32,
    Ghome: ?int32,
    W: ?int32,
    L: ?int32,
    DivWin: ?string,
    WCWin: ?string,
    LgWin: ?string,
    WSWin: ?string,
    R: ?int32,
    AB: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    SB: ?int32,
    CS: ?int32,
    HBP: ?int32,
    SF: ?int32,
    RA: ?int32,
    ER: ?int32,
    ERA: ?float64,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    HA: ?int32,
    HRA: ?int32,
    BBA: ?int32,
    SOA: ?int32,
    E: ?int32,
    DP: ?int32,
    FP: ?float64,
    name: ?string,
    park: ?string,
    attendance: ?int32,
    BPF: ?int32,
    PPF: ?int32,
    teamIDBR: ?string,
    teamIDlahman45: ?string,
    teamIDretro: ?string
    },
  TeamsFranchises: var * {
    franchID: ?string,
    franchName: ?string,
    active: ?string,
    NAassoc: ?string
    },
  TeamsHalf: var * {
    yearID: ?int32,
    lgID: ?string,
    teamID: ?string,
    Half: ?string,
    divID: ?string,
    DivWin: ?string,
    Rank: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32
    },
  temp: var * {ID: ?int32, namefull: ?string, born: ?float64}
  }""")



In [ ]:

    
# View the Salaries table



In [ ]:

    
# What are the distinct teamIDs in the Salaries table?



In [ ]:

    
# What is the minimum and maximum yearID in the Sarlaries table?



In [ ]:

    
# For the Oakland Athletics (teamID OAK), pick out the playerID, salary, and yearID columns



In [ ]:

    
# Sort that result by salary.  
# Use the ascending=False keyword argument to the sort function to find the highest paid players

Example: Split-apply-combine

In Pandas we perform computations on a per-group basis with the groupby operator. In Blaze our syntax is slightly different, using instead the by function.



In [23]:

    
import pandas as pd
iris = pd.read_csv('iris.csv')
iris.groupby('Species').PetalLength.min()









    Out[23]:





Species
setosa        1.0
versicolor    3.0
virginica     4.5
Name: PetalLength, dtype: float64



In [24]:

    
iris = bz.Data('sqlite:///my.db::iris')
bz.by(iris.Species, largest=iris.PetalLength.max(), 
                    smallest=iris.PetalLength.min())
print(_)









    



Species
setosa        1.0
versicolor    3.0
virginica     4.5
Name: PetalLength, dtype: float64

Store Results

By default Blaze only shows us the first ten lines of a result. This provides a more interactive feel and stops us from accidentally crushing our system. Sometimes we do want to compute all of the results and store them someplace.

Blaze expressions are valid sources for odo. So we can store our results in any format.



In [25]:

    
iris = bz.Data('sqlite:///my.db::iris')
query = bz.by(iris.Species, largest=iris.PetalLength.max(),   # A lazily evaluated result
                            smallest=iris.PetalLength.min())  

odo(query, list)  # A concrete result









    Out[25]:





[(u'Species', u'PetalLength', u'PetalLength'),
 (u'setosa', 1.9, 1.0),
 (u'versicolor', 5.1, 3.0),
 (u'virginica', 6.9, 4.5)]

Exercise: Storage

The solution to the first split-apply-combine problem is below. Store that result in a list, a CSV file, and in a new SQL table in our database (use a uri like sqlite://... to specify the SQL table.)



In [26]:

    
result = bz.by(db.Salaries.teamID, avg=db.Salaries.salary.mean(), 
                                   max=db.Salaries.salary.max(), 
                                   ratio=db.Salaries.salary.max() / db.Salaries.salary.min()
                ).sort('ratio', ascending=False)



In [27]:

    
odo(result, list)[:10]









    Out[27]:





[(u'PHI', 2092230.9326364691, 25000000.0, 416.6666666666667),
 (u'LAN', 2346982.6980255516, 23854494.0, 397.5749),
 (u'NYN', 2317349.977245509, 23145011.0, 385.7501833333333),
 (u'DET', 1980834.9902080784, 23000000.0, 383.3333333333333),
 (u'MIN', 1525031.650385604, 23000000.0, 383.3333333333333),
 (u'BOS', 2692113.8568075118, 22500000.0, 375.0),
 (u'SFN', 2044198.6841463414, 22250000.0, 370.8333333333333),
 (u'TEX', 1874651.551143201, 22000000.0, 366.6666666666667),
 (u'SEA', 1932288.913202934, 20557143.0, 342.61905),
 (u'CHA', 1992653.5012315272, 17000000.0, 340.0)]

	Unnamed: 0	SepalLength	SepalWidth	PetalLength	PetalWidth	Species
0	1	5.1	3.5	1.4	0.2	setosa
1	2	4.9	3.0	1.4	0.2	setosa
2	3	4.7	3.2	1.3	0.2	setosa
3	4	4.6	3.1	1.5	0.2	setosa
4	5	5.0	3.6	1.4	0.2	setosa

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
5	25.29	4.71	Male	No	Sun	Dinner	4
6	8.77	2.00	Male	No	Sun	Dinner	2
7	26.88	3.12	Male	No	Sun	Dinner	4
8	15.04	1.96	Male	No	Sun	Dinner	2
9	14.78	3.23	Male	No	Sun	Dinner	2
10	10.27	1.71	Male	No	Sun	Dinner	2
11	35.26	5.00	Female	No	Sun	Dinner	4
12	15.42	1.57	Male	No	Sun	Dinner	2
13	18.43	3.00	Male	No	Sun	Dinner	4
14	14.83	3.02	Female	No	Sun	Dinner	2
15	21.58	3.92	Male	No	Sun	Dinner	2
16	10.33	1.67	Female	No	Sun	Dinner	3
17	16.29	3.71	Male	No	Sun	Dinner	3
18	16.97	3.50	Female	No	Sun	Dinner	3
19	20.65	3.35	Male	No	Sat	Dinner	3
20	17.92	4.08	Male	No	Sat	Dinner	2
21	20.29	2.75	Female	No	Sat	Dinner	2
22	15.77	2.23	Female	No	Sat	Dinner	2
23	39.42	7.58	Male	No	Sat	Dinner	4
24	19.82	3.18	Male	No	Sat	Dinner	2
25	17.81	2.34	Male	No	Sat	Dinner	4
26	13.37	2.00	Male	No	Sat	Dinner	2
27	12.69	2.00	Male	No	Sat	Dinner	2
28	21.70	4.30	Male	No	Sat	Dinner	2
29	19.65	3.00	Female	No	Sat	Dinner	2
...	...	...	...	...	...	...	...
214	28.17	6.50	Female	Yes	Sat	Dinner	3
215	12.90	1.10	Female	Yes	Sat	Dinner	2
216	28.15	3.00	Male	Yes	Sat	Dinner	5
217	11.59	1.50	Male	Yes	Sat	Dinner	2
218	7.74	1.44	Male	Yes	Sat	Dinner	2
219	30.14	3.09	Female	Yes	Sat	Dinner	4
220	12.16	2.20	Male	Yes	Fri	Lunch	2
221	13.42	3.48	Female	Yes	Fri	Lunch	2
222	8.58	1.92	Male	Yes	Fri	Lunch	1
223	15.98	3.00	Female	No	Fri	Lunch	3
224	13.42	1.58	Male	Yes	Fri	Lunch	2
225	16.27	2.50	Female	Yes	Fri	Lunch	2
226	10.09	2.00	Female	Yes	Fri	Lunch	2
227	20.45	3.00	Male	No	Sat	Dinner	4
228	13.28	2.72	Male	No	Sat	Dinner	2
229	22.12	2.88	Female	Yes	Sat	Dinner	2
230	24.01	2.00	Male	Yes	Sat	Dinner	4
231	15.69	3.00	Male	Yes	Sat	Dinner	3
232	11.61	3.39	Male	No	Sat	Dinner	2
233	10.77	1.47	Male	No	Sat	Dinner	2
234	15.53	3.00	Male	Yes	Sat	Dinner	2
235	10.07	1.25	Male	No	Sat	Dinner	2
236	12.60	1.00	Male	Yes	Sat	Dinner	2
237	32.83	1.17	Male	Yes	Sat	Dinner	2
238	35.83	4.67	Female	No	Sat	Dinner	3
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

	Species	SepalLength
0	versicolor	7.0
1	versicolor	6.4
2	versicolor	6.9
3	versicolor	5.5
4	versicolor	6.5
5	versicolor	5.7
6	versicolor	6.3
7	versicolor	4.9
8	versicolor	6.6
9	versicolor	5.2
10	versicolor	5.0

Introduction to Blaze

Install

Goal: Accessible, Interactive, Analytic Queries

Databases and Python

odo (formerly into)

Examples

Exercise

URI Strings

Examples

Exercise

How it works

Remote Data

S3

Blaze

Pandas example

Blaze example

Foreign Data

SQL

Examples

Work happens on the database

Exercises

Example: Split-apply-combine

Store Results

Exercise: Storage

`odo` (formerly `into`)