Blaze - A Quick Tour

Blaze provides a lightweight interface on top of pre-existing computational infrastructure. This notebook gives a quick overview of how Blaze interacts with a variety of data types.


In [1]:
from blaze import Data, by, compute

Blaze wraps pre-existing data

Blaze interacts with normal Python objects. Operations on Blaze Data objects create expression trees.

These expressions deliver an intuitive numpy/pandas-like feel.


In [2]:
x = Data(1)
x


Out[2]:
1

In [3]:
x.dshape


Out[3]:
dshape("int64")

In [4]:
x + 1


Out[4]:
2

In [5]:
print type(x + 1)
print type(compute(x + 1))


<class 'blaze.expr.arithmetic.Add'>
<type 'int'>

Lists

Starting small, Blaze interacts happily with collections of data.

It uses Pandas for pretty notebook printing.


In [6]:
x = Data([1, 2, 3, 4, 5])
x


Out[6]:
_2
0 1
1 2
2 3
3 4
4 5

In [7]:
x[x > 2] * 10


Out[7]:
_2
0 30
1 40
2 50

In [8]:
x.dshape


Out[8]:
dshape("5 * int64")

Or Tabular, Pandas-like datasets

Slightly more exciting, Blaze operates on tabular data


In [9]:
L = [[1, 'Alice',   100],
     [2, 'Bob',    -200],
     [3, 'Charlie', 300],
     [4, 'Dennis',  400],
     [5, 'Edith',  -500]]

In [10]:
x = Data(L, fields=['id', 'name', 'amount'])
x.dshape


Out[10]:
dshape("5 * {id: int64, name: string, amount: int64}")

In [11]:
x


Out[11]:
id name amount
0 1 Alice 100
1 2 Bob -200
2 3 Charlie 300
3 4 Dennis 400
4 5 Edith -500

In [12]:
deadbeats = x[x.amount < 0].name
deadbeats


Out[12]:
name
0 Bob
1 Edith

Or it can even just drive pandas

Blaze doesn't do work, it just tells other systems to do work.

In the previous example, Blaze told Python which for-loops to write. In this example, it calls the right functions in Pandas.

The user experience is identical, only performance differs.


In [13]:
from pandas import DataFrame

df = DataFrame([[1, 'Alice',   100],                         
                [2, 'Bob',    -200],
                [3, 'Charlie', 300],
                [4, 'Denis',   400],
                [5, 'Edith',  -500]], columns=['id', 'name', 'amount'])

In [14]:
df


Out[14]:
id name amount
0 1 Alice 100
1 2 Bob -200
2 3 Charlie 300
3 4 Denis 400
4 5 Edith -500

In [15]:
x = Data(df)
x


Out[15]:
id name amount
0 1 Alice 100
1 2 Bob -200
2 3 Charlie 300
3 4 Denis 400
4 5 Edith -500

In [16]:
deadbeats = x[x.amount < 0].name
deadbeats


Out[16]:
name
1 Bob
4 Edith

Calling compute, we see that Blaze returns a thing like what it was given.


In [17]:
type(compute(deadbeats))


Out[17]:
pandas.core.series.Series

Other data types like SQLAlchemy Tables

Blaze extends beyond just Python and Pandas (that's the main motivation.)

Here it drives SQLAlchemy.


In [18]:
from sqlalchemy import Table, Column, MetaData, Integer, String, create_engine

tab = Table('bank', MetaData(),
            Column('id', Integer),
            Column('name', String),
            Column('amount', Integer))

In [19]:
x = Data(tab)
x.dshape


Out[19]:
dshape("var * {id: ?int32, name: ?string, amount: ?int32}")

Just like computations on pandas objects produce pandas objects, computations on SQLAlchemy tables produce SQLAlchemy Select statements.


In [20]:
deadbeats = x[x.amount < 0].name
compute(deadbeats)


Out[20]:
<sqlalchemy.sql.selectable.Select at 0x7f2543f2fc10; Select object>

In [21]:
print compute(deadbeats)  # SQLAlchemy generates actual SQL


SELECT bank.name 
FROM bank 
WHERE bank.amount < :amount_1

Connect to a real database

When we drive a SQLAlchemy table connected to a database we get actual computation.


In [22]:
engine = create_engine('sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db')

In [23]:
x = Data(engine)
x


Out[23]:
Data: Engine(sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db)
DataShape: {
iris: var * {
sepal_length: ?float64,
sepal_width: ?float64,
petal_length: ?float64,
petal_width: ?float64,
species: ?string
...

In [24]:
x.iris


Out[24]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa

In [25]:
by(x.iris.species, shortest=x.iris.sepal_length.min(), 
                    longest=x.iris.sepal_length.max())


Out[25]:
species longest shortest
0 Iris-setosa 5.8 4.3
1 Iris-versicolor 7.0 4.9
2 Iris-virginica 7.9 4.9

Use URI strings to ease access

Often just figuring out how to produce the relevant Python object can be a challenge.

Blaze supports many formats of URI strings


In [26]:
x = Data('sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db::iris')
x


Out[26]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa

Once you have SQL, might as well go big


In [27]:
x = Data('impala://ec2-54-90-201-28.compute-1.amazonaws.com')

MongoDB

Github's database is mirrored in a Mongo collection hosted in the Netherlands.

Connecting via ssh tunnel. See http://ghtorrent.org/ to obtain access.


In [28]:
users = Data('mongodb://ghtorrentro:ghtorrentro@localhost/github::users')
users


Out[28]:
avatar_url bio blog company created_at email followers following gravatar_id hireable html_url id location login name public_gists public_repos type url
0 https://secure.gravatar.com/avatar/a7e55f31bb4... None None None 2012-05-04T13:59:54Z None 0 0 a7e55f31bb45321f30211e901cd89ffa None https://github.com/Michaelwussler 1706010 None Michaelwussler None 0 3 User https://api.github.com/users/Michaelwussler
1 https://secure.gravatar.com/avatar/eb8139078bc... None None None 2012-05-03T18:47:13Z None 0 0 eb8139078bc623dee103ed3917c080dc None https://github.com/praiser 1703505 None praiser None 0 3 User https://api.github.com/users/praiser
2 https://secure.gravatar.com/avatar/13c7b665e0c... None 2010-04-07T12:15:00Z vad.viktor@gmail.com 2 3 13c7b665e0cbd94e0155387c35957d13 False https://github.com/vadviktor 238703 Budapest vadviktor Vad Viktor 0 10 User https://api.github.com/users/vadviktor
3 https://secure.gravatar.com/avatar/b7937805411... None Appcelerator 2012-04-02T16:13:58Z yjin@appcelerator.com 0 0 b7937805411d278ceb839175e251e2a0 False https://github.com/ypjin 1598831 Beijing ypjin Yuping 0 5 User https://api.github.com/users/ypjin
4 https://secure.gravatar.com/avatar/89e109fca84... http://blogs.perl.org/users/steven_haryanto - 2010-02-26T01:28:09Z stevenharyanto@gmail.com 39 307 89e109fca8474e5636c9feef7a8422ea False https://github.com/sharyanto 211084 Jakarta, Indonesia sharyanto Steven Haryanto 5 195 User https://api.github.com/users/sharyanto
5 https://secure.gravatar.com/avatar/7490b4e3e9c... Perl, C, C++, JavaScript, PHP, Haskell, Ruby, ... http://c9s.me 2009-02-01T15:20:08Z cornelius.howl@gmail.com 330 599 7490b4e3e9cb85a1f7dc0c8ea01a86e5 True https://github.com/c9s 50894 Taipei, Taiwan c9s Yo-An Lin 281 206 User https://api.github.com/users/c9s
6 https://secure.gravatar.com/avatar/dc078ac4dbd... None azhari.harahap.us CapungRiders 2010-10-31T05:53:40Z azhari@harahap.us 26 11 dc078ac4dbdc06d3e3c0ec0b6801b53d False https://github.com/back2arie 461397 Indonesia back2arie Azhari Harahap 1 15 User https://api.github.com/users/back2arie
7 https://secure.gravatar.com/avatar/fb844ffed6c... Git Ninja and language-agnostic problem solver... http://dukeleto.pl Leto Labs LLC 2008-10-22T03:02:15Z jonathan@leto.net 175 635 fb844ffed6c5a2e69638627e3b721308 True https://github.com/leto 30298 Portland, OR leto Jonathan "Duke" Leto 276 112 User https://api.github.com/users/leto
8 https://secure.gravatar.com/avatar/3843ec7861e... http://alanhaggai.org/ Thought Ripples 2009-01-13T16:25:15Z haggai@cpan.org 46 365 3843ec7861e271e803ea076035d683dd False https://github.com/alanhaggai 46288 IN alanhaggai Alan Haggai Alavi 4 54 User https://api.github.com/users/alanhaggai
9 https://secure.gravatar.com/avatar/f611628c558... None arisdottle.net Team Rooster Pirates 2009-05-12T19:29:09Z amiri@roosterpirates.com 16 87 f611628c5588f7a0a72c65ec1f94dfb8 False https://github.com/amiri 83806 Los Angeles, CA amiri Amiri Barksdale 16 18 User https://api.github.com/users/amiri
10 https://secure.gravatar.com/avatar/c57483c5cfe... None http://www.geekfarm.org/wu/muse/WebHome.html None 2009-02-08T03:28:54Z git-c@geekfarm.org 16 87 c57483c5cfe159b98a6e33ee7e9eec38 False https://github.com/wu 52700 None wu Alex White 0 15 User https://api.github.com/users/wu

Handle NumPy-like computations


In [29]:
import h5py
f = h5py.File('/home/mrocklin/Downloads/OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5')

In [30]:
x = Data(f)
x.dshape


Out[30]:
dshape("""{
  HDFEOS: {
    ADDITIONAL: {FILE_ATTRIBUTES: {}},
    SWATHS: {
      ColumnAmountAerosol: {
        Data Fields: {
          AerosolIndexUV: 1643 * 60 * int16,
          AerosolIndexVIS: 1643 * 60 * int16,
          AerosolModelMW: 1643 * 60 * uint16,
          AerosolModelsPassedThreshold: 1643 * 60 * 10 * uint16,
          AerosolOpticalThicknessMW: 1643 * 60 * 14 * int16,
          AerosolOpticalThicknessMWPrecision: 1643 * 60 * int16,
          AerosolOpticalThicknessNUV: 1643 * 60 * 2 * int16,
          AerosolOpticalThicknessPassedThreshold: 1643 * 60 * 10 * 9 * int16,
          AerosolOpticalThicknessPassedThresholdMean: 1643 * 60 * 9 * int16,
          AerosolOpticalThicknessPassedThresholdStd: 1643 * 60 * 9 * int16,
          CloudFlags: 1643 * 60 * uint8,
          CloudPressure: 1643 * 60 * int16,
          EffectiveCloudFraction: 1643 * 60 * int8,
          InstrumentConfigurationId: 1643 * uint8,
          MeasurementQualityFlags: 1643 * uint8,
          NumberOfModelsPassedThreshold: 1643 * 60 * uint8,
          ProcessingQualityFlagsMW: 1643 * 60 * uint16,
          ProcessingQualityFlagsNUV: 1643 * 60 * uint16,
          RootMeanSquareErrorOfFitPassedThreshold: 1643 * 60 * 10 * int16,
          SingleScatteringAlbedoMW: 1643 * 60 * 14 * int16,
          SingleScatteringAlbedoMWPrecision: 1643 * 60 * int16,
          SingleScatteringAlbedoNUV: 1643 * 60 * 2 * int16,
          SingleScatteringAlbedoPassedThreshold: 1643 * 60 * 10 * 9 * int16,
          SingleScatteringAlbedoPassedThresholdMean: 1643 * 60 * 9 * int16,
          SingleScatteringAlbedoPassedThresholdStd: 1643 * 60 * 9 * int16,
          SmallPixelRadiancePointerUV: 1643 * 2 * int16,
          SmallPixelRadiancePointerVIS: 1643 * 2 * int16,
          SmallPixelRadianceUV: 6783 * 60 * float32,
          SmallPixelRadianceVIS: 6786 * 60 * float32,
          SmallPixelWavelengthUV: 6783 * 60 * uint16,
          SmallPixelWavelengthVIS: 6786 * 60 * uint16,
          TerrainPressure: 1643 * 60 * int16,
          TerrainReflectivity: 1643 * 60 * 9 * int16,
          XTrackQualityFlags: 1643 * 60 * uint8
          },
        Geolocation Fields: {
          GroundPixelQualityFlags: 1643 * 60 * uint16,
          Latitude: 1643 * 60 * float32,
          Longitude: 1643 * 60 * float32,
          OrbitPhase: 1643 * float32,
          SolarAzimuthAngle: 1643 * 60 * float32,
          SolarZenithAngle: 1643 * 60 * float32,
          SpacecraftAltitude: 1643 * float32,
          SpacecraftLatitude: 1643 * float32,
          SpacecraftLongitude: 1643 * float32,
          TerrainHeight: 1643 * 60 * int16,
          Time: 1643 * float64,
          ViewingAzimuthAngle: 1643 * 60 * float32,
          ViewingZenithAngle: 1643 * 60 * float32
          }
        }
      }
    },
  HDFEOS INFORMATION: {
    ArchiveMetadata.0: string[65535, 'A'],
    CoreMetadata.0: string[65535, 'A'],
    StructMetadata.0: string[32000, 'A']
    }
  }""")

In [31]:
x.HDFEOS.SWATHS.ColumnAmountAerosol.Data_Fields.CloudPressure


Out[31]:
array([[-32767, -32767, -32767, ..., -32767, -32767, -32767],
[-32767, -32767, -32767, ..., -32767, -32767, -32767],
[-32767, -32767, -32767, ..., -32767, -32767, -32767],
...,
[-32767, -32767, -32767, ..., -32767, -32767, -32767],
[-32767, -32767, -32767, ..., -32767, -32767, -32767],
[-32767, -32767, -32767, ..., -32767, -32767, -32767]], dtype=int16)

In [32]:
x.HDFEOS.SWATHS.ColumnAmountAerosol.Data_Fields.CloudPressure.max()


Out[32]:
1013