<img src="images/continuum_analytics_logo.png" alt="Continuum Logo", align="right", width="30%">,
In this 45 minute tutorial we'll learn how to use Blaze to discover, migrate, and query data living in other databases. Generally this tutorial will have the following format
into
- Move data to databaseblaze
- Query data in databaseremote
- What if data and database is on an HDFS-backed cluster?NumPy and Pandas provide accessible, interactive, analytic queries; this is valuable.
In [1]:
import pandas as pd
df = pd.read_csv('iris.csv')
df.head()
Out[1]:
In [5]:
df.groupby(df.species).petal_length.mean() # Average petal length per species
Out[5]:
If your data fits on your computer then this is probably the way to go, and you can stop reading right now.
From now on, we're going to assume one of the following:
When in-memory arrays/dataframes cease to be an option, we turn to databases. These live outside of the Python process and so might be less convenient. The open source Python ecosystem includes libraries to interact with these databases and with foreign data in general.
Examples:
sqlalchemy
pyhive
impyla
redshift-sqlalchemy
pymongo
happybase
pyspark
paramiko
pywebhdfs
boto
Today we're going to use some of these indirectly with into
and Blaze. We'll try to point out these libraries as we automate them so that, if you'd like, you can use them independently.
into
The Blaze and into
projects give a consistent interface over many of the libraries above. They strive to trivialize common tasks.
into
moves data from place to place and from format to formatblaze
queries data in databasesWe're going to start with into
, learning how to migrate data between formats, between computers, and into databases. We'll then use Blaze to perform analytic queries on that data.
We'll eventually do things like this
>>> from into import into
>>> into('hive://hostname/default::iris', 'iris.csv') # Move local data onto HDFS and register with Hive
>>> from blaze import Data, by
>>> db = Data('hive://hostname/default')
>>> by(db.species, avg=db.petal_length.mean())
...
Blaze and into
make easy things trivial but are not a replacement for intimate knowledge of your database or for your problem. These projects are intended to enable non-expert users to do every-day tasks.