This notebook builds off of the HMDA dataset introduced in the following blogpost: http://continuum.io/blog/blaze-hmda
In [1]:
import blaze
from blaze import Table, into
In [2]:
blaze.__version__
Out[2]:
In [3]:
hmda = Table('mongodb://localhost/db::hmda')
hmda
Out[3]:
In [4]:
import pyspark
sc = pyspark.SparkContext('local', 'test-app')
In [5]:
rdd = into(sc, hmda.head(10000))
In [6]:
rdd.take(3)
Out[6]:
Grab distinct elements of first column
In [7]:
rdd.map(lambda x: x[0]).distinct()
Out[7]:
That returned an RDD. Lets bring to local memory with .collect()
In [8]:
rdd.map(lambda x: x[0]).distinct().collect()
Out[8]:
Now lets use blaze.into to write those results to a file.
In [9]:
into('myfile.csv', rdd.map(lambda x: x[0]).distinct().collect())
Out[9]:
In [10]:
!head myfile.csv
In [11]:
t = Table(rdd, columns=hmda.columns)
In [12]:
t
Out[12]:
We can easily inspect the table, just like we would in pandas.
In [13]:
t.action_taken_name
Out[13]:
All of the (meta)data movement is handled, giving the user a natural interactive experience.
In [14]:
t.action_taken_name.distinct()
Out[14]:
In [15]:
into(list, t.action_taken_name.distinct())
Out[15]: