Playing with Spark

This notebook builds off of the HMDA dataset introduced in the following blogpost: http://continuum.io/blog/blaze-hmda


In [1]:
import blaze
from blaze import Table, into

In [2]:
blaze.__version__


Out[2]:
'0.6.5'

Load HMDA data from local Mongo database


In [3]:
hmda = Table('mongodb://localhost/db::hmda')
hmda


Out[3]:
action_taken_name agency_abbr applicant_ethnicity_name applicant_race_name_1 applicant_sex_name county_name loan_purpose_name state_abbr
0 Loan originated HUD Not Hispanic or Latino White Male Will County Refinancing IL
1 Loan originated NCUA Not Hispanic or Latino White Male Midland County Refinancing MI
2 Loan purchased by the institution CFPB Not applicable Not applicable Not applicable Benton County Refinancing AR
3 Loan purchased by the institution CFPB Not Hispanic or Latino White Female Ramsey County Refinancing MN
4 Loan originated FDIC Not Hispanic or Latino White Male Allen County Home improvement IN
5 Loan originated HUD Not Hispanic or Latino White Male Cook County Refinancing IL
6 Loan originated HUD Not Hispanic or Latino Black or African American Male Calcasieu Parish Home purchase LA
7 Loan originated HUD Not Hispanic or Latino White Male Grand County Refinancing CO
8 Loan originated FDIC Not Hispanic or Latino White Female Allen County Refinancing IN
9 Loan originated CFPB Not Hispanic or Latino White Male Talbot County Refinancing MD
10 Loan originated HUD Not Hispanic or Latino White Male Calcasieu Parish Home purchase LA

Create a local PySpark instance

This could point to a cluster


In [4]:
import pyspark
sc = pyspark.SparkContext('local', 'test-app')

Playing with PySpark

  1. Move 10000 rows from Mongo to spark
  2. Find the possible actions taken (first column)
  3. Put this result into a csv file

In [5]:
rdd = into(sc, hmda.head(10000))

In [6]:
rdd.take(3)


Out[6]:
[(u'Loan originated',
  u'HUD',
  u'Not Hispanic or Latino',
  u'White',
  u'Male',
  u'Will County',
  u'Refinancing',
  u'IL'),
 (u'Loan originated',
  u'NCUA',
  u'Not Hispanic or Latino',
  u'White',
  u'Male',
  u'Midland County',
  u'Refinancing',
  u'MI'),
 (u'Loan purchased by the institution',
  u'CFPB',
  u'Not applicable',
  u'Not applicable',
  u'Not applicable',
  u'Benton County',
  u'Refinancing',
  u'AR')]

Grab distinct elements of first column


In [7]:
rdd.map(lambda x: x[0]).distinct()


Out[7]:
PythonRDD[7] at RDD at PythonRDD.scala:43

That returned an RDD. Lets bring to local memory with .collect()


In [8]:
rdd.map(lambda x: x[0]).distinct().collect()


Out[8]:
[u'Loan originated',
 u'Application denied by financial institution',
 u'Application approved but not accepted',
 u'Loan purchased by the institution',
 u'Application withdrawn by applicant',
 u'File closed for incompleteness']

Now lets use blaze.into to write those results to a file.


In [9]:
into('myfile.csv', rdd.map(lambda x: x[0]).distinct().collect())


Out[9]:
<blaze.data.csv.CSV at 0x7f8648810290>

In [10]:
!head myfile.csv











Blaze drives PySpark

We do the exact same work, but now driving with Blaze. The computation is the same, only the interface is different.

  1. Wrap a Table around the rdd
  2. Find the possible actions taken
  3. Put this result into a variety of

In [11]:
t = Table(rdd, columns=hmda.columns)

In [12]:
t


Out[12]:
action_taken_name agency_abbr applicant_ethnicity_name applicant_race_name_1 applicant_sex_name county_name loan_purpose_name state_abbr
0 Loan originated HUD Not Hispanic or Latino White Male Will County Refinancing IL
1 Loan originated NCUA Not Hispanic or Latino White Male Midland County Refinancing MI
2 Loan purchased by the institution CFPB Not applicable Not applicable Not applicable Benton County Refinancing AR
3 Loan purchased by the institution CFPB Not Hispanic or Latino White Female Ramsey County Refinancing MN
4 Loan originated FDIC Not Hispanic or Latino White Male Allen County Home improvement IN
5 Loan originated HUD Not Hispanic or Latino White Male Cook County Refinancing IL
6 Loan originated HUD Not Hispanic or Latino Black or African American Male Calcasieu Parish Home purchase LA
7 Loan originated HUD Not Hispanic or Latino White Male Grand County Refinancing CO
8 Loan originated FDIC Not Hispanic or Latino White Female Allen County Refinancing IN
9 Loan originated CFPB Not Hispanic or Latino White Male Talbot County Refinancing MD
10 Loan originated HUD Not Hispanic or Latino White Male Calcasieu Parish Home purchase LA

We can easily inspect the table, just like we would in pandas.


In [13]:
t.action_taken_name


Out[13]:
action_taken_name
0 Loan originated
1 Loan originated
2 Loan purchased by the institution
3 Loan purchased by the institution
4 Loan originated
5 Loan originated
6 Loan originated
7 Loan originated
8 Loan originated
9 Loan originated
10 Loan originated

All of the (meta)data movement is handled, giving the user a natural interactive experience.


In [14]:
t.action_taken_name.distinct()


Out[14]:
action_taken_name
0 Loan originated
1 Application denied by financial institution
2 Application approved but not accepted
3 Loan purchased by the institution
4 Application withdrawn by applicant
5 File closed for incompleteness

In [15]:
into(list, t.action_taken_name.distinct())


Out[15]:
[u'Loan originated',
 u'Application denied by financial institution',
 u'Application approved but not accepted',
 u'Loan purchased by the institution',
 u'Application withdrawn by applicant',
 u'File closed for incompleteness']

Main Points

Blaze provides a lightweight wrapper around PySpark, giving a familiar interface to a powerful platform.