Playing with Spark

This notebook builds off of the HMDA dataset introduced in the following blogpost: http://continuum.io/blog/blaze-hmda



In [1]:

    
import blaze
from blaze import Table, into



In [2]:

    
blaze.__version__









    Out[2]:





'0.6.5'

Load HMDA data from local Mongo database



In [3]:

    
hmda = Table('mongodb://localhost/db::hmda')
hmda









    Out[3]:





  
    
      
      action_taken_name
      agency_abbr
      applicant_ethnicity_name
      applicant_race_name_1
      applicant_sex_name
      county_name
      loan_purpose_name
      state_abbr
    
  
  
    
      0 
                         Loan originated
        HUD
       Not Hispanic or Latino
                           White
                 Male
            Will County
            Refinancing
       IL
    
    
      1 
                         Loan originated
       NCUA
       Not Hispanic or Latino
                           White
                 Male
         Midland County
            Refinancing
       MI
    
    
      2 
       Loan purchased by the institution
       CFPB
               Not applicable
                  Not applicable
       Not applicable
          Benton County
            Refinancing
       AR
    
    
      3 
       Loan purchased by the institution
       CFPB
       Not Hispanic or Latino
                           White
               Female
          Ramsey County
            Refinancing
       MN
    
    
      4 
                         Loan originated
       FDIC
       Not Hispanic or Latino
                           White
                 Male
           Allen County
       Home improvement
       IN
    
    
      5 
                         Loan originated
        HUD
       Not Hispanic or Latino
                           White
                 Male
            Cook County
            Refinancing
       IL
    
    
      6 
                         Loan originated
        HUD
       Not Hispanic or Latino
       Black or African American
                 Male
       Calcasieu Parish
          Home purchase
       LA
    
    
      7 
                         Loan originated
        HUD
       Not Hispanic or Latino
                           White
                 Male
           Grand County
            Refinancing
       CO
    
    
      8 
                         Loan originated
       FDIC
       Not Hispanic or Latino
                           White
               Female
           Allen County
            Refinancing
       IN
    
    
      9 
                         Loan originated
       CFPB
       Not Hispanic or Latino
                           White
                 Male
          Talbot County
            Refinancing
       MD
    
    
      10
                         Loan originated
        HUD
       Not Hispanic or Latino
                           White
                 Male
       Calcasieu Parish
          Home purchase
       LA

Create a local PySpark instance

This could point to a cluster



In [4]:

    
import pyspark
sc = pyspark.SparkContext('local', 'test-app')

Playing with PySpark

Move 10000 rows from Mongo to spark
Find the possible actions taken (first column)
Put this result into a csv file



In [5]:

    
rdd = into(sc, hmda.head(10000))



In [6]:

    
rdd.take(3)









    Out[6]:





[(u'Loan originated',
  u'HUD',
  u'Not Hispanic or Latino',
  u'White',
  u'Male',
  u'Will County',
  u'Refinancing',
  u'IL'),
 (u'Loan originated',
  u'NCUA',
  u'Not Hispanic or Latino',
  u'White',
  u'Male',
  u'Midland County',
  u'Refinancing',
  u'MI'),
 (u'Loan purchased by the institution',
  u'CFPB',
  u'Not applicable',
  u'Not applicable',
  u'Not applicable',
  u'Benton County',
  u'Refinancing',
  u'AR')]

Grab distinct elements of first column



In [7]:

    
rdd.map(lambda x: x[0]).distinct()









    Out[7]:





PythonRDD[7] at RDD at PythonRDD.scala:43

That returned an RDD. Lets bring to local memory with .collect()



In [8]:

    
rdd.map(lambda x: x[0]).distinct().collect()









    Out[8]:





[u'Loan originated',
 u'Application denied by financial institution',
 u'Application approved but not accepted',
 u'Loan purchased by the institution',
 u'Application withdrawn by applicant',
 u'File closed for incompleteness']

Now lets use blaze.into to write those results to a file.



In [9]:

    
into('myfile.csv', rdd.map(lambda x: x[0]).distinct().collect())









    Out[9]:





<blaze.data.csv.CSV at 0x7f8648810290>



In [10]:

    
!head myfile.csv

Blaze drives PySpark

We do the exact same work, but now driving with Blaze. The computation is the same, only the interface is different.

Wrap a Table around the rdd
Find the possible actions taken
Put this result into a variety of



In [11]:

    
t = Table(rdd, columns=hmda.columns)



In [12]:

    
t









    Out[12]:





  
    
      
      action_taken_name
      agency_abbr
      applicant_ethnicity_name
      applicant_race_name_1
      applicant_sex_name
      county_name
      loan_purpose_name
      state_abbr
    
  
  
    
      0 
                         Loan originated
        HUD
       Not Hispanic or Latino
                           White
                 Male
            Will County
            Refinancing
       IL
    
    
      1 
                         Loan originated
       NCUA
       Not Hispanic or Latino
                           White
                 Male
         Midland County
            Refinancing
       MI
    
    
      2 
       Loan purchased by the institution
       CFPB
               Not applicable
                  Not applicable
       Not applicable
          Benton County
            Refinancing
       AR
    
    
      3 
       Loan purchased by the institution
       CFPB
       Not Hispanic or Latino
                           White
               Female
          Ramsey County
            Refinancing
       MN
    
    
      4 
                         Loan originated
       FDIC
       Not Hispanic or Latino
                           White
                 Male
           Allen County
       Home improvement
       IN
    
    
      5 
                         Loan originated
        HUD
       Not Hispanic or Latino
                           White
                 Male
            Cook County
            Refinancing
       IL
    
    
      6 
                         Loan originated
        HUD
       Not Hispanic or Latino
       Black or African American
                 Male
       Calcasieu Parish
          Home purchase
       LA
    
    
      7 
                         Loan originated
        HUD
       Not Hispanic or Latino
                           White
                 Male
           Grand County
            Refinancing
       CO
    
    
      8 
                         Loan originated
       FDIC
       Not Hispanic or Latino
                           White
               Female
           Allen County
            Refinancing
       IN
    
    
      9 
                         Loan originated
       CFPB
       Not Hispanic or Latino
                           White
                 Male
          Talbot County
            Refinancing
       MD
    
    
      10
                         Loan originated
        HUD
       Not Hispanic or Latino
                           White
                 Male
       Calcasieu Parish
          Home purchase
       LA

We can easily inspect the table, just like we would in pandas.



In [13]:

    
t.action_taken_name









    Out[13]:





  
    
      
      action_taken_name
    
  
  
    
      0 
                         Loan originated
    
    
      1 
                         Loan originated
    
    
      2 
       Loan purchased by the institution
    
    
      3 
       Loan purchased by the institution
    
    
      4 
                         Loan originated
    
    
      5 
                         Loan originated
    
    
      6 
                         Loan originated
    
    
      7 
                         Loan originated
    
    
      8 
                         Loan originated
    
    
      9 
                         Loan originated
    
    
      10
                         Loan originated

All of the (meta)data movement is handled, giving the user a natural interactive experience.



In [14]:

    
t.action_taken_name.distinct()









    Out[14]:





  
    
      
      action_taken_name
    
  
  
    
      0
                                   Loan originated
    
    
      1
       Application denied by financial institution
    
    
      2
             Application approved but not accepted
    
    
      3
                 Loan purchased by the institution
    
    
      4
                Application withdrawn by applicant
    
    
      5
                    File closed for incompleteness



In [15]:

    
into(list, t.action_taken_name.distinct())









    Out[15]:





[u'Loan originated',
 u'Application denied by financial institution',
 u'Application approved but not accepted',
 u'Loan purchased by the institution',
 u'Application withdrawn by applicant',
 u'File closed for incompleteness']

Main Points

Blaze provides a lightweight wrapper around PySpark, giving a familiar interface to a powerful platform.

	action_taken_name	agency_abbr	applicant_ethnicity_name	applicant_race_name_1	applicant_sex_name	county_name	loan_purpose_name	state_abbr
0	Loan originated	HUD	Not Hispanic or Latino	White	Male	Will County	Refinancing	IL
1	Loan originated	NCUA	Not Hispanic or Latino	White	Male	Midland County	Refinancing	MI
2	Loan purchased by the institution	CFPB	Not applicable	Not applicable	Not applicable	Benton County	Refinancing	AR
3	Loan purchased by the institution	CFPB	Not Hispanic or Latino	White	Female	Ramsey County	Refinancing	MN
4	Loan originated	FDIC	Not Hispanic or Latino	White	Male	Allen County	Home improvement	IN
5	Loan originated	HUD	Not Hispanic or Latino	White	Male	Cook County	Refinancing	IL
6	Loan originated	HUD	Not Hispanic or Latino	Black or African American	Male	Calcasieu Parish	Home purchase	LA
7	Loan originated	HUD	Not Hispanic or Latino	White	Male	Grand County	Refinancing	CO
8	Loan originated	FDIC	Not Hispanic or Latino	White	Female	Allen County	Refinancing	IN
9	Loan originated	CFPB	Not Hispanic or Latino	White	Male	Talbot County	Refinancing	MD
10	Loan originated	HUD	Not Hispanic or Latino	White	Male	Calcasieu Parish	Home purchase	LA

	action_taken_name
0	Loan originated
1	Application denied by financial institution
2	Application approved but not accepted
3	Loan purchased by the institution
4	Application withdrawn by applicant
5	File closed for incompleteness