Binary Classification

This notebook follows exactly the tutorial from Kaggle: data analysis framework from Kaggle. Thanks to the author LD Freeman for creating such a great tutorial! My goal is to create an Apache Spark version using the same framework.

Create Spark entry points



In [1]:

    
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession



In [2]:

    
sc = SparkContext(conf=SparkConf())
spark = SparkSession(sparkContext=sc)

Step 1: define the problem

What sorts of people were likely to survive from the Titanic accident?

Step 2: gather the data

The datasets can be found here: https://www.kaggle.com/c/titanic/data. It is also available in this github repository:

Step 3: prepare data for consumption

3.1 Import libraries

3.11 Import python packages



In [3]:

    
# load packages
import sys
print('Python version: {}'. format(sys.version))

import pandas as pd
print('Python version: {}'. format(pd.__version__))

import matplotlib
print('matplotlib version: {}'. format(matplotlib.__version__))

import numpy as np
print('numpy version: {}'. format(np.__version__))

import scipy as sp
print('scipy version: {}'. format(sp.__version__))

import IPython
from IPython import display # pretty printing of dataframe in Jupyter notebook
print('IPython version: {}'. format(IPython.__version__))

import pyspark
print('Apache Spark Pyspark version: {}'. format(pyspark.__version__)) # pyspark version

# misc libraries
import random
import time

# ignore warnings
import warnings
warnings.filterwarnings('ignore')
print('-'*25)









    



Python version: 3.6.4 |Anaconda custom (64-bit)| (default, Dec 21 2017, 21:42:08) 
[GCC 7.2.0]
Python version: 0.20.3
matplotlib version: 2.1.0
numpy version: 1.13.3
scipy version: 1.0.0
IPython version: 6.1.0
Apache Spark Pyspark version: 2.2.1
-------------------------

3.12 Import Pyspark Models for binary classification



In [4]:

    
from pyspark.ml.classification import LinearSVC
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.classification import OneVsRest

3.2 Meet and greet data



In [5]:

    
from subprocess import check_output
print('-'*10, 'datasets', '-'*10)
print(check_output(['ls', 'data/titanic']).decode('utf8'))









    



---------- datasets ----------
gender_submission.csv
test.csv
train.csv



In [6]:

    
# import data
# we will split the train data into train and test data in future sections
data_raw = spark.read.csv('data/titanic/train.csv', inferSchema=True, header=True)

# the test file provided is for validation of final model.
data_val = spark.read.csv('data/titanic/test.csv', inferSchema=True, header=True)

# preview the data
# data type
print('-'*10, 'data types', '-'*10)
pd.DataFrame(data_raw.dtypes)









    



---------- data types ----------






    Out[6]:







  
    
      
      0
      1
    
  
  
    
      0
      PassengerId
      int
    
    
      1
      Survived
      int
    
    
      2
      Pclass
      int
    
    
      3
      Name
      string
    
    
      4
      Sex
      string
    
    
      5
      Age
      double
    
    
      6
      SibSp
      int
    
    
      7
      Parch
      int
    
    
      8
      Ticket
      string
    
    
      9
      Fare
      double
    
    
      10
      Cabin
      string
    
    
      11
      Embarked
      string



In [7]:

    
# data summary
print('-'*10, 'data summary', '-'*10)
data_raw.describe().toPandas()









    



---------- data summary ----------






    Out[7]:







  
    
      
      summary
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      count
      891
      891
      891
      891
      891
      714
      891
      891
      891
      891
      204
      889
    
    
      1
      mean
      446.0
      0.3838383838383838
      2.308641975308642
      None
      None
      29.69911764705882
      0.5230078563411896
      0.38159371492704824
      260318.54916792738
      32.2042079685746
      None
      None
    
    
      2
      stddev
      257.3538420152301
      0.48659245426485753
      0.8360712409770491
      None
      None
      14.526497332334035
      1.1027434322934315
      0.8060572211299488
      471609.26868834975
      49.69342859718089
      None
      None
    
    
      3
      min
      1
      0
      1
      "Andersson, Mr. August Edvard (""Wennerstrom"")"
      female
      0.42
      0
      0
      110152
      0.0
      A10
      C
    
    
      4
      max
      891
      1
      3
      van Melkebeke, Mr. Philemon
      male
      80.0
      8
      6
      WE/P 5735
      512.3292
      T
      S



In [8]:

    
# view a small subset of the data
print('-'*10, 'randomely sample 1% data to view', '-'*10)
data_raw.randomSplit([0.01, 0.99])[0].toPandas()









    



---------- randomely sample 1% data to view ----------






    Out[8]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      71
      0
      2
      Jenkin, Mr. Stephen Curnow
      male
      32.0
      0
      0
      C.A. 33111
      10.5000
      None
      S
    
    
      1
      80
      1
      3
      Dowdell, Miss. Elizabeth
      female
      30.0
      0
      0
      364516
      12.4750
      None
      S
    
    
      2
      133
      0
      3
      Robins, Mrs. Alexander A (Grace Charity Laury)
      female
      47.0
      1
      0
      A/5. 3337
      14.5000
      None
      S
    
    
      3
      138
      0
      1
      Futrelle, Mr. Jacques Heath
      male
      37.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      193
      1
      3
      Andersen-Jensen, Miss. Carla Christine Nielsine
      female
      19.0
      1
      0
      350046
      7.8542
      None
      S
    
    
      5
      459
      1
      2
      Toomey, Miss. Ellen
      female
      50.0
      0
      0
      F.C.C. 13531
      10.5000
      None
      S
    
    
      6
      526
      0
      3
      Farrell, Mr. James
      male
      40.5
      0
      0
      367232
      7.7500
      None
      Q
    
    
      7
      774
      0
      3
      Elias, Mr. Dibo
      male
      NaN
      0
      0
      2674
      7.2250
      None
      C
    
    
      8
      777
      0
      3
      Tobin, Mr. Roger
      male
      NaN
      0
      0
      383121
      7.7500
      F38
      Q
    
    
      9
      784
      0
      3
      Johnston, Mr. Andrew G
      male
      NaN
      1
      2
      W./C. 6607
      23.4500
      None
      S

3.21 The 4 C's of data clearning: Correcting, Completing, Creating, and Converting



In [9]:

    
# we first check which values are NULL values for each column
# then we convert the boolean values to int (0 and 1), then we can count how many 1's exist in each column.
print('-'*25)
print('0: is not NULL')
print('1: is NULL')
print('-'*25)
print(' '*25)
# we build column strings and then use eval() to convert strings to column expressions.
data_raw.select([eval('data_raw.' + x + '.isNull().cast("int").alias("' + x + '")') for x in data_raw.columns]).show(n=10)









    



-------------------------
0: is not NULL
1: is NULL
-------------------------
                         
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    0|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    0|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|
|          0|       0|     0|   0|  0|  1|    0|    0|     0|   0|    1|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    0|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
only showing top 10 rows



In [10]:

    
print('Train columns with null values:')
print('-'*25)
data_raw.select([eval('data_raw.' + x + '.isNull().cast("int").alias("' + x + '")') for x in data_raw.columns]).\
    groupBy().sum().toPandas()









    



Train columns with null values:
-------------------------






    Out[10]:







  
    
      
      sum(PassengerId)
      sum(Survived)
      sum(Pclass)
      sum(Name)
      sum(Sex)
      sum(Age)
      sum(SibSp)
      sum(Parch)
      sum(Ticket)
      sum(Fare)
      sum(Cabin)
      sum(Embarked)
    
  
  
    
      0
      0
      0
      0
      0
      0
      177
      0
      0
      0
      0
      687
      2



In [11]:

    
print('Test columns with null values:')
print('-'*25)
data_val.select([eval('data_val.' + x + '.isNull().cast("int").alias("' + x + '")') for x in data_val.columns]).\
    groupBy().sum().toPandas()









    



Test columns with null values:
-------------------------






    Out[11]:







  
    
      
      sum(PassengerId)
      sum(Pclass)
      sum(Name)
      sum(Sex)
      sum(Age)
      sum(SibSp)
      sum(Parch)
      sum(Ticket)
      sum(Fare)
      sum(Cabin)
      sum(Embarked)
    
  
  
    
      0
      0
      0
      0
      0
      86
      0
      0
      0
      1
      327
      0

3.22 Clean data

COMPLETE



In [23]:

    
# COMPLETE: complete or delete missing values in train and test/validation dataset.

# complete missing age with median

# complete missing embarked with mode

# complete missing fare with median



In [24]:

    
data_raw.select('Age')









    Out[24]:





DataFrame[Age: double]



In [ ]:

	0	1
0	PassengerId	int
1	Survived	int
2	Pclass	int
3	Name	string
4	Sex	string
5	Age	double
6	SibSp	int
7	Parch	int
8	Ticket	string
9	Fare	double
10	Cabin	string
11	Embarked	string

	summary	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	count	891	891	891	891	891	714	891	891	891	891	204	889
1	mean	446.0	0.3838383838383838	2.308641975308642	None	None	29.69911764705882	0.5230078563411896	0.38159371492704824	260318.54916792738	32.2042079685746	None	None
2	stddev	257.3538420152301	0.48659245426485753	0.8360712409770491	None	None	14.526497332334035	1.1027434322934315	0.8060572211299488	471609.26868834975	49.69342859718089	None	None
3	min	1	0	1	"Andersson, Mr. August Edvard (""Wennerstrom"")"	female	0.42	0	0	110152	0.0	A10	C
4	max	891	1	3	van Melkebeke, Mr. Philemon	male	80.0	8	6	WE/P 5735	512.3292	T	S

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	71	0	2	Jenkin, Mr. Stephen Curnow	male	32.0	0	0	C.A. 33111	10.5000	None	S
1	80	1	3	Dowdell, Miss. Elizabeth	female	30.0	0	0	364516	12.4750	None	S
2	133	0	3	Robins, Mrs. Alexander A (Grace Charity Laury)	female	47.0	1	0	A/5. 3337	14.5000	None	S
3	138	0	1	Futrelle, Mr. Jacques Heath	male	37.0	1	0	113803	53.1000	C123	S
4	193	1	3	Andersen-Jensen, Miss. Carla Christine Nielsine	female	19.0	1	0	350046	7.8542	None	S
5	459	1	2	Toomey, Miss. Ellen	female	50.0	0	0	F.C.C. 13531	10.5000	None	S
6	526	0	3	Farrell, Mr. James	male	40.5	0	0	367232	7.7500	None	Q
7	774	0	3	Elias, Mr. Dibo	male	NaN	0	0	2674	7.2250	None	C
8	777	0	3	Tobin, Mr. Roger	male	NaN	0	0	383121	7.7500	F38	Q
9	784	0	3	Johnston, Mr. Andrew G	male	NaN	1	2	W./C. 6607	23.4500	None	S