Binary Classification

This notebook follows exactly the tutorial from Kaggle: data analysis framework from Kaggle. Thanks to the author LD Freeman for creating such a great tutorial! My goal is to create an Apache Spark version using the same framework.

Create Spark entry points


In [1]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

In [2]:
sc = SparkContext(conf=SparkConf())
spark = SparkSession(sparkContext=sc)

Step 1: define the problem

What sorts of people were likely to survive from the Titanic accident?

Step 2: gather the data

The datasets can be found here: https://www.kaggle.com/c/titanic/data. It is also available in this github repository:

Step 3: prepare data for consumption

3.1 Import libraries

3.11 Import python packages


In [3]:
# load packages
import sys
print('Python version: {}'. format(sys.version))

import pandas as pd
print('Python version: {}'. format(pd.__version__))

import matplotlib
print('matplotlib version: {}'. format(matplotlib.__version__))

import numpy as np
print('numpy version: {}'. format(np.__version__))

import scipy as sp
print('scipy version: {}'. format(sp.__version__))

import IPython
from IPython import display # pretty printing of dataframe in Jupyter notebook
print('IPython version: {}'. format(IPython.__version__))

import pyspark
print('Apache Spark Pyspark version: {}'. format(pyspark.__version__)) # pyspark version

# misc libraries
import random
import time

# ignore warnings
import warnings
warnings.filterwarnings('ignore')
print('-'*25)


Python version: 3.6.4 |Anaconda custom (64-bit)| (default, Dec 21 2017, 21:42:08) 
[GCC 7.2.0]
Python version: 0.20.3
matplotlib version: 2.1.0
numpy version: 1.13.3
scipy version: 1.0.0
IPython version: 6.1.0
Apache Spark Pyspark version: 2.2.1
-------------------------

3.12 Import Pyspark Models for binary classification


In [4]:
from pyspark.ml.classification import LinearSVC
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.classification import OneVsRest

3.2 Meet and greet data


In [5]:
from subprocess import check_output
print('-'*10, 'datasets', '-'*10)
print(check_output(['ls', 'data/titanic']).decode('utf8'))


---------- datasets ----------
gender_submission.csv
test.csv
train.csv


In [6]:
# import data
# we will split the train data into train and test data in future sections
data_raw = spark.read.csv('data/titanic/train.csv', inferSchema=True, header=True)

# the test file provided is for validation of final model.
data_val = spark.read.csv('data/titanic/test.csv', inferSchema=True, header=True)

# preview the data
# data type
print('-'*10, 'data types', '-'*10)
pd.DataFrame(data_raw.dtypes)


---------- data types ----------
Out[6]:
0 1
0 PassengerId int
1 Survived int
2 Pclass int
3 Name string
4 Sex string
5 Age double
6 SibSp int
7 Parch int
8 Ticket string
9 Fare double
10 Cabin string
11 Embarked string

In [7]:
# data summary
print('-'*10, 'data summary', '-'*10)
data_raw.describe().toPandas()


---------- data summary ----------
Out[7]:
summary PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 count 891 891 891 891 891 714 891 891 891 891 204 889
1 mean 446.0 0.3838383838383838 2.308641975308642 None None 29.69911764705882 0.5230078563411896 0.38159371492704824 260318.54916792738 32.2042079685746 None None
2 stddev 257.3538420152301 0.48659245426485753 0.8360712409770491 None None 14.526497332334035 1.1027434322934315 0.8060572211299488 471609.26868834975 49.69342859718089 None None
3 min 1 0 1 "Andersson, Mr. August Edvard (""Wennerstrom"")" female 0.42 0 0 110152 0.0 A10 C
4 max 891 1 3 van Melkebeke, Mr. Philemon male 80.0 8 6 WE/P 5735 512.3292 T S

In [8]:
# view a small subset of the data
print('-'*10, 'randomely sample 1% data to view', '-'*10)
data_raw.randomSplit([0.01, 0.99])[0].toPandas()


---------- randomely sample 1% data to view ----------
Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 71 0 2 Jenkin, Mr. Stephen Curnow male 32.0 0 0 C.A. 33111 10.5000 None S
1 80 1 3 Dowdell, Miss. Elizabeth female 30.0 0 0 364516 12.4750 None S
2 133 0 3 Robins, Mrs. Alexander A (Grace Charity Laury) female 47.0 1 0 A/5. 3337 14.5000 None S
3 138 0 1 Futrelle, Mr. Jacques Heath male 37.0 1 0 113803 53.1000 C123 S
4 193 1 3 Andersen-Jensen, Miss. Carla Christine Nielsine female 19.0 1 0 350046 7.8542 None S
5 459 1 2 Toomey, Miss. Ellen female 50.0 0 0 F.C.C. 13531 10.5000 None S
6 526 0 3 Farrell, Mr. James male 40.5 0 0 367232 7.7500 None Q
7 774 0 3 Elias, Mr. Dibo male NaN 0 0 2674 7.2250 None C
8 777 0 3 Tobin, Mr. Roger male NaN 0 0 383121 7.7500 F38 Q
9 784 0 3 Johnston, Mr. Andrew G male NaN 1 2 W./C. 6607 23.4500 None S

3.21 The 4 C's of data clearning: Correcting, Completing, Creating, and Converting


In [9]:
# we first check which values are NULL values for each column
# then we convert the boolean values to int (0 and 1), then we can count how many 1's exist in each column.
print('-'*25)
print('0: is not NULL')
print('1: is NULL')
print('-'*25)
print(' '*25)
# we build column strings and then use eval() to convert strings to column expressions.
data_raw.select([eval('data_raw.' + x + '.isNull().cast("int").alias("' + x + '")') for x in data_raw.columns]).show(n=10)


-------------------------
0: is not NULL
1: is NULL
-------------------------
                         
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    0|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    0|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|
|          0|       0|     0|   0|  0|  1|    0|    0|     0|   0|    1|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    0|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|
|          0|       0|     0|   0|  0|  0|    0|    0|     0|   0|    1|       0|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
only showing top 10 rows


In [10]:
print('Train columns with null values:')
print('-'*25)
data_raw.select([eval('data_raw.' + x + '.isNull().cast("int").alias("' + x + '")') for x in data_raw.columns]).\
    groupBy().sum().toPandas()


Train columns with null values:
-------------------------
Out[10]:
sum(PassengerId) sum(Survived) sum(Pclass) sum(Name) sum(Sex) sum(Age) sum(SibSp) sum(Parch) sum(Ticket) sum(Fare) sum(Cabin) sum(Embarked)
0 0 0 0 0 0 177 0 0 0 0 687 2

In [11]:
print('Test columns with null values:')
print('-'*25)
data_val.select([eval('data_val.' + x + '.isNull().cast("int").alias("' + x + '")') for x in data_val.columns]).\
    groupBy().sum().toPandas()


Test columns with null values:
-------------------------
Out[11]:
sum(PassengerId) sum(Pclass) sum(Name) sum(Sex) sum(Age) sum(SibSp) sum(Parch) sum(Ticket) sum(Fare) sum(Cabin) sum(Embarked)
0 0 0 0 0 86 0 0 0 1 327 0

3.22 Clean data

COMPLETE


In [23]:
# COMPLETE: complete or delete missing values in train and test/validation dataset.

# complete missing age with median

# complete missing embarked with mode

# complete missing fare with median

In [24]:
data_raw.select('Age')


Out[24]:
DataFrame[Age: double]

In [ ]: