Titanic Dataset

In which we explore Disasters, Trees, Classification & the Kaggle Competition

  1. Visit Kaggle and download Data from http://www.kaggle.com/c/titanic-gettingStarted
  2. Read Titanic Data
  3. Transform and select features
  4. Create a simple model & Predict
  5. Submit to Kaggle & checkout the leaderboard
  6. Decision Tree Model, Predict & Submit
  7. Random Forest Model, Predict & Submit
  8. Discussion

In [16]:
import datetime
from pytz import timezone
print "Last run @%s" % (datetime.datetime.now(timezone('US/Pacific')))
#
from pyspark.context import SparkContext
print "Running Spark Version %s" % (sc.version)
#
from pyspark.conf import SparkConf
conf = SparkConf()
print conf.toDebugString()


Last run @2015-12-17 21:42:57.865440-08:00
Running Spark Version 1.6.0
spark.app.name=PySparkShell
spark.files=file:/Users/ksankar/.ivy2/jars/com.databricks_spark-csv_2.10-1.3.0.jar,file:/Users/ksankar/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,file:/Users/ksankar/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar
spark.jars=file:/Users/ksankar/.ivy2/jars/com.databricks_spark-csv_2.10-1.3.0.jar,file:/Users/ksankar/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,file:/Users/ksankar/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar
spark.master=local[*]
spark.submit.deployMode=client
spark.submit.pyFiles=/Users/ksankar/.ivy2/jars/com.databricks_spark-csv_2.10-1.3.0.jar,/Users/ksankar/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,/Users/ksankar/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar

Read Titanic Data

The Data is part of the Kaggle Competition "Titanic: Machine Learning from Disaster"

Download data from http://www.kaggle.com/c/titanic-gettingStarted


In [17]:
# Read Train & Test Datasets
train = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('titanic-r/train.csv')
test = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('titanic-r/test.csv')

In [18]:
train.dtypes


Out[18]:
[('PassengerId', 'string'),
 ('Survived', 'string'),
 ('Pclass', 'string'),
 ('Name', 'string'),
 ('Sex', 'string'),
 ('Age', 'string'),
 ('SibSp', 'string'),
 ('Parch', 'string'),
 ('Ticket', 'string'),
 ('Fare', 'string'),
 ('Cabin', 'string'),
 ('Embarked', 'string')]

In [19]:
train.describe().show()


+-------+
|summary|
+-------+
|  count|
|   mean|
| stddev|
|    min|
|    max|
+-------+


In [20]:
train.show(2)


+-----------+--------+------+--------------------+------+---+-----+-----+---------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex|Age|SibSp|Parch|   Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+---+-----+-----+---------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male| 22|    1|    0|A/5 21171|   7.25|     |       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female| 38|    1|    0| PC 17599|71.2833|  C85|       C|
+-----------+--------+------+--------------------+------+---+-----+-----+---------+-------+-----+--------+
only showing top 2 rows


In [21]:
import pyspark.sql.functions as F
train_1 = train.select(train['PassengerId'], 
                 train['Survived'].cast("integer").alias("Survived"),
                 train['Pclass'].cast("integer").alias("Pclass"),
                 F.when(train['Sex'] == 'female', 1).otherwise(0).alias("Gender"), 
                 train['Age'].cast("integer").alias("Age"),
                 train['SibSp'].cast("integer").alias("SibSp"),
                 train['Parch'].cast("integer").alias("Parch"),
                 train['Fare'].cast("float").alias("Fare"))

In [22]:
train.count()


Out[22]:
891

In [23]:
train_1.count()


Out[23]:
891

In [24]:
train_1.show(2)


+-----------+--------+------+------+---+-----+-----+-------+
|PassengerId|Survived|Pclass|Gender|Age|SibSp|Parch|   Fare|
+-----------+--------+------+------+---+-----+-----+-------+
|          1|       0|     3|     0| 22|    1|    0|   7.25|
|          2|       1|     1|     1| 38|    1|    0|71.2833|
+-----------+--------+------+------+---+-----+-----+-------+
only showing top 2 rows


In [25]:
train_1.describe().show()


+-------+-------------------+------------------+-------------------+------------------+------------------+-------------------+-----------------+
|summary|           Survived|            Pclass|             Gender|               Age|             SibSp|              Parch|             Fare|
+-------+-------------------+------------------+-------------------+------------------+------------------+-------------------+-----------------+
|  count|                891|               891|                891|               714|               891|                891|              891|
|   mean| 0.3838383838383838| 2.308641975308642|0.35241301907968575|29.712885154061624|0.5230078563411896|0.38159371492704824|32.20420804114722|
| stddev|0.48659245426485737|0.8360712409770491| 0.4779900708960981|14.529273128376586| 1.102743432293432| 0.8060572211299486|49.69342916316157|
|    min|                  0|                 1|                  0|                 0|                 0|                  0|              0.0|
|    max|                  1|                 3|                  1|                80|                 8|                  6|         512.3292|
+-------+-------------------+------------------+-------------------+------------------+------------------+-------------------+-----------------+


In [26]:
# Replace null age by 30
# Do we have nulls ?
train_1.filter(train_1['Age'].isNull()).show(40)


+-----------+--------+------+------+----+-----+-----+--------+
|PassengerId|Survived|Pclass|Gender| Age|SibSp|Parch|    Fare|
+-----------+--------+------+------+----+-----+-----+--------+
|          6|       0|     3|     0|null|    0|    0|  8.4583|
|         18|       1|     2|     0|null|    0|    0|    13.0|
|         20|       1|     3|     1|null|    0|    0|   7.225|
|         27|       0|     3|     0|null|    0|    0|   7.225|
|         29|       1|     3|     1|null|    0|    0|  7.8792|
|         30|       0|     3|     0|null|    0|    0|  7.8958|
|         32|       1|     1|     1|null|    1|    0|146.5208|
|         33|       1|     3|     1|null|    0|    0|    7.75|
|         37|       1|     3|     0|null|    0|    0|  7.2292|
|         43|       0|     3|     0|null|    0|    0|  7.8958|
|         46|       0|     3|     0|null|    0|    0|    8.05|
|         47|       0|     3|     0|null|    1|    0|    15.5|
|         48|       1|     3|     1|null|    0|    0|    7.75|
|         49|       0|     3|     0|null|    2|    0| 21.6792|
|         56|       1|     1|     0|null|    0|    0|    35.5|
|         65|       0|     1|     0|null|    0|    0| 27.7208|
|         66|       1|     3|     0|null|    1|    1| 15.2458|
|         77|       0|     3|     0|null|    0|    0|  7.8958|
|         78|       0|     3|     0|null|    0|    0|    8.05|
|         83|       1|     3|     1|null|    0|    0|  7.7875|
|         88|       0|     3|     0|null|    0|    0|    8.05|
|         96|       0|     3|     0|null|    0|    0|    8.05|
|        102|       0|     3|     0|null|    0|    0|  7.8958|
|        108|       1|     3|     0|null|    0|    0|   7.775|
|        110|       1|     3|     1|null|    1|    0|   24.15|
|        122|       0|     3|     0|null|    0|    0|    8.05|
|        127|       0|     3|     0|null|    0|    0|    7.75|
|        129|       1|     3|     1|null|    1|    1| 22.3583|
|        141|       0|     3|     1|null|    0|    2| 15.2458|
|        155|       0|     3|     0|null|    0|    0|  7.3125|
|        159|       0|     3|     0|null|    0|    0|  8.6625|
|        160|       0|     3|     0|null|    8|    2|   69.55|
|        167|       1|     1|     1|null|    0|    1|    55.0|
|        169|       0|     1|     0|null|    0|    0|  25.925|
|        177|       0|     3|     0|null|    3|    1| 25.4667|
|        181|       0|     3|     1|null|    8|    2|   69.55|
|        182|       0|     2|     0|null|    0|    0|   15.05|
|        186|       0|     1|     0|null|    0|    0|    50.0|
|        187|       1|     3|     1|null|    1|    0|    15.5|
|        197|       0|     3|     0|null|    0|    0|    7.75|
+-----------+--------+------+------+----+-----+-----+--------+
only showing top 40 rows


In [27]:
# Replace null age by 30
train_1.na.fill(30,'Age').show(40)


+-----------+--------+------+------+---+-----+-----+--------+
|PassengerId|Survived|Pclass|Gender|Age|SibSp|Parch|    Fare|
+-----------+--------+------+------+---+-----+-----+--------+
|          1|       0|     3|     0| 22|    1|    0|    7.25|
|          2|       1|     1|     1| 38|    1|    0| 71.2833|
|          3|       1|     3|     1| 26|    0|    0|   7.925|
|          4|       1|     1|     1| 35|    1|    0|    53.1|
|          5|       0|     3|     0| 35|    0|    0|    8.05|
|          6|       0|     3|     0| 30|    0|    0|  8.4583|
|          7|       0|     1|     0| 54|    0|    0| 51.8625|
|          8|       0|     3|     0|  2|    3|    1|  21.075|
|          9|       1|     3|     1| 27|    0|    2| 11.1333|
|         10|       1|     2|     1| 14|    1|    0| 30.0708|
|         11|       1|     3|     1|  4|    1|    1|    16.7|
|         12|       1|     1|     1| 58|    0|    0|   26.55|
|         13|       0|     3|     0| 20|    0|    0|    8.05|
|         14|       0|     3|     0| 39|    1|    5|  31.275|
|         15|       0|     3|     1| 14|    0|    0|  7.8542|
|         16|       1|     2|     1| 55|    0|    0|    16.0|
|         17|       0|     3|     0|  2|    4|    1|  29.125|
|         18|       1|     2|     0| 30|    0|    0|    13.0|
|         19|       0|     3|     1| 31|    1|    0|    18.0|
|         20|       1|     3|     1| 30|    0|    0|   7.225|
|         21|       0|     2|     0| 35|    0|    0|    26.0|
|         22|       1|     2|     0| 34|    0|    0|    13.0|
|         23|       1|     3|     1| 15|    0|    0|  8.0292|
|         24|       1|     1|     0| 28|    0|    0|    35.5|
|         25|       0|     3|     1|  8|    3|    1|  21.075|
|         26|       1|     3|     1| 38|    1|    5| 31.3875|
|         27|       0|     3|     0| 30|    0|    0|   7.225|
|         28|       0|     1|     0| 19|    3|    2|   263.0|
|         29|       1|     3|     1| 30|    0|    0|  7.8792|
|         30|       0|     3|     0| 30|    0|    0|  7.8958|
|         31|       0|     1|     0| 40|    0|    0| 27.7208|
|         32|       1|     1|     1| 30|    1|    0|146.5208|
|         33|       1|     3|     1| 30|    0|    0|    7.75|
|         34|       0|     2|     0| 66|    0|    0|    10.5|
|         35|       0|     1|     0| 28|    1|    0| 82.1708|
|         36|       0|     1|     0| 42|    1|    0|    52.0|
|         37|       1|     3|     0| 30|    0|    0|  7.2292|
|         38|       0|     3|     0| 21|    0|    0|    8.05|
|         39|       0|     3|     1| 18|    2|    0|    18.0|
|         40|       1|     3|     1| 14|    1|    0| 11.2417|
+-----------+--------+------+------+---+-----+-----+--------+
only showing top 40 rows


In [28]:
# Replace null age by 30
train_2 = train_1.na.fill(30,'Age')

In [29]:
train_2.crosstab("Gender","Survived").show()


+---------------+---+---+
|Gender_Survived|  0|  1|
+---------------+---+---+
|              1| 81|233|
|              0|468|109|
+---------------+---+---+


In [30]:
print "F = %3.2f%% M = %3.2f%%" % ( (100*233.0/(233+81)), (100*109.0/(109+468)) )


F = 74.20% M = 18.89%

Dick, The butcher to Jack Cade

Dick: The first thing we do, let's kill all the men.

Cade: Nay, that I mean to do.

Ref : http://www.enotes.com/shakespeare-quotes/lets-kill-all-lawyers


In [31]:
#
# 1 : Simple Model (M=Survived) 
#
test.show(2)


+-----------+------+--------------------+------+----+-----+-----+------+------+-----+--------+
|PassengerId|Pclass|                Name|   Sex| Age|SibSp|Parch|Ticket|  Fare|Cabin|Embarked|
+-----------+------+--------------------+------+----+-----+-----+------+------+-----+--------+
|        892|     3|    Kelly, Mr. James|  male|34.5|    0|    0|330911|7.8292|     |       Q|
|        893|     3|Wilkes, Mrs. Jame...|female|  47|    1|    0|363272|     7|     |       S|
+-----------+------+--------------------+------+----+-----+-----+------+------+-----+--------+
only showing top 2 rows


In [32]:
out = test.select(test['PassengerId'], 
                 F.when(test['Sex'] == 'female', 1).otherwise(0).alias("Survived"))

In [33]:
out.show(2)


+-----------+--------+
|PassengerId|Survived|
+-----------+--------+
|        892|       0|
|        893|       1|
+-----------+--------+
only showing top 2 rows


In [34]:
out.coalesce(1).write.mode('overwrite').format('com.databricks.spark.csv')\
.options(header='true').save('titanic-r/spark-sub-01.csv')

In [35]:
# Submit
# Rank : 2586 Score : 0.76555

In [36]:
#
# Would age be a better predictor ?
#
train_1.na.drop().crosstab("Age","Survived").show()


+------------+---+---+
|Age_Survived|  1|  0|
+------------+---+---+
|           0|  1|  0|
|           5|  4|  0|
|          10|  0|  2|
|          56|  2|  3|
|          42|  6|  7|
|          24| 15| 16|
|          37|  1|  6|
|          25|  6| 18|
|          52|  3|  3|
|          14|  3|  3|
|          20|  3| 12|
|          46|  0|  5|
|          57|  0|  2|
|          29|  8| 14|
|          61|  0|  3|
|           1| 11|  2|
|          74|  0|  1|
|           6|  2|  1|
|          60|  2|  2|
|          28|  7| 18|
+------------+---+---+
only showing top 20 rows


In [37]:
#
# *** Home work : See if Pclass, SibSp or Parch is a better indication and change survival accordingly¶
#

In [38]:
from pyspark.mllib.regression import LabeledPoint
def parse_passenger_list(r):
    return LabeledPoint(r[1],[r[2],r[3],r[4],r[5],r[6],r[7]])

In [39]:
train_rdd = train_2.map(lambda x: parse_passenger_list(x))

In [40]:
train_rdd.count()


Out[40]:
891

In [41]:
train_rdd.first()


Out[41]:
LabeledPoint(0.0, [3.0,0.0,22.0,1.0,0.0,7.25])

In [42]:
from pyspark.mllib.tree import DecisionTree
model = DecisionTree.trainClassifier(train_rdd, numClasses=2,categoricalFeaturesInfo={})

In [43]:
print(model)
# print(model.toDebugString())


DecisionTreeModel classifier of depth 5 with 49 nodes

In [44]:
# Transform test and predict
import pyspark.sql.functions as F
test_1 = test.select(test['PassengerId'], 
                 test['Pclass'].cast("integer").alias("Pclass"),
                 F.when(test['Sex'] == 'female', 1).otherwise(0).alias("Gender"), 
                 test['Age'].cast("integer").alias("Age"),
                 test['SibSp'].cast("integer").alias("SibSp"),
                 test['Parch'].cast("integer").alias("Parch"),
                 test['Fare'].cast("float").alias("Fare"))

In [45]:
test_1.show(2)


+-----------+------+------+---+-----+-----+------+
|PassengerId|Pclass|Gender|Age|SibSp|Parch|  Fare|
+-----------+------+------+---+-----+-----+------+
|        892|     3|     0| 35|    0|    0|7.8292|
|        893|     3|     1| 47|    1|    0|   7.0|
+-----------+------+------+---+-----+-----+------+
only showing top 2 rows


In [46]:
# Do we have nulls ?
test_1.filter(test_1['Age'].isNull()).show(40)


+-----------+------+------+----+-----+-----+-------+
|PassengerId|Pclass|Gender| Age|SibSp|Parch|   Fare|
+-----------+------+------+----+-----+-----+-------+
|        902|     3|     0|null|    0|    0| 7.8958|
|        914|     1|     1|null|    0|    0|31.6833|
|        921|     3|     0|null|    2|    0|21.6792|
|        925|     3|     1|null|    1|    2|  23.45|
|        928|     3|     1|null|    0|    0|   8.05|
|        931|     3|     0|null|    0|    0|56.4958|
|        933|     1|     0|null|    0|    0|  26.55|
|        939|     3|     0|null|    0|    0|   7.75|
|        946|     2|     0|null|    0|    0|15.5792|
|        950|     3|     0|null|    1|    0|   16.1|
|        957|     2|     1|null|    0|    0|   21.0|
|        968|     3|     0|null|    0|    0|   8.05|
|        975|     3|     0|null|    0|    0| 7.8958|
|        976|     2|     0|null|    0|    0|10.7083|
|        977|     3|     0|null|    1|    0|14.4542|
|        980|     3|     1|null|    0|    0|   7.75|
|        983|     3|     0|null|    0|    0|  7.775|
|        985|     3|     0|null|    0|    0|   8.05|
|        994|     3|     0|null|    0|    0|   7.75|
|        999|     3|     0|null|    0|    0|   7.75|
|       1000|     3|     0|null|    0|    0| 8.7125|
|       1003|     3|     1|null|    0|    0| 7.7792|
|       1008|     3|     0|null|    0|    0| 6.4375|
|       1013|     3|     0|null|    1|    0|   7.75|
|       1016|     3|     0|null|    0|    0|   7.75|
|       1019|     3|     1|null|    2|    0|  23.25|
|       1024|     3|     1|null|    0|    4|25.4667|
|       1025|     3|     0|null|    1|    0| 6.4375|
|       1038|     1|     0|null|    0|    0|51.8625|
|       1040|     1|     0|null|    0|    0|  26.55|
|       1043|     3|     0|null|    0|    0| 7.8958|
|       1052|     3|     1|null|    0|    0| 7.7333|
|       1055|     3|     0|null|    0|    0|    7.0|
|       1060|     1|     1|null|    0|    0|27.7208|
|       1062|     3|     0|null|    0|    0|   7.55|
|       1065|     3|     0|null|    0|    0| 7.2292|
|       1075|     3|     0|null|    0|    0|   7.75|
|       1080|     3|     1|null|    8|    2|  69.55|
|       1083|     1|     0|null|    0|    0|   26.0|
|       1091|     3|     1|null|    0|    0| 8.1125|
+-----------+------+------+----+-----+-----+-------+
only showing top 40 rows


In [47]:
test_1.groupBy().avg('Age').show()


+------------------+
|          avg(Age)|
+------------------+
|30.295180722891565|
+------------------+


In [48]:
# Replace null age by 30.24 - the mean
test_2 = test_1.na.fill(30,'Age')

In [49]:
# parse test data for predictions
from pyspark.mllib.regression import LabeledPoint
def parse_test(r):
    return (r[1],r[2],r[3],r[4],r[5],r[6])

In [50]:
test_rdd = test_2.map(lambda x: parse_test(x))

In [51]:
test_rdd.count()


Out[51]:
418

In [52]:
predictions = model.predict(test_rdd)

In [53]:
predictions.first()


Out[53]:
0.0

In [54]:
out_rdd = test_2.map(lambda x: x[0]).zip(predictions)

In [55]:
out_rdd.first()


Out[55]:
(u'892', 0.0)

In [56]:
out_df = out_rdd.toDF(['PassengerId','Survived'])

In [57]:
out_df.show(2)


+-----------+--------+
|PassengerId|Survived|
+-----------+--------+
|        892|     0.0|
|        893|     0.0|
+-----------+--------+
only showing top 2 rows


In [58]:
out_1 = out_df.select(out_df['PassengerId'],
                      out_df['Survived'].cast('integer').alias('Survived'))

In [59]:
out_1.show(2)


+-----------+--------+
|PassengerId|Survived|
+-----------+--------+
|        892|       0|
|        893|       0|
+-----------+--------+
only showing top 2 rows


In [60]:
out_1.coalesce(1).write.mode('overwrite').format('com.databricks.spark.csv')\
.options(header='true').save('titanic-r/spark-sub-02.csv')

In [61]:
# Submit
# Rank : 2038 +549 Score : 0.77512

In [62]:
from pyspark.mllib.tree import RandomForest
model_rf = RandomForest.trainClassifier(train_rdd, numClasses=2,categoricalFeaturesInfo={},numTrees=42)

In [63]:
print(model_rf)
#print(model_rf.toDebugString())


TreeEnsembleModel classifier with 42 trees


In [64]:
pred_rf = model_rf.predict(test_rdd).coalesce(1)

In [65]:
pred_rf.first()


Out[65]:
0.0

In [66]:
out_rf = test_2.map(lambda x: x[0]).coalesce(1).zip(pred_rf)

In [67]:
out_rf.first()


Out[67]:
(u'892', 0.0)

In [68]:
out_df_rf = out_rf.toDF(['PassengerId','Survived'])

In [69]:
out_2 = out_df_rf.select(out_df_rf['PassengerId'],
                      out_df_rf['Survived'].cast('integer').alias('Survived'))

In [70]:
out_2.coalesce(1).write.mode('overwrite').format('com.databricks.spark.csv')\
.options(header='true').save('titanic-r/spark-sub-03.csv')

In [71]:
# Submit
# Rank : 1550 +488 Score : 0.78469

In [72]:
# Looks like we are on a roll ! Let us try SVM !

In [73]:
from pyspark.mllib.classification import SVMWithSGD
model_svm = SVMWithSGD.train(train_rdd, iterations=100)

In [74]:
pred_svm = model_svm.predict(test_rdd).coalesce(1)
out_svm = test_2.map(lambda x: x[0]).coalesce(1).zip(pred_svm)
out_df_svm = out_svm.toDF(['PassengerId','Survived'])

In [75]:
out_3 = out_df_svm.select(out_df_svm['PassengerId'],
                      out_df_svm['Survived'].cast('integer').alias('Survived'))

In [76]:
out_3.coalesce(1).write.mode('overwrite').format('com.databricks.spark.csv')\
.options(header='true').save('titanic-r/spark-sub-04.csv')

In [77]:
# Not good. Only 0.39713 !

Did Random Forest or SVM do Better ?

Why ? Why Not ?


In [ ]: