In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# IPython Notebook to get students started with the Kaggle Titanic competition:
# http://www.kaggle.com/c/titanic-gettingStarted/
# MIT licensed, this just solves the Excel demo that is used on Kaggle to machine
# learn using the 'Sex' field:
# http://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-excel

# Start the Notebook at the command line using the --pylab flag
# so that the charts are drawn in-line inside the browser, rather than being
# drawn in a separate pop-up window
# e.g.
# $ ipython notebook --pylab=inline

# maybe start a second IPython shell
# $ ipython console --existing c09440a7-190d-4b84-a073-db3905711aa8  # id comes from the output when you
# start the notebook (in the shell)

In [2]:
# Get the train.csv and test.csv files here:
# http://www.kaggle.com/c/titanic-gettingStarted/data
train = pd.read_csv("train.csv")  # Gives a DataFrame
print len(train)
train.head()  # should generate an HTML table object rather than raw text


891
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
https://www.kaggle.com/c/titanic-gettingStarted/data?train.csv VARIABLE DESCRIPTIONS: survival Survival (0 = No; 1 = Yes) pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) name Name sex Sex age Age sibsp Number of Siblings/Spouses Aboard parch Number of Parents/Children Aboard ticket Ticket Number fare Passenger Fare cabin Cabin embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) SPECIAL NOTES: Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5 With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch. Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored) Parent: Mother or Father of Passenger Aboard Titanic Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

In [3]:
train.dtypes


Out[3]:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [4]:
train['Fare'].hist(bins=100)


Out[4]:
<matplotlib.axes.AxesSubplot at 0x3646890>

In [5]:
train['Fare'].hist(bins=100)
plt.xlim((0, 100))


Out[5]:
(0, 100)

In [6]:
train['Age'].hist()


Out[6]:
<matplotlib.axes.AxesSubplot at 0x3787550>
# TODO # visualise survival versus sex - maybe a stacked bar chart? at least do a simple percentage calculation and print the result # visualise age vs fare as a scatter chart and colour the markers to show survival - do we see a pattern? # maybe modify the scatter plot to show Pclass using a symbol (e.g. square, triangle, circle) as an addition, so we show 4 dimensions (age, fare, Pclass, survival) on a 2D chart

In [7]:
# Note that 'Sex' column is an object (it is a string)
print train['Sex'].map(lambda x: x=='male')[:5]


0     True
1    False
2    False
3    False
4     True
Name: Sex, dtype: bool

In [8]:
def is_male(cell_value):
    return cell_value == "male"
print train['Sex'].map(is_male)[:5]


0     True
1    False
2    False
3    False
4     True
Name: Sex, dtype: bool

In [9]:
# add a new is_male column to the DataFrame by calling our function
train['is_male'] = train['Sex'].map(is_male)

In [10]:
train.dtypes  # what dtypes do we have now?


Out[10]:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
is_male           bool
dtype: object

In [11]:
train.head()  # show new 'is_male' column


Out[11]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked is_male
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S True
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C False
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S False
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S False
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S True

In [12]:
# Let's start with a tiny bit of statistics - how many males survived?
# This lets us ask - did women have a better chance of survival?
nbr_male = sum(train['is_male'])
nbr_people = float(len(train))
print "{} ({:0.2f}%) males of {} people".format(nbr_male, nbr_male/nbr_people * 100, nbr_people)

print "{} people survived".format(sum(train.Survived == 1))
print "{} males survived".format(sum(train.is_male[train.Survived == 1]))
# Do women have an advantage?


577 (64.76%) males of 891.0 people
342 people survived
109 males survived

In [13]:
train[['Survived', 'is_male']].head()


Out[13]:
Survived is_male
0 0 True
1 1 False
2 1 False
3 1 False
4 0 True

In [14]:
train[['Survived', 'Pclass']].head()


Out[14]:
Survived Pclass
0 0 3
1 1 1
2 1 3
3 1 1
4 0 3

In [15]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()

In [16]:
print train['is_male'].head()  # sanity-check what the column looks like


0     True
1    False
2    False
3    False
4     True
Name: is_male, dtype: bool

In [17]:
clf.fit(train[['is_male']], train['Survived'])


Out[17]:
DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features=None, min_density=None,
            min_samples_leaf=1, min_samples_split=2, random_state=None,
            splitter='best')

In [18]:
# Get the score
print "Score:", clf.score(train[['is_male']], train['Survived'])
print "Feature importances", clf.feature_importances_


Score: 0.786756453423
Feature importances [ 1.]

In [19]:
# load the test data set
# you have to get this file from Kaggle (it is in the same place as train.csv above)
test = pd.read_csv("test.csv")
test['is_male'] = test['Sex'].map(is_male)

test.head()


Out[19]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked is_male
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q True
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S False
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q True
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S True
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S False

In [20]:
# we can't score the test set as it doesn't have a Survived column
# this means we can't cheat, so we only get a score when we upload to Kaggle
#clf.score(test['is_male'].reshape((-1,2)), test['Survived'])

In [21]:
# predict using the test set, write the first few answers
predictions = clf.predict(test[['is_male']])
predictions[:5]
print "Total predicted survivor rate:", sum(predictions)  # this should print 152


Total predicted survivor rate: 152

In [24]:
#from sklearn.externals.six import StringIO  
import StringIO
import pydot 
root_filename = "clf_is_male"
dot_data = StringIO.StringIO()
tree.export_graphviz(clf, out_file=dot_data, feature_names=['is_male']) 
with open(root_filename + '.dot', 'w') as f:
    f.write(dot_data.getvalue())
# output as a PNG
import os
os.system("dot -Tpng {}.dot -o {}.png".format(root_filename, root_filename))

# note! if pyparsing==2.0.1 then we get:
# NameError: global name 'dot_parser' is not defined
# so we have to uninstall pyparsing (pip uninstall pyparsing) and
# install the older more compatible version:
# pip install pyparsing==1.5.7
# if you installed the 'requirements.txt' as specified in the README then you'll
# have this version (1.5.7) already

# write a PDF locally - you can comment out these 2 lines if it doesn't work
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
graph.write_pdf("{}.pdf".format(root_filename)) 

# draw the PNG file into the page below this box so we can visualise
# it here in the Notebook
from IPython.core.display import Image 
Image(filename='{}.png'.format(root_filename))


Out[24]:
In the diagram above the field `is_male` is boolean but the Decision Tree represents it as a number, so the <= 0.5 test is saying "if 0 then go left, if 1 then go right". 0 is False so the left branch is for females, the right branch is for males. The two leaf nodes (on the second row) use the same output `value`. Column 0 is for Survived==0 (i.e. the number who died), column 1 is for Survived==1 (i.e. those who lived). You can see that 109 males survived. Confirm this with the result a few cells above where we calculated the number of men who survived. Question - how many females died?

In [27]:
# write out a predictions_is_male.csv file, this can be directly uploaded
# to kaggle as a first result. This gives an identical result to anyone who uses the
# Sex column (e.g. anyone who follows Kaggle's Excel or Python introduction)
predictions_df = pd.DataFrame(predictions, columns=['Survived'])
predictions_df['PassengerId'] = test['PassengerId']
predictions_df.set_index('PassengerId')
predictions_df.to_csv('predictions_is_male.csv', index=False, cols=['PassengerId', 'Survived'])

# note the output should be identical to gendermodel.csv which is provided by Kaggle
# the above scores 0.76555 for rank 4000 or so

# if you downloaded Kaggle's gendermodel.csv then the file we've just generated and gendermodel.csv 
# should be identical
Next you should read more on the Kaggle website about "Getting Started" for this competition, it gives some hints about where to go next.