Introduction

This is the third installment of Applying Machine Learning to Kaggle Datasets, a series of ipython notebooks demonstrating the methods described in the Stanford Machine Learning Course. In each noteobok, I apply one method taught in the course to an open kaggle competition.

In this notebook, I demonstrate the use of an artificial neural network to in the Titanic competition.

Outline

  1. Import and examine the data
  2. Create input vectors for the neural network
  3. Set up the network using neurolab library in python
  4. Evaluate model results
  5. Submit results to the Kaggle competition

Import Necessary Modules


In [122]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import code.Neural_Net_Funcs as NNF
import neurolab as nl

In [123]:
reload(NNF)


Out[123]:
<module 'code.Neural_Net_Funcs' from 'code/Neural_Net_Funcs.pyc'>

1. Read Titanic Data


In [124]:
train = pd.read_csv("./data/titanic/train.csv", index_col="PassengerId")
train.head()


Out[124]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

In [125]:
#temp = pd.crosstab([train.Pclass, train.Sex],train.Survived.astype(bool))
#temp

In [126]:
#sb.set(style="white")
#sb.factorplot('Pclass','Survived','Sex',data=train,palette="muted")
#sb.factorplot('Embarked','Survived','Pclass',data=train,palette="muted")
#sb.factorplot('Embarked','Survived','Sex',data=train,palette="muted")
#fg = sb.FacetGrid(train,hue="Pclass",aspect=3,palette="muted")
#fg.map(sb.kdeplot,"Age",bw=4,shade=True,legend=True)
#fg.set(xlim=(0,80))

2. Create input and output vectors for the neural network


In [171]:
reload(NNF)
datain_age,dataout_age,min_max_list_age,pid = NNF.make_input_output(train)
datain,dataout,min_max_list,pid = NNF.make_input_output(train,Age=False)

In [172]:
print len(datain_age), len(datain)


714 177

In [172]:


In [172]:

3. Set up neural network using neurolab library in python


In [173]:
# Get arguments to neurolab net, feed-forward network

# Create the net
# By default, all activation functions are the tangent function
# and all layers have a bias node.
#net.trainf = nl.train.train_gdm

In [178]:
# Build and train the network on the training data.
m = datain.shape[0]    # number of observations
ci = datain.shape[1]    # number of input nodes
layers = [ci,1]   # One hidden layer with ci nodes
net = nl.net.newff(min_max_list,layers)
err = net.train(datain, dataout, show=2,goal=0.01,epochs=20)
net.save('myfirst_net_noage.sav')


Epoch: 2; Error: 61.0199410394;
Epoch: 4; Error: 60.3566937146;
Epoch: 6; Error: 57.4157801794;
Epoch: 8; Error: 53.8256491684;
Epoch: 10; Error: 46.4204596104;
Epoch: 12; Error: 44.4684620818;
Epoch: 14; Error: 42.1227134574;
Epoch: 16; Error: 41.2205662617;
Epoch: 18; Error: 40.3514747192;
Epoch: 20; Error: 40.0138357294;
The maximum number of train epochs is reached

In [180]:
# Train the network on the training data.
m_age = datain_age.shape[0]    # number of observations
ci_age = datain_age.shape[1]    # number of input nodes
layers_age = [ci_age,1]   # One hidden layer with ci nodes
net_age = nl.net.newff(min_max_list_age,layers_age)
err_age = net_age.train(datain_age, dataout_age, show=2,goal=0.01,epochs=20)
net_age.save('myfirst_net_age.sav')


Epoch: 2; Error: 217.539895838;
Epoch: 4; Error: 210.2968521;
Epoch: 6; Error: 205.309284427;
Epoch: 8; Error: 199.91987307;
Epoch: 10; Error: 191.917091424;
Epoch: 12; Error: 186.467781263;
Epoch: 14; Error: 183.414792882;
Epoch: 16; Error: 180.644522543;
Epoch: 18; Error: 179.948153752;
Epoch: 20; Error: 178.4347148;
The maximum number of train epochs is reached

4. Evaluate Model Results


In [181]:
plt.plot(np.array(err)/len(datain),label='No Age')
plt.hold(True)
plt.plot(np.array(err_age)/len(datain_age), label='Age')
plt.legend()


Out[181]:
<matplotlib.legend.Legend at 0x10a4d8790>

In [182]:
# Print fraction of results correctly modeled
trainsim = np.sign(net.sim(datain))
correct = trainsim==dataout
print "Fraction correct (no age): ",np.sum(correct)/ np.float(len(correct)), len(correct)
trainsim = np.sign(net_age.sim(datain_age))
correct = trainsim==dataout_age
print "Fraction correct (w/ age): ",np.sum(correct)/ np.float(len(correct)), len(correct)


Fraction correct (no age):  0.853107344633 177
Fraction correct (w/ age):  0.831932773109 714

5. Run test data through networks


In [183]:
test = pd.read_csv("./data/titanic/test.csv", index_col="PassengerId")
reload(NNF)


Out[183]:
<module 'code.Neural_Net_Funcs' from 'code/Neural_Net_Funcs.pyc'>

In [184]:
datain_age,dataout_age,min_max_list_age,pid_age = NNF.make_input_output(test,Test=True)

In [185]:
datain,dataout,min_max_list,pid = NNF.make_input_output(test,Test=True,Age=False)

In [186]:
predict_age = np.sign(net_age.sim(datain_age))
predict_noage = np.sign(net.sim(datain))

6. Submit prediction to kaggle competition


In [187]:
predictions = np.concatenate([predict_age,predict_noage])
predictions = np.where(predictions==1,predictions,0)
passengerid = np.concatenate([pid_age,pid])
dfout = pd.DataFrame(predictions,index=passengerid,columns=['Survived'])
dfout.index.name = 'PassengerID'
dfout = dfout.astype(int)
dfout = dfout.sort()
dfout.to_csv('./predictions/Neural_Network_Prediction.csv',sep=',')
This submission scored a 0.78469, placing 1037 out of 2102 submissions. This is just better than the "Gender, Price, and Class Based Model" benchmark.