Titanic

Goal: predict survival on the Titanic
It's a basic learning competition on the ML platform Kaggle, a simple introduction to machine learning concepts, specifically binary classification (survived / not survived).
Here we are looking into how to apply Logistic Regression to the Titanic dataset.

1. Collect and understand the data

The data can be downloaded directly from Kaggle



In [45]:

    
import pandas as pd



In [46]:

    
# get titanic training file as a DataFrame
titanic = pd.read_csv("../datasets/titanic_train.csv")



In [47]:

    
titanic.shape









    Out[47]:





(891, 12)



In [48]:

    
# preview the data
titanic.head()









    Out[48]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

Variable Description

Survived: Survived (1) or died (0); this is the target variable
Pclass: Passenger's class (1st, 2nd or 3rd class)
Name: Passenger's name
Sex: Passenger's sex
Age: Passenger's age
SibSp: Number of siblings/spouses aboard
Parch: Number of parents/children aboard
Ticket: Ticket number
Fare: Fare
Cabin: Cabin
Embarked: Port of embarkation



In [49]:

    
titanic.describe()









    Out[49]:







  
    
      
      PassengerId
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
    
  
  
    
      count
      891.000000
      891.000000
      891.000000
      714.000000
      891.000000
      891.000000
      891.000000
    
    
      mean
      446.000000
      0.383838
      2.308642
      29.699118
      0.523008
      0.381594
      32.204208
    
    
      std
      257.353842
      0.486592
      0.836071
      14.526497
      1.102743
      0.806057
      49.693429
    
    
      min
      1.000000
      0.000000
      1.000000
      0.420000
      0.000000
      0.000000
      0.000000
    
    
      25%
      223.500000
      0.000000
      2.000000
      20.125000
      0.000000
      0.000000
      7.910400
    
    
      50%
      446.000000
      0.000000
      3.000000
      28.000000
      0.000000
      0.000000
      14.454200
    
    
      75%
      668.500000
      1.000000
      3.000000
      38.000000
      1.000000
      0.000000
      31.000000
    
    
      max
      891.000000
      1.000000
      3.000000
      80.000000
      8.000000
      6.000000
      512.329200

Not all features are numeric:



In [50]:

    
titanic.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

2. Process the Data

Categorical variables need to be transformed to numeric variables

Transform the embarkment port

There are three ports: C = Cherbourg, Q = Queenstown, S = Southampton



In [51]:

    
ports = pd.get_dummies(titanic.Embarked , prefix='Embarked')
ports.head()

Now the feature Embarked (a category) has been trasformed into 3 binary features, e.g. Embarked_C = 0 not embarked in Cherbourg, 1 = embarked in Cherbourg.
Finally, the 3 new binary features substitute the orignal one in the data frame:



In [52]:

    
titanic = titanic.join(ports)
titanic.drop(['Embarked'], axis=1, inplace=True) # then drop the original column

Transform the gender feature

This is easier, being already a binary classification (male or female).
This was 1912.



In [53]:

    
titanic.Sex = titanic.Sex.map({'male':0, 'female':1})

Extract the target variable



In [54]:

    
y = titanic.Survived.copy() # copy “y” column values out



In [55]:

    
X = titanic.drop(['Survived'], axis=1) # then, drop y column

Drop not so important features

For the first model, we ignore some categorical features which will not add too much of a signal.



In [56]:

    
X.drop(['Cabin'], axis=1, inplace=True)



In [57]:

    
X.drop(['Ticket'], axis=1, inplace=True)



In [58]:

    
X.drop(['Name'], axis=1, inplace=True)



In [59]:

    
X.drop(['PassengerId'], axis=1, inplace=True)



In [60]:

    
X.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Pclass        891 non-null int64
Sex           891 non-null int64
Age           714 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Embarked_C    891 non-null uint8
Embarked_Q    891 non-null uint8
Embarked_S    891 non-null uint8
dtypes: float64(2), int64(4), uint8(3)
memory usage: 44.5 KB

All features are now numeric, ready for regression.
But we have still a couple of processing to do.

Check if there are any missing values



In [61]:

    
X.isnull().values.any()









    Out[61]:





True



In [62]:

    
#X[pd.isnull(X).any(axis=1)]

True, there are missing values in the data (NaN) and a quick look at the data reveals that they are all in the Age feature.
One possibility could be to remove the feature, another one is to fille the missing value with a fixed number or the average age.



In [63]:

    
X.Age.fillna(X.Age.mean(), inplace=True)  # replace NaN with average age



In [64]:

    
X.isnull().values.any()









    Out[64]:





False

Now all missing values have been removed.
The logistic regression would otherwise not work with missing values.

Split the dataset into training and validation

The training set will be used to build the machine learning models. The model will be based on the features like passengers’ gender and class but also on the known survived flag.

The validation set should be used to see how well the model performs on unseen data. For each passenger in the test set, I use the model trained to predict whether or not they survived the sinking of the Titanic, then will be compared with the actual survival flag.



In [65]:

    
from sklearn.model_selection import train_test_split
  # 80 % go into the training test, 20% in the validation test
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=7)

3. Modelling

Get a baseline

A baseline is always useful to see if the model trained behaves significantly better than an easy to obtain baseline, such as a random guess or a simple heuristic like all and only female passengers survived. In this case, after quickly looging at the training dataset - where the survivial outcome is present - I am going to use the following:



In [66]:

    
def simple_heuristic(titanicDF):
    '''
    predict whether or not the passngers survived or perished.
    Here's the algorithm, predict the passenger survived:
    1) If the passenger is female or
    2) if his socioeconomic status is high AND if the passenger is under 18
    '''

    predictions = [] # a list
    
    for passenger_index, passenger in titanicDF.iterrows():
          
        if passenger['Sex'] == 1:
                    # female
            predictions.append(1)  # survived
        elif passenger['Age'] < 18 and passenger['Pclass'] == 1:
                    # male but minor and rich
            predictions.append(1)  # survived
        else:
            predictions.append(0) # everyone else perished

    return predictions

Let's see how this simple algorithm will behave on the validation dataset and we will keep that number as our baseline:



In [67]:

    
simplePredictions = simple_heuristic(X_valid)
correct = sum(simplePredictions == y_valid)
print ("Baseline: ", correct/len(y_valid))









    



Baseline:  0.731843575419

Baseline: a simple algorithm predicts correctly 73% of validation cases.
Now let's see if the model can do better.

Logistic Regression



In [68]:

    
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()



In [69]:

    
model.fit(X_train, y_train)









    Out[69]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

4. Evaluate the model



In [70]:

    
model.score(X_train, y_train)









    Out[70]:





0.8103932584269663



In [71]:

    
model.score(X_valid, y_valid)









    Out[71]:





0.75977653631284914

Two things:

the score on the training set is much better than on the validation set, an indication that could be overfitting and not being a general model, e.g. for all ship sinks.
the score on the validation set is better than the baseline, so it adds some value at a minimal cost (the logistic regression is not computationally expensive, at least not for smaller datasets).

An advantage of logistic regression (e.g. against a neural network) is that it's easily interpreatble. It can be written as a math formula:



In [72]:

    
model.intercept_ # the fitted intercept









    Out[72]:





array([ 1.42242591])



In [73]:

    
model.coef_  # the fitted coefficients









    Out[73]:





array([[ -9.31919207e-01,   2.83123046e+00,  -3.92725788e-02,
         -3.92811214e-01,   1.93182645e-02,   1.90387275e-03,
          7.44068256e-01,   4.55523662e-01,   2.22833991e-01]])

Which means that the formula is:

$$ \boldsymbol P(survive) = \frac{1}{1+e^{-logit}} $$

where the logit is:

$$ logit = \boldsymbol{\beta_{0} + \beta_{1}\cdot x_{1} + ... + \beta_{n}\cdot x_{n}}$$

where $\beta_{0}$ is the model intercept and the other beta parameters are the model coefficients from above, each multiplied for the related feature:

$$ logit = \boldsymbol{1.4224 - 0.9319 * Pclass + ... + 0.2228 * Embarked_S}$$

5. Iterate on the model

The model could be improved, for example transforming the excluded features above or creating new ones (e.g. I could extract titles from the names which could be another indication of the socio-economic status).

A heat map of correlation may give us a understanding of which variables are important



In [74]:

    
titanic.corr()









    Out[74]:







  
    
      
      PassengerId
      Survived
      Pclass
      Sex
      Age
      SibSp
      Parch
      Fare
      Embarked_C
      Embarked_Q
      Embarked_S
    
  
  
    
      PassengerId
      1.000000
      -0.005007
      -0.035144
      -0.042939
      0.036847
      -0.057527
      -0.001652
      0.012658
      -0.001205
      -0.033606
      0.022148
    
    
      Survived
      -0.005007
      1.000000
      -0.338481
      0.543351
      -0.077221
      -0.035322
      0.081629
      0.257307
      0.168240
      0.003650
      -0.155660
    
    
      Pclass
      -0.035144
      -0.338481
      1.000000
      -0.131900
      -0.369226
      0.083081
      0.018443
      -0.549500
      -0.243292
      0.221009
      0.081720
    
    
      Sex
      -0.042939
      0.543351
      -0.131900
      1.000000
      -0.093254
      0.114631
      0.245489
      0.182333
      0.082853
      0.074115
      -0.125722
    
    
      Age
      0.036847
      -0.077221
      -0.369226
      -0.093254
      1.000000
      -0.308247
      -0.189119
      0.096067
      0.036261
      -0.022405
      -0.032523
    
    
      SibSp
      -0.057527
      -0.035322
      0.083081
      0.114631
      -0.308247
      1.000000
      0.414838
      0.159651
      -0.059528
      -0.026354
      0.070941
    
    
      Parch
      -0.001652
      0.081629
      0.018443
      0.245489
      -0.189119
      0.414838
      1.000000
      0.216225
      -0.011069
      -0.081228
      0.063036
    
    
      Fare
      0.012658
      0.257307
      -0.549500
      0.182333
      0.096067
      0.159651
      0.216225
      1.000000
      0.269335
      -0.117216
      -0.166603
    
    
      Embarked_C
      -0.001205
      0.168240
      -0.243292
      0.082853
      0.036261
      -0.059528
      -0.011069
      0.269335
      1.000000
      -0.148258
      -0.778359
    
    
      Embarked_Q
      -0.033606
      0.003650
      0.221009
      0.074115
      -0.022405
      -0.026354
      -0.081228
      -0.117216
      -0.148258
      1.000000
      -0.496624
    
    
      Embarked_S
      0.022148
      -0.155660
      0.081720
      -0.125722
      -0.032523
      0.070941
      0.063036
      -0.166603
      -0.778359
      -0.496624
      1.000000



In [ ]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked_C	Embarked_Q	Embarked_S
PassengerId	1.000000	-0.005007	-0.035144	-0.042939	0.036847	-0.057527	-0.001652	0.012658	-0.001205	-0.033606	0.022148
Survived	-0.005007	1.000000	-0.338481	0.543351	-0.077221	-0.035322	0.081629	0.257307	0.168240	0.003650	-0.155660
Pclass	-0.035144	-0.338481	1.000000	-0.131900	-0.369226	0.083081	0.018443	-0.549500	-0.243292	0.221009	0.081720
Sex	-0.042939	0.543351	-0.131900	1.000000	-0.093254	0.114631	0.245489	0.182333	0.082853	0.074115	-0.125722
Age	0.036847	-0.077221	-0.369226	-0.093254	1.000000	-0.308247	-0.189119	0.096067	0.036261	-0.022405	-0.032523
SibSp	-0.057527	-0.035322	0.083081	0.114631	-0.308247	1.000000	0.414838	0.159651	-0.059528	-0.026354	0.070941
Parch	-0.001652	0.081629	0.018443	0.245489	-0.189119	0.414838	1.000000	0.216225	-0.011069	-0.081228	0.063036
Fare	0.012658	0.257307	-0.549500	0.182333	0.096067	0.159651	0.216225	1.000000	0.269335	-0.117216	-0.166603
Embarked_C	-0.001205	0.168240	-0.243292	0.082853	0.036261	-0.059528	-0.011069	0.269335	1.000000	-0.148258	-0.778359
Embarked_Q	-0.033606	0.003650	0.221009	0.074115	-0.022405	-0.026354	-0.081228	-0.117216	-0.148258	1.000000	-0.496624
Embarked_S	0.022148	-0.155660	0.081720	-0.125722	-0.032523	0.070941	0.063036	-0.166603	-0.778359	-0.496624	1.000000