Titanic

Goal: predict survival on the Titanic
It's a basic learning competition on the ML platform Kaggle, a simple introduction to machine learning concepts, specifically binary classification (survived / not survived).
Here we are looking into how to apply Logistic Regression to the Titanic dataset.

1. Collect and understand the data

The data can be downloaded directly from Kaggle


In [45]:
import pandas as pd

In [46]:
# get titanic training file as a DataFrame
titanic = pd.read_csv("../datasets/titanic_train.csv")

In [47]:
titanic.shape


Out[47]:
(891, 12)

In [48]:
# preview the data
titanic.head()


Out[48]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Variable Description

Survived: Survived (1) or died (0); this is the target variable
Pclass: Passenger's class (1st, 2nd or 3rd class)
Name: Passenger's name
Sex: Passenger's sex
Age: Passenger's age
SibSp: Number of siblings/spouses aboard
Parch: Number of parents/children aboard
Ticket: Ticket number
Fare: Fare
Cabin: Cabin
Embarked: Port of embarkation


In [49]:
titanic.describe()


Out[49]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Not all features are numeric:


In [50]:
titanic.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

2. Process the Data

Categorical variables need to be transformed to numeric variables

Transform the embarkment port

There are three ports: C = Cherbourg, Q = Queenstown, S = Southampton


In [51]:
ports = pd.get_dummies(titanic.Embarked , prefix='Embarked')
ports.head()


Out[51]:
Embarked_C Embarked_Q Embarked_S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1

Now the feature Embarked (a category) has been trasformed into 3 binary features, e.g. Embarked_C = 0 not embarked in Cherbourg, 1 = embarked in Cherbourg.
Finally, the 3 new binary features substitute the orignal one in the data frame:


In [52]:
titanic = titanic.join(ports)
titanic.drop(['Embarked'], axis=1, inplace=True) # then drop the original column

Transform the gender feature

This is easier, being already a binary classification (male or female).
This was 1912.


In [53]:
titanic.Sex = titanic.Sex.map({'male':0, 'female':1})

 Extract the target variable


In [54]:
y = titanic.Survived.copy() # copy “y” column values out

In [55]:
X = titanic.drop(['Survived'], axis=1) # then, drop y column

Drop not so important features

For the first model, we ignore some categorical features which will not add too much of a signal.


In [56]:
X.drop(['Cabin'], axis=1, inplace=True)

In [57]:
X.drop(['Ticket'], axis=1, inplace=True)

In [58]:
X.drop(['Name'], axis=1, inplace=True)

In [59]:
X.drop(['PassengerId'], axis=1, inplace=True)

In [60]:
X.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Pclass        891 non-null int64
Sex           891 non-null int64
Age           714 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Embarked_C    891 non-null uint8
Embarked_Q    891 non-null uint8
Embarked_S    891 non-null uint8
dtypes: float64(2), int64(4), uint8(3)
memory usage: 44.5 KB

All features are now numeric, ready for regression.
But we have still a couple of processing to do.

Check if there are any missing values


In [61]:
X.isnull().values.any()


Out[61]:
True

In [62]:
#X[pd.isnull(X).any(axis=1)]

True, there are missing values in the data (NaN) and a quick look at the data reveals that they are all in the Age feature.
One possibility could be to remove the feature, another one is to fille the missing value with a fixed number or the average age.


In [63]:
X.Age.fillna(X.Age.mean(), inplace=True)  # replace NaN with average age

In [64]:
X.isnull().values.any()


Out[64]:
False

Now all missing values have been removed.
The logistic regression would otherwise not work with missing values.

Split the dataset into training and validation

The training set will be used to build the machine learning models. The model will be based on the features like passengers’ gender and class but also on the known survived flag.

The validation set should be used to see how well the model performs on unseen data. For each passenger in the test set, I use the model trained to predict whether or not they survived the sinking of the Titanic, then will be compared with the actual survival flag.


In [65]:
from sklearn.model_selection import train_test_split
  # 80 % go into the training test, 20% in the validation test
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=7)

3. Modelling

Get a baseline

A baseline is always useful to see if the model trained behaves significantly better than an easy to obtain baseline, such as a random guess or a simple heuristic like all and only female passengers survived. In this case, after quickly looging at the training dataset - where the survivial outcome is present - I am going to use the following:


In [66]:
def simple_heuristic(titanicDF):
    '''
    predict whether or not the passngers survived or perished.
    Here's the algorithm, predict the passenger survived:
    1) If the passenger is female or
    2) if his socioeconomic status is high AND if the passenger is under 18
    '''

    predictions = [] # a list
    
    for passenger_index, passenger in titanicDF.iterrows():
          
        if passenger['Sex'] == 1:
                    # female
            predictions.append(1)  # survived
        elif passenger['Age'] < 18 and passenger['Pclass'] == 1:
                    # male but minor and rich
            predictions.append(1)  # survived
        else:
            predictions.append(0) # everyone else perished

    return predictions

Let's see how this simple algorithm will behave on the validation dataset and we will keep that number as our baseline:


In [67]:
simplePredictions = simple_heuristic(X_valid)
correct = sum(simplePredictions == y_valid)
print ("Baseline: ", correct/len(y_valid))


Baseline:  0.731843575419

Baseline: a simple algorithm predicts correctly 73% of validation cases.
Now let's see if the model can do better.

Logistic Regression


In [68]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [69]:
model.fit(X_train, y_train)


Out[69]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

4. Evaluate the model


In [70]:
model.score(X_train, y_train)


Out[70]:
0.8103932584269663

In [71]:
model.score(X_valid, y_valid)


Out[71]:
0.75977653631284914

Two things:

  • the score on the training set is much better than on the validation set, an indication that could be overfitting and not being a general model, e.g. for all ship sinks.
  • the score on the validation set is better than the baseline, so it adds some value at a minimal cost (the logistic regression is not computationally expensive, at least not for smaller datasets).

An advantage of logistic regression (e.g. against a neural network) is that it's easily interpreatble. It can be written as a math formula:


In [72]:
model.intercept_ # the fitted intercept


Out[72]:
array([ 1.42242591])

In [73]:
model.coef_  # the fitted coefficients


Out[73]:
array([[ -9.31919207e-01,   2.83123046e+00,  -3.92725788e-02,
         -3.92811214e-01,   1.93182645e-02,   1.90387275e-03,
          7.44068256e-01,   4.55523662e-01,   2.22833991e-01]])

Which means that the formula is:

$$ \boldsymbol P(survive) = \frac{1}{1+e^{-logit}} $$

where the logit is:

$$ logit = \boldsymbol{\beta_{0} + \beta_{1}\cdot x_{1} + ... + \beta_{n}\cdot x_{n}}$$

where $\beta_{0}$ is the model intercept and the other beta parameters are the model coefficients from above, each multiplied for the related feature:

$$ logit = \boldsymbol{1.4224 - 0.9319 * Pclass + ... + 0.2228 * Embarked_S}$$

5. Iterate on the model

The model could be improved, for example transforming the excluded features above or creating new ones (e.g. I could extract titles from the names which could be another indication of the socio-economic status).

A heat map of correlation may give us a understanding of which variables are important


In [74]:
titanic.corr()


Out[74]:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S
PassengerId 1.000000 -0.005007 -0.035144 -0.042939 0.036847 -0.057527 -0.001652 0.012658 -0.001205 -0.033606 0.022148
Survived -0.005007 1.000000 -0.338481 0.543351 -0.077221 -0.035322 0.081629 0.257307 0.168240 0.003650 -0.155660
Pclass -0.035144 -0.338481 1.000000 -0.131900 -0.369226 0.083081 0.018443 -0.549500 -0.243292 0.221009 0.081720
Sex -0.042939 0.543351 -0.131900 1.000000 -0.093254 0.114631 0.245489 0.182333 0.082853 0.074115 -0.125722
Age 0.036847 -0.077221 -0.369226 -0.093254 1.000000 -0.308247 -0.189119 0.096067 0.036261 -0.022405 -0.032523
SibSp -0.057527 -0.035322 0.083081 0.114631 -0.308247 1.000000 0.414838 0.159651 -0.059528 -0.026354 0.070941
Parch -0.001652 0.081629 0.018443 0.245489 -0.189119 0.414838 1.000000 0.216225 -0.011069 -0.081228 0.063036
Fare 0.012658 0.257307 -0.549500 0.182333 0.096067 0.159651 0.216225 1.000000 0.269335 -0.117216 -0.166603
Embarked_C -0.001205 0.168240 -0.243292 0.082853 0.036261 -0.059528 -0.011069 0.269335 1.000000 -0.148258 -0.778359
Embarked_Q -0.033606 0.003650 0.221009 0.074115 -0.022405 -0.026354 -0.081228 -0.117216 -0.148258 1.000000 -0.496624
Embarked_S 0.022148 -0.155660 0.081720 -0.125722 -0.032523 0.070941 0.063036 -0.166603 -0.778359 -0.496624 1.000000

In [ ]: