Goal: predict survival on the Titanic
It's a basic learning competition on the ML platform Kaggle, a simple introduction to machine learning concepts, specifically binary classification (survived / not survived).
Here we are looking into how to apply Logistic Regression to the Titanic dataset.
The data can be downloaded directly from Kaggle
In [45]:
import pandas as pd
In [46]:
# get titanic training file as a DataFrame
titanic = pd.read_csv("../datasets/titanic_train.csv")
In [47]:
titanic.shape
Out[47]:
In [48]:
# preview the data
titanic.head()
Out[48]:
Survived: Survived (1) or died (0); this is the target variable
Pclass: Passenger's class (1st, 2nd or 3rd class)
Name: Passenger's name
Sex: Passenger's sex
Age: Passenger's age
SibSp: Number of siblings/spouses aboard
Parch: Number of parents/children aboard
Ticket: Ticket number
Fare: Fare
Cabin: Cabin
Embarked: Port of embarkation
In [49]:
titanic.describe()
Out[49]:
Not all features are numeric:
In [50]:
titanic.info()
There are three ports: C = Cherbourg, Q = Queenstown, S = Southampton
In [51]:
ports = pd.get_dummies(titanic.Embarked , prefix='Embarked')
ports.head()
Out[51]:
Now the feature Embarked (a category) has been trasformed into 3 binary features, e.g. Embarked_C = 0 not embarked in Cherbourg, 1 = embarked in Cherbourg.
Finally, the 3 new binary features substitute the orignal one in the data frame:
In [52]:
titanic = titanic.join(ports)
titanic.drop(['Embarked'], axis=1, inplace=True) # then drop the original column
In [53]:
titanic.Sex = titanic.Sex.map({'male':0, 'female':1})
In [54]:
y = titanic.Survived.copy() # copy “y” column values out
In [55]:
X = titanic.drop(['Survived'], axis=1) # then, drop y column
In [56]:
X.drop(['Cabin'], axis=1, inplace=True)
In [57]:
X.drop(['Ticket'], axis=1, inplace=True)
In [58]:
X.drop(['Name'], axis=1, inplace=True)
In [59]:
X.drop(['PassengerId'], axis=1, inplace=True)
In [60]:
X.info()
All features are now numeric, ready for regression.
But we have still a couple of processing to do.
In [61]:
X.isnull().values.any()
Out[61]:
In [62]:
#X[pd.isnull(X).any(axis=1)]
True, there are missing values in the data (NaN) and a quick look at the data reveals that they are all in the Age feature.
One possibility could be to remove the feature, another one is to fille the missing value with a fixed number or the average age.
In [63]:
X.Age.fillna(X.Age.mean(), inplace=True) # replace NaN with average age
In [64]:
X.isnull().values.any()
Out[64]:
Now all missing values have been removed.
The logistic regression would otherwise not work with missing values.
The training set will be used to build the machine learning models. The model will be based on the features like passengers’ gender and class but also on the known survived flag.
The validation set should be used to see how well the model performs on unseen data. For each passenger in the test set, I use the model trained to predict whether or not they survived the sinking of the Titanic, then will be compared with the actual survival flag.
In [65]:
from sklearn.model_selection import train_test_split
# 80 % go into the training test, 20% in the validation test
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=7)
A baseline is always useful to see if the model trained behaves significantly better than an easy to obtain baseline, such as a random guess or a simple heuristic like all and only female passengers survived. In this case, after quickly looging at the training dataset - where the survivial outcome is present - I am going to use the following:
In [66]:
def simple_heuristic(titanicDF):
'''
predict whether or not the passngers survived or perished.
Here's the algorithm, predict the passenger survived:
1) If the passenger is female or
2) if his socioeconomic status is high AND if the passenger is under 18
'''
predictions = [] # a list
for passenger_index, passenger in titanicDF.iterrows():
if passenger['Sex'] == 1:
# female
predictions.append(1) # survived
elif passenger['Age'] < 18 and passenger['Pclass'] == 1:
# male but minor and rich
predictions.append(1) # survived
else:
predictions.append(0) # everyone else perished
return predictions
Let's see how this simple algorithm will behave on the validation dataset and we will keep that number as our baseline:
In [67]:
simplePredictions = simple_heuristic(X_valid)
correct = sum(simplePredictions == y_valid)
print ("Baseline: ", correct/len(y_valid))
Baseline: a simple algorithm predicts correctly 73% of validation cases.
Now let's see if the model can do better.
In [68]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
In [69]:
model.fit(X_train, y_train)
Out[69]:
In [70]:
model.score(X_train, y_train)
Out[70]:
In [71]:
model.score(X_valid, y_valid)
Out[71]:
Two things:
An advantage of logistic regression (e.g. against a neural network) is that it's easily interpreatble. It can be written as a math formula:
In [72]:
model.intercept_ # the fitted intercept
Out[72]:
In [73]:
model.coef_ # the fitted coefficients
Out[73]:
Which means that the formula is:
$$ \boldsymbol P(survive) = \frac{1}{1+e^{-logit}} $$where the logit is:
$$ logit = \boldsymbol{\beta_{0} + \beta_{1}\cdot x_{1} + ... + \beta_{n}\cdot x_{n}}$$where $\beta_{0}$ is the model intercept and the other beta parameters are the model coefficients from above, each multiplied for the related feature:
$$ logit = \boldsymbol{1.4224 - 0.9319 * Pclass + ... + 0.2228 * Embarked_S}$$A heat map of correlation may give us a understanding of which variables are important
In [74]:
titanic.corr()
Out[74]:
In [ ]: