Introduction

This case study is about predicting which passengers survived the sinking of the famous Titanic. In our work, we would like to establish a model that predicts the survival of each passenger. In order to do this, we will use a dataset that describe each passenger (multiple features) and if they survived or not.

Data description

In this section, we load and explore the dataset. First, we import the libraries needed along the project.


In [1]:
# Import numerical and data processing libraries
import numpy as np
import pandas as pd

# Import helpers that make it easy to do cross-validation
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# Import machine learning models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

# Import visualisation libraries
import matplotlib.pyplot as plt
%matplotlib inline

# Import a method in order to make deep copies
from copy import deepcopy

# Import an other usefull libraries
import itertools

# Set the paths for inputs and outputs
local = 1
if(local == 0):
    inputPath = "../input/"
    outputPath = "../output/"
else:
    inputPath = "data/"
    outputPath = "data/"

Here, we load the datasets. We actually have 2 datasets:

  • "train.csv": contains informations about some passengers (multiple columns) and the fact that they survived or not (one column). You may download this dataset here in CSV format.
  • "test.csv": contains informations about some other passengers (multiple columns) but without the survival information. You may download this dataset here in CSV format.

Note that the only difference in the structure of the 2 datasets is that the "test" dataset does not contain "Survived" column (the "label" or the "class" to which the passenger belongs).

We describe in what follows the columns of the "train" dataset.


In [2]:
# This creates a pandas dataframe and assigns it to the titanic variable
titanicOrigTrainDS = pd.read_csv(inputPath + "train.csv")
titanicTrainDS = deepcopy(titanicOrigTrainDS)

titanicOrigTestDS = pd.read_csv(inputPath + "test.csv")
titanicTestDS = deepcopy(titanicOrigTestDS)

# Print the first five rows of the dataframe
titanicTrainDS.head(5)


Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Here is a short description of the different columns:

  • PassengerId: Id of the passenger
  • Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
  • Name: Name
  • Sex: Sex
  • Age: Age in years
  • Sibsp: Number of siblings / spouses aboard the Titanic
  • Parch: Number of parents / children aboard the Titanic
  • Ticket: Ticket number
  • Fare: Passenger fare
  • Cabin: Cabin number
  • Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
  • Survival: Survival (0 = No, 1 = Yes)

Hypothesis

Let's consider which variables might affect the outcome of survival (feature selection). In this section, we test the variability of the survival percentage according to each feature. It is to be noted that a variability induce that the feature has some influence. But the opposite is not automatically true.

We consider the following features:

  • "Pclass": knowing that first class cabins were closer to the deck of the ship, are passengers from the first class more likely to survive? If yes, passenger class "Pclass" might affect the outcome.
  • "Fare": the passenger fare is probably tied to passenger class and could have a correlation too.
  • "Sex": are women more likely to survive? If yes, "Sex" would be a good predictor.
  • "Age": are children more likely to survive? If yes, "Age" would be a good predictor.
  • "Embarked": people who boarded at certain ports may have had cabins closer or farther away exits.
  • "Sibsp": does being alone give more chance of surviving or less because no one is thinking about you.
  • "Parch": same remark.

However, we do not consider these features:

  • PassengerId
  • Name
  • Ticket
  • Cabin

Let us explore the pre-selected features and their correlations with the variable "Survived".


In [3]:
# What is the percentage of survival by class (1st, 2nd, 3rd)?
titanicTrainDS[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean() 

# We find a big variability. The first class passengers had definetely more chances to survive. 
# This means that "Pclass" is an important feature.


Out[3]:
Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363

In [4]:
# What is the percentage of survival by sex?
titanicTrainDS[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean()

# We find a huge variability. Woman had more chances to survive. 
# This is definitely an important feature.


Out[4]:
Sex Survived
0 female 0.742038
1 male 0.188908

In [5]:
# What is the percentage of survival according to the port of embarkation
titanicTrainDS[["Embarked", "Survived"]].groupby(['Embarked'], as_index=False).mean()


Out[5]:
Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.336957

In [6]:
# What is the percentage of survival according to the number of siblings?
titanicTrainDS[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean()


Out[6]:
SibSp Survived
0 0 0.345395
1 1 0.535885
2 2 0.464286
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000

In [7]:
# What is the percentage of survival according to the number of parents?
titanicTrainDS[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean()


Out[7]:
Parch Survived
0 0 0.343658
1 1 0.550847
2 2 0.500000
3 3 0.600000
4 4 0.000000
5 5 0.200000
6 6 0.000000

In [8]:
# What is the percentage of survival according to the age (grouped)?
interval = 10
TempV = round(titanicTrainDS["Age"]//interval)*interval
titanicTrainDS["AgeIntervalMin"] = TempV
titanicTrainDS["AgeIntervalMax"] = TempV + interval
titanicTrainDS[["AgeIntervalMin", "AgeIntervalMax", "Survived"]].groupby(["AgeIntervalMin"], as_index=False).mean()


Out[8]:
AgeIntervalMin AgeIntervalMax Survived
0 0.0 10.0 0.612903
1 10.0 20.0 0.401961
2 20.0 30.0 0.350000
3 30.0 40.0 0.437126
4 40.0 50.0 0.382022
5 50.0 60.0 0.416667
6 60.0 70.0 0.315789
7 70.0 80.0 0.000000
8 80.0 90.0 1.000000

In [9]:
# What is the percentage of survival according to the fare (grouped)?
interval = 25
TempV = round(titanicTrainDS["Fare"]//interval)*interval
titanicTrainDS["FareIntervalMin"] = TempV
titanicTrainDS["FareIntervalMax"] = TempV + interval
titanicTrainDS[["FareIntervalMin", "FareIntervalMax", "Survived"]].groupby(["FareIntervalMin"], as_index=False).mean()


Out[9]:
FareIntervalMin FareIntervalMax Survived
0 0.0 25.0 0.287253
1 25.0 50.0 0.421965
2 50.0 75.0 0.546875
3 75.0 100.0 0.795455
4 100.0 125.0 0.733333
5 125.0 150.0 0.888889
6 150.0 175.0 0.666667
7 200.0 225.0 0.600000
8 225.0 250.0 0.666667
9 250.0 275.0 0.666667
10 500.0 525.0 1.000000

We decide to keep all pre-selected features. However, some of them need to be "cleaned" before running models on our datasets.

Data cleaning

Let us have a look to the datasets ("train" and "test"). Note that we need to do the cleaning part in parallel for both datasets.


In [10]:
titanicDSs = [titanicTrainDS, titanicTestDS]

In [11]:
# lenght of the dataframe
len(titanicTrainDS)


Out[11]:
891

In [12]:
# Summary on the dataframe
titanicTrainDS.describe()


Out[12]:
PassengerId Survived Pclass Age SibSp Parch Fare AgeIntervalMin AgeIntervalMax FareIntervalMin FareIntervalMax
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 714.000000 714.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 25.252101 35.252101 22.615039 47.615039
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 14.970969 14.970969 49.512306 49.512306
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 0.000000 10.000000 0.000000 25.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 20.000000 30.000000 0.000000 25.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 20.000000 30.000000 0.000000 25.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 30.000000 40.000000 25.000000 50.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200 80.000000 90.000000 500.000000 525.000000

In [13]:
# lenght of the dataframe
len(titanicTestDS)


Out[13]:
418

In [14]:
# Summary on the dataframe
titanicTestDS.describe()


Out[14]:
PassengerId Pclass Age SibSp Parch Fare
count 418.000000 418.000000 332.000000 418.000000 418.000000 417.000000
mean 1100.500000 2.265550 30.272590 0.447368 0.392344 35.627188
std 120.810458 0.841838 14.181209 0.896760 0.981429 55.907576
min 892.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 996.250000 1.000000 21.000000 0.000000 0.000000 7.895800
50% 1100.500000 3.000000 27.000000 0.000000 0.000000 14.454200
75% 1204.750000 3.000000 39.000000 1.000000 0.000000 31.500000
max 1309.000000 3.000000 76.000000 8.000000 9.000000 512.329200

If we have a look to the first dataset (the "train" one), we see that all the numerical columns have indeed a count of 891 except the "Age" column that has a count of 714. This indicates that there are missing values (null, NA, or not a number).

As we don't want to remove the rows with missing values, we choose to clean the data by filling in all of the missing values. It would be a good idea to test if the missing value for "Age" is correlated with other variable. For example, we see that it is there are way more missing values for the "Q" port of embarkation.


In [15]:
titanicTrainDS["AgeEmptyOrNot"] =  titanicTrainDS["Age"].apply(lambda x: 1 if x>=0  else 0)
titanicTrainDS[['Embarked', 'AgeEmptyOrNot']].groupby(['Embarked'], as_index=False).mean()


Out[15]:
Embarked AgeEmptyOrNot
0 C 0.773810
1 Q 0.363636
2 S 0.860248

However, the mean age does not seem to differ strongly according to the port of embarkation.


In [16]:
titanicTrainDS[['Embarked', 'Age']].groupby(['Embarked'], as_index=False).mean()


Out[16]:
Embarked Age
0 C 30.814769
1 Q 28.089286
2 S 29.445397

Finally, we decide to clean the data by filling in all of the missing values with simply the median of all the values in the column


In [17]:
# Fill missing values with the median value
for dataset in titanicDSs:
    dataset["Age"] = dataset["Age"].fillna(dataset["Age"].median())

The "Sex" column is non-numeric, we need to convert it. But first, we confirm that this column does not have empty values. then we make the conversion.


In [18]:
# What are the values for this column?
for dataset in titanicDSs:
    print(dataset["Sex"].unique())


['male' 'female']
['male' 'female']

In [19]:
# Convert to numerical values
for dataset in titanicDSs:
    dataset.loc[dataset["Sex"] == "male", "Sex"] = 0
    dataset.loc[dataset["Sex"] == "female", "Sex"] = 1

We do the same with the "Embarked" column. We First analyse if there are missing values. We will see that yes and choose to fill the missing values with the most frequent value.


In [20]:
# What are the values for this column?
for dataset in titanicDSs:
    print(dataset["Embarked"].unique())


['S' 'C' 'Q' nan]
['Q' 'S' 'C']

In [21]:
# Fill missing values with most frequent value
mostFrequentOccurrence = titanicTrainDS["Embarked"].dropna().mode()[0]
titanicTrainDS["Embarked"] = titanicTrainDS["Embarked"].fillna(mostFrequentOccurrence)

# Convert to numerical values
for dataset in titanicDSs:
    dataset.loc[dataset["Embarked"] == "S", "Embarked"] = 0
    dataset.loc[dataset["Embarked"] == "C", "Embarked"] = 1
    dataset.loc[dataset["Embarked"] == "Q", "Embarked"] = 2

Finally, we clean the "Fare" variable of the "test" dataset.


In [22]:
titanicTestDS["Fare"] = titanicTestDS["Fare"].fillna(titanicTestDS["Fare"].median())

Model application

Now, we can turn to the core of the analysis. We will introduce a couple of functions. The first function is the one that will enable evaluating the accuracy of one classification method type. However, we introduce a second function that enables to run the first function on each combination of predictors (ex: ["Sex", "Age", "Embarked"] or ["Age", "SibSp", "Parch", "Fare"] etc.).

In what follows, we build the list of combinations and then introduce the these functions.


In [23]:
# The columns that can be used in the prediction
predictorsAll = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"] 

# Create all combinations of predictors
predictorCombinations = [] # all combination of predictord
for index in range(1, len(predictorsAll)+1):
    for subset in itertools.combinations(predictorsAll, index):
         predictorCombinations.append(list(subset))  
            
#predictorCombinations

In [24]:
# Function: Evaluate one algorithm type (and return n fitted algorithms)
# -input
    # predictorsDs: the dataset projected to the predictors of interest
    # targetDs: the target or label vector of interest (the column "Survived" in our work)
    # algModel: the "template" or model of the algorithm to apply
    # nbFK: the number of cross validation folders
# -output
    # algs: nbKF fitted algorithms 
    # accuracy: the evaluation of the accuracy
def binClassifModel_kf(predictorsDs, targetDs, algModel, nbKF):
    # List of algorithms
    algs = []
    
    # Generate cross-validation folds for the titanic data set
    # It returns the row indices corresponding to train and test
    # We set random_state to ensure we get the same splits every time we run this
    kf = KFold(nbKF, random_state=1)

    # List of predictions
    predictions = []

    for trainIndexes, testIndexes in kf.split(predictorsDs):
        # The predictors we're using to train the algorithm  
        # Note how we only take the rows in the train folds
        predictorsTrainDs = (predictorsDs.iloc[trainIndexes,:])
        # The target we're using to train the algorithm
        train_target = targetDs.iloc[trainIndexes]
        
        # Initialize our algorithm class
        alg = deepcopy(algModel)
        # Training the algorithm using the predictors and target
        alg.fit(predictorsTrainDs, train_target)
        algs.append(alg)
        
        # We can now make predictions on the test fold
        thisSlitpredictions = alg.predict(predictorsDs.iloc[testIndexes,:])
        predictions.append(thisSlitpredictions)


    # The predictions are in three separate NumPy arrays  
    # Concatenate them into a single array, along the axis 0 (the only 1 axis) 
    predictions = np.concatenate(predictions, axis=0)

    # Map predictions to outcomes (the only possible outcomes are 1 and 0)
    predictions[predictions > .5] = 1
    predictions[predictions <=.5] = 0
    accuracy = len(predictions[predictions == targetDs]) / len(predictions)
    
    # return the multiple algoriths and the accuracy
    return [algs, accuracy]

In [25]:
# Helper that return the indexed of the sorted list
def sort_list(myList):
    return sorted(range(len(myList)), key=lambda i:myList[i])

# Function: Run multiple evaluations for one algorithm type (one for each combination of predictors)
# -input
    # algModel: the "template" or model of the algorithm to apply
    # nbFK: the number of cross validation folders
# -output
    # {}
def getAccuracy_forEachPredictor(algModel, nbKF):
    accuracyList = []
    
    # For each combination of predictors
    for combination in predictorCombinations:
        result = binClassifModel_kf(titanicTrainDS[combination], titanicTrainDS["Survived"], algModel, nbKF)
        accuracy = result[1]
        accuracyList.append(accuracy)

    # Sort the accuracies
    accuracySortedList = sort_list(accuracyList)

    # Diplay the best combinations
    for i in range(-5, 0):
        print(predictorCombinations[accuracySortedList[i]], ": ", accuracyList[accuracySortedList[i]])
    #for elementIndex in sort_list(accuracyList1):
    #    print(predictorCombinations[elementIndex], ": ", accuracyList1[elementIndex])
        
    print("--------------------------------------------------")

    # Display the accuracy corresponding to combination that uses all the predictors
    lastIndex = len(predictorCombinations)-1
    print(predictorCombinations[lastIndex], ":", accuracyList[lastIndex])

Now that we have introduce the above functions, we evaluate a set of classification methods on each combination of predictors. Here are the evaluated classification methods:

  • LinearRegression
  • LogisticRegression
  • GaussianNB
  • KNeighborsClassifier
  • DecisionTreeClassifier
  • RandomForestClassifier

In [26]:
algModel = LinearRegression(fit_intercept=True, normalize=True)
getAccuracy_forEachPredictor(algModel, 5)


/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/linalg/basic.py:884: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)
['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'] :  0.7912457912457912
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare'] :  0.792368125701459
['Pclass', 'Sex', 'Age', 'SibSp'] :  0.7934904601571269
['Pclass', 'Sex', 'Age', 'SibSp', 'Fare'] :  0.7934904601571269
['Pclass', 'Sex', 'Age', 'Parch', 'Embarked'] :  0.7934904601571269
--------------------------------------------------
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] : 0.7878787878787878

In [27]:
algModel = LogisticRegression()
getAccuracy_forEachPredictor(algModel, 5)


['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'] :  0.7946127946127947
['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'Embarked'] :  0.7946127946127947
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] :  0.7946127946127947
['Pclass', 'Sex', 'SibSp', 'Parch'] :  0.7968574635241302
['Pclass', 'Sex', 'SibSp'] :  0.7991021324354658
--------------------------------------------------
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] : 0.7946127946127947

In [28]:
algModel = GaussianNB()
getAccuracy_forEachPredictor(algModel, 5)


['Sex', 'SibSp', 'Parch', 'Embarked'] :  0.7957351290684624
['Sex', 'Age', 'SibSp', 'Fare', 'Embarked'] :  0.7957351290684624
['Sex', 'SibSp'] :  0.7968574635241302
['Sex', 'SibSp', 'Embarked'] :  0.7968574635241302
['Sex', 'SibSp', 'Parch'] :  0.797979797979798
--------------------------------------------------
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] : 0.7856341189674523

In [29]:
algModel = KNeighborsClassifier(n_neighbors=5)
getAccuracy_forEachPredictor(algModel, 5)


['Pclass', 'Sex', 'Parch', 'Embarked'] :  0.7878787878787878
['Sex', 'SibSp', 'Embarked'] :  0.7890011223344556
['Pclass', 'Sex'] :  0.7901234567901234
['Pclass', 'Sex', 'Age', 'SibSp'] :  0.7912457912457912
['Sex', 'SibSp', 'Parch'] :  0.8002244668911336
--------------------------------------------------
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] : 0.691358024691358

In [30]:
algModel = DecisionTreeClassifier(min_samples_split=4, min_samples_leaf=2)
getAccuracy_forEachPredictor(algModel, 5)


['Pclass', 'Sex', 'Age', 'SibSp', 'Parch'] :  0.8047138047138047
['Pclass', 'Sex', 'Parch', 'Embarked'] :  0.8058361391694725
['Pclass', 'Sex', 'Parch', 'Fare', 'Embarked'] :  0.8058361391694725
['Pclass', 'Sex', 'Fare', 'Embarked'] :  0.8103254769921436
['Pclass', 'Sex', 'Embarked'] :  0.8114478114478114
--------------------------------------------------
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] : 0.797979797979798

In [34]:
algModel = RandomForestClassifier(n_estimators=100, min_samples_split=4, min_samples_leaf=2)
getAccuracy_forEachPredictor(algModel, 5)


['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'Embarked'] :  0.8249158249158249
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] :  0.8282828282828283
['Pclass', 'Sex', 'Age', 'Fare'] :  0.8305274971941639
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare'] :  0.8305274971941639
['Pclass', 'Sex', 'Age', 'Parch', 'Fare'] :  0.835016835016835
--------------------------------------------------
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] : 0.8282828282828283

Prediction application

After having run all the models, we decide to choose the model that gave the best performance. This model is the "RandomForestClassifier" with the specific parameters above. Furthermore, we will use it with the best combination of predictors which is ['Pclass', 'Sex', 'Age', 'Parch', 'Fare'] that gave approximately 83% of accuracy.


In [32]:
# Run again the model with the tuned parameters on the dataset using the best combination of predictors
algModel = RandomForestClassifier(n_estimators=100, min_samples_split=4, min_samples_leaf=2)
predictors = ['Pclass', 'Sex', 'Age', 'Parch', 'Fare']
result = binClassifModel_kf(titanicTrainDS[predictors], titanicTrainDS["Survived"], algModel, 5)
algList = result[0] # the set of algorithms

predictionsList = []
for alg in algList:
    predictions = alg.predict(titanicTestDS[predictors])
    predictionsList.append(predictions)    

# There are different preditions, we take the mean (a voting-like system)
predictionsFinal = np.mean(predictionsList, axis=0)

# Map predictions to outcomes (the only possible outcomes are 1 and 0)
predictionsFinal[predictionsFinal > .5] = 1
predictionsFinal[predictionsFinal <=.5] = 0

# Cast as int
predictionsFinal = predictionsFinal.astype(int)

Finally, we can generate the submission file with the prediction of the survival for each passenger of the test dataset.


In [33]:
# Create a new dataset with only the id and the target column
submission = pd.DataFrame({
        "PassengerId": titanicTestDS["PassengerId"],
        "Survived": predictionsFinal
    })

# submission.to_csv(outputPath + 'submission.csv', index=False)

Conclusion

Throughout this work, we tried to establish a good model for predicting the survival of passengers in the Titanic disaster. As outlooks, we could investigate the influence of some features cleaning and scaling (such as the "Fare" scaling) on the overall performance.

References

Many ideas in this work are inspired by the great tutorials of the Titanic competition and other sources.


In [ ]: