This case study is about predicting which passengers survived the sinking of the famous Titanic. In our work, we would like to establish a model that predicts the survival of each passenger. In order to do this, we will use a dataset that describe each passenger (multiple features) and if they survived or not.
In [1]:
# Import numerical and data processing libraries
import numpy as np
import pandas as pd
# Import helpers that make it easy to do cross-validation
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# Import machine learning models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
# Import visualisation libraries
import matplotlib.pyplot as plt
%matplotlib inline
# Import a method in order to make deep copies
from copy import deepcopy
# Import an other usefull libraries
import itertools
# Set the paths for inputs and outputs
local = 1
if(local == 0):
inputPath = "../input/"
outputPath = "../output/"
else:
inputPath = "data/"
outputPath = "data/"
Here, we load the datasets. We actually have 2 datasets:
Note that the only difference in the structure of the 2 datasets is that the "test" dataset does not contain "Survived" column (the "label" or the "class" to which the passenger belongs).
We describe in what follows the columns of the "train" dataset.
In [2]:
# This creates a pandas dataframe and assigns it to the titanic variable
titanicOrigTrainDS = pd.read_csv(inputPath + "train.csv")
titanicTrainDS = deepcopy(titanicOrigTrainDS)
titanicOrigTestDS = pd.read_csv(inputPath + "test.csv")
titanicTestDS = deepcopy(titanicOrigTestDS)
# Print the first five rows of the dataframe
titanicTrainDS.head(5)
Out[2]:
Here is a short description of the different columns:
Let's consider which variables might affect the outcome of survival (feature selection). In this section, we test the variability of the survival percentage according to each feature. It is to be noted that a variability induce that the feature has some influence. But the opposite is not automatically true.
We consider the following features:
However, we do not consider these features:
Let us explore the pre-selected features and their correlations with the variable "Survived".
In [3]:
# What is the percentage of survival by class (1st, 2nd, 3rd)?
titanicTrainDS[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean()
# We find a big variability. The first class passengers had definetely more chances to survive.
# This means that "Pclass" is an important feature.
Out[3]:
In [4]:
# What is the percentage of survival by sex?
titanicTrainDS[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean()
# We find a huge variability. Woman had more chances to survive.
# This is definitely an important feature.
Out[4]:
In [5]:
# What is the percentage of survival according to the port of embarkation
titanicTrainDS[["Embarked", "Survived"]].groupby(['Embarked'], as_index=False).mean()
Out[5]:
In [6]:
# What is the percentage of survival according to the number of siblings?
titanicTrainDS[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean()
Out[6]:
In [7]:
# What is the percentage of survival according to the number of parents?
titanicTrainDS[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean()
Out[7]:
In [8]:
# What is the percentage of survival according to the age (grouped)?
interval = 10
TempV = round(titanicTrainDS["Age"]//interval)*interval
titanicTrainDS["AgeIntervalMin"] = TempV
titanicTrainDS["AgeIntervalMax"] = TempV + interval
titanicTrainDS[["AgeIntervalMin", "AgeIntervalMax", "Survived"]].groupby(["AgeIntervalMin"], as_index=False).mean()
Out[8]:
In [9]:
# What is the percentage of survival according to the fare (grouped)?
interval = 25
TempV = round(titanicTrainDS["Fare"]//interval)*interval
titanicTrainDS["FareIntervalMin"] = TempV
titanicTrainDS["FareIntervalMax"] = TempV + interval
titanicTrainDS[["FareIntervalMin", "FareIntervalMax", "Survived"]].groupby(["FareIntervalMin"], as_index=False).mean()
Out[9]:
We decide to keep all pre-selected features. However, some of them need to be "cleaned" before running models on our datasets.
In [10]:
titanicDSs = [titanicTrainDS, titanicTestDS]
In [11]:
# lenght of the dataframe
len(titanicTrainDS)
Out[11]:
In [12]:
# Summary on the dataframe
titanicTrainDS.describe()
Out[12]:
In [13]:
# lenght of the dataframe
len(titanicTestDS)
Out[13]:
In [14]:
# Summary on the dataframe
titanicTestDS.describe()
Out[14]:
If we have a look to the first dataset (the "train" one), we see that all the numerical columns have indeed a count of 891 except the "Age" column that has a count of 714. This indicates that there are missing values (null, NA, or not a number).
As we don't want to remove the rows with missing values, we choose to clean the data by filling in all of the missing values. It would be a good idea to test if the missing value for "Age" is correlated with other variable. For example, we see that it is there are way more missing values for the "Q" port of embarkation.
In [15]:
titanicTrainDS["AgeEmptyOrNot"] = titanicTrainDS["Age"].apply(lambda x: 1 if x>=0 else 0)
titanicTrainDS[['Embarked', 'AgeEmptyOrNot']].groupby(['Embarked'], as_index=False).mean()
Out[15]:
However, the mean age does not seem to differ strongly according to the port of embarkation.
In [16]:
titanicTrainDS[['Embarked', 'Age']].groupby(['Embarked'], as_index=False).mean()
Out[16]:
Finally, we decide to clean the data by filling in all of the missing values with simply the median of all the values in the column
In [17]:
# Fill missing values with the median value
for dataset in titanicDSs:
dataset["Age"] = dataset["Age"].fillna(dataset["Age"].median())
The "Sex" column is non-numeric, we need to convert it. But first, we confirm that this column does not have empty values. then we make the conversion.
In [18]:
# What are the values for this column?
for dataset in titanicDSs:
print(dataset["Sex"].unique())
In [19]:
# Convert to numerical values
for dataset in titanicDSs:
dataset.loc[dataset["Sex"] == "male", "Sex"] = 0
dataset.loc[dataset["Sex"] == "female", "Sex"] = 1
We do the same with the "Embarked" column. We First analyse if there are missing values. We will see that yes and choose to fill the missing values with the most frequent value.
In [20]:
# What are the values for this column?
for dataset in titanicDSs:
print(dataset["Embarked"].unique())
In [21]:
# Fill missing values with most frequent value
mostFrequentOccurrence = titanicTrainDS["Embarked"].dropna().mode()[0]
titanicTrainDS["Embarked"] = titanicTrainDS["Embarked"].fillna(mostFrequentOccurrence)
# Convert to numerical values
for dataset in titanicDSs:
dataset.loc[dataset["Embarked"] == "S", "Embarked"] = 0
dataset.loc[dataset["Embarked"] == "C", "Embarked"] = 1
dataset.loc[dataset["Embarked"] == "Q", "Embarked"] = 2
Finally, we clean the "Fare" variable of the "test" dataset.
In [22]:
titanicTestDS["Fare"] = titanicTestDS["Fare"].fillna(titanicTestDS["Fare"].median())
Now, we can turn to the core of the analysis. We will introduce a couple of functions. The first function is the one that will enable evaluating the accuracy of one classification method type. However, we introduce a second function that enables to run the first function on each combination of predictors (ex: ["Sex", "Age", "Embarked"] or ["Age", "SibSp", "Parch", "Fare"] etc.).
In what follows, we build the list of combinations and then introduce the these functions.
In [23]:
# The columns that can be used in the prediction
predictorsAll = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
# Create all combinations of predictors
predictorCombinations = [] # all combination of predictord
for index in range(1, len(predictorsAll)+1):
for subset in itertools.combinations(predictorsAll, index):
predictorCombinations.append(list(subset))
#predictorCombinations
In [24]:
# Function: Evaluate one algorithm type (and return n fitted algorithms)
# -input
# predictorsDs: the dataset projected to the predictors of interest
# targetDs: the target or label vector of interest (the column "Survived" in our work)
# algModel: the "template" or model of the algorithm to apply
# nbFK: the number of cross validation folders
# -output
# algs: nbKF fitted algorithms
# accuracy: the evaluation of the accuracy
def binClassifModel_kf(predictorsDs, targetDs, algModel, nbKF):
# List of algorithms
algs = []
# Generate cross-validation folds for the titanic data set
# It returns the row indices corresponding to train and test
# We set random_state to ensure we get the same splits every time we run this
kf = KFold(nbKF, random_state=1)
# List of predictions
predictions = []
for trainIndexes, testIndexes in kf.split(predictorsDs):
# The predictors we're using to train the algorithm
# Note how we only take the rows in the train folds
predictorsTrainDs = (predictorsDs.iloc[trainIndexes,:])
# The target we're using to train the algorithm
train_target = targetDs.iloc[trainIndexes]
# Initialize our algorithm class
alg = deepcopy(algModel)
# Training the algorithm using the predictors and target
alg.fit(predictorsTrainDs, train_target)
algs.append(alg)
# We can now make predictions on the test fold
thisSlitpredictions = alg.predict(predictorsDs.iloc[testIndexes,:])
predictions.append(thisSlitpredictions)
# The predictions are in three separate NumPy arrays
# Concatenate them into a single array, along the axis 0 (the only 1 axis)
predictions = np.concatenate(predictions, axis=0)
# Map predictions to outcomes (the only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy = len(predictions[predictions == targetDs]) / len(predictions)
# return the multiple algoriths and the accuracy
return [algs, accuracy]
In [25]:
# Helper that return the indexed of the sorted list
def sort_list(myList):
return sorted(range(len(myList)), key=lambda i:myList[i])
# Function: Run multiple evaluations for one algorithm type (one for each combination of predictors)
# -input
# algModel: the "template" or model of the algorithm to apply
# nbFK: the number of cross validation folders
# -output
# {}
def getAccuracy_forEachPredictor(algModel, nbKF):
accuracyList = []
# For each combination of predictors
for combination in predictorCombinations:
result = binClassifModel_kf(titanicTrainDS[combination], titanicTrainDS["Survived"], algModel, nbKF)
accuracy = result[1]
accuracyList.append(accuracy)
# Sort the accuracies
accuracySortedList = sort_list(accuracyList)
# Diplay the best combinations
for i in range(-5, 0):
print(predictorCombinations[accuracySortedList[i]], ": ", accuracyList[accuracySortedList[i]])
#for elementIndex in sort_list(accuracyList1):
# print(predictorCombinations[elementIndex], ": ", accuracyList1[elementIndex])
print("--------------------------------------------------")
# Display the accuracy corresponding to combination that uses all the predictors
lastIndex = len(predictorCombinations)-1
print(predictorCombinations[lastIndex], ":", accuracyList[lastIndex])
Now that we have introduce the above functions, we evaluate a set of classification methods on each combination of predictors. Here are the evaluated classification methods:
In [26]:
algModel = LinearRegression(fit_intercept=True, normalize=True)
getAccuracy_forEachPredictor(algModel, 5)
In [27]:
algModel = LogisticRegression()
getAccuracy_forEachPredictor(algModel, 5)
In [28]:
algModel = GaussianNB()
getAccuracy_forEachPredictor(algModel, 5)
In [29]:
algModel = KNeighborsClassifier(n_neighbors=5)
getAccuracy_forEachPredictor(algModel, 5)
In [30]:
algModel = DecisionTreeClassifier(min_samples_split=4, min_samples_leaf=2)
getAccuracy_forEachPredictor(algModel, 5)
In [34]:
algModel = RandomForestClassifier(n_estimators=100, min_samples_split=4, min_samples_leaf=2)
getAccuracy_forEachPredictor(algModel, 5)
After having run all the models, we decide to choose the model that gave the best performance. This model is the "RandomForestClassifier" with the specific parameters above. Furthermore, we will use it with the best combination of predictors which is ['Pclass', 'Sex', 'Age', 'Parch', 'Fare'] that gave approximately 83% of accuracy.
In [32]:
# Run again the model with the tuned parameters on the dataset using the best combination of predictors
algModel = RandomForestClassifier(n_estimators=100, min_samples_split=4, min_samples_leaf=2)
predictors = ['Pclass', 'Sex', 'Age', 'Parch', 'Fare']
result = binClassifModel_kf(titanicTrainDS[predictors], titanicTrainDS["Survived"], algModel, 5)
algList = result[0] # the set of algorithms
predictionsList = []
for alg in algList:
predictions = alg.predict(titanicTestDS[predictors])
predictionsList.append(predictions)
# There are different preditions, we take the mean (a voting-like system)
predictionsFinal = np.mean(predictionsList, axis=0)
# Map predictions to outcomes (the only possible outcomes are 1 and 0)
predictionsFinal[predictionsFinal > .5] = 1
predictionsFinal[predictionsFinal <=.5] = 0
# Cast as int
predictionsFinal = predictionsFinal.astype(int)
Finally, we can generate the submission file with the prediction of the survival for each passenger of the test dataset.
In [33]:
# Create a new dataset with only the id and the target column
submission = pd.DataFrame({
"PassengerId": titanicTestDS["PassengerId"],
"Survived": predictionsFinal
})
# submission.to_csv(outputPath + 'submission.csv', index=False)
Many ideas in this work are inspired by the great tutorials of the Titanic competition and other sources.
In [ ]: