Introduction

This case study is about predicting which passengers survived the sinking of the famous Titanic. In our work, we would like to establish a model that predicts the survival of each passenger. In order to do this, we will use a dataset that describe each passenger (multiple features) and if they survived or not.

Data description

In this section, we load and explore the dataset. First, we import the libraries needed along the project.



In [1]:

    
# Import numerical and data processing libraries
import numpy as np
import pandas as pd

# Import helpers that make it easy to do cross-validation
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# Import machine learning models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

# Import visualisation libraries
import matplotlib.pyplot as plt
%matplotlib inline

# Import a method in order to make deep copies
from copy import deepcopy

# Import an other usefull libraries
import itertools

# Set the paths for inputs and outputs
local = 1
if(local == 0):
    inputPath = "../input/"
    outputPath = "../output/"
else:
    inputPath = "data/"
    outputPath = "data/"

Here, we load the datasets. We actually have 2 datasets:

"train.csv": contains informations about some passengers (multiple columns) and the fact that they survived or not (one column). You may download this dataset here in CSV format.
"test.csv": contains informations about some other passengers (multiple columns) but without the survival information. You may download this dataset here in CSV format.

Note that the only difference in the structure of the 2 datasets is that the "test" dataset does not contain "Survived" column (the "label" or the "class" to which the passenger belongs).

We describe in what follows the columns of the "train" dataset.



In [2]:

    
# This creates a pandas dataframe and assigns it to the titanic variable
titanicOrigTrainDS = pd.read_csv(inputPath + "train.csv")
titanicTrainDS = deepcopy(titanicOrigTrainDS)

titanicOrigTestDS = pd.read_csv(inputPath + "test.csv")
titanicTestDS = deepcopy(titanicOrigTestDS)

# Print the first five rows of the dataframe
titanicTrainDS.head(5)









    Out[2]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

Here is a short description of the different columns:

PassengerId: Id of the passenger
Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
Name: Name
Sex: Sex
Age: Age in years
Sibsp: Number of siblings / spouses aboard the Titanic
Parch: Number of parents / children aboard the Titanic
Ticket: Ticket number
Fare: Passenger fare
Cabin: Cabin number
Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Survival: Survival (0 = No, 1 = Yes)

Hypothesis

Let's consider which variables might affect the outcome of survival (feature selection). In this section, we test the variability of the survival percentage according to each feature. It is to be noted that a variability induce that the feature has some influence. But the opposite is not automatically true.

We consider the following features:

"Pclass": knowing that first class cabins were closer to the deck of the ship, are passengers from the first class more likely to survive? If yes, passenger class "Pclass" might affect the outcome.
"Fare": the passenger fare is probably tied to passenger class and could have a correlation too.
"Sex": are women more likely to survive? If yes, "Sex" would be a good predictor.
"Age": are children more likely to survive? If yes, "Age" would be a good predictor.
"Embarked": people who boarded at certain ports may have had cabins closer or farther away exits.
"Sibsp": does being alone give more chance of surviving or less because no one is thinking about you.
"Parch": same remark.

However, we do not consider these features:

PassengerId
Name
Ticket
Cabin

Let us explore the pre-selected features and their correlations with the variable "Survived".



In [3]:

    
# What is the percentage of survival by class (1st, 2nd, 3rd)?
titanicTrainDS[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean() 

# We find a big variability. The first class passengers had definetely more chances to survive. 
# This means that "Pclass" is an important feature.



In [4]:

    
# What is the percentage of survival by sex?
titanicTrainDS[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean()

# We find a huge variability. Woman had more chances to survive. 
# This is definitely an important feature.



In [5]:

    
# What is the percentage of survival according to the port of embarkation
titanicTrainDS[["Embarked", "Survived"]].groupby(['Embarked'], as_index=False).mean()



In [6]:

    
# What is the percentage of survival according to the number of siblings?
titanicTrainDS[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean()



In [7]:

    
# What is the percentage of survival according to the number of parents?
titanicTrainDS[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean()



In [8]:

    
# What is the percentage of survival according to the age (grouped)?
interval = 10
TempV = round(titanicTrainDS["Age"]//interval)*interval
titanicTrainDS["AgeIntervalMin"] = TempV
titanicTrainDS["AgeIntervalMax"] = TempV + interval
titanicTrainDS[["AgeIntervalMin", "AgeIntervalMax", "Survived"]].groupby(["AgeIntervalMin"], as_index=False).mean()









    Out[8]:






  
    
      
      AgeIntervalMin
      AgeIntervalMax
      Survived
    
  
  
    
      0
      0.0
      10.0
      0.612903
    
    
      1
      10.0
      20.0
      0.401961
    
    
      2
      20.0
      30.0
      0.350000
    
    
      3
      30.0
      40.0
      0.437126
    
    
      4
      40.0
      50.0
      0.382022
    
    
      5
      50.0
      60.0
      0.416667
    
    
      6
      60.0
      70.0
      0.315789
    
    
      7
      70.0
      80.0
      0.000000
    
    
      8
      80.0
      90.0
      1.000000



In [9]:

    
# What is the percentage of survival according to the fare (grouped)?
interval = 25
TempV = round(titanicTrainDS["Fare"]//interval)*interval
titanicTrainDS["FareIntervalMin"] = TempV
titanicTrainDS["FareIntervalMax"] = TempV + interval
titanicTrainDS[["FareIntervalMin", "FareIntervalMax", "Survived"]].groupby(["FareIntervalMin"], as_index=False).mean()









    Out[9]:






  
    
      
      FareIntervalMin
      FareIntervalMax
      Survived
    
  
  
    
      0
      0.0
      25.0
      0.287253
    
    
      1
      25.0
      50.0
      0.421965
    
    
      2
      50.0
      75.0
      0.546875
    
    
      3
      75.0
      100.0
      0.795455
    
    
      4
      100.0
      125.0
      0.733333
    
    
      5
      125.0
      150.0
      0.888889
    
    
      6
      150.0
      175.0
      0.666667
    
    
      7
      200.0
      225.0
      0.600000
    
    
      8
      225.0
      250.0
      0.666667
    
    
      9
      250.0
      275.0
      0.666667
    
    
      10
      500.0
      525.0
      1.000000

We decide to keep all pre-selected features. However, some of them need to be "cleaned" before running models on our datasets.

Data cleaning

Let us have a look to the datasets ("train" and "test"). Note that we need to do the cleaning part in parallel for both datasets.



In [10]:

    
titanicDSs = [titanicTrainDS, titanicTestDS]



In [11]:

    
# lenght of the dataframe
len(titanicTrainDS)









    Out[11]:





891



In [12]:

    
# Summary on the dataframe
titanicTrainDS.describe()









    Out[12]:






  
    
      
      PassengerId
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
      AgeIntervalMin
      AgeIntervalMax
      FareIntervalMin
      FareIntervalMax
    
  
  
    
      count
      891.000000
      891.000000
      891.000000
      714.000000
      891.000000
      891.000000
      891.000000
      714.000000
      714.000000
      891.000000
      891.000000
    
    
      mean
      446.000000
      0.383838
      2.308642
      29.699118
      0.523008
      0.381594
      32.204208
      25.252101
      35.252101
      22.615039
      47.615039
    
    
      std
      257.353842
      0.486592
      0.836071
      14.526497
      1.102743
      0.806057
      49.693429
      14.970969
      14.970969
      49.512306
      49.512306
    
    
      min
      1.000000
      0.000000
      1.000000
      0.420000
      0.000000
      0.000000
      0.000000
      0.000000
      10.000000
      0.000000
      25.000000
    
    
      25%
      223.500000
      0.000000
      2.000000
      20.125000
      0.000000
      0.000000
      7.910400
      20.000000
      30.000000
      0.000000
      25.000000
    
    
      50%
      446.000000
      0.000000
      3.000000
      28.000000
      0.000000
      0.000000
      14.454200
      20.000000
      30.000000
      0.000000
      25.000000
    
    
      75%
      668.500000
      1.000000
      3.000000
      38.000000
      1.000000
      0.000000
      31.000000
      30.000000
      40.000000
      25.000000
      50.000000
    
    
      max
      891.000000
      1.000000
      3.000000
      80.000000
      8.000000
      6.000000
      512.329200
      80.000000
      90.000000
      500.000000
      525.000000



In [13]:

    
# lenght of the dataframe
len(titanicTestDS)









    Out[13]:





418



In [14]:

    
# Summary on the dataframe
titanicTestDS.describe()









    Out[14]:






  
    
      
      PassengerId
      Pclass
      Age
      SibSp
      Parch
      Fare
    
  
  
    
      count
      418.000000
      418.000000
      332.000000
      418.000000
      418.000000
      417.000000
    
    
      mean
      1100.500000
      2.265550
      30.272590
      0.447368
      0.392344
      35.627188
    
    
      std
      120.810458
      0.841838
      14.181209
      0.896760
      0.981429
      55.907576
    
    
      min
      892.000000
      1.000000
      0.170000
      0.000000
      0.000000
      0.000000
    
    
      25%
      996.250000
      1.000000
      21.000000
      0.000000
      0.000000
      7.895800
    
    
      50%
      1100.500000
      3.000000
      27.000000
      0.000000
      0.000000
      14.454200
    
    
      75%
      1204.750000
      3.000000
      39.000000
      1.000000
      0.000000
      31.500000
    
    
      max
      1309.000000
      3.000000
      76.000000
      8.000000
      9.000000
      512.329200

If we have a look to the first dataset (the "train" one), we see that all the numerical columns have indeed a count of 891 except the "Age" column that has a count of 714. This indicates that there are missing values (null, NA, or not a number).

As we don't want to remove the rows with missing values, we choose to clean the data by filling in all of the missing values. It would be a good idea to test if the missing value for "Age" is correlated with other variable. For example, we see that it is there are way more missing values for the "Q" port of embarkation.



In [15]:

    
titanicTrainDS["AgeEmptyOrNot"] =  titanicTrainDS["Age"].apply(lambda x: 1 if x>=0  else 0)
titanicTrainDS[['Embarked', 'AgeEmptyOrNot']].groupby(['Embarked'], as_index=False).mean()









    Out[15]:






  
    
      
      Embarked
      AgeEmptyOrNot
    
  
  
    
      0
      C
      0.773810
    
    
      1
      Q
      0.363636
    
    
      2
      S
      0.860248

However, the mean age does not seem to differ strongly according to the port of embarkation.



In [16]:

    
titanicTrainDS[['Embarked', 'Age']].groupby(['Embarked'], as_index=False).mean()

Finally, we decide to clean the data by filling in all of the missing values with simply the median of all the values in the column



In [17]:

    
# Fill missing values with the median value
for dataset in titanicDSs:
    dataset["Age"] = dataset["Age"].fillna(dataset["Age"].median())

The "Sex" column is non-numeric, we need to convert it. But first, we confirm that this column does not have empty values. then we make the conversion.



In [18]:

    
# What are the values for this column?
for dataset in titanicDSs:
    print(dataset["Sex"].unique())









    



['male' 'female']
['male' 'female']



In [19]:

    
# Convert to numerical values
for dataset in titanicDSs:
    dataset.loc[dataset["Sex"] == "male", "Sex"] = 0
    dataset.loc[dataset["Sex"] == "female", "Sex"] = 1

We do the same with the "Embarked" column. We First analyse if there are missing values. We will see that yes and choose to fill the missing values with the most frequent value.



In [20]:

    
# What are the values for this column?
for dataset in titanicDSs:
    print(dataset["Embarked"].unique())









    



['S' 'C' 'Q' nan]
['Q' 'S' 'C']



In [21]:

    
# Fill missing values with most frequent value
mostFrequentOccurrence = titanicTrainDS["Embarked"].dropna().mode()[0]
titanicTrainDS["Embarked"] = titanicTrainDS["Embarked"].fillna(mostFrequentOccurrence)

# Convert to numerical values
for dataset in titanicDSs:
    dataset.loc[dataset["Embarked"] == "S", "Embarked"] = 0
    dataset.loc[dataset["Embarked"] == "C", "Embarked"] = 1
    dataset.loc[dataset["Embarked"] == "Q", "Embarked"] = 2

Finally, we clean the "Fare" variable of the "test" dataset.



In [22]:

    
titanicTestDS["Fare"] = titanicTestDS["Fare"].fillna(titanicTestDS["Fare"].median())

Model application

Now, we can turn to the core of the analysis. We will introduce a couple of functions. The first function is the one that will enable evaluating the accuracy of one classification method type. However, we introduce a second function that enables to run the first function on each combination of predictors (ex: ["Sex", "Age", "Embarked"] or ["Age", "SibSp", "Parch", "Fare"] etc.).

In what follows, we build the list of combinations and then introduce the these functions.



In [23]:

    
# The columns that can be used in the prediction
predictorsAll = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"] 

# Create all combinations of predictors
predictorCombinations = [] # all combination of predictord
for index in range(1, len(predictorsAll)+1):
    for subset in itertools.combinations(predictorsAll, index):
         predictorCombinations.append(list(subset))  
            
#predictorCombinations



In [24]:

    
# Function: Evaluate one algorithm type (and return n fitted algorithms)
# -input
    # predictorsDs: the dataset projected to the predictors of interest
    # targetDs: the target or label vector of interest (the column "Survived" in our work)
    # algModel: the "template" or model of the algorithm to apply
    # nbFK: the number of cross validation folders
# -output
    # algs: nbKF fitted algorithms 
    # accuracy: the evaluation of the accuracy
def binClassifModel_kf(predictorsDs, targetDs, algModel, nbKF):
    # List of algorithms
    algs = []
    
    # Generate cross-validation folds for the titanic data set
    # It returns the row indices corresponding to train and test
    # We set random_state to ensure we get the same splits every time we run this
    kf = KFold(nbKF, random_state=1)

    # List of predictions
    predictions = []

    for trainIndexes, testIndexes in kf.split(predictorsDs):
        # The predictors we're using to train the algorithm  
        # Note how we only take the rows in the train folds
        predictorsTrainDs = (predictorsDs.iloc[trainIndexes,:])
        # The target we're using to train the algorithm
        train_target = targetDs.iloc[trainIndexes]
        
        # Initialize our algorithm class
        alg = deepcopy(algModel)
        # Training the algorithm using the predictors and target
        alg.fit(predictorsTrainDs, train_target)
        algs.append(alg)
        
        # We can now make predictions on the test fold
        thisSlitpredictions = alg.predict(predictorsDs.iloc[testIndexes,:])
        predictions.append(thisSlitpredictions)


    # The predictions are in three separate NumPy arrays  
    # Concatenate them into a single array, along the axis 0 (the only 1 axis) 
    predictions = np.concatenate(predictions, axis=0)

    # Map predictions to outcomes (the only possible outcomes are 1 and 0)
    predictions[predictions > .5] = 1
    predictions[predictions <=.5] = 0
    accuracy = len(predictions[predictions == targetDs]) / len(predictions)
    
    # return the multiple algoriths and the accuracy
    return [algs, accuracy]



In [25]:

    
# Helper that return the indexed of the sorted list
def sort_list(myList):
    return sorted(range(len(myList)), key=lambda i:myList[i])

# Function: Run multiple evaluations for one algorithm type (one for each combination of predictors)
# -input
    # algModel: the "template" or model of the algorithm to apply
    # nbFK: the number of cross validation folders
# -output
    # {}
def getAccuracy_forEachPredictor(algModel, nbKF):
    accuracyList = []
    
    # For each combination of predictors
    for combination in predictorCombinations:
        result = binClassifModel_kf(titanicTrainDS[combination], titanicTrainDS["Survived"], algModel, nbKF)
        accuracy = result[1]
        accuracyList.append(accuracy)

    # Sort the accuracies
    accuracySortedList = sort_list(accuracyList)

    # Diplay the best combinations
    for i in range(-5, 0):
        print(predictorCombinations[accuracySortedList[i]], ": ", accuracyList[accuracySortedList[i]])
    #for elementIndex in sort_list(accuracyList1):
    #    print(predictorCombinations[elementIndex], ": ", accuracyList1[elementIndex])
        
    print("--------------------------------------------------")

    # Display the accuracy corresponding to combination that uses all the predictors
    lastIndex = len(predictorCombinations)-1
    print(predictorCombinations[lastIndex], ":", accuracyList[lastIndex])

Now that we have introduce the above functions, we evaluate a set of classification methods on each combination of predictors. Here are the evaluated classification methods:

LinearRegression
LogisticRegression
GaussianNB
KNeighborsClassifier
DecisionTreeClassifier
RandomForestClassifier



In [26]:

    
algModel = LinearRegression(fit_intercept=True, normalize=True)
getAccuracy_forEachPredictor(algModel, 5)









    



/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/linalg/basic.py:884: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)






    



['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'] :  0.7912457912457912
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare'] :  0.792368125701459
['Pclass', 'Sex', 'Age', 'SibSp'] :  0.7934904601571269
['Pclass', 'Sex', 'Age', 'SibSp', 'Fare'] :  0.7934904601571269
['Pclass', 'Sex', 'Age', 'Parch', 'Embarked'] :  0.7934904601571269
--------------------------------------------------
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] : 0.7878787878787878



In [27]:

    
algModel = LogisticRegression()
getAccuracy_forEachPredictor(algModel, 5)









    



['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'] :  0.7946127946127947
['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'Embarked'] :  0.7946127946127947
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] :  0.7946127946127947
['Pclass', 'Sex', 'SibSp', 'Parch'] :  0.7968574635241302
['Pclass', 'Sex', 'SibSp'] :  0.7991021324354658
--------------------------------------------------
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] : 0.7946127946127947



In [28]:

    
algModel = GaussianNB()
getAccuracy_forEachPredictor(algModel, 5)









    



['Sex', 'SibSp', 'Parch', 'Embarked'] :  0.7957351290684624
['Sex', 'Age', 'SibSp', 'Fare', 'Embarked'] :  0.7957351290684624
['Sex', 'SibSp'] :  0.7968574635241302
['Sex', 'SibSp', 'Embarked'] :  0.7968574635241302
['Sex', 'SibSp', 'Parch'] :  0.797979797979798
--------------------------------------------------
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] : 0.7856341189674523



In [29]:

    
algModel = KNeighborsClassifier(n_neighbors=5)
getAccuracy_forEachPredictor(algModel, 5)









    



['Pclass', 'Sex', 'Parch', 'Embarked'] :  0.7878787878787878
['Sex', 'SibSp', 'Embarked'] :  0.7890011223344556
['Pclass', 'Sex'] :  0.7901234567901234
['Pclass', 'Sex', 'Age', 'SibSp'] :  0.7912457912457912
['Sex', 'SibSp', 'Parch'] :  0.8002244668911336
--------------------------------------------------
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] : 0.691358024691358



In [30]:

    
algModel = DecisionTreeClassifier(min_samples_split=4, min_samples_leaf=2)
getAccuracy_forEachPredictor(algModel, 5)









    



['Pclass', 'Sex', 'Age', 'SibSp', 'Parch'] :  0.8047138047138047
['Pclass', 'Sex', 'Parch', 'Embarked'] :  0.8058361391694725
['Pclass', 'Sex', 'Parch', 'Fare', 'Embarked'] :  0.8058361391694725
['Pclass', 'Sex', 'Fare', 'Embarked'] :  0.8103254769921436
['Pclass', 'Sex', 'Embarked'] :  0.8114478114478114
--------------------------------------------------
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] : 0.797979797979798



In [34]:

    
algModel = RandomForestClassifier(n_estimators=100, min_samples_split=4, min_samples_leaf=2)
getAccuracy_forEachPredictor(algModel, 5)









    



['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'Embarked'] :  0.8249158249158249
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] :  0.8282828282828283
['Pclass', 'Sex', 'Age', 'Fare'] :  0.8305274971941639
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare'] :  0.8305274971941639
['Pclass', 'Sex', 'Age', 'Parch', 'Fare'] :  0.835016835016835
--------------------------------------------------
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] : 0.8282828282828283

Prediction application

After having run all the models, we decide to choose the model that gave the best performance. This model is the "RandomForestClassifier" with the specific parameters above. Furthermore, we will use it with the best combination of predictors which is ['Pclass', 'Sex', 'Age', 'Parch', 'Fare'] that gave approximately 83% of accuracy.



In [32]:

    
# Run again the model with the tuned parameters on the dataset using the best combination of predictors
algModel = RandomForestClassifier(n_estimators=100, min_samples_split=4, min_samples_leaf=2)
predictors = ['Pclass', 'Sex', 'Age', 'Parch', 'Fare']
result = binClassifModel_kf(titanicTrainDS[predictors], titanicTrainDS["Survived"], algModel, 5)
algList = result[0] # the set of algorithms

predictionsList = []
for alg in algList:
    predictions = alg.predict(titanicTestDS[predictors])
    predictionsList.append(predictions)    

# There are different preditions, we take the mean (a voting-like system)
predictionsFinal = np.mean(predictionsList, axis=0)

# Map predictions to outcomes (the only possible outcomes are 1 and 0)
predictionsFinal[predictionsFinal > .5] = 1
predictionsFinal[predictionsFinal <=.5] = 0

# Cast as int
predictionsFinal = predictionsFinal.astype(int)

Finally, we can generate the submission file with the prediction of the survival for each passenger of the test dataset.



In [33]:

    
# Create a new dataset with only the id and the target column
submission = pd.DataFrame({
        "PassengerId": titanicTestDS["PassengerId"],
        "Survived": predictionsFinal
    })

# submission.to_csv(outputPath + 'submission.csv', index=False)

Conclusion

Throughout this work, we tried to establish a good model for predicting the survival of passengers in the Titanic disaster. As outlooks, we could investigate the influence of some features cleaning and scaling (such as the "Fare" scaling) on the overall performance.

References

Many ideas in this work are inspired by the great tutorials of the Titanic competition and other sources.



In [ ]:

	Pclass	Survived
0	1	0.629630
1	2	0.472826
2	3	0.242363

	Embarked	Survived
0	C	0.553571
1	Q	0.389610
2	S	0.336957

	SibSp	Survived
0	0	0.345395
1	1	0.535885
2	2	0.464286
3	3	0.250000
4	4	0.166667
5	5	0.000000
6	8	0.000000

	Embarked	Age
0	C	30.814769
1	Q	28.089286
2	S	29.445397

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	Parch	Survived
0	0	0.343658
1	1	0.550847
2	2	0.500000
3	3	0.600000
4	4	0.000000
5	5	0.200000
6	6	0.000000

	AgeIntervalMin	AgeIntervalMax	Survived
0	0.0	10.0	0.612903
1	10.0	20.0	0.401961
2	20.0	30.0	0.350000
3	30.0	40.0	0.437126
4	40.0	50.0	0.382022
5	50.0	60.0	0.416667
6	60.0	70.0	0.315789
7	70.0	80.0	0.000000
8	80.0	90.0	1.000000

	FareIntervalMin	FareIntervalMax	Survived
0	0.0	25.0	0.287253
1	25.0	50.0	0.421965
2	50.0	75.0	0.546875
3	75.0	100.0	0.795455
4	100.0	125.0	0.733333
5	125.0	150.0	0.888889
6	150.0	175.0	0.666667
7	200.0	225.0	0.600000
8	225.0	250.0	0.666667
9	250.0	275.0	0.666667
10	500.0	525.0	1.000000

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare	AgeIntervalMin	AgeIntervalMax	FareIntervalMin	FareIntervalMax
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000	714.000000	714.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208	25.252101	35.252101	22.615039	47.615039
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429	14.970969	14.970969	49.512306	49.512306
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000	0.000000	10.000000	0.000000	25.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400	20.000000	30.000000	0.000000	25.000000
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200	20.000000	30.000000	0.000000	25.000000
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000	30.000000	40.000000	25.000000	50.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200	80.000000	90.000000	500.000000	525.000000

	PassengerId	Pclass	Age	SibSp	Parch	Fare
count	418.000000	418.000000	332.000000	418.000000	418.000000	417.000000
mean	1100.500000	2.265550	30.272590	0.447368	0.392344	35.627188
std	120.810458	0.841838	14.181209	0.896760	0.981429	55.907576
min	892.000000	1.000000	0.170000	0.000000	0.000000	0.000000
25%	996.250000	1.000000	21.000000	0.000000	0.000000	7.895800
50%	1100.500000	3.000000	27.000000	0.000000	0.000000	14.454200
75%	1204.750000	3.000000	39.000000	1.000000	0.000000	31.500000
max	1309.000000	3.000000	76.000000	8.000000	9.000000	512.329200