Titanic - survival probability

The following notebook is a compilation of the analysis steps described in DataCamp

First quest: Using machine learning and identify the greatest set of factors (Age, Sex, demographic, status, cabin location, etc.) that had significant impact on survival probability for passengers of the Titanic. Show how you came to your conclusion.

The data

Variable Description
PassengerId Passenger ID
Survival Survival (0 = No; 1 = Yes)
Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name Name
Sex Sex
Age Age
Sibsp Number of Siblings/Spouses Aboard
Parch Number of Parents/Children Aboard
Ticket Ticket Number
Fare Passenger Fare
Cabin Cabin Number
Embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Import the pandas library

In [ ]:
import pandas as pd

Downloading the data

We create 2 data frames, one for training, one for testing

In [ ]:
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
# train_url = "out_test.csv"  # local name (already has the "Child" column)
train = pd.read_csv(train_url)

In [ ]:
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
# test_url = "out_train.csv" # local name
test = pd.read_csv(test_url)

Printing the head of the train dataframes:

In [ ]:

Notice that the test data frame doesn't have a "Survived" columnn, that is for us to predict with our model.

Interrogating the training data set

How many people actually survived?

In [ ]:

Only ~38% of the people survived. How many of the survivors were women?

In [ ]:

233 people out of the 324 survivers were women, so gender could be a good predictor. What if you were a child? What was the change of surviving? Well, not as definitive. First we have to create a column on the data structure that discrimiates adults from children.

We can initialize a column named "Child" like this:

In [ ]:
train["Child"] = float('NaN')

We will populate it according to the following rule: 1 if Age < 18, 0 if Age >= 18

In [ ]:
train["Child"][train["Age"]<18] = 1
train["Child"][train["Age"]>=18] = 0

I got warnings here about using chained indexing, but the column was changed succesfully:

In [ ]:

We can write any data frame to a csv file, for example we could export the train file now, and notice that it has the child column at the end

In [ ]:

but a better use of the cvs output is to make prediction. For example, we can make a prediction of who would survive by only taking gender in account.

In [ ]:
test_one = test
test_one["Survived"] = 0
test_one["Survived"][test_one["Sex"]=="female"] = 1
prediction = pd.DataFrame()
prediction["PassengerId"] = test_one["PassengerId"]

In [ ]:
import csv
prediction.to_csv("myPrediction_genderOnly.csv",index=False, quoting=csv.QUOTE_NONNUMERIC) #this seems to be the right format

In [ ]:

Cleaning a formatting

Decision trees are very useful to do classification on structured data. Here we will feed our data to a tree (actually several trees later on) but first there is some cleaning and formatting that we need to do.

1) There are some missing values, for example "Age" and "Embarked" were not filled for all the passengers. A common way to deal with missing values on a numerical variable is to fill this spaces with the mean of the values that we do have. We will use the fillna fucntion and the median() atrribute. For the cathegorical variables, we will use the most common value.

In [ ]:
train["Age"] = train["Age"].fillna(train["Age"].median())

In [ ]:

In [ ]:
# Most people embarked in Southampton
train["Embarked"] = train["Embarked"].fillna('S')

2) It is also better to change data from cathegorical format to numerical. We will do this for "Sex" and "Embarked"

In [ ]:
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2

Adding a little tree

We will add a Decision Tree (DT) here. In order to train the tree, we need to specify what column of our data is the target and what features we will use. We do so by defining the following arrays:

In [ ]:
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

In [ ]:
from sklearn import tree
import numpy as np

In [ ]:
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one,target)

We can determine how well the model did on the train set by checking the score() and the list of feature importances.

In [ ]:
print("[Pclass", "Sex", "Age", "Fare]")
From here we were able to see that the "Fare" is the most important feature, that is, we can find a good value (or theshold) on the "Fare" column that will allow us to classify who would survive or not.

Making a prediction with one DT

To asses our model, we have to make a prediction for the survival rates, using the "test" set, which is a clean set that our model hasn't been exposed to.

1) Cleaning

Fill the nans in "Fare" with the median of the corresponding Passanger class (Pclass)

In [ ]:
Fare1 = test["Fare"][test.Pclass == 1].mean()
Fare2 = test["Fare"][test.Pclass == 2].mean()
Fare3 = test["Fare"][test.Pclass == 3].mean()
print(Fare1, Fare2, Fare3)

In [ ]:
print(test["Fare"][(pd.isnull(test.Fare)) & (test.Pclass == 1)])
print(test["Fare"][(pd.isnull(test.Fare)) & (test.Pclass == 2)])
print(test["Fare"][(pd.isnull(test.Fare)) & (test.Pclass == 3)])

So we know there is only one missing value for "Fare" and we know it's for someone on 3rd class:

In [ ]:
# test.Fare[pd.isnull(test.Fare)] = Fare3
test.Fare = test.Fare.fillna(Fare3)
# print(test[150:160])

2) Run all the transformation we ran on the test set

In [ ]:
# Fill up missing values on any of the parameters: 
# "Pclass", "Sex", "Age" and "Fare"
pd.isnull(test.Pclass).value_counts() # all False

pd.isnull(test.Sex).value_counts() # also all False, but this has to be change to numberic
test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1

pd.isnull(test.Age).value_counts() #86/418 ~ 20%
# pd.isnull(train.Age).value_counts() #177/891 in the original data ~ 19% OK
test["Age"] = test["Age"].fillna(test["Age"].median())

# pd.isnull(test.Fare).value_counts() # all false now... of course

3) Define the array of relevan relevant features to be passed to our model:

In [ ]:
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values

Make the prediction applying our model "my_tree_one" to the test set:

In [ ]:
my_prediction = my_tree_one.predict(test_features)

In [ ]:
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])

In [ ]:

In [ ]:
my_solution.to_csv("my_solution_DT_one.csv", index_label = ["PassengerId"])

Submitted this and I got 0.71770 in Kaggle, they say it's not the best submission of my team, so keep on trying!!! :)

In [ ]: