The following notebook is a compilation of the analysis steps described in DataCamp
First quest: Using machine learning and identify the greatest set of factors (Age, Sex, demographic, status, cabin location, etc.) that had significant impact on survival probability for passengers of the Titanic. Show how you came to your conclusion.
Variable | Description |
---|---|
PassengerId | Passenger ID |
Survival | Survival (0 = No; 1 = Yes) |
Pclass | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) |
Name | Name |
Sex | Sex |
Age | Age |
Sibsp | Number of Siblings/Spouses Aboard |
Parch | Number of Parents/Children Aboard |
Ticket | Ticket Number |
Fare | Passenger Fare |
Cabin | Cabin Number |
Embarked | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) |
Import the pandas library
In [ ]:
import pandas as pd
In [ ]:
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
# train_url = "out_test.csv" # local name (already has the "Child" column)
train = pd.read_csv(train_url)
In [ ]:
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
# test_url = "out_train.csv" # local name
test = pd.read_csv(test_url)
Printing the head
of the train dataframes:
In [ ]:
print(train.head())
print(test.head())
Notice that the test data frame doesn't have a "Survived" columnn, that is for us to predict with our model.
How many people actually survived?
In [ ]:
print(train["Survived"].value_counts())
Only ~38% of the people survived. How many of the survivors were women?
In [ ]:
print(train["Survived"][train["Sex"]=='female'].value_counts())
233 people out of the 324 survivers were women, so gender could be a good predictor. What if you were a child? What was the change of surviving? Well, not as definitive. First we have to create a column on the data structure that discrimiates adults from children.
We can initialize a column named "Child" like this:
In [ ]:
train["Child"] = float('NaN')
print(train["Child"][0:10])
We will populate it according to the following rule: 1 if Age < 18, 0 if Age >= 18
In [ ]:
train["Child"][train["Age"]<18] = 1
train["Child"][train["Age"]>=18] = 0
I got warnings here about using chained indexing, but the column was changed succesfully:
In [ ]:
print(train["Child"][0:10])
We can write any data frame to a csv file, for example we could export the train file now, and notice that it has the child column at the end
In [ ]:
train.to_csv("out_train.csv")
but a better use of the cvs output is to make prediction. For example, we can make a prediction of who would survive by only taking gender in account.
In [ ]:
test_one = test
test_one["Survived"] = 0
test_one["Survived"][test_one["Sex"]=="female"] = 1
prediction = pd.DataFrame()
prediction["PassengerId"] = test_one["PassengerId"]
prediction['Survived']=test_one["Survived"]
In [ ]:
import csv
prediction.to_csv("myPrediction_genderOnly.csv",index=False, quoting=csv.QUOTE_NONNUMERIC) #this seems to be the right format
In [ ]:
prediction.head()
Decision trees are very useful to do classification on structured data. Here we will feed our data to a tree (actually several trees later on) but first there is some cleaning and formatting that we need to do.
1) There are some missing values, for example "Age" and "Embarked" were not filled for all the passengers. A common way to deal with missing values on a numerical variable is to fill this spaces with the mean of the values that we do have. We will use the fillna fucntion and the median() atrribute. For the cathegorical variables, we will use the most common value.
In [ ]:
train["Age"] = train["Age"].fillna(train["Age"].median())
In [ ]:
train["Embarked"].value_counts()
In [ ]:
# Most people embarked in Southampton
train["Embarked"] = train["Embarked"].fillna('S')
2) It is also better to change data from cathegorical format to numerical. We will do this for "Sex" and "Embarked"
In [ ]:
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2
We will add a Decision Tree (DT) here. In order to train the tree, we need to specify what column of our data is the target and what features we will use. We do so by defining the following arrays:
In [ ]:
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values
In [ ]:
from sklearn import tree
import numpy as np
In [ ]:
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one,target)
We can determine how well the model did on the train set by checking the score() and the list of feature importances.
In [ ]:
print("[Pclass", "Sex", "Age", "Fare]")
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one,target))
To asses our model, we have to make a prediction for the survival rates, using the "test" set, which is a clean set that our model hasn't been exposed to.
1) Cleaning
Fill the nans in "Fare" with the median of the corresponding Passanger class (Pclass)
In [ ]:
Fare1 = test["Fare"][test.Pclass == 1].mean()
Fare2 = test["Fare"][test.Pclass == 2].mean()
Fare3 = test["Fare"][test.Pclass == 3].mean()
print(Fare1, Fare2, Fare3)
In [ ]:
print(test["Fare"][(pd.isnull(test.Fare)) & (test.Pclass == 1)])
print(test["Fare"][(pd.isnull(test.Fare)) & (test.Pclass == 2)])
print(test["Fare"][(pd.isnull(test.Fare)) & (test.Pclass == 3)])
So we know there is only one missing value for "Fare" and we know it's for someone on 3rd class:
In [ ]:
# test.Fare[pd.isnull(test.Fare)] = Fare3
test.Fare = test.Fare.fillna(Fare3)
# print(test[150:160])
2) Run all the transformation we ran on the test set
In [ ]:
# Fill up missing values on any of the parameters:
# "Pclass", "Sex", "Age" and "Fare"
pd.isnull(test.Pclass).value_counts() # all False
pd.isnull(test.Sex).value_counts() # also all False, but this has to be change to numberic
test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1
pd.isnull(test.Age).value_counts() #86/418 ~ 20%
# pd.isnull(train.Age).value_counts() #177/891 in the original data ~ 19% OK
test["Age"] = test["Age"].fillna(test["Age"].median())
# pd.isnull(test.Fare).value_counts() # all false now... of course
3) Define the array of relevan relevant features to be passed to our model:
In [ ]:
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values
Make the prediction applying our model "my_tree_one" to the test set:
In [ ]:
my_prediction = my_tree_one.predict(test_features)
In [ ]:
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)
In [ ]:
print(my_solution.shape)
In [ ]:
my_solution.to_csv("my_solution_DT_one.csv", index_label = ["PassengerId"])
Submitted this and I got 0.71770 in Kaggle, they say it's not the best submission of my team, so keep on trying!!! :)
In [ ]: