Doing Machine Learning with Kaggle's Titanic Problem (SAIL ON, October 2016)

We're finally here: Let's predict who lives and who dies on the Titanic given the passenger manifest records.

If you want it for reference, tonight's materials are available:

However, this notebook will treat the math as a given and turn to focusing on the applied data engineering (getting the data into the right format) and critical thinking (figuring out what features to include in the model). NOTE: You don't need to learn git to download this notebook. Just click 'Raw' and save the .ipynb file, then run Jupyter Notebook on your machine and open the file within Jupyter.

Getting and examining the data

The first step is always reading in data and looking at it to get a sense for what it contains.

Here's one way to do that in Python:


In [35]:
from IPython.display import display, HTML  # this lets us get pretty tabular displays of data
import pandas as pd  # this is a useful library for working with data

all_data = pd.read_csv("train.csv", index_col=0)  # assumes train.csv is in the same folder as this notebook
display(all_data[0:5])


Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [36]:
# We can name which rows to display -- here, rows 1 (inclusive) through 4 (exclusive)
# Remember that Python counts from 0...
display(all_data[1:4])


Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S

In [37]:
# We can name a column and rows
print all_data["Survived"][1:4]


PassengerId
2    1
3    1
4    1
Name: Survived, dtype: int64

In [38]:
# We can get multiple columns by putting them in a list: ["Survived", "Pclass", "Sex"]
print all_data[["Survived", "Pclass", "Sex"]][1:4]


             Survived  Pclass     Sex
PassengerId                          
2                   1       1  female
3                   1       3  female
4                   1       1  female

In [39]:
# The opposite order for "which rows" - "which columns" also works (but it's weird)
print all_data[1:4][["Survived", "Pclass", "Sex"]]


             Survived  Pclass     Sex
PassengerId                          
2                   1       1  female
3                   1       3  female
4                   1       1  female

Separating out the features from the target variable

We need to pull out the target variable into a separate data structure from the rest of the features.

(Why are we putting the target variable into its own Python variable? Well, mostly because the implementation of Naive Bayes we're going to use later wants it that way. But also we want to make sure we don't include the target variable as one of our inputs. If the target variable got included as an input by accident, how good would the model's accuracy look?)


In [40]:
# Let's grab just the survived column and store it as `y` (our target variable)
y = all_data["Survived"]

In [41]:
# Let's check it worked
display(y[0:5])


PassengerId
1    0
2    1
3    1
4    1
5    0
Name: Survived, dtype: int64

Creating a derived matrix containing the bucketed features

We'll use Sex, Embarked, Cabin, and Age in the model we build together -- and you'll still have many other directions to go when we're done.

We need to discretize (bucket) our features for Naive Bayes, so now we will come up with a scheme for discretizing those 4 variables. In the process, you'll see part of what the pandas library makes easy.


In [42]:
X = all_data[["Sex", "Embarked", "Cabin", "Age"]]
display(X[0:5])


Sex Embarked Cabin Age
PassengerId
1 male S NaN 22.0
2 female C C85 38.0
3 female S NaN 26.0
4 female S C123 35.0
5 male S NaN 35.0

Hmm. Our data won't easily facilitate doing the "counts" over it that we need, like we did for the class grades example. We need to make each row & column contain either a 0 or a 1.

The pandas library has a lot of ways to help us with that. First and foremost, the get_dummies function will convert from a single column containing n unique strings into n columns of 1s and 0s.


In [43]:
# We can use the "get dummies" variable to turn a column of words into a different 
# column for each type of value
pd.get_dummies(X["Sex"])[0:5]


Out[43]:
female male
PassengerId
1 0.0 1.0
2 1.0 0.0
3 1.0 0.0
4 1.0 0.0
5 0.0 1.0

In [44]:
# Since anyone who isn't female is male, we only need to keep one of them
female_dummy_variable = pd.get_dummies(X["Sex"])["female"]        # get the female dummy info
X = X.assign(female=female_dummy_variable)  # incorporate it into the X variable

display(X[0:5])


Sex Embarked Cabin Age female
PassengerId
1 male S NaN 22.0 0.0
2 female C C85 38.0 1.0
3 female S NaN 26.0 1.0
4 female S C123 35.0 1.0
5 male S NaN 35.0 0.0

In [45]:
# Let's do the same thing with embarked, which also needs a separate column for each entry
embarked_dummy_variables = pd.get_dummies(X["Embarked"])
display(embarked_dummy_variables[0:5])


C Q S
PassengerId
1 0.0 0.0 1.0
2 1.0 0.0 0.0
3 0.0 0.0 1.0
4 0.0 0.0 1.0
5 0.0 0.0 1.0

In [46]:
X = X.assign(C=embarked_dummy_variables["C"])  # incorporate it into the X variable
X = X.assign(Q=embarked_dummy_variables["Q"])  # incorporate it into the X variable
X = X.assign(S=embarked_dummy_variables["S"])  # incorporate it into the X variable

display(X[0:5])


Sex Embarked Cabin Age female C Q S
PassengerId
1 male S NaN 22.0 0.0 0.0 0.0 1.0
2 female C C85 38.0 1.0 1.0 0.0 0.0
3 female S NaN 26.0 1.0 0.0 0.0 1.0
4 female S C123 35.0 1.0 0.0 0.0 1.0
5 male S NaN 35.0 0.0 0.0 0.0 1.0

Now for another challenge. One way to encode cabin information is to note whether or not the passenger has a cabin. Let's do that.

Values of NaN stand for "Not a Number" -- these are the people who didn't have a cabin. Let's identify them.


In [47]:
missing_cabin = pd.isnull(X["Cabin"])
print missing_cabin[0:5]


PassengerId
1     True
2    False
3     True
4    False
5     True
Name: Cabin, dtype: bool

We wanted numbers, but True/False will work just as well in Python: True evaluates to 1, and False evaluates to 0. So let's add in the cabin possession information as well.


In [48]:
X = X.assign(nocabin=missing_cabin)  # incorporate it into the X variable
display(X[0:5])


Sex Embarked Cabin Age female C Q S nocabin
PassengerId
1 male S NaN 22.0 0.0 0.0 0.0 1.0 True
2 female C C85 38.0 1.0 1.0 0.0 0.0 False
3 female S NaN 26.0 1.0 0.0 0.0 1.0 True
4 female S C123 35.0 1.0 0.0 0.0 1.0 False
5 male S NaN 35.0 0.0 0.0 0.0 1.0 True

And for our final variable conversion, let's bucket Age by time of life.


In [49]:
# Pandas lets us discetize with the cut() method
binned_ages = pd.cut(X["Age"], 
             bins=[0, 14, 25, 50, 150],  # I use a ludicrous maximum age 
             labels=["child","youngadult","adult","olderadult"]) # roughly: children, young adults and young parents, 
                                                                 # established folk, older folk

print binned_ages[3:8]


PassengerId
4         adult
5         adult
6           NaN
7    olderadult
8         child
Name: Age, dtype: category
Categories (4, object): [child < youngadult < adult < olderadult]

Uh oh. There are NaN values for age -- we forgot about that. Let's create another age category for NaN.


In [50]:
binned_ages = binned_ages.cat.add_categories(["noage"])   # create a new category
binned_ages.fillna("noage", inplace=True)        # fill it "in-place" -- without having to use an = to capture it

print binned_ages[3:8]

# If you get a ValueError when you execute this cell, 
# it's because you're trying to re-add "noage" back into binned_age's categories.
# Fix it by rerunning the above cell and then rerunning this one.


PassengerId
4         adult
5         adult
6         noage
7    olderadult
8         child
dtype: category
Categories (5, object): [child < youngadult < adult < olderadult < noage]

In [51]:
# Now let's create dummies for age & add them into the dataset X
age_dummy_variables = pd.get_dummies(binned_ages)
X = X.assign(child=age_dummy_variables["child"])  # incorporate it into the X variable
X = X.assign(youngadult=age_dummy_variables["youngadult"])  # incorporate it into the X variable
X = X.assign(adult=age_dummy_variables["adult"])  # incorporate it into the X variable
X = X.assign(olderadult=age_dummy_variables["olderadult"])  # incorporate it into the X variable
X = X.assign(noage=age_dummy_variables["noage"])  # incorporate it into the X variable
display(X[0:5])


Sex Embarked Cabin Age female C Q S nocabin child youngadult adult olderadult noage
PassengerId
1 male S NaN 22.0 0.0 0.0 0.0 1.0 True 0.0 1.0 0.0 0.0 0.0
2 female C C85 38.0 1.0 1.0 0.0 0.0 False 0.0 0.0 1.0 0.0 0.0
3 female S NaN 26.0 1.0 0.0 0.0 1.0 True 0.0 0.0 1.0 0.0 0.0
4 female S C123 35.0 1.0 0.0 0.0 1.0 False 0.0 0.0 1.0 0.0 0.0
5 male S NaN 35.0 0.0 0.0 0.0 1.0 True 0.0 0.0 1.0 0.0 0.0

Almost there! Let's keep only the columns with 1s and 0s that we intend to model with.


In [52]:
X = X[["female", "C", "Q", "S", "nocabin", "child", "youngadult", "adult", "olderadult", "noage"]]
display(X[0:5])


female C Q S nocabin child youngadult adult olderadult noage
PassengerId
1 0.0 0.0 0.0 1.0 True 0.0 1.0 0.0 0.0 0.0
2 1.0 1.0 0.0 0.0 False 0.0 0.0 1.0 0.0 0.0
3 1.0 0.0 0.0 1.0 True 0.0 0.0 1.0 0.0 0.0
4 1.0 0.0 0.0 1.0 False 0.0 0.0 1.0 0.0 0.0
5 0.0 0.0 0.0 1.0 True 0.0 0.0 1.0 0.0 0.0

Pulling aside a subset of rows so we can evaluate ourselves

We could train a model on all 891 rows in the training data. That would give us a lot of data to use.* But once we trained the model, we would have no way to know if it was a good model or a poor model! Maybe the model predicts everyone's survival entirely opposite to what actually happened -- how would we know?

So we are going to extract 20% of the training data at random, and reserve it as "development" data. We aren't going to use it in building the models. Instead we are going to use it to evaluate the models. Essentially we're going to say, "build a model with 80% of the data", "now here's 20% that we are pretending has no labels -- what do you predict?", "are those predictions actually right?", "hmm, okay, let's make this change", "build a model with 80% of the data", ....

(* 891 rows isn't actually a lot of data)


In [53]:
# The easiest way to get this split is to use a built-in function 
# from the machine learning library sklearn.  They call it a
# "train-test" split, but we're going to call it a "train-development"
# split, since there is a test.csv dataset available from Kaggle.
import sklearn
from sklearn.cross_validation import train_test_split
X_train, X_dev, y_train, y_dev = train_test_split(X, y, test_size=0.2, random_state=42)

print X_train.shape  # the training data inputs have 712 rows and 10 columns
print y_train.shape  # the training data outputs have 712 rows
print
print X_dev.shape   # the development data we just created has 179 rows and 10 columns
print y_dev.shape   # the development data has 179 rows


(712, 10)
(712L,)

(179, 10)
(179L,)

Fit a model

Now we are going to "fit" a basic model. The mathematical operations come for free inside the models given by sklearn, but before we can actually use the model, we need to learn from data what the right parameters for this problem are. The right values are different for every problem (Titanic vs. grades vs. leaf classification vs. ...).

"Fitting" a model means learning the correct variables to have the math embedded in the classifier succeed. For Naive Bayes, that means we're going to learn all of the conditional probabilities in the training data.


In [54]:
# Bernoulli Naive Bayes is what we want to use (because all our cell values are either 0 or 1)
# We can import an implementation of the model from sklearn
from sklearn.naive_bayes import BernoulliNB         

clf = BernoulliNB()       # Create a classifier
clf.fit(X_train, y_train) # Learn the parameters needed to get the training examples right


Out[54]:
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

Let's do a prediction, just for fun...


In [55]:
print "For this person in the training data:"
display(X_train[0:1])
print "who is apparently an adult male with a cabin who embarked at S..."
print
print "Our model predicts:"
print clf.predict(X_train[0:1])
print "(that he died)"
print
print "And the truth is:"
print y_train[0:1]
print "(he unfortunately did die)"


For this person in the training data:
female C Q S nocabin child youngadult adult olderadult noage
PassengerId
332 0.0 0.0 0.0 1.0 False 0.0 0.0 1.0 0.0 0.0
who is apparently an adult male with a cabin who embarked at S...

Our model predicts:
[0]
(that he died)

And the truth is:
PassengerId
332    0
Name: Survived, dtype: int64
(he unfortunately did die)

Estimate how well we're doing

Now that we have a model, let's see how well it does on the 20% of data we held out earlier.

We'll use accuracy as our metric for "goodness of model" for now: What percent of the rows did our model get right? (As a side note -- accuracy isn't the best measure, but it's very easy to understand.)


In [56]:
from sklearn.metrics import accuracy_score

predictions_on_dev = clf.predict(X_dev)  # Let's run the dev dataset through the predictor function
print accuracy_score(y_dev, predictions_on_dev)


0.787709497207

Woohoo! We're getting about 79% correct!

Remember this is an estimate based on the 20% of the training data that we are calling "dev". The test data available on the Kaggle website is the official test. Their test data includes only the input variables -- the "Survived" column has been redacted. The only way to see how well your model is doing on entirely unseen data is to post a score on Kaggle.

An accuracy of 79% is good, but there's still more improvements to be had -- a very good score is around 82%. How well can you score?

Going further

Some ideas you might want to investigate:

  • Read the documentation about the variables online if you haven't already -- knowing your data always helps.
  • Choose one or more of SibSp, Parch, Fare and include it. Does model performance go up?
  • Vary how variables are bucketed and see how performance is affected.
  • Think about and include some new features based directly on information in your training data. (Maybe information from their names, like presence of "O'" or "ski" -- this could get at the effect of ethnicity on survival? Maybe information derived from their cabin assigment or lack thereof? Something else?)
  • If you create entirely new features (like "is a female with a cheap fare" or "number of family members on board"), does that help performance?
  • If you had to choose just a single variable to build a model on, what would it be? How well would it do?
  • Can you interpret what the current model is doingn into English? What sorts of things does the model indicate are driving survival?

And if you're feeling very bold:

  • Can you figure out any ways to fill in NaN values with reasonable guesses based on the other information we have?
  • Can you write the code that applies the model you learned on the training data to the test data? (This requires reading in a new file, transforming it in exactly the same way as your previous data, and running your trained model against the new data. The output file format that Kaggle wants is a .csv file with two headers: PassengerId and Survived. The file genderclassmodel.csv on their website is in the format they need for submissions.)
  • What does the alpha parameter to the BernoulliNB model do in the docs? Is it helpful to use alpha? (Try googling for the terms in the description alongside the word "Naive Bayes", maybe with a helpful term like "example" thrown in.)
  • Instead of predicting survival, can you predict sex?

In favor of getting you a deep understanding of Naive Bayes, we skipped over some general modeling "musts" that we'll touch on more in the future. These include: (1) don't ever look at your test data, (2) cross-validation is your friend, and (3) there are better scoring metrics than accuracy. Don't worry about these for now -- we will get there. But if you're done with everything else and still thirsting for more, you might also try googing for explanations of these.