Machine Learning Engineer Nanodegree

Introduction and Foundations

Project 0: Titanic Survival Exploration

In 1912, the ship RMS Titanic struck an iceberg on its maiden voyage and sank, resulting in the deaths of most of its passengers and crew. In this introductory project, we will explore a subset of the RMS Titanic passenger manifest to determine which features best predict whether someone survived or did not survive. To complete this project, you will need to implement several conditional predictions and answer the questions below. Your project submission will be evaluated based on the completion of the code and your responses to the questions.

Tip: Quoted sections like this will provide helpful instructions on how to navigate and use an iPython notebook.

Getting Started

To begin working with the RMS Titanic passenger data, we'll first need to import the functionality we need, and load our data into a pandas DataFrame.
Run the code cell below to load our data and display the first few entries (passengers) for examination using the .head() function.

Tip: You can run a code cell by clicking on the cell and using the keyboard shortcut Shift + Enter or Shift + Return. Alternatively, a code cell can be executed using the Play button in the hotbar after selecting it. Markdown cells (text cells like this one) can be edited by double-clicking, and saved using these same shortcuts. Markdown allows you to write easy-to-read plain text that can be converted to HTML.



In [329]:

    
import numpy as np
import pandas as pd

# RMS Titanic data visualization code 
from titanic_visualizations import survival_stats
from IPython.display import display

%matplotlib inline

# Load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)

# Print the first few entries of the RMS Titanic data
display(full_data.head())









    






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

From a sample of the RMS Titanic data, we can see the various features present for each passenger on the ship:

Survived: Outcome of survival (0 = No; 1 = Yes)
Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
Name: Name of passenger
Sex: Sex of the passenger
Age: Age of the passenger (Some entries contain NaN)
SibSp: Number of siblings and spouses of the passenger aboard
Parch: Number of parents and children of the passenger aboard
Ticket: Ticket number of the passenger
Fare: Fare paid by the passenger
Cabin Cabin number of the passenger (Some entries contain NaN)
Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Since we're interested in the outcome of survival for each passenger or crew member, we can remove the Survived feature from this dataset and store it as its own separate variable outcomes. We will use these outcomes as our prediction targets.
Run the code cell below to remove Survived as a feature of the dataset and store it in outcomes.



In [330]:

    
# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = full_data['Survived']
data = full_data.drop('Survived', axis = 1)

# Show the new dataset with 'Survived' removed
display(data.head())









    






  
    
      
      PassengerId
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

The very same sample of the RMS Titanic data now shows the Survived feature removed from the DataFrame. Note that data (the passenger data) and outcomes (the outcomes of survival) are now paired. That means for any passenger data.loc[i], they have the survival outcome outcome[i].

To measure the performance of our predictions, we need a metric to score our predictions against the true outcomes of survival. Since we are interested in how accurate our predictions are, we will calculate the proportion of passengers where our prediction of their survival is correct. Run the code cell below to create our accuracy_score function and test a prediction on the first five passengers.

Think: Out of the first five passengers, if we predict that all of them survived, what would you expect the accuracy of our predictions to be?



In [331]:

    
def accuracy_score(truth, pred):
    """ Returns accuracy score for input truth and predictions. """
    
    # Ensure that the number of predictions matches number of outcomes
    if len(truth) == len(pred): 
        
        # Calculate and return the accuracy as a percent
        return "Predictions have an accuracy of {:.2f}%.".format((truth == pred).mean()*100)
    
    else:
        return "Number of predictions does not match number of outcomes!"
    
# Test the 'accuracy_score' function
predictions = pd.Series(np.ones(5, dtype = int))
print accuracy_score(outcomes[:5], predictions)









    



Predictions have an accuracy of 60.00%.

Tip: If you save an iPython Notebook, the output from running code blocks will also be saved. However, the state of your workspace will be reset once a new session is started. Make sure that you run all of the code blocks from your previous session to reestablish variables and functions before picking up where you last left off.

Making Predictions

If we were asked to make a prediction about any passenger aboard the RMS Titanic whom we knew nothing about, then the best prediction we could make would be that they did not survive. This is because we can assume that a majority of the passengers (more than 50%) did not survive the ship sinking.
The predictions_0 function below will always predict that a passenger did not survive.



In [332]:

    
def predictions_0(data):
    """ Model with no features. Always predicts a passenger did not survive. """

    predictions = []
    for _, passenger in data.iterrows():
        
        # Predict the survival of 'passenger'
        predictions.append(0)
    
    # Return our predictions
    return pd.Series(predictions)

# Make the predictions
predictions = predictions_0(data)

Question 1

Using the RMS Titanic data, how accurate would a prediction be that none of the passengers survived?
Hint: Run the code cell below to see the accuracy of this prediction.



In [333]:

    
print accuracy_score(outcomes, predictions)









    



Predictions have an accuracy of 61.62%.

Answer: Assuming all passenger will not survive, we get an accuracy prediction of 61.62%.

Let's take a look at whether the feature Sex has any indication of survival rates among passengers using the survival_stats function. This function is defined in the titanic_visualizations.py Python script included with this project. The first two parameters passed to the function are the RMS Titanic data and passenger survival outcomes, respectively. The third parameter indicates which feature we want to plot survival statistics across.
Run the code cell below to plot the survival outcomes of passengers based on their sex.



In [334]:

    
survival_stats(data, outcomes, 'Sex')

Examining the survival statistics, a large majority of males did not survive the ship sinking. However, a majority of females did survive the ship sinking. Let's build on our previous prediction: If a passenger was female, then we will predict that they survived. Otherwise, we will predict the passenger did not survive.
Fill in the missing code below so that the function will make this prediction.
Hint: You can access the values of each feature for a passenger like a dictionary. For example, passenger['Sex'] is the sex of the passenger.



In [335]:

    
def predictions_1(data):
    """ Model with one feature: 
            - Predict a passenger survived if they are female. """
    
    predictions = []
    for _, passenger in data.iterrows():
        
        if passenger['Sex'] == 'female':
            predictions.append(1)
        else:
            predictions.append(0)
    
    # Return our predictions
    return pd.Series(predictions)

# Make the predictions
predictions = predictions_1(data)

Question 2

How accurate would a prediction be that all female passengers survived and the remaining passengers did not survive?
Hint: Run the code cell below to see the accuracy of this prediction.



In [336]:

    
print accuracy_score(outcomes, predictions)









    



Predictions have an accuracy of 78.68%.

Answer: If we take into account gender information, and assume only female passenger survive, then our prediction increases up to 78.68%, in respect to the previous approach.

Using just the Sex feature for each passenger, we are able to increase the accuracy of our predictions by a significant margin. Now, let's consider using an additional feature to see if we can further improve our predictions. For example, consider all of the male passengers aboard the RMS Titanic: Can we find a subset of those passengers that had a higher rate of survival? Let's start by looking at the Age of each male, by again using the survival_stats function. This time, we'll use a fourth parameter to filter out the data so that only passengers with the Sex 'male' will be included.
Run the code cell below to plot the survival outcomes of male passengers based on their age.



In [22]:

    
survival_stats(data, outcomes, 'Age', ["Sex == 'male'"])

Examining the survival statistics, the majority of males younger than 10 survived the ship sinking, whereas most males age 10 or older did not survive the ship sinking. Let's continue to build on our previous prediction: If a passenger was female, then we will predict they survive. If a passenger was male and younger than 10, then we will also predict they survive. Otherwise, we will predict they do not survive.
Fill in the missing code below so that the function will make this prediction.
Hint: You can start your implementation of this function using the prediction code you wrote earlier from predictions_1.



In [337]:

    
def predictions_2(data):
    """ Model with two features: 
            - Predict a passenger survived if they are female.
            - Predict a passenger survived if they are male and younger than 10. """
    
    predictions = []
    for _, passenger in data.iterrows():

        if passenger['Sex'] == 'female':
            predictions.append(1)
        elif passenger['Age'] < 10:
            predictions.append(1)
        else:
            predictions.append(0)

    return pd.Series(predictions)

# Make the predictions
predictions = predictions_2(data)

Question 3

How accurate would a prediction be that all female passengers and all male passengers younger than 10 survived?
Hint: Run the code cell below to see the accuracy of this prediction.



In [338]:

    
print accuracy_score(outcomes, predictions)









    



Predictions have an accuracy of 79.35%.

Answer: With this information added, we get a slightly better prediction rate of 79.35%.

Adding the feature Age as a condition in conjunction with Sex improves the accuracy by a small margin more than with simply using the feature Sex alone. Now it's your turn: Find a series of features and conditions to split the data on to obtain an outcome prediction accuracy of at least 80%. This may require multiple features and multiple levels of conditional statements to succeed. You can use the same feature multiple times with different conditions.
Pclass, Sex, Age, SibSp, and Parch are some suggested features to try.

Use the survival_stats function below to to examine various survival statistics.
Hint: To use mulitple filter conditions, put each condition in the list passed as the last argument. Example: ["Sex == 'male'", "Age < 18"]



In [340]:

    
survival_stats(data, outcomes, 'Pclass')
survival_stats(data, outcomes, 'SibSp',["Sex == 'male'"])
survival_stats(data, outcomes, 'Parch')

After exploring the survival statistics visualization, fill in the missing code below so that the function will make your prediction.
Make sure to keep track of the various features and conditions you tried before arriving at your final prediction model.
Hint: You can start your implementation of this function using the prediction code you wrote earlier from predictions_2.



In [341]:

    
def predictions_3(data):
    """ Model with multiple features. Makes a prediction with an accuracy of at least 80%. """
    
    predictions = []
    for _, passenger in data.iterrows():
        
        age_threshold = 12
        if passenger['Sex'] == 'female':
            if passenger['SibSp'] < 3:
                predictions.append(1)
            else:
                predictions.append(0)
                
        elif passenger['Pclass'] == 3:
            predictions.append(0)
        else:
            if passenger['Age'] < age_threshold:
                predictions.append(1)
            else:
                predictions.append(0)
    
    # Return our predictions
    return pd.Series(predictions)

# Make the predictions
predictions = predictions_3(data)

Question 4

Describe the steps you took to implement the final prediction model so that it got an accuracy of at least 80%. What features did you look at? Were certain features more informative than others? Which conditions did you use to split the survival outcomes in the data? How accurate are your predictions?
Hint: Run the code cell below to see the accuracy of your predictions.



In [342]:

    
print accuracy_score(outcomes, predictions)









    



Predictions have an accuracy of 81.03%.

Answer:

To get better results, we need to understand which of the features are more informative. I concentrated on discrete valued features only. The set considered is: Sex, Pclass, Parch and SibSp. We will also consider at the end the Age variable for "male" passengers only.

I first considered each feature separately and evaluated their information content, similarly to what done in Question 1 with the Sex feature. I evaluate the information content by finding out the prediction accuracy using only one of these features at a time. These are the results in accuracy score:

Sex = 0.787
Pclass = 0.68
Parch = 0.63
SibSp = 0.632

Sex is clearly the most important feature. What is the next? Pclass? Not necessarily, because information content can change once we split our dataset accordiginly to the Sex feature. Therefore we need to re-evaluate the information content for the subsets of data with Sex == "female" and Sex == "male". If we do that we find the following:

For Sex == "Female":

Pclass = 0.75
Parch = 0.76
SibSp = 0.77

We can see that Parch and SibSp bring more information than Pclass once we condition over the values of Sex.

For Sex == 'Male' instead:

Pclass = 0.811
Parch = 0.811
SibSp = 0.811

This last result is due to the fact that for males, there is no class, among Pclass, Parch and SibSp that favours survival over death. We know however that the Age feature for males, when Age < 10 (15 is a better threshold) allows for a prediction improvement, therefore we take that into account.

Regarding the Sex = "female" branch, is therefore favourable to use SibSp as a second variable. In particular we decided to threshold "t" the values and predicting survival for women such that SibSp < t and viceversa. We find out that the optimal value is t = 3. We do not proceed further on the "female branch".

Lets see finally if we can improve the prediction by working one step more in the "male" branch. By simple trials we found that no dichotomic choice in the SibSp or Parch helps in improving the result. We are left with the Age and Pclass variables. Due to the fact that most men are in the third class, and that third class men seem to be likely not to survive, we first predict death on them. Finally on the remaining passengers, i.e. the men with Pclass = 1,2, we threshold the decision based on the Age variable. The optimal threshold found is Age = 12.

To conclude, our Decision classifier (implemented in the previous cell as the prediction_3() function) returns a 81.03% prediction probability. We believe there is still some room for improvement by increasing the depth of the classifier. However without a validation/test set we risk to overfit the training set. The classifier built until here is simple enough to be, we think, generalizable to other homogeneous datasets.

P.S.: In the following cells we implement two functions for a "naive" but already useful, evaluation of the information content of the variables.



In [258]:

    
def information_content(data, feat):
    classes = data[feat].unique()
    MAX = 0
    MIN = 0
    for cl in classes:
        survived = len(data[data[feat] == cl][data['Survived'] == 1])
        dead = len(data[data[feat] == cl][data['Survived'] == 0])
        MAX += max(survived, dead)
        MIN += min(survived, dead)
    print "Information content for" , feat, " = ", float(MAX)/(MAX + MIN)

information_content(full_data, 'Sex')
information_content(full_data, 'Pclass')
information_content(full_data, 'Parch')
information_content(full_data, 'SibSp')

def information_content_conditioned(data, feat, cond_feat):
    classes = data[feat].unique()
    cond_class = data[cond_feat].unique()
    for cond_cl in cond_class:
        MAX = 0
        MIN = 0
        for cl in classes:
            survived = len(data[data[feat] == cl][data[cond_feat] == cond_cl][data['Survived'] == 1])
            dead = len(data[data[feat] == cl][data[cond_feat] == cond_cl][data['Survived'] == 0])
            MAX += max(survived, dead)
            MIN += min(survived, dead)
        print "Condition", cond_feat, "=", cond_cl,  "Information content for" , feat, " = ", float(MAX)/(MAX + MIN)

        
information_content_conditioned(full_data, 'Pclass', 'Sex')
information_content_conditioned(full_data, 'Parch', 'Sex')
information_content_conditioned(full_data, 'SibSp', 'Sex')









    



Information content for Sex  =  0.786756453423
Information content for Pclass  =  0.679012345679
Information content for Parch  =  0.630751964085
Information content for SibSp  =  0.632996632997
Condition Sex = male Information content for Pclass  =  0.811091854419
Condition Sex = female Information content for Pclass  =  0.742038216561
Condition Sex = male Information content for Parch  =  0.811091854419
Condition Sex = female Information content for Parch  =  0.757961783439
Condition Sex = male Information content for SibSp  =  0.811091854419
Condition Sex = female Information content for SibSp  =  0.770700636943






    



/home/ruggero/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
/home/ruggero/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:8: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
/home/ruggero/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:25: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
/home/ruggero/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:26: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

Conclusion

After several iterations of exploring and conditioning on the data, you have built a useful algorithm for predicting the survival of each passenger aboard the RMS Titanic. The technique applied in this project is a manual implementation of a simple machine learning model, the decision tree. A decision tree splits a set of data into smaller and smaller groups (called nodes), by one feature at a time. Each time a subset of the data is split, our predictions become more accurate if each of the resulting subgroups are more homogeneous (contain similar labels) than before. The advantage of having a computer do things for us is that it will be more exhaustive and more precise than our manual exploration above. This link provides another introduction into machine learning using a decision tree.

A decision tree is just one of many models that come from supervised learning. In supervised learning, we attempt to use features of the data to predict or model things with objective outcome labels. That is to say, each of our data points has a known outcome value, such as a categorical, discrete label like 'Survived', or a numerical, continuous value like predicting the price of a house.

Question 5

Think of a real-world scenario where supervised learning could be applied. What would be the outcome variable that you are trying to predict? Name two features about the data used in this scenario that might be helpful for making the predictions.

Answer:

An example of real-world scenario is forecasting of power produced by wind power plants.

Producers of wind energy need to provide day ahead or week ahead forecast for the power produced by their plants. The reason is that the market needs to adapt to production from volatile sources (wind and solar) and balance the energy in the all power grid. If producers provide a bad forecast, they need to pay fees. If they underproduce polluting sources of energ such as coal, oil or gas need to be used, and if they overproducing, since energy cannot be stored efficiently in batteries, part of the grid need to be shutdown.

Standard forecasting techniques are based on weather predictions. Typical variables are wind speed at different heights (10m up to 200m) and wind directions. Wind speed direclty regulates the production due to the third power proportionality law between speed and Energy in a wind turbine. Wind direction is also fundamental for two reason: wind direction determines state of the atmosphere and therefore forecasting of weather change in the following days. Moreover due to screening effects from other turbines (some parks have turbines belongining to different, competing, companies), production could be overestimated or underestimated.

Note: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to
File -> Download as -> HTML (.html). Include the finished document along with this notebook as your submission.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S