Example 3: Final Project Report—Predicting Survival

Load clean data sets

Note: The data wrangling performed in Example 1 would be included in the full report, but the focus of this example is on the model fitting stage, so the report starts with loading the cleaned data sets.


In [2]:
/* Reduce log output */
options nosource nonotes;
/* Fetch the file from the website */
filename titanic temp;
proc http
    url="https://raw.githubusercontent.com/sascommunities/sas-global-forum-2019/master/3133-2019-Gaines/titanicTrainClean.csv"
    method="GET"
    out=titanic;
run;
/* Import the file */
proc import
    file=titanic
    out=work.titanicTrainClean replace
    dbms=csv;
run;

filename titanic2 temp;
proc http
    url="https://raw.githubusercontent.com/sascommunities/sas-global-forum-2019/master/3133-2019-Gaines/titanicTestClean.csv"
    method="GET"
    out=titanic2;
run;

proc import
    file=titanic2
    out=work.titanicTestClean replace
    dbms=csv;
run;
options notes;


Out[2]:

NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
1048 rows created in WORK.TITANICTRAINCLEAN from TITANIC.



260 rows created in WORK.TITANICTESTCLEAN from TITANIC2.



Feature Engineering

Before building a statistical model, we first explore feature (variable) engineering, which is an important part of the analytical process. Ideally, we want to include only the most relevant variables in our model. Doing so helps maintain model interpretability, and a simpler model can also guard against overfitting the training dataset, which improves predictive accuracy.

One type of feature engineering reduces the number of model inputs by using existing variables to calculate new variables. This can be done automatically or by leveraging domain knowledge. In the Titanic data set, the variable parch is the number of siblings or spouses and sibsp represent the number of children or parents. We can combine these variables into a new family size variable, famSize, because it might be reasonable to suspect that there is a relationship between the number of family members on board and survival.


In [3]:
data titanicTrainClean;
    set titanicTrainClean;
    famSize = parch + sibsp + 1;
run;

data titanicTestClean;
    set titanicTestClean;
    famSize = parch + sibsp + 1;
run;


Out[3]:

NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
NOTE: There were 1048 observations read from the data set WORK.TITANICTRAINCLEAN.
NOTE: The data set WORK.TITANICTRAINCLEAN has 1048 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

NOTE: There were 260 observations read from the data set WORK.TITANICTESTCLEAN.
NOTE: The data set WORK.TITANICTESTCLEAN has 260 observations and 10 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

Statistical Model

To predict survival on the Titanic, a binary logistic regression model was used. This widely-used model is applicable when the target or response variable of interest, $y$, contains only two levels (survived or perished, in this case). A logistic regression models the probability of the outcome of interest given the values of the predictor variables, $x$. This probability is denoted as $P(y=1|x)$ where $y = 1$ corresponds to survival.

Because the quantity being modeled is a probability, a traditional multiple linear regression should not be used because there is no guarantee that the estimated probabilities will be valid. Logistic regression uses the logit function to link the probability of interest with a linear model. That is,

$$\text{logit}(\pi) = \log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \sum_{j=1}^p \beta_j x_j,$$

where $\pi = P(y=1|x)$.

Ultimately, the goal is to use the model to classify a passenger in either the survived or perished groups. One approach to using a binary logistic regression model for classification is to estimate the survival probability and then assign a passenger into the survived group if the estimated probability is greater than $0.5$. Alternatively, a data-driven approach, such as cross-validation, can be used to determine the cut-off point, as something other than $0.5$ might produce better results. This can be easily achieved with the LOGISTIC procedure by specifying the CTABLE option in the MODEL statement. The procedure uses an approximation to leave-one-out cross-validation to construct the table to reduce the bias in the error estimates.

Here, we estimate the logistic regression model

$$ \log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1 \text{sex} + \beta_2 \text{age} + \beta_3 \text{sex*age} + \beta_4 \text{pclass} + \beta_5 \text{fare} + \beta_6 \text{famSize}$$

by using the training set. The results include the Classification Table containing error estimates for different cut-off points, in addition to fit statistics for the training and test sets.


In [4]:
proc logistic data=WORK.titanicTrainClean;
    class sex / param=glm;
    model survived(event='1')=sex age sex*age pclass fare famSize / link=logit 
        technique=fisher ctable pprob=(0.45 to 0.65 by 0.01);
    score data=work.titanicTestClean out=work.titanicTestPred fitstat;
    score data=work.titanicTrainClean out=work.titanicTrainPred fitstat;
    ods select Classification ScoreFitStat;
run;


Out[4]:
SAS Output

The SAS System

The LOGISTIC Procedure

Classification Table
Prob
Level
Correct Incorrect Percentages
Event Non-
Event
Event Non-
Event
Correct Sensi-
tivity
Speci-
ficity
Pos
Pred
Neg
Pred
0.450 305 538 107 98 80.4 75.7 83.4 74.0 84.6
0.460 303 543 102 100 80.7 75.2 84.2 74.8 84.4
0.470 303 544 101 100 80.8 75.2 84.3 75.0 84.5
0.480 303 546 99 100 81.0 75.2 84.7 75.4 84.5
0.490 299 550 95 104 81.0 74.2 85.3 75.9 84.1
0.500 298 553 92 105 81.2 73.9 85.7 76.4 84.0
0.510 297 555 90 106 81.3 73.7 86.0 76.7 84.0
0.520 295 559 86 108 81.5 73.2 86.7 77.4 83.8
0.530 290 563 82 113 81.4 72.0 87.3 78.0 83.3
0.540 285 563 82 118 80.9 70.7 87.3 77.7 82.7
0.550 278 566 79 125 80.5 69.0 87.8 77.9 81.9
0.560 271 570 75 132 80.2 67.2 88.4 78.3 81.2
0.570 265 574 71 138 80.1 65.8 89.0 78.9 80.6
0.580 262 575 70 141 79.9 65.0 89.1 78.9 80.3
0.590 258 579 66 145 79.9 64.0 89.8 79.6 80.0
0.600 251 586 59 152 79.9 62.3 90.9 81.0 79.4
0.610 244 589 56 159 79.5 60.5 91.3 81.3 78.7
0.620 237 597 48 166 79.6 58.8 92.6 83.2 78.2
0.630 237 599 46 166 79.8 58.8 92.9 83.7 78.3
0.640 209 604 41 194 77.6 51.9 93.6 83.6 75.7
0.650 205 619 26 198 78.6 50.9 96.0 88.7 75.8
Fit Statistics for SCORE Data
Data Set Total Frequency Log Likelihood Error Rate AIC AICC BIC SC R-Square Max-Rescaled R-Square AUC Brier Score
WORK.TITANICTESTCLEAN 260 -132.0 0.2308 278.0976 278.5421 303.0224 303.0224 0.263161 0.358935 0.800866 0.163313
WORK.TITANICTRAINCLEAN 1048 -463.2 0.1861 940.4222 940.5299 975.1046 975.1046 0.361413 0.49093 0.852436 0.139288

Based on the Classification Table above, a cut-off point of $0.52$ instead of $0.50$ is used to make predictions on the test set to assess the model's performance. In this situation, the resulting misclassification error rate was the same, and the model is able to correctly classify passengers about $77\%$ of the time.


In [5]:
data work.titanicTestPred;
    set work.titanicTestPred;
    if P_1 > 0.52 then survivedPred = 1;
        else survivedPred = 0;
    if survivedPred = survived then predError = 0;
        else predError = 1;
run;
proc means data=titanicTestPred mean; 
    var predError;
run;


Out[5]:
SAS Output

The SAS System

The MEANS Procedure

Analysis Variable : predError
Mean
0.2307692