Example 3: Final Project Report—Predicting Survival

Load clean data sets

Note: The data wrangling performed in Example 1 would be included in the full report, but the focus of this example is on the model fitting stage, so the report starts with loading the cleaned data sets.



In [2]:

    
/* Reduce log output */
options nosource nonotes;
/* Fetch the file from the website */
filename titanic temp;
proc http
    url="https://raw.githubusercontent.com/sascommunities/sas-global-forum-2019/master/3133-2019-Gaines/titanicTrainClean.csv"
    method="GET"
    out=titanic;
run;
/* Import the file */
proc import
    file=titanic
    out=work.titanicTrainClean replace
    dbms=csv;
run;

filename titanic2 temp;
proc http
    url="https://raw.githubusercontent.com/sascommunities/sas-global-forum-2019/master/3133-2019-Gaines/titanicTestClean.csv"
    method="GET"
    out=titanic2;
run;

proc import
    file=titanic2
    out=work.titanicTestClean replace
    dbms=csv;
run;
options notes;









    Out[2]:









  
  
  




NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
1048 rows created in WORK.TITANICTRAINCLEAN from TITANIC.
  
  
  
260 rows created in WORK.TITANICTESTCLEAN from TITANIC2.

Feature Engineering

Before building a statistical model, we first explore feature (variable) engineering, which is an important part of the analytical process. Ideally, we want to include only the most relevant variables in our model. Doing so helps maintain model interpretability, and a simpler model can also guard against overfitting the training dataset, which improves predictive accuracy.

One type of feature engineering reduces the number of model inputs by using existing variables to calculate new variables. This can be done automatically or by leveraging domain knowledge. In the Titanic data set, the variable parch is the number of siblings or spouses and sibsp represent the number of children or parents. We can combine these variables into a new family size variable, famSize, because it might be reasonable to suspect that there is a relationship between the number of family members on board and survival.



In [3]:

    
data titanicTrainClean;
    set titanicTrainClean;
    famSize = parch + sibsp + 1;
run;

data titanicTestClean;
    set titanicTestClean;
    famSize = parch + sibsp + 1;
run;









    Out[3]:









  
  
  




NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
NOTE: There were 1048 observations read from the data set WORK.TITANICTRAINCLEAN.
NOTE: The data set WORK.TITANICTRAINCLEAN has 1048 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
NOTE: There were 260 observations read from the data set WORK.TITANICTESTCLEAN.
NOTE: The data set WORK.TITANICTESTCLEAN has 260 observations and 10 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Statistical Model

To predict survival on the Titanic, a binary logistic regression model was used. This widely-used model is applicable when the target or response variable of interest, $y$, contains only two levels (survived or perished, in this case). A logistic regression models the probability of the outcome of interest given the values of the predictor variables, $x$. This probability is denoted as $P(y=1|x)$ where $y = 1$ corresponds to survival.

Because the quantity being modeled is a probability, a traditional multiple linear regression should not be used because there is no guarantee that the estimated probabilities will be valid. Logistic regression uses the logit function to link the probability of interest with a linear model. That is,

$$\text{logit}(\pi) = \log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \sum_{j=1}^p \beta_j x_j,$$

where $\pi = P(y=1|x)$.

Ultimately, the goal is to use the model to classify a passenger in either the survived or perished groups. One approach to using a binary logistic regression model for classification is to estimate the survival probability and then assign a passenger into the survived group if the estimated probability is greater than $0.5$. Alternatively, a data-driven approach, such as cross-validation, can be used to determine the cut-off point, as something other than $0.5$ might produce better results. This can be easily achieved with the LOGISTIC procedure by specifying the CTABLE option in the MODEL statement. The procedure uses an approximation to leave-one-out cross-validation to construct the table to reduce the bias in the error estimates.

Here, we estimate the logistic regression model

$$ \log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1 \text{sex} + \beta_2 \text{age} + \beta_3 \text{sex*age} + \beta_4 \text{pclass} + \beta_5 \text{fare} + \beta_6 \text{famSize}$$

by using the training set. The results include the Classification Table containing error estimates for different cut-off points, in addition to fit statistics for the training and test sets.



In [4]:

    
proc logistic data=WORK.titanicTrainClean;
    class sex / param=glm;
    model survived(event='1')=sex age sex*age pclass fare famSize / link=logit 
        technique=fisher ctable pprob=(0.45 to 0.65 by 0.01);
    score data=work.titanicTestClean out=work.titanicTestPred fitstat;
    score data=work.titanicTrainClean out=work.titanicTrainPred fitstat;
    ods select Classification ScoreFitStat;
run;









    Out[4]:










SAS Output





The SAS System 


The LOGISTIC Procedure







Classification Table


Prob
Level

Correct
Incorrect
Percentages


Event
Non-
Event

Event
Non-
Event

Correct
Sensi-
tivity

Speci-
ficity

Pos
Pred

Neg
Pred





0.450
305
538
107
98
80.4
75.7
83.4
74.0
84.6


0.460
303
543
102
100
80.7
75.2
84.2
74.8
84.4


0.470
303
544
101
100
80.8
75.2
84.3
75.0
84.5


0.480
303
546
99
100
81.0
75.2
84.7
75.4
84.5


0.490
299
550
95
104
81.0
74.2
85.3
75.9
84.1


0.500
298
553
92
105
81.2
73.9
85.7
76.4
84.0


0.510
297
555
90
106
81.3
73.7
86.0
76.7
84.0


0.520
295
559
86
108
81.5
73.2
86.7
77.4
83.8


0.530
290
563
82
113
81.4
72.0
87.3
78.0
83.3


0.540
285
563
82
118
80.9
70.7
87.3
77.7
82.7


0.550
278
566
79
125
80.5
69.0
87.8
77.9
81.9


0.560
271
570
75
132
80.2
67.2
88.4
78.3
81.2


0.570
265
574
71
138
80.1
65.8
89.0
78.9
80.6


0.580
262
575
70
141
79.9
65.0
89.1
78.9
80.3


0.590
258
579
66
145
79.9
64.0
89.8
79.6
80.0


0.600
251
586
59
152
79.9
62.3
90.9
81.0
79.4


0.610
244
589
56
159
79.5
60.5
91.3
81.3
78.7


0.620
237
597
48
166
79.6
58.8
92.6
83.2
78.2


0.630
237
599
46
166
79.8
58.8
92.9
83.7
78.3


0.640
209
604
41
194
77.6
51.9
93.6
83.6
75.7


0.650
205
619
26
198
78.6
50.9
96.0
88.7
75.8










Fit Statistics for SCORE Data


Data Set
Total Frequency
Log Likelihood
Error Rate
AIC
AICC
BIC
SC
R-Square
Max-Rescaled R-Square
AUC
Brier Score




WORK.TITANICTESTCLEAN
260
-132.0
0.2308
278.0976
278.5421
303.0224
303.0224
0.263161
0.358935
0.800866
0.163313


WORK.TITANICTRAINCLEAN
1048
-463.2
0.1861
940.4222
940.5299
975.1046
975.1046
0.361413
0.49093
0.852436
0.139288

Based on the Classification Table above, a cut-off point of $0.52$ instead of $0.50$ is used to make predictions on the test set to assess the model's performance. In this situation, the resulting misclassification error rate was the same, and the model is able to correctly classify passengers about $77\%$ of the time.



In [5]:

    
data work.titanicTestPred;
    set work.titanicTestPred;
    if P_1 > 0.52 then survivedPred = 1;
        else survivedPred = 0;
    if survivedPred = survived then predError = 0;
        else predError = 1;
run;
proc means data=titanicTestPred mean; 
    var predError;
run;









    Out[5]:










SAS Output





The SAS System 


The MEANS Procedure







Analysis Variable : predError


Mean




0.2307692

Classification Table
Prob Level	Correct		Incorrect		Percentages
Prob Level	Event	Non- Event	Event	Non- Event	Correct	Sensi- tivity	Speci- ficity	Pos Pred	Neg Pred
0.450	305	538	107	98	80.4	75.7	83.4	74.0	84.6
0.460	303	543	102	100	80.7	75.2	84.2	74.8	84.4
0.470	303	544	101	100	80.8	75.2	84.3	75.0	84.5
0.480	303	546	99	100	81.0	75.2	84.7	75.4	84.5
0.490	299	550	95	104	81.0	74.2	85.3	75.9	84.1
0.500	298	553	92	105	81.2	73.9	85.7	76.4	84.0
0.510	297	555	90	106	81.3	73.7	86.0	76.7	84.0
0.520	295	559	86	108	81.5	73.2	86.7	77.4	83.8
0.530	290	563	82	113	81.4	72.0	87.3	78.0	83.3
0.540	285	563	82	118	80.9	70.7	87.3	77.7	82.7
0.550	278	566	79	125	80.5	69.0	87.8	77.9	81.9
0.560	271	570	75	132	80.2	67.2	88.4	78.3	81.2
0.570	265	574	71	138	80.1	65.8	89.0	78.9	80.6
0.580	262	575	70	141	79.9	65.0	89.1	78.9	80.3
0.590	258	579	66	145	79.9	64.0	89.8	79.6	80.0
0.600	251	586	59	152	79.9	62.3	90.9	81.0	79.4
0.610	244	589	56	159	79.5	60.5	91.3	81.3	78.7
0.620	237	597	48	166	79.6	58.8	92.6	83.2	78.2
0.630	237	599	46	166	79.8	58.8	92.9	83.7	78.3
0.640	209	604	41	194	77.6	51.9	93.6	83.6	75.7
0.650	205	619	26	198	78.6	50.9	96.0	88.7	75.8

Fit Statistics for SCORE Data
Data Set	Total Frequency	Log Likelihood	Error Rate	AIC	AICC	BIC	SC	R-Square	Max-Rescaled R-Square	AUC	Brier Score
WORK.TITANICTESTCLEAN	260	-132.0	0.2308	278.0976	278.5421	303.0224	303.0224	0.263161	0.358935	0.800866	0.163313
WORK.TITANICTRAINCLEAN	1048	-463.2	0.1861	940.4222	940.5299	975.1046	975.1046	0.361413	0.49093	0.852436	0.139288