In [2]:
/* Reduce log output */
options nosource nonotes;
/* Fetch the file from the website */
filename titanic temp;
proc http
url="https://raw.githubusercontent.com/sascommunities/sas-global-forum-2019/master/3133-2019-Gaines/titanicTrainClean.csv"
method="GET"
out=titanic;
run;
/* Import the file */
proc import
file=titanic
out=work.titanicTrainClean replace
dbms=csv;
run;
filename titanic2 temp;
proc http
url="https://raw.githubusercontent.com/sascommunities/sas-global-forum-2019/master/3133-2019-Gaines/titanicTestClean.csv"
method="GET"
out=titanic2;
run;
proc import
file=titanic2
out=work.titanicTestClean replace
dbms=csv;
run;
options notes;
Out[2]:
Before building a statistical model, we first explore feature (variable) engineering, which is an important part of the analytical process. Ideally, we want to include only the most relevant variables in our model. Doing so helps maintain model interpretability, and a simpler model can also guard against overfitting the training dataset, which improves predictive accuracy.
One type of feature engineering reduces the number of model inputs by using existing variables to calculate new variables. This can be done automatically or by leveraging domain knowledge. In the Titanic data set, the variable parch is the number of siblings or spouses and sibsp represent the number of children or parents. We can combine these variables into a new family size variable, famSize, because it might be reasonable to suspect that there is a relationship between the number of family members on board and survival.
In [3]:
data titanicTrainClean;
set titanicTrainClean;
famSize = parch + sibsp + 1;
run;
data titanicTestClean;
set titanicTestClean;
famSize = parch + sibsp + 1;
run;
Out[3]:
To predict survival on the Titanic, a binary logistic regression model was used. This widely-used model is applicable when the target or response variable of interest, $y$, contains only two levels (survived or perished, in this case). A logistic regression models the probability of the outcome of interest given the values of the predictor variables, $x$. This probability is denoted as $P(y=1|x)$ where $y = 1$ corresponds to survival.
Because the quantity being modeled is a probability, a traditional multiple linear regression should not be used because there is no guarantee that the estimated probabilities will be valid. Logistic regression uses the logit function to link the probability of interest with a linear model. That is,
$$\text{logit}(\pi) = \log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \sum_{j=1}^p \beta_j x_j,$$where $\pi = P(y=1|x)$.
Ultimately, the goal is to use the model to classify a passenger in either the survived or perished groups. One approach to using a binary logistic regression model for classification is to estimate the survival probability and then assign a passenger into the survived group if the estimated probability is greater than $0.5$. Alternatively, a data-driven approach, such as cross-validation, can be used to determine the cut-off point, as something other than $0.5$ might produce better results. This can be easily achieved with the LOGISTIC procedure by specifying the CTABLE option in the MODEL statement. The procedure uses an approximation to leave-one-out cross-validation to construct the table to reduce the bias in the error estimates.
Here, we estimate the logistic regression model
$$ \log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1 \text{sex} + \beta_2 \text{age} + \beta_3 \text{sex*age} + \beta_4 \text{pclass} + \beta_5 \text{fare} + \beta_6 \text{famSize}$$by using the training set. The results include the Classification Table containing error estimates for different cut-off points, in addition to fit statistics for the training and test sets.
In [4]:
proc logistic data=WORK.titanicTrainClean;
class sex / param=glm;
model survived(event='1')=sex age sex*age pclass fare famSize / link=logit
technique=fisher ctable pprob=(0.45 to 0.65 by 0.01);
score data=work.titanicTestClean out=work.titanicTestPred fitstat;
score data=work.titanicTrainClean out=work.titanicTrainPred fitstat;
ods select Classification ScoreFitStat;
run;
Out[4]:
Based on the Classification Table above, a cut-off point of $0.52$ instead of $0.50$ is used to make predictions on the test set to assess the model's performance. In this situation, the resulting misclassification error rate was the same, and the model is able to correctly classify passengers about $77\%$ of the time.
In [5]:
data work.titanicTestPred;
set work.titanicTestPred;
if P_1 > 0.52 then survivedPred = 1;
else survivedPred = 0;
if survivedPred = survived then predError = 0;
else predError = 1;
run;
proc means data=titanicTestPred mean;
var predError;
run;
Out[5]: