Titanic: Machine Learning from Disaster

Author: Aurélie Daviaud

  • Language: R
  • Methods : Classification (logistic regression, CART, Random Forest, SVM, neural network, XGBoost), cross-validation

Table of Contents

1 Exploratory data analysis
...1.1 Check data
...1.2 Check correlations among variables
...1.3 Visualize some potentially important variables
......1.3.1 Passenger's class
......1.3.2 Women and children first!
2 Feature engineering
....2.1 Create new features 1
......2.1.1 Create Title from Name
....2.2 Impute missing data
......2.2.1 Fare
......2.2.2 Embarked
......2.2.3 Age
....2.3 Create new features 2
......2.3.1 Create Child and Young from Age
......2.3.2 Create Title2 from Title
......2.3.3 Create dummy variables
3 Machine learning
...3.1 Logistic regression
......3.1.1 A first quick and dirty model
......3.1.2 Check learning curves
......3.1.3 Feature selection
.........3.1.3.1 Manual selection
.........3.1.3.2 Automatic selection
...3.2 Classification tree: CART (Recursive Partitioning)
...3.3 Random forest
...3.4 Support Vector Machine (SVM)
......3.4.1 With linear kernel
......3.4.2 With Gaussian kernel
...3.5 Neural network
...3.6 XGBoost
......3.6.1 Linear
......3.6.2 Tree
4 Compare the best models
...4.1 Compare the performances
...4.2 Check correlations among models
5 Predictions on the test set
...5.1 GLM
...5.2 CART
...5.3 Random forest
...5.4 SVM
...5.5 XGBoost (linear)
...5.6 XGBoost (tree)
6 Going further

Context
The aim of this project is to predict survivors in the Titanic dataset. I will follow typical steps of a machine learning project: exploratory data analysis, pre-processing (data imputation, scaling, feature engineering, feature selection...) and machine learning.

Data description

Variable Description
Survived Survival (0 = No, 1 = Yes)
Pclass Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
Sex Sex (male/female)
Age Age (in years)
SibSp # of siblings/spouses aboard the Titanic
Parch # of parents/children aboard the Titanic
Ticket Ticket number
Fare Passenger fare
Cabin Cabin number
Embarked Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

In [2]:
# Load libraries
library(polycor)      # hetcor()
library(ggplot2)      # ggplot()
library(caret)        # train(), trainControl(), resamples(), varImp()
library(mice)         # mice()
library(lattice)      # densityplot()
library(gridExtra)    # grid.arrange()
library(dummies)      # dummy.data.frame()
library(plyr)         # revalue()
library(rpart)        # rpart()
library(rpart.plot)   # rpart.plot()
library(randomForest) # method = 'rf'
library(party)        # method = 'cforest'
library(kernlab)      # method = 'svmLinear' and method = 'svmRadial' 
library(nnet)         # method = 'nnet'
library(corrplot)     # corrplot()
library(caretEnsemble)# caretList(), caretStack()

In [3]:
setwd("C:/Users/AurelieD/Documents/Boulot/0-Data scientist/Kaggle/Titanic")

In [4]:
## Load data (some cells of our table are empty, we have to specify to R to treat them as missing values (NA) using na.string)
Train <- read.csv("train.csv", na.strings=c(""))
Test <- read.csv("test.csv", na.strings=c(""))

In [91]:
str(Train)


'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

In [92]:
str(Test)


'data.frame':	418 obs. of  11 variables:
 $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
 $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
 $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
 $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
 $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
 $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
 $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
 $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
 $ Cabin      : Factor w/ 76 levels "A11","A18","A21",..: NA NA NA NA NA NA NA NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...

1 Exploratory data analysis

First, let's take a look at the Train data.

1.1 Check data


In [93]:
## Check data
head(Train)


PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NA S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NA S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NA S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 NA Q

In [94]:
str(Train)          # check factors' levels


'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

In [5]:
# Specify proper classes to variables
Train$Survived <- factor(Train$Survived)
Train$Pclass <- factor(Train$Pclass, levels=c("1", "2", "3"), ordered=TRUE)

In [96]:
summary(Train)      # check min/max/mean... of each variable


  PassengerId    Survived Pclass                                     Name    
 Min.   :  1.0   0:549    1:216   Abbing, Mr. Anthony                  :  1  
 1st Qu.:223.5   1:342    2:184   Abbott, Mr. Rossmore Edward          :  1  
 Median :446.0            3:491   Abbott, Mrs. Stanton (Rosa Hunt)     :  1  
 Mean   :446.0                    Abelson, Mr. Samuel                  :  1  
 3rd Qu.:668.5                    Abelson, Mrs. Samuel (Hannah Wizosky):  1  
 Max.   :891.0                    Adahl, Mr. Mauritz Nils Martin       :  1  
                                  (Other)                              :885  
     Sex           Age            SibSp           Parch             Ticket   
 female:314   Min.   : 0.42   Min.   :0.000   Min.   :0.0000   1601    :  7  
 male  :577   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   347082  :  7  
              Median :28.00   Median :0.000   Median :0.0000   CA. 2343:  7  
              Mean   :29.70   Mean   :0.523   Mean   :0.3816   3101295 :  6  
              3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6  
              Max.   :80.00   Max.   :8.000   Max.   :6.0000   CA 2144 :  6  
              NA's   :177                                      (Other) :852  
      Fare                Cabin     Embarked  
 Min.   :  0.00   B96 B98    :  4   C   :168  
 1st Qu.:  7.91   C23 C25 C27:  4   Q   : 77  
 Median : 14.45   G6         :  4   S   :644  
 Mean   : 32.20   C22 C26    :  3   NA's:  2  
 3rd Qu.: 31.00   D          :  3             
 Max.   :512.33   (Other)    :186             
                  NA's       :687             

1.2 Check correlations among variables

This will allow us to:

  • see which variables may be good predictors of our dependent variable (Survived) (this can be helpful when we don't have any domain knowledge);
  • check multicollinearity among predictors, i.e. when one predictor can be linearly predicted from the others.

Since we have factors, ordered factors and numeric variables, we will use the hetcor() function of the 'polycor' package. It "computes a heterogenous correlation matrix, consisting of Pearson product-moment correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables." (https://cran.r-project.org/web/packages/polycor/polycor.pdf). We will ignore the variables Cabin and Name for preliminary exploration of the data since they contain many missing data and need some processing.


In [97]:
Train_sub <- subset(Train, select=c(Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked))

In [98]:
hetcor(Train_sub, use = "pairwise.complete.obs")$correlations


SurvivedPclassSexAgeSibSpParchFareEmbarked
Survived 1.00000000-0.48055675-0.7609001 -0.09788042-0.04897746 0.10128585 0.38166338-0.26734299
Pclass-0.48055675 1.00000000 0.1992472 -0.41238334 0.12283381 0.02263785-0.78299896 0.19342079
Sex-0.76090008 0.19924719 1.0000000 0.12085141-0.14124694-0.30412276-0.21952881 0.18706902
Age-0.09788042-0.41238334 0.1208514 1.00000000-0.30824676-0.18911926 0.09606669-0.04053557
SibSp-0.04897746 0.12283381-0.1412469 -0.30824676 1.00000000 0.41483770 0.15965104 0.11114924
Parch 0.10128585 0.02263785-0.3041228 -0.18911926 0.41483770 1.00000000 0.21622494 0.06730955
Fare 0.38166338-0.78299896-0.2195288 0.09606669 0.15965104 0.21622494 1.00000000-0.25808764
Embarked-0.26734299 0.19342079 0.1870690 -0.04053557 0.11114924 0.06730955-0.25808764 1.00000000

We observe that:

  • Survived is highly correlated with Sex (r=-0.76). So Sex may be a good predictor for Survived. To a lesser extent, Pclass is also correlated to Survived (r=-0.48).
  • Fare is highly correlated with Pclass (r=-0.78). So Fare and Pclass are collinear.

1.3 Visualize some potentially important variables

1.3.1 Passenger's class

It is well known that passengers from the 1st class were the first ones to be allowed to board on the life rafts. Therefore the correlation we have just seen between the survival and the passenger's class was expected. Let's inspect this relationship deeplier.


In [6]:
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)

In [7]:
# Prepare a nice theme (i.e. general appearance of graphs)
theme_new <- theme_classic() + 
    theme(plot.title = element_text(size=11, face="bold", hjust = 0.5), axis.text.x = element_text(size = 11))

In [101]:
ggplot(data=Train) + 
  geom_bar(aes(x = Survived, fill = Pclass), position = "fill") +
  labs(title = "Distribution of classes\namong dead and safe passengers", y = "Proportion of passengers", x = "") +
  scale_x_discrete(labels=c("Died","Survived")) +
  scale_fill_brewer(palette="Blues") +
  theme_new



In [102]:
ggplot(data=Train) + 
  geom_bar(aes(x = Pclass, fill = Survived), position = "fill") +
  labs(title = "Distribution of dead and safe passengers\namong classes", y = "Proportion of passengers", x = "") +
  scale_x_discrete(labels=c("1","2","3")) +
  scale_fill_brewer(palette="Blues", labels=c("Died", "Survived"), name = "Survival") +
  theme_new


We can see that most of the people who survived were in the first class. On the contrary, most of the people who died were in the 3rd class.

Similarly, here, we can see that most of the first class people survived whereas most of the 3rd class people died.

1.3.2 Women and children first!


In [103]:
ggplot(data=Train) + 
  geom_bar(aes(x = Survived, fill = Sex), position = "fill") +
  labs(title = "Distribution of classes\namong dead and safe passengers", y = "Proportion of passengers", x = "") +
  scale_x_discrete(labels=c("Died","Survived")) +
  scale_fill_brewer(palette="Blues") +
  theme_new


Most dead people were men and most surviving people were women.
As expected (from our domain knowledge and our preliminary correlation test), Sex is a good predictor for the survival of passengers.

What about age? Children were supposed to board the life rafts first.


In [104]:
ggplot(data=Train) + 
  geom_boxplot(aes(x = Survived, y = Age)) +
  labs(title = "Distribution of ages\namong dead and safe passengers", x = "") +
  scale_x_discrete(labels=c("Died","Survived")) +
  theme_new


Warning message:
"Removed 177 rows containing non-finite values (stat_boxplot)."

There seem to be only a slight difference in the distribution of ages among dead and safe passengers. Although the median is equal, the upper and lower quartiles are a bit lower for the surviving passengers meaning that surviving people were slightly younger than dead people.

Let's use an histogram now to look at the details.


In [105]:
ggplot(data = Train, aes(x = Age, group = Survived, fill = Survived)) + 
   geom_histogram(position="dodge", binwidth = 5, boundary=15) + 
   labs(title = "Distribution of ages\namong dead and safe passengers") + 
   scale_fill_manual(labels = c("Died", "Survived"), name = "Survival", values = c("steelblue3", "lightblue")) +
   theme_new


Warning message:
"Removed 177 rows containing non-finite values (stat_bin)."

Now, we clearly see a difference for some ages: most child under 5 have survived whereas most young passengers between 15-30 have died.


2 Feature engineering

Our aim is to give as much information as we can to the model: the more useful information the model gets, the better it is likely to be.

First, let's combine the Train and Test datasets (while removing temporarily the Survived column of the Train dataset).


In [8]:
Survived <- Train$Survived   # We will add it back to the train dataset after processing of the combined data set.

In [9]:
Train$Survived <- NULL
titanic = rbind(Train, Test)

In [108]:
str(titanic)


'data.frame':	1309 obs. of  11 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Pclass     : Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 186 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

2.1 Create new features 1

2.1.1 Create Title from Name

The name can provide some interesting information. It is in the form: Last Name, Title First name Middle name. In case of a spouse, her first name and last name are added in brackets.
Let's retrieve the title which may reflect the age as well as the social status of the passengers.


In [10]:
titanic$Title <- sub(".+,\\s", "", titanic$Name)
titanic$Title <- as.factor(sub("\\.\\s.+", "", titanic$Title))

2.2 Impute missing data

First, we can increase the number of observations used in the model by imputing the missing values.

What is the numer of missing values in each variable?


In [110]:
nbNA <- function(x){sum(is.na(x))}

In [111]:
apply(titanic,2,nbNA)


PassengerId
0
Pclass
0
Name
0
Sex
0
Age
263
SibSp
0
Parch
0
Ticket
0
Fare
1
Cabin
1014
Embarked
2
Title
0

2.2.1 Fare


In [112]:
hist(titanic$Fare, breaks=50)



In [113]:
summary(titanic$Fare)


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   7.896  14.450  33.300  31.280 512.300       1 

Given that the distribution of Fare is highly skewed, the best seems to impute the missing value with the median.


In [11]:
# Impute Fare with the median
titanic[is.na(titanic[,"Fare"]), "Fare"] <- median(titanic[,"Fare"], na.rm = TRUE)

2.2.2 Embarked


For Embarked, there are only two passengers with missing values. Who are those passengers?


In [115]:
titanic[is.na(titanic$Embarked), ]


PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedTitle
62 62 1 Icard, Miss. Amelie female 38 0 0 113572 80 B28 NA Miss
830830 1 Stone, Mrs. George Nelson (Martha Evelyn)female 62 0 0 113572 80 B28 NA Mrs

We can see that they are two women > 30 years old, sharing the same first class cabin. Let's see whether we can use this information to impute the missing data.


In [116]:
table(titanic[titanic$Sex == "female" & titanic$Pclass == "1" & titanic$Age > 30,"Embarked"])


 C  Q  S 
44  2 36 

Women sharing the same traits have mainly embarked at either Cherbourg (France) or Southampton (UK). Since the two passengers with missing data have an English name, we can assume that they embarked at Southampton.


In [12]:
titanic$Embarked[is.na(titanic$Embarked)] = "S"

2.2.3 Age


In [118]:
# Percentage of missing values in Age
263/nrow(titanic)*100


20.0916730328495

Given the percentage of missing values for Age, will impute this variable using Multivariate Imputation by Chained Equations (MICE).

First, let's check the missing pattern of Age.


In [119]:
# Are the missing ages equally distributed across male and female? 
table(titanic[is.na(titanic$Age), "Sex"])


female   male 
    78    185 

In [120]:
# Are the missing ages equally distributed across classes? 
table(titanic[is.na(titanic$Age), "Pclass"])


  1   2   3 
 39  16 208 

We observe that the missing ages are not randomly distributed across all sexes and classes, with more missing data in males and 3rd class passengers. So we probably have a MAR (Missing at random) pattern and not a MCAR pattern (Missing completely ate random).


In [121]:
names(titanic)


  1. 'PassengerId'
  2. 'Pclass'
  3. 'Name'
  4. 'Sex'
  5. 'Age'
  6. 'SibSp'
  7. 'Parch'
  8. 'Ticket'
  9. 'Fare'
  10. 'Cabin'
  11. 'Embarked'
  12. 'Title'

Let's initialize the mice function parameters.


In [13]:
# Create a mids object containing the default settings
init <- mice(titanic[, !names(titanic) %in% c("PassengerID", "Name")], maxit=0, seed = 1)

In [14]:
# Selects the best predictors
new.pred<-quickpred(titanic[, !names(titanic) %in% c("PassengerID", "Name")])

In [15]:
# Define the methods
meth <- init$method
meth[c("Cabin")]=""    # Skip a variable from imputation while this variable will still be used for prediction.

In [16]:
# Perform imputation (here, we create 5 imputed datasets, the default value for the m parameter)
imputed_Data <- mice(titanic[, !names(titanic) %in% c("PassengerID", "Name")], method=meth, m=10, pred=new.pred, seed = 1, print=FALSE)

Let's inspect the distribution of original and imputed data


In [126]:
densityplot(imputed_Data)


The density of the imputed data for each imputed dataset is showed in magenta while the density of the observed data is showed in blue. We observe that the central tendencies of the density plots of imputed data appear relatively similar to the observed data.


In [127]:
# Give a proper size to plots
options(repr.plot.width=10, repr.plot.height=10)

In [128]:
for (i in 1:10){
    namePlot <- paste("plot", i, sep="")
    plot <- ggplot(titanic,aes(x=Age)) + 
          geom_density(data=data.frame(titanic$PassengerId, complete(imputed_Data,i)), alpha = 0.2, fill = "blue")+
          geom_density(data=titanic, alpha = 0.2, fill = "Red")+
          labs(title=paste("Age Distribution, imputation ", i, sep=""))+
          labs(x="Age") +
          theme_new
     assign(namePlot, plot)
}

In [129]:
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, plot7, plot8, plot9, plot10, nrow=4, ncol=3)


Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."

The fifth imputed dataset seems to be the best one, based on the density distribution. Let's use it.


In [17]:
titanic_imp <- complete(imputed_Data, 5)
titanic$Age <- titanic_imp$Age

The Cabin variable may also be of interest: it contains the deck ('deck''cabine_number') which may reflect some proximity to the life rafts (The deck A being the closest deck and the deck G being the farthest one). However, this variable contains a majority of missing values (NA).
We could still try to impute it using the family names, ticket number and class: 1) families are probably people sharing the same name and having the same ticket number; 2) people from the same family were probably on the same deck; 3) first class had the top decks (A-E), second class (D-F), and third class (E-G).
However, given that the deck is highly correlated with Pclass, we will just keep it away for now.

2.3 Create new features 2

2.3.1 Create Child and Young from Age

We have seen that the age classes were not equal regarding the survival rate: most child under 5 have survived whereas most young passengers between 15-30 have died. Let's create a variable Child (<5) and a variable Young (15-30).


In [18]:
titanic$Child <- factor(titanic$Age <= 5)
titanic$Young <- factor(titanic$Age > 15 & titanic$Age <= 30)

2.3.2 Create Title2 from Title

In title, we have many levels which have only one observation or a few observations. This is not optimal for modelisation since those levels may have near-to-zero variance. Let's merge titles into a smaller number of levels, using the age and sex usually related to those titles.


In [132]:
summary(titanic$Title)


Capt
1
Col
4
Don
1
Dona
1
Dr
8
Jonkheer
1
Lady
1
Major
2
Master
61
Miss
260
Mlle
2
Mme
1
Mr
757
Mrs
197
Ms
2
Rev
8
Sir
1
the Countess
1

In [133]:
titles <- as.character(unique(titanic$Title))

In [134]:
as.data.frame(cbind("Title" = titles, 
   "Female" = sapply(titles, function(x) nrow(titanic[titanic$Title == x & titanic$Sex == "female",])),
   "Male" = sapply(titles, function(x) nrow(titanic[titanic$Title == x & titanic$Sex == "male",])),
   "Minimum_Age" = sapply(titles, function(x) min(titanic[titanic$Title == x ,'Age'], na.rm = TRUE)),
   "Maximum_Age" = sapply(titles, function(x) max(titanic[titanic$Title == x,'Age'], na.rm = TRUE))), row.names = F)


TitleFemaleMaleMinimum_AgeMaximum_Age
Mr 0 757 2 80
Mrs 197 0 14 76
Miss 260 0 0.17 63
Master 0 61 0.33 14.5
Don 0 1 40 40
Rev 0 8 27 57
Dr 1 7 23 54
Mme 1 0 24 24
Ms 2 0 26 28
Major 0 2 45 52
Lady 1 0 48 48
Sir 0 1 49 49
Mlle 2 0 24 24
Col 0 4 47 60
Capt 0 1 70 70
the Countess1 0 33 33
Jonkheer 0 1 38 38
Dona 1 0 39 39

We can merge titles into:

  • Sir : men above 14.5 years old with an important social status
  • Mr: men above 14.5 years old without status
  • Master: boys up to 14.5 years old
  • Mrs: women above 14 years old, married or not
  • Miss: girls up to 14 years old

In [19]:
titanic$Title2 <- titanic$Title

titanic[(titanic$Sex == "male" & titanic$Age <= 14.5),]$Title2 = "Master" 
titanic[(titanic$Sex == "male" & titanic$Age > 14.5),]$Title2 = "Mr"
titanic[(titanic$Sex == "female" & titanic$Age <= 14),]$Title2 = "Miss"
titanic[(titanic$Sex == "female" & titanic$Age > 14),]$Title2 = "Mrs"

titanic[titanic$Title == "Capt"|
   titanic$Title == "Col"|
   titanic$Title == "Don"|
   titanic$Title == "Major"|
   titanic$Title == "Rev"|      
   titanic$Title == "Jonkheer"|
   (titanic$Title == "Dr" &  titanic$Sex == "male"),]$Title2 = "Sir"

titanic$Title2 <- factor(titanic$Title2)

In [136]:
summary(titanic$Title2)


Master
69
Miss
68
Mr
750
Mrs
398
Sir
24

Now, we have levels that are much more balanced.

2.3.3 Create dummy variables

Let's convert our factor variables into dummy variables. Having dummy variables is not absolutely necessary, however:

  • it will make the intrepretation easier
  • it may also help the model to perform better
  • contrary to glm() or train() that automatically create dummy variables, most functions don't.

In [20]:
titanic_dum <- dummy.data.frame(titanic, names= c("Sex","Embarked", "Title2") , sep = "_")
names(titanic_dum)


  1. 'PassengerId'
  2. 'Pclass'
  3. 'Name'
  4. 'Sex_female'
  5. 'Sex_male'
  6. 'Age'
  7. 'SibSp'
  8. 'Parch'
  9. 'Ticket'
  10. 'Fare'
  11. 'Cabin'
  12. 'Embarked_C'
  13. 'Embarked_Q'
  14. 'Embarked_S'
  15. 'Title'
  16. 'Child'
  17. 'Young'
  18. 'Title2_Master'
  19. 'Title2_Miss'
  20. 'Title2_Mr'
  21. 'Title2_Mrs'
  22. 'Title2_Sir'

In [21]:
# Convert the dummy variables into factors
vars = c("Sex_female", "Sex_male", "Embarked_C", "Embarked_Q", "Embarked_S", "Title2_Master", "Title2_Miss", "Title2_Mr", 
         "Title2_Mrs", "Title2_Sir")
titanic_dum[,vars] = lapply(titanic_dum[,vars], function(x) as.factor(x))

In [139]:
str(titanic_dum)


'data.frame':	1309 obs. of  22 variables:
 $ PassengerId  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Pclass       : Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Name         : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex_female   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
 $ Sex_male     : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age          : num  22 38 26 35 35 24 54 2 27 14 ...
 $ SibSp        : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch        : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket       : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare         : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin        : Factor w/ 186 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
 $ Embarked_C   : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 2 ...
 $ Embarked_Q   : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
 $ Embarked_S   : Factor w/ 2 levels "0","1": 2 1 2 2 2 1 2 2 2 1 ...
 $ Title        : Factor w/ 18 levels "Capt","Col","Don",..: 13 14 10 14 13 13 13 9 14 14 ...
 $ Child        : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 2 1 1 ...
 $ Young        : Factor w/ 2 levels "FALSE","TRUE": 2 1 2 1 1 2 1 1 2 1 ...
 $ Title2_Master: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
 $ Title2_Miss  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
 $ Title2_Mr    : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 2 1 1 1 ...
 $ Title2_Mrs   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 1 ...
 $ Title2_Sir   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "dummies")=List of 3
  ..$ Sex     : int  4 5
  ..$ Embarked: int  12 13 14
  ..$ Title2  : int  18 19 20 21 22

3 Machine learning

Now, we can split again our data into the training and test sets.


In [22]:
Train = titanic_dum[1:891,]
Test = titanic_dum[892:1309,]

And add back the variable Survived to the training set.


In [23]:
Train$Survived <- as.factor(Survived)
Train$Survived <- as.factor(revalue(Train$Survived, c("0" = "No", "1"="Yes")))

In [24]:
# Let's remove variables that we won't use
Train$PassengerId <- NULL
Train$Cabin <- NULL
Train$Ticket <- NULL
Train$Name <- NULL
Train$Title <- NULL

We will use the train() function of the 'caret' package. It is a very convenient wrapper for modeling, allowing automatic preprocessing (e.g. scaling) and model tuning using resampling (cross-validation, bootstrap, etc.). It provides a uniform interface of the functions themselves, as well as a way to standardize common tasks. Thus, we will be able to compare various algorithms easily.

Note that we will set the same seed prior to each training. This will ensures that the same resampling sets are used, which will come in handy when we compare the resampling profiles between models.

3.1 Logistic regression

3.1.1 A first quick and dirty model

We build a first quick and dirty model, using all the variables. That will give us a benchmark accuracy.


In [143]:
names(Train)


  1. 'Pclass'
  2. 'Sex_female'
  3. 'Sex_male'
  4. 'Age'
  5. 'SibSp'
  6. 'Parch'
  7. 'Fare'
  8. 'Embarked_C'
  9. 'Embarked_Q'
  10. 'Embarked_S'
  11. 'Child'
  12. 'Young'
  13. 'Title2_Master'
  14. 'Title2_Miss'
  15. 'Title2_Mr'
  16. 'Title2_Mrs'
  17. 'Title2_Sir'
  18. 'Survived'

In [25]:
## Define cross-validation experiment
numFolds = trainControl(method = "cv", number = 10) # method='cv' for cross-validation, number=10 for 10 folds

In [145]:
## Perform the training with cross-validation
set.seed(1)
logmod1 <- train(Survived ~ .-Sex_female -Embarked_C -Title2_Master,    # We have to remove the "reference" level of dummy variables
                data = Train, 
                method = "glm", 
                trControl = numFolds, 
                na.action = na.omit, 
                preProcess=c("center", "scale"))


Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"

In [68]:
summary(logmod1)     # with Pclass as ordered factor


Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4176  -0.5318  -0.3675   0.5541   2.5701  

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.737246   0.106584  -6.917 4.61e-12 ***
Pclass.L     -0.940760   0.160721  -5.853 4.82e-09 ***
Pclass.Q      0.001987   0.101244   0.020 0.984340    
Sex_male1    -0.084721   0.410383  -0.206 0.836444    
Age           0.221354   0.731684   0.303 0.762250    
SibSp        -0.838129   0.566057  -1.481 0.138702    
Parch         0.323158   0.342626   0.943 0.345588    
Fare         -0.383011   0.540686  -0.708 0.478709    
Embarked_Q1  -0.019203   0.114395  -0.168 0.866690    
Embarked_S1  -0.208026   0.114543  -1.816 0.069349 .  
ChildTRUE     0.323772   0.160619   2.016 0.043823 *  
YoungTRUE    -0.185078   0.161228  -1.148 0.250997    
Age_Sq       -0.618905   0.596661  -1.037 0.299605    
AgeXFare      0.305446   0.410229   0.745 0.456529    
Fare_Sq       0.269560   0.305260   0.883 0.377209    
AgeXSibSp     0.434261   0.244652   1.775 0.075896 .  
FareXSibSp    0.243983   0.240591   1.014 0.310537    
SibSp_sq     -0.880481   0.755610  -1.165 0.243915    
AgeXParch    -0.338084   0.311986  -1.084 0.278520    
FareXParch   -0.246033   0.268287  -0.917 0.359117    
SibSpXParch   0.222713   0.368423   0.605 0.545508    
Parch_sp     -0.153459   0.253153  -0.606 0.544387    
Title2_Miss1 -0.166523   0.175341  -0.950 0.342259    
Title2_Mr1   -1.471975   0.429721  -3.425 0.000614 ***
Title2_Mrs1         NA         NA      NA       NA    
Title2_Sir1  -0.512864   0.155144  -3.306 0.000947 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  710.79  on 866  degrees of freedom
AIC: 760.79

Number of Fisher Scoring iterations: 6

Note 1: The lm()/glm() functions use linear least squares, which is a direct optimization method (global optimum guarantee and good computational performance). Linear least squares are computed using QR decomposition, which is more efficient than gradient descent, however, they are very sensitive to outliers.
Note 2: We have to normalize our data to have similar scale among the independent variables. By using the parameter preProcess=c("center", "scale"), the normalization will be performed on the 9 training folds only (not on the whole original training set). This will limit accidental contamination of the training data.
Note 3: Since the Pclass is an ordered factor, the L (linear) parameter gives a measure of the linear trend and Q specifies quadratic terms (which is close to zero in this case because the pattern is linear).

As expected, the class, the fact to be a child and the title (which includes the sex) have a significant impact on the survival of passengers. We observe that the probability of survival is significantly higher if the passenger is a child < 5 and lower if the passenger is in 3rd class and/or has a male title (i.e. Mr or Sir).

Let's see what is the accuracy of the model (i.e. mean accuracy accross cross-validated models).


In [146]:
logmod1$results$Accuracy


0.822667688117126

That was a really quick and dirty model and the performance is already good! But I think that we can do much better by improving it.

Now, how can we improve our model?

  1. Get more training examples?
  2. Try smaller sets of features (to avoid overfitting)?
  3. Try getting additional features?
  4. Try adding polynomial features?

Before rushing into any of these options (which is likely to waste our time), let's be smart and check what is wrong in our model.

3.1.2 Check learning curves

Let's check for high bias or high variance using learning curves.


In [152]:
set.seed(1)
lc <- learing_curve_dat(Train[, !names(Train) %in% c("Sex_female", "Embarked_C", "Title2_Master")], outcome = "Survived", 
                        proportion = (1:10)/10, 
                        test_prop = 0, 
                        method = "glm", 
                        trControl = numFolds, 
                        na.action = na.omit,                        
                        verbose = FALSE)


Warning message in eval(expr, envir, enclos):
"model fit failed for Fold01: parameter=none Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels
"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
"There were missing values in resampled performance measures."Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"

In [153]:
# Give a proper size to plots
options(repr.plot.width=5, repr.plot.height=4)

In [154]:
ggplot(lc, aes(x = Training_Size, y = Accuracy, color = Data)) +
    geom_smooth(method = loess, span = .8) +   
    theme_classic()


Warning message:
"Removed 1 rows containing non-finite values (stat_smooth)."

This graph shows the learning curves of the model (Accuracy vs Sample size). "Training" and "Resampling" represent the perfomance of the model on the training set and on the resampling set respectively, with a increasing size of the training set. (The resampling set is the set that was retrieved as a "test" set during the cross-validation process)

We observe that the performance on the training set decreases with the increase of the training set size. Indeed, the more data we get, the less likely is the linear model to fit exactly the data. On the contrary, the performance of the model on the resampling set increases with the increase of the training set size. The more data we train the model with, the better the model is to generalize to unknown data.

We have a slightly lower performance when we fit the model the resampling set compared to the training set. It seems that we have a high variance problem, even if the learning curve is not very clear here... In other words, our model seems to be overfitting. Let's remove some features.

3.1.3 Feature selection

We can perform feature selection, to keep only relevant predictors, either manually or automatically.
As written by Guyon & Elisseeff (2003) (http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf),

There are many potential benefits of feature selection:

  • facilitating data visualization and data understanding,
  • reducing the measurement and storage requirements,
  • reducing training and utilization times,
  • defying the curse of dimensionality to improve prediction performance.

3.1.3.1 Manual selection

First, let's check the warning that appeared when we run our quick and dirty model. It seems that we have a problem related to rank-deficiency. It may stem from many origins, one of which may be that we have some collinear variables. We know that Fare is highly correlated with Pclass. However, removing Fare does not remove the warning.
We have created Title2 by taking into account Sex and Age. Sex and Titles2 are likely highly correlated.


In [161]:
# For example: 
hetcor(Train[,c("Sex_male", "Title2_Mr")])$correlations


Warning message in polychor(x, y, ML = ML, std.err = std.err):
"inadmissible correlation set to 0.9999"
Sex_maleTitle2_Mr
Sex_male1.00000.9999
Title2_Mr0.99991.0000

Let's try to remove Sex_male then.


In [155]:
names(Train)


  1. 'Pclass'
  2. 'Sex_female'
  3. 'Sex_male'
  4. 'Age'
  5. 'SibSp'
  6. 'Parch'
  7. 'Fare'
  8. 'Embarked_C'
  9. 'Embarked_Q'
  10. 'Embarked_S'
  11. 'Child'
  12. 'Young'
  13. 'Title2_Master'
  14. 'Title2_Miss'
  15. 'Title2_Mr'
  16. 'Title2_Mrs'
  17. 'Title2_Sir'
  18. 'Survived'

In [156]:
## Perform the cross validation
set.seed(1)
logmod2 <- train(Survived ~ .-Sex_female -Embarked_C -Title2_Master -Sex_male, 
                 data = Train, 
                 method = "glm", 
                 trControl = numFolds, 
                 na.action = na.omit, 
                 preProcess=c("center", "scale"))

In [157]:
summary(logmod2)


Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5066  -0.5717  -0.3620   0.5418   2.5770  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.68573    0.09680  -7.084 1.40e-12 ***
Pclass.L     -0.90930    0.13440  -6.766 1.33e-11 ***
Pclass.Q     -0.03550    0.09676  -0.367  0.71373    
Age          -0.44081    0.18545  -2.377  0.01746 *  
SibSp        -0.56822    0.13495  -4.211 2.55e-05 ***
Parch        -0.23011    0.10352  -2.223  0.02624 *  
Fare          0.13828    0.13140   1.052  0.29266    
Embarked_Q1  -0.06132    0.11192  -0.548  0.58376    
Embarked_S1  -0.20492    0.11155  -1.837  0.06620 .  
ChildTRUE     0.20463    0.11924   1.716  0.08615 .  
YoungTRUE    -0.27831    0.15120  -1.841  0.06568 .  
Title2_Miss1 -0.02414    0.12063  -0.200  0.84141    
Title2_Mr1   -1.07443    0.32704  -3.285  0.00102 ** 
Title2_Mrs1   0.50812    0.29677   1.712  0.08686 .  
Title2_Sir1  -0.37218    0.13046  -2.853  0.00433 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  727.25  on 876  degrees of freedom
AIC: 757.25

Number of Fisher Scoring iterations: 5

Let's see what is the accuracy of the model.


In [158]:
logmod2$results$Accuracy


0.822667688117126

Removing Sex_male did not change the accuracy of our model.

Let's have a look at the importance of each variable.


In [163]:
# absolute value of the t-statistic
varImp(logmod2, scale=FALSE)


glm variable importance

             Overall
Pclass.L      6.7659
SibSp         4.2108
Title2_Mr1    3.2853
Title2_Sir1   2.8528
Age           2.3769
Parch         2.2227
YoungTRUE     1.8406
Embarked_S1   1.8370
ChildTRUE     1.7161
Title2_Mrs1   1.7122
Fare          1.0523
Embarked_Q1   0.5479
Pclass.Q      0.3668
Title2_Miss1  0.2001

We can remove some of the least important variable: Title2_Miss, Title2_Mrs, Embarked_S and Embarked_Q.


In [164]:
names(Train)


  1. 'Pclass'
  2. 'Sex_female'
  3. 'Sex_male'
  4. 'Age'
  5. 'SibSp'
  6. 'Parch'
  7. 'Fare'
  8. 'Embarked_C'
  9. 'Embarked_Q'
  10. 'Embarked_S'
  11. 'Child'
  12. 'Young'
  13. 'Title2_Master'
  14. 'Title2_Miss'
  15. 'Title2_Mr'
  16. 'Title2_Mrs'
  17. 'Title2_Sir'
  18. 'Survived'

In [189]:
## Perform the cross validation
set.seed(1)
logmod3 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare  + Child + Young + Title2_Mr + Title2_Sir, 
                 data = Train, 
                 method = "glm", 
                 trControl = numFolds, 
                 na.action = na.omit, 
                 preProcess=c("center", "scale"))

In [190]:
logmod3$results$Accuracy


0.829409828623312

And, now, let's remove Fare which is among the least important variables and collinear with Pclass.


In [34]:
## Perform the cross validation
set.seed(1)
logmod4 <- train(Survived ~ Pclass + Age + SibSp + Parch + Child + Young + Title2_Mr + Title2_Sir, 
                  data = Train, 
                  method = "glm", 
                  trControl = numFolds, 
                  na.action = na.omit, 
                  preProcess=c("center", "scale"))

In [35]:
logmod4$results$Accuracy


0.831669787765293

Still better!

3.1.3.2 Automatic selection

Another way to perform feature selection is to use stepwise feature selection based on AIC. Let's try it starting with all the variables.


In [174]:
names(Train)


  1. 'Pclass'
  2. 'Sex_female'
  3. 'Sex_male'
  4. 'Age'
  5. 'SibSp'
  6. 'Parch'
  7. 'Fare'
  8. 'Embarked_C'
  9. 'Embarked_Q'
  10. 'Embarked_S'
  11. 'Child'
  12. 'Young'
  13. 'Title2_Master'
  14. 'Title2_Miss'
  15. 'Title2_Mr'
  16. 'Title2_Mrs'
  17. 'Title2_Sir'
  18. 'Survived'

In [184]:
## Perform the training with cross-validation
set.seed(1)
AIC_glm <- train(Survived ~ .-Sex_female -Title2_Master -Embarked_C -Sex_male, 
                 data = Train, 
                 method = "glmStepAIC", 
                 trControl = numFolds, 
                 na.action = na.omit, 
                 preProcess=c("center", "scale"),
                 trace=FALSE)

In [185]:
summary(AIC_glm)


Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5318  -0.5549  -0.3578   0.5499   2.5366  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.69743    0.09620  -7.249 4.19e-13 ***
Pclass.L    -1.00514    0.10990  -9.146  < 2e-16 ***
Age         -0.45852    0.18387  -2.494 0.012642 *  
SibSp       -0.54236    0.12913  -4.200 2.67e-05 ***
Parch       -0.19795    0.09945  -1.991 0.046532 *  
Embarked_S1 -0.18664    0.09243  -2.019 0.043458 *  
ChildTRUE    0.20270    0.11816   1.716 0.086246 .  
YoungTRUE   -0.27762    0.15057  -1.844 0.065208 .  
Title2_Mr1  -1.01982    0.27945  -3.649 0.000263 ***
Title2_Mrs1  0.55479    0.25418   2.183 0.029059 *  
Title2_Sir1 -0.36575    0.12075  -3.029 0.002453 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  728.95  on 880  degrees of freedom
AIC: 750.95

Number of Fisher Scoring iterations: 5

In [186]:
AIC_glm$results$Accuracy


0.828260980592441

In [177]:
AIC_glm$results$Accuracy


0.830508171603677

We observe that performing stepwise selection slightly decreased the performance of our model. Actually, feature selection does not necessarily improve the accuracy of a model. Its aim is primarily to decrease the model's complexity. Since our aim is to have the best accuracy, we will keep all the variables selected through the manual selection.


3.2 Classification tree: CART (Recursive Partitioning)


In [191]:
names(Train)


  1. 'Pclass'
  2. 'Sex_female'
  3. 'Sex_male'
  4. 'Age'
  5. 'SibSp'
  6. 'Parch'
  7. 'Fare'
  8. 'Embarked_C'
  9. 'Embarked_Q'
  10. 'Embarked_S'
  11. 'Child'
  12. 'Young'
  13. 'Title2_Master'
  14. 'Title2_Miss'
  15. 'Title2_Mr'
  16. 'Title2_Mrs'
  17. 'Title2_Sir'
  18. 'Survived'

In [26]:
## Define the cross-validation experiment
numFolds = trainControl(method = "cv", number = 10 ) # method='cv' for cross-validation, number=10 for 10 folds
cpGrid = expand.grid( .cp = seq(0.01,0.5,0.01))

For this model, we are not going to use the 'formula' option, then our tree will be clearer when we plot it.


In [27]:
DV = Train[, "Survived"]
pred = Train[, c('Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Q', 'Embarked_S', 'Child', 
                     'Young', 'Title2_Miss', 'Title2_Mr', 'Title2_Mrs', 'Title2_Sir')]

In [195]:
## Perform the cross validation
set.seed(1)
CART1 <- train(pred, DV, method = "rpart", trControl = numFolds, na.action = na.omit, tuneGrid = cpGrid)

In [196]:
CART1$bestTune


cp
20.02

In [197]:
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)

In [198]:
plot(CART1)



In [199]:
# Give a proper size to plots
options(repr.plot.width=6, repr.plot.height=6)

In [200]:
rpart.plot(CART1$finalModel, type=0, extra=106, under=TRUE, tweak=0.8)


The plot shows the probability of survival and the percentage of observations in each leaf.
The title was important: a passenger who was a 'Mr' was more likely to die, and in the 2nd and 1st class, a 'Sir' was also more likely to die. These results mirror what we said earlier about the lower probability of survival in men. As expected, the class of the passenger had an impact on their survival: first and second class passengers were more likely to survive.
Then, among passengers from the 3rd class, the Fare was of interest: interestingly, having paid at least 23 made a person more likely to die...


In [201]:
# Best accuracy
max(CART1$results$Accuracy)


0.829435648621042

This CART model does not have a better performance than our best logistic regression model.


3.3 Random forest

Random forest = tree bagging + feature sampling


In [28]:
## Define the cross-validation experiment
mtryGrid = expand.grid( .mtry = 1:10)
numFolds = trainControl(method = "cv", number = 10 ) # method='cv' for cross-validation, number=10 for 10 folds

In [224]:
names(Train)


  1. 'Pclass'
  2. 'Sex_female'
  3. 'Sex_male'
  4. 'Age'
  5. 'SibSp'
  6. 'Parch'
  7. 'Fare'
  8. 'Embarked_C'
  9. 'Embarked_Q'
  10. 'Embarked_S'
  11. 'Child'
  12. 'Young'
  13. 'Title2_Master'
  14. 'Title2_Miss'
  15. 'Title2_Mr'
  16. 'Title2_Mrs'
  17. 'Title2_Sir'
  18. 'Survived'
  19. 'Deck'

In [37]:
## Perform the cross validation
set.seed(1)
RF1 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
             data = Train, 
             method = "rf", 
             trControl = numFolds, 
             na.action = na.omit, 
             preProcess=c("center", "scale"), 
             tuneGrid = mtryGrid, 
             nodesize = 15, 
             importance = TRUE)

In [225]:
RF1$bestTune


mtry
1010

In [226]:
# Best accuracy
max(RF1$results$Accuracy)


0.836163318579049

Our first random forest model is better than our best log and CART models.
However, we lose some of the interpretability that comes with CART. Actually, we can still compute metrics that give us insight into which variables are important. One metric that we can look at is the number of times, aggregated over all of the trees in the random forest model, that a certain variable is selected for a split (i.e. selection frequency).


In [227]:
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)

In [228]:
vu = varUsed(RF1$finalModel, count=TRUE)
vusorted = sort(vu, decreasing = FALSE, index.return = TRUE)
dotchart(vusorted$x, RF1$finalModel$xNames[vusorted$ix], main = "RF1", xlab = "Selection frequency", cex=0.7)


Here, Fare and Age appear to be important variable.

Another way to see the importance of a variable is to look at the reduction in impurity (i.e. how homogenous each bucket or leaf of the tree is) after that variable has been selected for splitting in all of the trees in the forest (i.e. Gini importance). Or similarly, we can look at the mean decrease in accuracy (i.e. permutation importance).


In [229]:
# Give a proper size to plots
options(repr.plot.width=8, repr.plot.height=4)

In [230]:
par(mfrow=c(1,2))

# Compute the reduction in impurity (i.e. Gini importance)
Gini <- varImp(RF1, type=2, scale=FALSE) 
Ginisorted = sort(Gini$importance$Overall, decreasing = FALSE, index.return = TRUE)
dotchart(Ginisorted$x, rownames(Gini$importance)[Ginisorted$ix], 
         main = "RF1", 
         xlab = "Gini importance", 
         cex=0.7)

# Compute the reduction in accuracy (i.e. permutation accuracy importance)
Perm <- varImp(RF1, type=1, scale=FALSE) 
Permsorted = sort(Perm$importance$Overall, decreasing = FALSE, index.return = TRUE)
dotchart(Permsorted$x, rownames(Perm$importance)[Permsorted$ix], 
         main = "RF1", 
         xlab = "Permutation accuracy importance", 
         cex=0.7)


Again, Fare and Age appear among the top variables, but we can see that Title2_Mr and Pclass, are also important.

WARNING! We have to be careful here: isn't it a strange coincidence that all the continuous variables are in the top 5 variables used for splitting?...

Checking the literature, we can find an explanation to this behaviour:

  • Strobl et al (2007) Bias in random forest variable importance measures: Illustrations, sources and a solution, BMC Bioinformatics, 8: 25
  • Strobl et al (2009) Party on!, The R Journal, 1/2.

In standard tree algorithms, variable selection is biased in favor of variables offering many potential cut-points, so that variables with many categories and continuous variables are artificially preferred.

Therefore, authors suggested that

The randomForest function should not be used when the potential predictor variables vary in their scale of measurement or their number of categories.


In [231]:
pred3 <- subset(Train, select=c(Pclass, Age, SibSp, Parch, Fare, Embarked_Q, Embarked_S, Child, Young, 
                                    Title2_Miss, Title2_Mr, Title2_Mrs, Title2_Sir))

In [232]:
# Number of "levels" (i.e. potential cut-points) for each variable
sort(apply(pred3, 2, function(x) length(unique(x))), decreasing = TRUE)


Fare
248
Age
89
SibSp
7
Parch
7
Pclass
3
Embarked_Q
2
Embarked_S
2
Child
2
Young
2
Title2_Miss
2
Title2_Mr
2
Title2_Mrs
2
Title2_Sir
2

Given our dataset, we have to use unbiased tree algorithm (i.e. ctree and cforest functions in the party package). They use subsampling without replacement which provides more reliable variable importance measures.


In [233]:
names(Train)


  1. 'Pclass'
  2. 'Sex_female'
  3. 'Sex_male'
  4. 'Age'
  5. 'SibSp'
  6. 'Parch'
  7. 'Fare'
  8. 'Embarked_C'
  9. 'Embarked_Q'
  10. 'Embarked_S'
  11. 'Child'
  12. 'Young'
  13. 'Title2_Master'
  14. 'Title2_Miss'
  15. 'Title2_Mr'
  16. 'Title2_Mrs'
  17. 'Title2_Sir'
  18. 'Survived'
  19. 'Deck'

In [234]:
## Perform the cross validation
set.seed(1)
RF2 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
             data = Train, 
             method = "cforest", 
             trControl = numFolds, 
             preProcess=c("center", "scale"), 
             tuneGrid = mtryGrid, 
             controls = cforest_unbiased(ntree=500))

In [235]:
RF2$bestTune


mtry
88

In [236]:
import <- varimp(RF2$finalModel)

In [237]:
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)

In [238]:
importSorted = sort(import, decreasing = FALSE, index.return = TRUE)
dotchart(importSorted$x, 
         main = "RF2", 
         xlab = "Permutation accuracy importance", 
         cex=0.7)


The 'cforest()' function is not as easy as the 'randomForest()' function to retrieve the variable selection frequency. Let's compute it.

The 'cutpoints_list()' function gives a list of all the cutpoints used for a given variable in a given tree made with the 'party' package (It can be found at: https://github.com/cran/party/blob/master/R/varimp.R). So, the list's length corresponds to the number of times the variable has been used for splitting.


In [239]:
cutpoints_list <- function(tree, variableID) {

    cutp <- function(node) {
       if (node[[4]]) return(NULL)
       cp <- NULL
       if (node[[5]][[1]] == variableID)
           cp <- node[[5]][[3]]
       nl <- cutp(node[[8]])
       nr <- cutp(node[[9]])
       return(c(cp, nl, nr))
    }
    return(cutp(tree))
}

Now we can use this function to compute the frequency selection of each variable.


In [240]:
# Retrieve the variables' names
inputs <- RF2$finalModel@data@get("input")
xnames <- colnames(inputs)

# Number of variables
nbvar <- length(xnames)

# Number of trees in the forest
nbtree <- length(RF2$finalModel@ensemble)

# Prepare container
nbtimes <- data.frame()

# Compute the number of times a given variable has been used to split a given tree
for (i in 1:nbvar) {                      # for each variable
    for (j in 1:nbtree) {                 # for each tree in the forest
        cutpts <- cutpoints_list(RF2$finalModel@ensemble[[j]], i)
        nbtimes[j,i] <- length(cutpts)
    }
}   
colnames(nbtimes) <- xnames

# Frequency selection of each variable
nbtimesTot <- apply(nbtimes, 2, sum)

In [241]:
nbtimesTot


Pclass.L
1441
Pclass.Q
738
Age
3482
SibSp
1683
Parch
1129
Fare
4183
Embarked_Q1
539
Embarked_S1
1333
ChildTRUE
382
YoungTRUE
1651
Title2_Miss1
203
Title2_Mr1
546
Title2_Mrs1
505
Title2_Sir1
307

In [242]:
nbtimesSorted = sort(nbtimesTot, decreasing = FALSE, index.return = TRUE)
dotchart(nbtimesSorted$x, 
         main = "RF2", 
         xlab = "Selection frequency", 
         cex=0.7)


Let's check our randomForest and cforest models together.


In [243]:
# Give a proper size to plots
options(repr.plot.width=8, repr.plot.height=4)

In [244]:
par(mfrow=c(1,2))
dotchart(vusorted$x, 
         RF1$finalModel$xNames[vusorted$ix], 
         main = "RF1 (randomForest)", 
         xlab = "Selection frequency", 
         cex=0.7)
dotchart(nbtimesSorted$x, 
         main = "RF2 (cforest)", 
         xlab = "Selection frequency", 
         cex=0.7)


The use of 'cforest()' has clearly decreased the difference between Age/Fare and other variables such as Pclass or Embarked_S.
We can see that 'Fare' and 'Age' are still the mostly used variables although their impact into either the Gini importance or permutation accuracy importance is lower than other variables.


In [245]:
# Best accuracy
max(RF2$results$Accuracy)


0.824864090341618

The accuracy of this model is lower than the random forest model, and even lower than the CART and log models...


3.4 Support Vector Machine (SVM)

Since we have a relatively small number of features (1-1,000) and an intermediate number of observations (10-10,000), SVM seems to be a reasonable option, at least better than logistic regression. (with Gaussian kernel)

3.4.1 With linear kernel


In [29]:
## Define the cross-validation experiment
costGrid = expand.grid( .C = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10))
numFolds = trainControl(method = "cv", number = 10 )

In [247]:
names(Train)


  1. 'Pclass'
  2. 'Sex_female'
  3. 'Sex_male'
  4. 'Age'
  5. 'SibSp'
  6. 'Parch'
  7. 'Fare'
  8. 'Embarked_C'
  9. 'Embarked_Q'
  10. 'Embarked_S'
  11. 'Child'
  12. 'Young'
  13. 'Title2_Master'
  14. 'Title2_Miss'
  15. 'Title2_Mr'
  16. 'Title2_Mrs'
  17. 'Title2_Sir'
  18. 'Survived'
  19. 'Deck'

In [248]:
## Perform the cross validation
set.seed(1)
svm1 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                          Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
              data = Train, 
              method = "svmLinear", 
              trControl = numFolds, 
              na.action = na.omit, 
              preProcess=c("center", "scale"), 
              tuneGrid = costGrid)

In [249]:
svm1$bestTune


C
63

In [250]:
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)

In [251]:
max(svm1$results$Accuracy)


0.82717568947906

3.4.2 With Gaussian kernel


In [30]:
## Define the cross-validation experiment
costGrid = expand.grid( .C = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10), .sigma = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10))
numFolds = trainControl(method = "cv", number = 10)

In [253]:
## Perform the cross validation
set.seed(1)
svm2 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                          Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
              data = Train, 
              method = "svmRadial", 
              trControl = numFolds, 
              na.action = na.omit, 
              preProcess=c("center", "scale"), 
              tuneGrid = costGrid)

In [254]:
svm2$bestTune


sigmaC
250.30.3

In [255]:
# Give a proper size to plots
options(repr.plot.width=6, repr.plot.height=4)

In [256]:
plot(svm2)



In [257]:
max(svm2$results$Accuracy)


0.830520655998184

This is better than CART but not better than our log and random forest models.

Even if SVM already uses regularisation to avoid overfitting (regularisation parameter C), let's remove some features that were defined as the least important ones by the random forest (i.e. Title2_Miss, Child, Title2_Mrs, Embarked_Q).


In [244]:
par(mfrow=c(1,2))
dotchart(vusorted$x, 
         RF1$finalModel$xNames[vusorted$ix], 
         main = "RF1 (randomForest)", 
         xlab = "Selection frequency", 
         cex=0.7)
dotchart(nbtimesSorted$x, 
         main = "RF2 (cforest)", 
         xlab = "Selection frequency", 
         cex=0.7)



In [38]:
## Perform the cross validation
set.seed(1)
svm3 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Child + Title2_Miss + 
                          Title2_Mr + Title2_Mrs + Title2_Sir, 
              data = Train, 
              method = "svmRadial", 
              trControl = numFolds, 
              na.action = na.omit, 
              preProcess=c("center", "scale"), 
              tuneGrid = costGrid)

In [269]:
## Perform the cross validation
set.seed(1)
svm3 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_S +
                          Title2_Mr + Title2_Sir, 
              data = Train, 
              method = "svmRadial", 
              trControl = numFolds, 
              na.action = na.omit, 
              preProcess=c("center", "scale"), 
              tuneGrid = costGrid)

In [270]:
svm3$bestTune


sigmaC
430.0110

In [271]:
max(svm3$results$Accuracy)


0.836164737260243

Now, we reach the same performance as our random forest model.


3.5 Neural network


In [31]:
## Define the cross-validation experiment
layerGrid = expand.grid( .size = seq(5, 105, by=10), .decay = c(0, 10^seq(-3, 0, by=1)))
numFolds = trainControl(method = "cv", number = 10 )

In [ ]:
## Perform the cross validation
set.seed(1)
nn1 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
             data = Train, 
             method = "nnet", 
             trControl = numFolds, 
             na.action = na.omit, 
             preProcess=c("center", "scale"), 
             tuneGrid = layerGrid, 
             MaxNWts=1900,
             trace=FALSE)

In [ ]:
nn1$bestTune

In [ ]:
max(nn1$results$Accuracy, na.rm=TRUE)

Neural networks are very sensitive to multicollinearity. Let's remove Fare that is highly correlater with Pclass.


In [40]:
## Perform the cross validation
set.seed(1)
nn2 <- train(Survived ~ Pclass + Age + SibSp + Parch + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
             data = Train, 
             method = "nnet", 
             trControl = numFolds, 
             na.action = na.omit, 
             preProcess=c("center", "scale"), 
             tuneGrid = layerGrid,
             MaxNWts=1700,
             trace=FALSE)

In [229]:
nn2$bestTune


sizedecay
40751

In [230]:
max(nn2$results$Accuracy, na.rm=TRUE)


0.832042253521127

Now, let's remove some features that were defined as the least important ones by the random forest (i.e. Title2_Miss, Child, Title2_Mrs, Embarked_Q).


In [234]:
## Perform the cross validation
set.seed(1)
nn3 <- train(Survived ~ Pclass + Age + SibSp + Parch + Embarked_S + Young + 
                        Title2_Mr + Title2_Sir,
             data = Train
             method = "nnet", 
             trControl = numFolds, 
             na.action = na.omit, 
             preProcess=c("center", "scale"), 
             tuneGrid = layerGrid, 
             MaxNWts=1900,
             trace=FALSE)

In [235]:
nn3$bestTune


sizedecay
35651

In [236]:
max(nn3$results$Accuracy, na.rm=TRUE)


0.826388888888889

That's not better. Let's keep them in the model.

3.6 XGBoost

3.6.1 Linear


In [41]:
## Define the cross-validation experiment
tuneGridL = expand.grid(.nrounds = 1:3, 
                       .lambda = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10), 
                       .alpha = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10), #1e-5, 1e-2, 0.1, 1, 100
                       .eta = c(0.01, 0.21, 0.05))   # 0.01-0.2
numFolds = trainControl(method = "cv", number = 10 )

In [42]:
## Perform the cross validation
set.seed(1)
xgb1 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
             data = Train, 
             method = "xgbLinear", 
             trControl = numFolds, 
             na.action = na.omit, 
             preProcess=c("center", "scale"), 
             tuneGrid = tuneGridL)


Loading required package: xgboost
Warning message:
"package 'xgboost' was built under R version 3.3.3"

In [43]:
xgb1$bestTune


nroundslambdaalphaeta
2442 1 1 0.01

In [44]:
max(xgb1$results$Accuracy)


0.839509420043128

3.6.2 Tree


In [45]:
## Define the cross-validation experiment
tuneGridT = expand.grid(.nrounds = 1:3, 
                       .gamma = c(0.01, 0.03, 0.1, 0.3, 1),
                       .max_depth = seq(3,10,2), 
                       .eta = c(0.01, 0.21, 0.05),
                       .colsample_bytree = c(0.6, 0.7, 0.8, 0.9, 1), 
                       .min_child_weight = seq(1,10,2), 
                       .subsample = c(0.6, 0.7, 0.8, 0.9, 1)) 
numFolds = trainControl(method = "cv", number = 10 )

In [46]:
## Perform the cross validation
set.seed(1)
xgb2 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
             data = Train, 
             method = "xgbTree", 
             trControl = numFolds, 
             na.action = na.omit, 
             preProcess=c("center", "scale"), 
             tuneGrid = tuneGridT)

In [47]:
xgb2$bestTune


nroundsmax_depthetagammacolsample_bytreemin_child_weightsubsample
143512 9 0.050.3 0.7 3 0.9

In [48]:
max(xgb2$results$Accuracy)


0.847362387924186

4 Compare the best models

4.1 Compare the performances

First, we have to run again the CART2 model, but using the formula and the Train_dum dataframe so that all models use the dame training dataset.


In [36]:
set.seed(1)
CART1b <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
                data = Train, 
                method = "rpart", 
                trControl = numFolds, 
                na.action = na.omit, 
                tuneGrid = cpGrid)

In [49]:
# Collect the resampling results
resamps <- resamples(list(GLM = logmod4,
                          CART = CART1b,
                          RF = RF1,
                          SVM = svm3,
                          NN = nn2,
                          XGBL = xgb1,
                          XGBT = xgb2))
summary(resamps)


Call:
summary.resamples(object = resamps)

Models: GLM, CART, RF, SVM, NN, XGBL, XGBT 
Number of resamples: 10 

Accuracy 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
GLM  0.7416  0.8230 0.8324 0.8317  0.8551 0.8764    0
CART 0.7303  0.8118 0.8323 0.8294  0.8551 0.8977    0
RF   0.7528  0.8338 0.8427 0.8362  0.8516 0.8764    0
SVM  0.7303  0.8118 0.8380 0.8328  0.8551 0.8977    0
NN   0.7191  0.8187 0.8258 0.8226  0.8417 0.8764    0
XGBL 0.7528  0.8315 0.8428 0.8395  0.8551 0.8876    0
XGBT 0.7753  0.8455 0.8556 0.8474  0.8648 0.8764    0

Kappa 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
GLM  0.4303  0.6150 0.6405 0.6369  0.6887 0.7338    0
CART 0.4158  0.5882 0.6407 0.6334  0.6887 0.7807    0
RF   0.4453  0.6405 0.6580 0.6438  0.6758 0.7338    0
SVM  0.4090  0.5962 0.6453 0.6406  0.6913 0.7783    0
NN   0.3807  0.6025 0.6253 0.6156  0.6569 0.7338    0
XGBL 0.4453  0.6386 0.6577 0.6523  0.6893 0.7593    0
XGBT 0.5075  0.6597 0.6913 0.6714  0.7089 0.7454    0

Let's plot the results


In [50]:
# Give a proper size to plots
options(repr.plot.width=6, repr.plot.height=4)

In [51]:
bwplot(resamps, layout = c(2, 1))



In [243]:
dotplot(resamps, metric = "Accuracy")



In [52]:
# Give a proper size to plots
options(repr.plot.width=5, repr.plot.height=5)

In [53]:
splom(resamps, varname.cex=0.8, varname.fontface="bold", axis.text.cex=0.4)


Since models are fit on the same versions of the training data, it makes sense to make inferences on the differences between models. In this way we reduce the within-resample correlation that may exist. We can compute the differences, then use a simple t-test to evaluate the null hypothesis that there is no difference between models. (Max Kuhn)


In [54]:
difValues <- diff(resamps)
difValues
summary(difValues)


Call:
diff.resamples(x = resamps)

Models: GLM, CART, RF, SVM, NN, XGBL, XGBT 
Metrics: Accuracy, Kappa 
Number of differences: 21 
p-value adjustment: bonferroni 
Call:
summary.diff.resamples(object = difValues)

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

Accuracy 
     GLM    CART      RF        SVM       NN        XGBL      XGBT     
GLM          0.002234 -0.004494 -0.001137  0.009040 -0.007840 -0.015693
CART 1.0000           -0.006728 -0.003371  0.006806 -0.010074 -0.017927
RF   1.0000 1.0000               0.003357  0.013534 -0.003346 -0.011199
SVM  1.0000 1.0000    1.0000               0.010177 -0.006703 -0.014556
NN   1.0000 1.0000    1.0000    1.0000              -0.016880 -0.024733
XGBL 1.0000 1.0000    1.0000    1.0000    0.1399              -0.007853
XGBT 1.0000 1.0000    1.0000    1.0000    0.3687    1.0000             

Kappa 
     GLM    CART      RF        SVM       NN        XGBL      XGBT     
GLM          0.003542 -0.006866 -0.003710  0.021340 -0.015408 -0.034500
CART 1.0000           -0.010408 -0.007252  0.017797 -0.018950 -0.038042
RF   1.0000 1.0000               0.003156  0.028206 -0.008542 -0.027634
SVM  1.0000 1.0000    1.0000               0.025050 -0.011698 -0.030790
NN   1.0000 1.0000    1.0000    1.0000              -0.036748 -0.055840
XGBL 1.0000 1.0000    1.0000    1.0000    0.1176              -0.019092
XGBT 1.0000 1.0000    1.0000    1.0000    0.3191    1.0000             

In [57]:
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=5)

In [58]:
bwplot(difValues, layout = c(3, 1), cex=0.8)


We can see that all those models have a very similar accuracy.

4.2 Check correlations among models

Let's check the correlation among models. We will use this information to decide whether we want to combine some models to build a stacked one.


In [59]:
model_cor <- modelCor(resamps)
model_cor


GLMCARTRFSVMNNXGBLXGBT
GLM1.00000000.98103100.78353330.97010620.87367940.92661260.7585487
CART0.98103101.00000000.68546240.98763990.80768060.85802920.7312503
RF0.78353330.68546241.00000000.70696430.85359020.93131250.8111155
SVM0.97010620.98763990.70696431.00000000.83530270.87874280.7721849
NN0.87367940.80768060.85359020.83530271.00000000.93331860.7599231
XGBL0.92661260.85802920.93131250.87874280.93331861.00000000.8692395
XGBT0.75854870.73125030.81111550.77218490.75992310.86923951.0000000

In [60]:
corrplot(model_cor)


The models are highly correlated to each other, so to assemble them will likely have little benefits.

5 Predictions on the test set


In [61]:
str(Test)


'data.frame':	418 obs. of  22 variables:
 $ PassengerId  : int  892 893 894 895 896 897 898 899 900 901 ...
 $ Pclass       : Ord.factor w/ 3 levels "1"<"2"<"3": 3 3 2 3 3 3 3 2 3 3 ...
 $ Name         : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 438 1298 1162 1303 1072 1259 178 949 896 994 ...
 $ Sex_female   : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...
 $ Sex_male     : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 1 2 1 2 ...
 $ Age          : num  34.5 47 62 27 22 14 30 26 18 21 ...
 $ SibSp        : int  0 1 0 0 1 0 0 1 0 2 ...
 $ Parch        : int  0 0 0 0 1 0 0 1 0 0 ...
 $ Ticket       : Factor w/ 929 levels "110152","110413",..: 781 841 726 776 252 869 787 159 745 520 ...
 $ Fare         : num  7.83 7 9.69 8.66 12.29 ...
 $ Cabin        : Factor w/ 186 levels "A10","A14","A16",..: NA NA NA NA NA NA NA NA NA NA ...
 $ Embarked_C   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
 $ Embarked_Q   : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 1 1 1 ...
 $ Embarked_S   : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 2 1 2 ...
 $ Title        : Factor w/ 18 levels "Capt","Col","Don",..: 13 14 13 13 14 13 10 13 14 13 ...
 $ Child        : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
 $ Young        : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 2 2 1 2 2 2 2 ...
 $ Title2_Master: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
 $ Title2_Miss  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Title2_Mr    : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 1 2 1 2 ...
 $ Title2_Mrs   : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...
 $ Title2_Sir   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "dummies")=List of 3
  ..$ Sex     : int  4 5
  ..$ Embarked: int  12 13 14
  ..$ Title2  : int  18 19 20 21 22

5.1 GLM


In [101]:
pred_GLM_Test = predict(logmod4, newdata=Test)

In [102]:
pred_GLM_final <- cbind(Test$PassengerId, as.character(pred_GLM_Test))

In [103]:
colnames(pred_GLM_final) <- c("PassengerId", "Survived")
head(pred_GLM_final)


PassengerIdSurvived
892No
893Yes
894No
895No
896No
897Yes

In [104]:
pred_GLM_final <- as.data.frame(pred_GLM_final)

In [105]:
pred_GLM_final$Survived <- revalue(pred_GLM_final$Survived, c("No"="0", "Yes"="1"))

In [107]:
write.table(pred_GLM_final, file="pred_glm_final2.csv", sep=",", row.names=FALSE)

leaderboard score on Kaggle: 0.78469

5.2 CART


In [109]:
pred_CART_Test = predict(CART1b, newdata=Test)

In [110]:
pred_CART_final <- cbind(Test$PassengerId, as.character(pred_CART_Test))

In [111]:
colnames(pred_CART_final) <- c("PassengerId", "Survived")

In [112]:
pred_CART_final <- as.data.frame(pred_CART_final)

In [113]:
pred_CART_final$Survived <- revalue(pred_CART_final$Survived, c("No"="0", "Yes"="1"))

In [114]:
write.table(pred_CART_final, file="pred_cart_final2.csv", sep=",", row.names=FALSE)

leaderboard score on Kaggle: 0.79426

5.3 Random forest


In [115]:
pred_RF_Test = predict(RF1, newdata=Test)

In [116]:
pred_RF_final <- cbind(Test$PassengerId, as.character(pred_RF_Test))

In [117]:
colnames(pred_RF_final) <- c("PassengerId", "Survived")

In [118]:
pred_RF_final <- as.data.frame(pred_RF_final)

In [119]:
pred_RF_final$Survived <- revalue(pred_RF_final$Survived, c("No"="0", "Yes"="1"))

In [120]:
write.table(pred_RF_final, file="pred_rf_final2.csv", sep=",", row.names=FALSE)

leaderboard score on Kaggle: 0.77512

5.4 SVM


In [78]:
pred_SVM_Test = predict(svm3, newdata=Test)

In [79]:
pred_SVM_final <- cbind(Test$PassengerId, as.character(pred_SVM_Test))

In [80]:
colnames(pred_SVM_final) <- c("PassengerId", "Survived")

In [81]:
pred_SVM_final <- as.data.frame(pred_SVM_final)

In [85]:
pred_SVM_final$Survived <- revalue(pred_SVM_final$Survived, c("No"="0", "Yes"="1"))

In [87]:
write.table(pred_SVM_final, file="pred_svm_final2.csv", sep=",", row.names=FALSE)

leaderboard score on Kaggle: 0.78469

5.5 XGBoost (linear)


In [88]:
pred_XGBL_Test = predict(xgb1, newdata=Test)

In [89]:
pred_XGBL_final <- cbind(Test$PassengerId, as.character(pred_XGBL_Test))

In [90]:
colnames(pred_XGBL_final) <- c("PassengerId", "Survived")

In [91]:
pred_XGBL_final <- as.data.frame(pred_XGBL_final)

In [93]:
pred_XGBL_final$Survived <- revalue(pred_XGBL_final$Survived, c("No"="0", "Yes"="1"))

In [94]:
write.table(pred_XGBL_final, file="pred_xgbL_final2.csv", sep=",", row.names=FALSE)

leaderboard score on Kaggle: 0.79904

5.6 XGBoost (tree)


In [95]:
pred_XGBT_Test = predict(xgb2, newdata=Test)

In [96]:
pred_XGBT_final <- cbind(Test$PassengerId, as.character(pred_XGBT_Test))

In [97]:
colnames(pred_XGBT_final) <- c("PassengerId", "Survived")

In [98]:
pred_XGBT_final <- as.data.frame(pred_XGBT_final)

In [99]:
pred_XGBT_final$Survived <- revalue(pred_XGBT_final$Survived, c("No"="0", "Yes"="1"))

In [100]:
write.table(pred_XGBT_final, file="pred_xgbT_final2.csv", sep=",", row.names=FALSE)

leaderboard score on Kaggle: 0.73206

6 Going further

To improve the accuracy, we can try to:

  • avoid combining Train and Test datasets at the beginning, and split our Train dataset into a Train2 and Validation sets. In doing so, we will avoid contamination of training data through data imputation, parameter optimization and feature selection. I choose not to do so for this first attempt because of the small sample size or the training set.
  • use separate predictive models for males and females given the difference in survival for each sex.
  • impute Deck.
  • create a variable Family (from Name and Ticket) and then use it to create the variables Single, SmallFamily and LargeFamily.
  • transform data