Titanic: Machine Learning from Disaster

Author: Aurélie Daviaud

Language: R
Methods : Classification (logistic regression, CART, Random Forest, SVM, neural network, XGBoost), cross-validation

1 Exploratory data analysis
...1.1 Check data
...1.2 Check correlations among variables
...1.3 Visualize some potentially important variables
......1.3.1 Passenger's class
......1.3.2 Women and children first!
2 Feature engineering
....2.1 Create new features 1
......2.1.1 Create Title from Name
....2.2 Impute missing data
......2.2.1 Fare
......2.2.2 Embarked
......2.2.3 Age
....2.3 Create new features 2
......2.3.1 Create Child and Young from Age
......2.3.2 Create Title2 from Title
......2.3.3 Create dummy variables
3 Machine learning
...3.1 Logistic regression
......3.1.1 A first quick and dirty model
......3.1.2 Check learning curves
......3.1.3 Feature selection
.........3.1.3.1 Manual selection
.........3.1.3.2 Automatic selection
...3.2 Classification tree: CART (Recursive Partitioning)
...3.3 Random forest
...3.4 Support Vector Machine (SVM)
......3.4.1 With linear kernel
......3.4.2 With Gaussian kernel
...3.5 Neural network
...3.6 XGBoost
......3.6.1 Linear
......3.6.2 Tree
4 Compare the best models
...4.1 Compare the performances
...4.2 Check correlations among models
5 Predictions on the test set
...5.1 GLM
...5.2 CART
...5.3 Random forest
...5.4 SVM
...5.5 XGBoost (linear)
...5.6 XGBoost (tree)
6 Going further

Context
The aim of this project is to predict survivors in the Titanic dataset. I will follow typical steps of a machine learning project: exploratory data analysis, pre-processing (data imputation, scaling, feature engineering, feature selection...) and machine learning.

Data description

Variable	Description
Survived	Survival (0 = No, 1 = Yes)
Pclass	Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
Sex	Sex (male/female)
Age	Age (in years)
SibSp	# of siblings/spouses aboard the Titanic
Parch	# of parents/children aboard the Titanic
Ticket	Ticket number
Fare	Passenger fare
Cabin	Cabin number
Embarked	Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)



In [2]:

    
# Load libraries
library(polycor)      # hetcor()
library(ggplot2)      # ggplot()
library(caret)        # train(), trainControl(), resamples(), varImp()
library(mice)         # mice()
library(lattice)      # densityplot()
library(gridExtra)    # grid.arrange()
library(dummies)      # dummy.data.frame()
library(plyr)         # revalue()
library(rpart)        # rpart()
library(rpart.plot)   # rpart.plot()
library(randomForest) # method = 'rf'
library(party)        # method = 'cforest'
library(kernlab)      # method = 'svmLinear' and method = 'svmRadial' 
library(nnet)         # method = 'nnet'
library(corrplot)     # corrplot()
library(caretEnsemble)# caretList(), caretStack()



In [3]:

    
setwd("C:/Users/AurelieD/Documents/Boulot/0-Data scientist/Kaggle/Titanic")



In [4]:

    
## Load data (some cells of our table are empty, we have to specify to R to treat them as missing values (NA) using na.string)
Train <- read.csv("train.csv", na.strings=c(""))
Test <- read.csv("test.csv", na.strings=c(""))



In [91]:

    
str(Train)









    



'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...



In [92]:

    
str(Test)









    



'data.frame':	418 obs. of  11 variables:
 $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
 $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
 $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
 $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
 $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
 $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
 $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
 $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
 $ Cabin      : Factor w/ 76 levels "A11","A18","A21",..: NA NA NA NA NA NA NA NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...

1 Exploratory data analysis

First, let's take a look at the Train data.

1.1 Check data



In [93]:

    
## Check data
head(Train)









    





PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

	1                                                  0                                                  3                                                  Braund, Mr. Owen Harris                            male                                               22                                                 1                                                  0                                                  A/5 21171                                           7.2500                                            NA                                                 S                                                  
	2                                                  1                                                  1                                                  Cumings, Mrs. John Bradley (Florence Briggs Thayer) female                                             38                                                 1                                                  0                                                  PC 17599                                           71.2833                                            C85                                                C                                                  
	3                                                  1                                                  3                                                  Heikkinen, Miss. Laina                             female                                             26                                                 0                                                  0                                                  STON/O2. 3101282                                    7.9250                                            NA                                                 S                                                  
	4                                                  1                                                  1                                                  Futrelle, Mrs. Jacques Heath (Lily May Peel)       female                                             35                                                 1                                                  0                                                  113803                                             53.1000                                            C123                                               S                                                  
	5                                                  0                                                  3                                                  Allen, Mr. William Henry                           male                                               35                                                 0                                                  0                                                  373450                                              8.0500                                            NA                                                 S                                                  
	6                                                  0                                                  3                                                  Moran, Mr. James                                   male                                               NA                                                 0                                                  0                                                  330877                                              8.4583                                            NA                                                 Q



In [94]:

    
str(Train)          # check factors' levels









    



'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...



In [5]:

    
# Specify proper classes to variables
Train$Survived <- factor(Train$Survived)
Train$Pclass <- factor(Train$Pclass, levels=c("1", "2", "3"), ordered=TRUE)



In [96]:

    
summary(Train)      # check min/max/mean... of each variable









    





  PassengerId    Survived Pclass                                     Name    
 Min.   :  1.0   0:549    1:216   Abbing, Mr. Anthony                  :  1  
 1st Qu.:223.5   1:342    2:184   Abbott, Mr. Rossmore Edward          :  1  
 Median :446.0            3:491   Abbott, Mrs. Stanton (Rosa Hunt)     :  1  
 Mean   :446.0                    Abelson, Mr. Samuel                  :  1  
 3rd Qu.:668.5                    Abelson, Mrs. Samuel (Hannah Wizosky):  1  
 Max.   :891.0                    Adahl, Mr. Mauritz Nils Martin       :  1  
                                  (Other)                              :885  
     Sex           Age            SibSp           Parch             Ticket   
 female:314   Min.   : 0.42   Min.   :0.000   Min.   :0.0000   1601    :  7  
 male  :577   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   347082  :  7  
              Median :28.00   Median :0.000   Median :0.0000   CA. 2343:  7  
              Mean   :29.70   Mean   :0.523   Mean   :0.3816   3101295 :  6  
              3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6  
              Max.   :80.00   Max.   :8.000   Max.   :6.0000   CA 2144 :  6  
              NA's   :177                                      (Other) :852  
      Fare                Cabin     Embarked  
 Min.   :  0.00   B96 B98    :  4   C   :168  
 1st Qu.:  7.91   C23 C25 C27:  4   Q   : 77  
 Median : 14.45   G6         :  4   S   :644  
 Mean   : 32.20   C22 C26    :  3   NA's:  2  
 3rd Qu.: 31.00   D          :  3             
 Max.   :512.33   (Other)    :186             
                  NA's       :687

1.2 Check correlations among variables

This will allow us to:

see which variables may be good predictors of our dependent variable (Survived) (this can be helpful when we don't have any domain knowledge);
check multicollinearity among predictors, i.e. when one predictor can be linearly predicted from the others.

Since we have factors, ordered factors and numeric variables, we will use the hetcor() function of the 'polycor' package. It "computes a heterogenous correlation matrix, consisting of Pearson product-moment correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables." (https://cran.r-project.org/web/packages/polycor/polycor.pdf). We will ignore the variables Cabin and Name for preliminary exploration of the data since they contain many missing data and need some processing.



In [97]:

    
Train_sub <- subset(Train, select=c(Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked))



In [98]:

    
hetcor(Train_sub, use = "pairwise.complete.obs")$correlations









    





Survived Pclass Sex Age SibSp Parch Fare Embarked

	Survived  1.00000000 -0.48055675 -0.7609001 -0.09788042 -0.04897746  0.10128585  0.38166338 -0.26734299
	Pclass -0.48055675  1.00000000  0.1992472 -0.41238334  0.12283381  0.02263785 -0.78299896  0.19342079
	Sex -0.76090008  0.19924719  1.0000000  0.12085141 -0.14124694 -0.30412276 -0.21952881  0.18706902
	Age -0.09788042 -0.41238334  0.1208514  1.00000000 -0.30824676 -0.18911926  0.09606669 -0.04053557
	SibSp -0.04897746  0.12283381 -0.1412469 -0.30824676  1.00000000  0.41483770  0.15965104  0.11114924
	Parch  0.10128585  0.02263785 -0.3041228 -0.18911926  0.41483770  1.00000000  0.21622494  0.06730955
	Fare  0.38166338 -0.78299896 -0.2195288  0.09606669  0.15965104  0.21622494  1.00000000 -0.25808764
	Embarked -0.26734299  0.19342079  0.1870690 -0.04053557  0.11114924  0.06730955 -0.25808764  1.00000000

We observe that:

Survived is highly correlated with Sex (r=-0.76). So Sex may be a good predictor for Survived. To a lesser extent, Pclass is also correlated to Survived (r=-0.48).
Fare is highly correlated with Pclass (r=-0.78). So Fare and Pclass are collinear.

1.3 Visualize some potentially important variables

1.3.1 Passenger's class

It is well known that passengers from the 1st class were the first ones to be allowed to board on the life rafts. Therefore the correlation we have just seen between the survival and the passenger's class was expected. Let's inspect this relationship deeplier.



In [6]:

    
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)



In [7]:

    
# Prepare a nice theme (i.e. general appearance of graphs)
theme_new <- theme_classic() + 
    theme(plot.title = element_text(size=11, face="bold", hjust = 0.5), axis.text.x = element_text(size = 11))



In [101]:

    
ggplot(data=Train) + 
  geom_bar(aes(x = Survived, fill = Pclass), position = "fill") +
  labs(title = "Distribution of classes\namong dead and safe passengers", y = "Proportion of passengers", x = "") +
  scale_x_discrete(labels=c("Died","Survived")) +
  scale_fill_brewer(palette="Blues") +
  theme_new



In [102]:

    
ggplot(data=Train) + 
  geom_bar(aes(x = Pclass, fill = Survived), position = "fill") +
  labs(title = "Distribution of dead and safe passengers\namong classes", y = "Proportion of passengers", x = "") +
  scale_x_discrete(labels=c("1","2","3")) +
  scale_fill_brewer(palette="Blues", labels=c("Died", "Survived"), name = "Survival") +
  theme_new

We can see that most of the people who survived were in the first class. On the contrary, most of the people who died were in the 3rd class.

Similarly, here, we can see that most of the first class people survived whereas most of the 3rd class people died.

1.3.2 Women and children first!



In [103]:

    
ggplot(data=Train) + 
  geom_bar(aes(x = Survived, fill = Sex), position = "fill") +
  labs(title = "Distribution of classes\namong dead and safe passengers", y = "Proportion of passengers", x = "") +
  scale_x_discrete(labels=c("Died","Survived")) +
  scale_fill_brewer(palette="Blues") +
  theme_new

Most dead people were men and most surviving people were women.
As expected (from our domain knowledge and our preliminary correlation test), Sex is a good predictor for the survival of passengers.

What about age? Children were supposed to board the life rafts first.



In [104]:

    
ggplot(data=Train) + 
  geom_boxplot(aes(x = Survived, y = Age)) +
  labs(title = "Distribution of ages\namong dead and safe passengers", x = "") +
  scale_x_discrete(labels=c("Died","Survived")) +
  theme_new









    



Warning message:
"Removed 177 rows containing non-finite values (stat_boxplot)."

There seem to be only a slight difference in the distribution of ages among dead and safe passengers. Although the median is equal, the upper and lower quartiles are a bit lower for the surviving passengers meaning that surviving people were slightly younger than dead people.

Let's use an histogram now to look at the details.



In [105]:

    
ggplot(data = Train, aes(x = Age, group = Survived, fill = Survived)) + 
   geom_histogram(position="dodge", binwidth = 5, boundary=15) + 
   labs(title = "Distribution of ages\namong dead and safe passengers") + 
   scale_fill_manual(labels = c("Died", "Survived"), name = "Survival", values = c("steelblue3", "lightblue")) +
   theme_new









    



Warning message:
"Removed 177 rows containing non-finite values (stat_bin)."

Now, we clearly see a difference for some ages: most child under 5 have survived whereas most young passengers between 15-30 have died.

2 Feature engineering

Our aim is to give as much information as we can to the model: the more useful information the model gets, the better it is likely to be.

First, let's combine the Train and Test datasets (while removing temporarily the Survived column of the Train dataset).



In [8]:

    
Survived <- Train$Survived   # We will add it back to the train dataset after processing of the combined data set.



In [9]:

    
Train$Survived <- NULL
titanic = rbind(Train, Test)



In [108]:

    
str(titanic)









    



'data.frame':	1309 obs. of  11 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Pclass     : Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 186 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

2.1 Create new features 1

2.1.1 Create Title from Name

The name can provide some interesting information. It is in the form: Last Name, Title First name Middle name. In case of a spouse, her first name and last name are added in brackets.
Let's retrieve the title which may reflect the age as well as the social status of the passengers.



In [10]:

    
titanic$Title <- sub(".+,\\s", "", titanic$Name)
titanic$Title <- as.factor(sub("\\.\\s.+", "", titanic$Title))

2.2 Impute missing data

First, we can increase the number of observations used in the model by imputing the missing values.

What is the numer of missing values in each variable?



In [110]:

    
nbNA <- function(x){sum(is.na(x))}



In [111]:

    
apply(titanic,2,nbNA)









    





	PassengerId
		0
	Pclass
		0
	Name
		0
	Sex
		0
	Age
		263
	SibSp
		0
	Parch
		0
	Ticket
		0
	Fare
		1
	Cabin
		1014
	Embarked
		2
	Title
		0

2.2.1 Fare



In [112]:

    
hist(titanic$Fare, breaks=50)



In [113]:

    
summary(titanic$Fare)









    





   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   7.896  14.450  33.300  31.280 512.300       1

Given that the distribution of Fare is highly skewed, the best seems to impute the missing value with the median.



In [11]:

    
# Impute Fare with the median
titanic[is.na(titanic[,"Fare"]), "Fare"] <- median(titanic[,"Fare"], na.rm = TRUE)

2.2.2 Embarked

For Embarked, there are only two passengers with missing values. Who are those passengers?



In [115]:

    
titanic[is.na(titanic$Embarked), ]









    





PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title

	62  62                                      1                                        Icard, Miss. Amelie                      female                                   38                                       0                                        0                                        113572                                   80                                       B28                                      NA                                       Miss                                     
	830 830                                      1                                        Stone, Mrs. George Nelson (Martha Evelyn) female                                   62                                       0                                        0                                        113572                                   80                                       B28                                      NA                                       Mrs

We can see that they are two women > 30 years old, sharing the same first class cabin. Let's see whether we can use this information to impute the missing data.



In [116]:

    
table(titanic[titanic$Sex == "female" & titanic$Pclass == "1" & titanic$Age > 30,"Embarked"])

Women sharing the same traits have mainly embarked at either Cherbourg (France) or Southampton (UK). Since the two passengers with missing data have an English name, we can assume that they embarked at Southampton.



In [12]:

    
titanic$Embarked[is.na(titanic$Embarked)] = "S"

2.2.3 Age



In [118]:

    
# Percentage of missing values in Age
263/nrow(titanic)*100









    




20.0916730328495

Given the percentage of missing values for Age, will impute this variable using Multivariate Imputation by Chained Equations (MICE).

First, let's check the missing pattern of Age.



In [119]:

    
# Are the missing ages equally distributed across male and female? 
table(titanic[is.na(titanic$Age), "Sex"])









    





female   male 
    78    185



In [120]:

    
# Are the missing ages equally distributed across classes? 
table(titanic[is.na(titanic$Age), "Pclass"])

We observe that the missing ages are not randomly distributed across all sexes and classes, with more missing data in males and 3rd class passengers. So we probably have a MAR (Missing at random) pattern and not a MCAR pattern (Missing completely ate random).



In [121]:

    
names(titanic)









    





	'PassengerId'
	'Pclass'
	'Name'
	'Sex'
	'Age'
	'SibSp'
	'Parch'
	'Ticket'
	'Fare'
	'Cabin'
	'Embarked'
	'Title'

Let's initialize the mice function parameters.



In [13]:

    
# Create a mids object containing the default settings
init <- mice(titanic[, !names(titanic) %in% c("PassengerID", "Name")], maxit=0, seed = 1)



In [14]:

    
# Selects the best predictors
new.pred<-quickpred(titanic[, !names(titanic) %in% c("PassengerID", "Name")])



In [15]:

    
# Define the methods
meth <- init$method
meth[c("Cabin")]=""    # Skip a variable from imputation while this variable will still be used for prediction.



In [16]:

    
# Perform imputation (here, we create 5 imputed datasets, the default value for the m parameter)
imputed_Data <- mice(titanic[, !names(titanic) %in% c("PassengerID", "Name")], method=meth, m=10, pred=new.pred, seed = 1, print=FALSE)

Let's inspect the distribution of original and imputed data



In [126]:

    
densityplot(imputed_Data)

The density of the imputed data for each imputed dataset is showed in magenta while the density of the observed data is showed in blue. We observe that the central tendencies of the density plots of imputed data appear relatively similar to the observed data.



In [127]:

    
# Give a proper size to plots
options(repr.plot.width=10, repr.plot.height=10)



In [128]:

    
for (i in 1:10){
    namePlot <- paste("plot", i, sep="")
    plot <- ggplot(titanic,aes(x=Age)) + 
          geom_density(data=data.frame(titanic$PassengerId, complete(imputed_Data,i)), alpha = 0.2, fill = "blue")+
          geom_density(data=titanic, alpha = 0.2, fill = "Red")+
          labs(title=paste("Age Distribution, imputation ", i, sep=""))+
          labs(x="Age") +
          theme_new
     assign(namePlot, plot)
}



In [129]:

    
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, plot7, plot8, plot9, plot10, nrow=4, ncol=3)









    



Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."Warning message:
"Removed 263 rows containing non-finite values (stat_density)."

The fifth imputed dataset seems to be the best one, based on the density distribution. Let's use it.



In [17]:

    
titanic_imp <- complete(imputed_Data, 5)
titanic$Age <- titanic_imp$Age

The Cabin variable may also be of interest: it contains the deck ('deck''cabine_number') which may reflect some proximity to the life rafts (The deck A being the closest deck and the deck G being the farthest one). However, this variable contains a majority of missing values (NA).
We could still try to impute it using the family names, ticket number and class: 1) families are probably people sharing the same name and having the same ticket number; 2) people from the same family were probably on the same deck; 3) first class had the top decks (A-E), second class (D-F), and third class (E-G).
However, given that the deck is highly correlated with Pclass, we will just keep it away for now.

2.3 Create new features 2

2.3.1 Create Child and Young from Age

We have seen that the age classes were not equal regarding the survival rate: most child under 5 have survived whereas most young passengers between 15-30 have died. Let's create a variable Child (<5) and a variable Young (15-30).



In [18]:

    
titanic$Child <- factor(titanic$Age <= 5)
titanic$Young <- factor(titanic$Age > 15 & titanic$Age <= 30)

2.3.2 Create Title2 from Title

In title, we have many levels which have only one observation or a few observations. This is not optimal for modelisation since those levels may have near-to-zero variance. Let's merge titles into a smaller number of levels, using the age and sex usually related to those titles.



In [132]:

    
summary(titanic$Title)









    





	Capt
		1
	Col
		4
	Don
		1
	Dona
		1
	Dr
		8
	Jonkheer
		1
	Lady
		1
	Major
		2
	Master
		61
	Miss
		260
	Mlle
		2
	Mme
		1
	Mr
		757
	Mrs
		197
	Ms
		2
	Rev
		8
	Sir
		1
	the Countess
		1



In [133]:

    
titles <- as.character(unique(titanic$Title))



In [134]:

    
as.data.frame(cbind("Title" = titles, 
   "Female" = sapply(titles, function(x) nrow(titanic[titanic$Title == x & titanic$Sex == "female",])),
   "Male" = sapply(titles, function(x) nrow(titanic[titanic$Title == x & titanic$Sex == "male",])),
   "Minimum_Age" = sapply(titles, function(x) min(titanic[titanic$Title == x ,'Age'], na.rm = TRUE)),
   "Maximum_Age" = sapply(titles, function(x) max(titanic[titanic$Title == x,'Age'], na.rm = TRUE))), row.names = F)









    





Title Female Male Minimum_Age Maximum_Age

	Mr          0           757         2           80          
	Mrs         197         0           14          76          
	Miss        260         0           0.17        63          
	Master      0           61          0.33        14.5        
	Don         0           1           40          40          
	Rev         0           8           27          57          
	Dr          1           7           23          54          
	Mme         1           0           24          24          
	Ms          2           0           26          28          
	Major       0           2           45          52          
	Lady        1           0           48          48          
	Sir         0           1           49          49          
	Mlle        2           0           24          24          
	Col         0           4           47          60          
	Capt        0           1           70          70          
	the Countess 1           0           33          33          
	Jonkheer    0           1           38          38          
	Dona        1           0           39          39

We can merge titles into:

Sir : men above 14.5 years old with an important social status
Mr: men above 14.5 years old without status
Master: boys up to 14.5 years old
Mrs: women above 14 years old, married or not
Miss: girls up to 14 years old



In [19]:

    
titanic$Title2 <- titanic$Title

titanic[(titanic$Sex == "male" & titanic$Age <= 14.5),]$Title2 = "Master" 
titanic[(titanic$Sex == "male" & titanic$Age > 14.5),]$Title2 = "Mr"
titanic[(titanic$Sex == "female" & titanic$Age <= 14),]$Title2 = "Miss"
titanic[(titanic$Sex == "female" & titanic$Age > 14),]$Title2 = "Mrs"

titanic[titanic$Title == "Capt"|
   titanic$Title == "Col"|
   titanic$Title == "Don"|
   titanic$Title == "Major"|
   titanic$Title == "Rev"|      
   titanic$Title == "Jonkheer"|
   (titanic$Title == "Dr" &  titanic$Sex == "male"),]$Title2 = "Sir"

titanic$Title2 <- factor(titanic$Title2)



In [136]:

    
summary(titanic$Title2)









    





	Master
		69
	Miss
		68
	Mr
		750
	Mrs
		398
	Sir
		24

Now, we have levels that are much more balanced.

2.3.3 Create dummy variables

Let's convert our factor variables into dummy variables. Having dummy variables is not absolutely necessary, however:

it will make the intrepretation easier
it may also help the model to perform better
contrary to glm() or train() that automatically create dummy variables, most functions don't.



In [20]:

    
titanic_dum <- dummy.data.frame(titanic, names= c("Sex","Embarked", "Title2") , sep = "_")
names(titanic_dum)









    





	'PassengerId'
	'Pclass'
	'Name'
	'Sex_female'
	'Sex_male'
	'Age'
	'SibSp'
	'Parch'
	'Ticket'
	'Fare'
	'Cabin'
	'Embarked_C'
	'Embarked_Q'
	'Embarked_S'
	'Title'
	'Child'
	'Young'
	'Title2_Master'
	'Title2_Miss'
	'Title2_Mr'
	'Title2_Mrs'
	'Title2_Sir'



In [21]:

    
# Convert the dummy variables into factors
vars = c("Sex_female", "Sex_male", "Embarked_C", "Embarked_Q", "Embarked_S", "Title2_Master", "Title2_Miss", "Title2_Mr", 
         "Title2_Mrs", "Title2_Sir")
titanic_dum[,vars] = lapply(titanic_dum[,vars], function(x) as.factor(x))



In [139]:

    
str(titanic_dum)









    



'data.frame':	1309 obs. of  22 variables:
 $ PassengerId  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Pclass       : Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Name         : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex_female   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
 $ Sex_male     : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age          : num  22 38 26 35 35 24 54 2 27 14 ...
 $ SibSp        : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch        : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket       : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare         : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin        : Factor w/ 186 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
 $ Embarked_C   : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 2 ...
 $ Embarked_Q   : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
 $ Embarked_S   : Factor w/ 2 levels "0","1": 2 1 2 2 2 1 2 2 2 1 ...
 $ Title        : Factor w/ 18 levels "Capt","Col","Don",..: 13 14 10 14 13 13 13 9 14 14 ...
 $ Child        : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 2 1 1 ...
 $ Young        : Factor w/ 2 levels "FALSE","TRUE": 2 1 2 1 1 2 1 1 2 1 ...
 $ Title2_Master: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
 $ Title2_Miss  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
 $ Title2_Mr    : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 2 1 1 1 ...
 $ Title2_Mrs   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 1 ...
 $ Title2_Sir   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "dummies")=List of 3
  ..$ Sex     : int  4 5
  ..$ Embarked: int  12 13 14
  ..$ Title2  : int  18 19 20 21 22

3 Machine learning

Now, we can split again our data into the training and test sets.



In [22]:

    
Train = titanic_dum[1:891,]
Test = titanic_dum[892:1309,]

And add back the variable Survived to the training set.



In [23]:

    
Train$Survived <- as.factor(Survived)
Train$Survived <- as.factor(revalue(Train$Survived, c("0" = "No", "1"="Yes")))



In [24]:

    
# Let's remove variables that we won't use
Train$PassengerId <- NULL
Train$Cabin <- NULL
Train$Ticket <- NULL
Train$Name <- NULL
Train$Title <- NULL

We will use the train() function of the 'caret' package. It is a very convenient wrapper for modeling, allowing automatic preprocessing (e.g. scaling) and model tuning using resampling (cross-validation, bootstrap, etc.). It provides a uniform interface of the functions themselves, as well as a way to standardize common tasks. Thus, we will be able to compare various algorithms easily.

Note that we will set the same seed prior to each training. This will ensures that the same resampling sets are used, which will come in handy when we compare the resampling profiles between models.

3.1 Logistic regression

3.1.1 A first quick and dirty model

We build a first quick and dirty model, using all the variables. That will give us a benchmark accuracy.



In [143]:

    
names(Train)









    





	'Pclass'
	'Sex_female'
	'Sex_male'
	'Age'
	'SibSp'
	'Parch'
	'Fare'
	'Embarked_C'
	'Embarked_Q'
	'Embarked_S'
	'Child'
	'Young'
	'Title2_Master'
	'Title2_Miss'
	'Title2_Mr'
	'Title2_Mrs'
	'Title2_Sir'
	'Survived'



In [25]:

    
## Define cross-validation experiment
numFolds = trainControl(method = "cv", number = 10) # method='cv' for cross-validation, number=10 for 10 folds



In [145]:

    
## Perform the training with cross-validation
set.seed(1)
logmod1 <- train(Survived ~ .-Sex_female -Embarked_C -Title2_Master,    # We have to remove the "reference" level of dummy variables
                data = Train, 
                method = "glm", 
                trControl = numFolds, 
                na.action = na.omit, 
                preProcess=c("center", "scale"))









    



Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"



In [68]:

    
summary(logmod1)     # with Pclass as ordered factor









    





Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4176  -0.5318  -0.3675   0.5541   2.5701  

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.737246   0.106584  -6.917 4.61e-12 ***
Pclass.L     -0.940760   0.160721  -5.853 4.82e-09 ***
Pclass.Q      0.001987   0.101244   0.020 0.984340    
Sex_male1    -0.084721   0.410383  -0.206 0.836444    
Age           0.221354   0.731684   0.303 0.762250    
SibSp        -0.838129   0.566057  -1.481 0.138702    
Parch         0.323158   0.342626   0.943 0.345588    
Fare         -0.383011   0.540686  -0.708 0.478709    
Embarked_Q1  -0.019203   0.114395  -0.168 0.866690    
Embarked_S1  -0.208026   0.114543  -1.816 0.069349 .  
ChildTRUE     0.323772   0.160619   2.016 0.043823 *  
YoungTRUE    -0.185078   0.161228  -1.148 0.250997    
Age_Sq       -0.618905   0.596661  -1.037 0.299605    
AgeXFare      0.305446   0.410229   0.745 0.456529    
Fare_Sq       0.269560   0.305260   0.883 0.377209    
AgeXSibSp     0.434261   0.244652   1.775 0.075896 .  
FareXSibSp    0.243983   0.240591   1.014 0.310537    
SibSp_sq     -0.880481   0.755610  -1.165 0.243915    
AgeXParch    -0.338084   0.311986  -1.084 0.278520    
FareXParch   -0.246033   0.268287  -0.917 0.359117    
SibSpXParch   0.222713   0.368423   0.605 0.545508    
Parch_sp     -0.153459   0.253153  -0.606 0.544387    
Title2_Miss1 -0.166523   0.175341  -0.950 0.342259    
Title2_Mr1   -1.471975   0.429721  -3.425 0.000614 ***
Title2_Mrs1         NA         NA      NA       NA    
Title2_Sir1  -0.512864   0.155144  -3.306 0.000947 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  710.79  on 866  degrees of freedom
AIC: 760.79

Number of Fisher Scoring iterations: 6

Note 1: The lm()/glm() functions use linear least squares, which is a direct optimization method (global optimum guarantee and good computational performance). Linear least squares are computed using QR decomposition, which is more efficient than gradient descent, however, they are very sensitive to outliers.
Note 2: We have to normalize our data to have similar scale among the independent variables. By using the parameter preProcess=c("center", "scale"), the normalization will be performed on the 9 training folds only (not on the whole original training set). This will limit accidental contamination of the training data.
Note 3: Since the Pclass is an ordered factor, the L (linear) parameter gives a measure of the linear trend and Q specifies quadratic terms (which is close to zero in this case because the pattern is linear).

As expected, the class, the fact to be a child and the title (which includes the sex) have a significant impact on the survival of passengers. We observe that the probability of survival is significantly higher if the passenger is a child < 5 and lower if the passenger is in 3rd class and/or has a male title (i.e. Mr or Sir).

Let's see what is the accuracy of the model (i.e. mean accuracy accross cross-validated models).



In [146]:

    
logmod1$results$Accuracy









    




0.822667688117126

That was a really quick and dirty model and the performance is already good! But I think that we can do much better by improving it.

Now, how can we improve our model?

Get more training examples?
Try smaller sets of features (to avoid overfitting)?
Try getting additional features?
Try adding polynomial features?

Before rushing into any of these options (which is likely to waste our time), let's be smart and check what is wrong in our model.

3.1.2 Check learning curves

Let's check for high bias or high variance using learning curves.



In [152]:

    
set.seed(1)
lc <- learing_curve_dat(Train[, !names(Train) %in% c("Sex_female", "Embarked_C", "Title2_Master")], outcome = "Survived", 
                        proportion = (1:10)/10, 
                        test_prop = 0, 
                        method = "glm", 
                        trControl = numFolds, 
                        na.action = na.omit,                        
                        verbose = FALSE)









    



Warning message in eval(expr, envir, enclos):
"model fit failed for Fold01: parameter=none Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels
"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
"There were missing values in resampled performance measures."Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"



In [153]:

    
# Give a proper size to plots
options(repr.plot.width=5, repr.plot.height=4)



In [154]:

    
ggplot(lc, aes(x = Training_Size, y = Accuracy, color = Data)) +
    geom_smooth(method = loess, span = .8) +   
    theme_classic()









    



Warning message:
"Removed 1 rows containing non-finite values (stat_smooth)."

This graph shows the learning curves of the model (Accuracy vs Sample size). "Training" and "Resampling" represent the perfomance of the model on the training set and on the resampling set respectively, with a increasing size of the training set. (The resampling set is the set that was retrieved as a "test" set during the cross-validation process)

We observe that the performance on the training set decreases with the increase of the training set size. Indeed, the more data we get, the less likely is the linear model to fit exactly the data. On the contrary, the performance of the model on the resampling set increases with the increase of the training set size. The more data we train the model with, the better the model is to generalize to unknown data.

We have a slightly lower performance when we fit the model the resampling set compared to the training set. It seems that we have a high variance problem, even if the learning curve is not very clear here... In other words, our model seems to be overfitting. Let's remove some features.

3.1.3 Feature selection

We can perform feature selection, to keep only relevant predictors, either manually or automatically.
As written by Guyon & Elisseeff (2003) (http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf),

There are many potential benefits of feature selection:

facilitating data visualization and data understanding,

reducing the measurement and storage requirements,

reducing training and utilization times,

defying the curse of dimensionality to improve prediction performance.

3.1.3.1 Manual selection

First, let's check the warning that appeared when we run our quick and dirty model. It seems that we have a problem related to rank-deficiency. It may stem from many origins, one of which may be that we have some collinear variables. We know that Fare is highly correlated with Pclass. However, removing Fare does not remove the warning.
We have created Title2 by taking into account Sex and Age. Sex and Titles2 are likely highly correlated.



In [161]:

    
# For example: 
hetcor(Train[,c("Sex_male", "Title2_Mr")])$correlations









    



Warning message in polychor(x, y, ML = ML, std.err = std.err):
"inadmissible correlation set to 0.9999"





    





Sex_male Title2_Mr

	Sex_male 1.0000 0.9999
	Title2_Mr 0.9999 1.0000

Let's try to remove Sex_male then.



In [155]:

    
names(Train)









    





	'Pclass'
	'Sex_female'
	'Sex_male'
	'Age'
	'SibSp'
	'Parch'
	'Fare'
	'Embarked_C'
	'Embarked_Q'
	'Embarked_S'
	'Child'
	'Young'
	'Title2_Master'
	'Title2_Miss'
	'Title2_Mr'
	'Title2_Mrs'
	'Title2_Sir'
	'Survived'



In [156]:

    
## Perform the cross validation
set.seed(1)
logmod2 <- train(Survived ~ .-Sex_female -Embarked_C -Title2_Master -Sex_male, 
                 data = Train, 
                 method = "glm", 
                 trControl = numFolds, 
                 na.action = na.omit, 
                 preProcess=c("center", "scale"))



In [157]:

    
summary(logmod2)









    





Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5066  -0.5717  -0.3620   0.5418   2.5770  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.68573    0.09680  -7.084 1.40e-12 ***
Pclass.L     -0.90930    0.13440  -6.766 1.33e-11 ***
Pclass.Q     -0.03550    0.09676  -0.367  0.71373    
Age          -0.44081    0.18545  -2.377  0.01746 *  
SibSp        -0.56822    0.13495  -4.211 2.55e-05 ***
Parch        -0.23011    0.10352  -2.223  0.02624 *  
Fare          0.13828    0.13140   1.052  0.29266    
Embarked_Q1  -0.06132    0.11192  -0.548  0.58376    
Embarked_S1  -0.20492    0.11155  -1.837  0.06620 .  
ChildTRUE     0.20463    0.11924   1.716  0.08615 .  
YoungTRUE    -0.27831    0.15120  -1.841  0.06568 .  
Title2_Miss1 -0.02414    0.12063  -0.200  0.84141    
Title2_Mr1   -1.07443    0.32704  -3.285  0.00102 ** 
Title2_Mrs1   0.50812    0.29677   1.712  0.08686 .  
Title2_Sir1  -0.37218    0.13046  -2.853  0.00433 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  727.25  on 876  degrees of freedom
AIC: 757.25

Number of Fisher Scoring iterations: 5

Let's see what is the accuracy of the model.



In [158]:

    
logmod2$results$Accuracy









    




0.822667688117126

Removing Sex_male did not change the accuracy of our model.

Let's have a look at the importance of each variable.



In [163]:

    
# absolute value of the t-statistic
varImp(logmod2, scale=FALSE)









    





glm variable importance

             Overall
Pclass.L      6.7659
SibSp         4.2108
Title2_Mr1    3.2853
Title2_Sir1   2.8528
Age           2.3769
Parch         2.2227
YoungTRUE     1.8406
Embarked_S1   1.8370
ChildTRUE     1.7161
Title2_Mrs1   1.7122
Fare          1.0523
Embarked_Q1   0.5479
Pclass.Q      0.3668
Title2_Miss1  0.2001

We can remove some of the least important variable: Title2_Miss, Title2_Mrs, Embarked_S and Embarked_Q.



In [164]:

    
names(Train)









    





	'Pclass'
	'Sex_female'
	'Sex_male'
	'Age'
	'SibSp'
	'Parch'
	'Fare'
	'Embarked_C'
	'Embarked_Q'
	'Embarked_S'
	'Child'
	'Young'
	'Title2_Master'
	'Title2_Miss'
	'Title2_Mr'
	'Title2_Mrs'
	'Title2_Sir'
	'Survived'



In [189]:

    
## Perform the cross validation
set.seed(1)
logmod3 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare  + Child + Young + Title2_Mr + Title2_Sir, 
                 data = Train, 
                 method = "glm", 
                 trControl = numFolds, 
                 na.action = na.omit, 
                 preProcess=c("center", "scale"))



In [190]:

    
logmod3$results$Accuracy









    




0.829409828623312

And, now, let's remove Fare which is among the least important variables and collinear with Pclass.



In [34]:

    
## Perform the cross validation
set.seed(1)
logmod4 <- train(Survived ~ Pclass + Age + SibSp + Parch + Child + Young + Title2_Mr + Title2_Sir, 
                  data = Train, 
                  method = "glm", 
                  trControl = numFolds, 
                  na.action = na.omit, 
                  preProcess=c("center", "scale"))



In [35]:

    
logmod4$results$Accuracy









    




0.831669787765293

Still better!

3.1.3.2 Automatic selection

Another way to perform feature selection is to use stepwise feature selection based on AIC. Let's try it starting with all the variables.



In [174]:

    
names(Train)









    





	'Pclass'
	'Sex_female'
	'Sex_male'
	'Age'
	'SibSp'
	'Parch'
	'Fare'
	'Embarked_C'
	'Embarked_Q'
	'Embarked_S'
	'Child'
	'Young'
	'Title2_Master'
	'Title2_Miss'
	'Title2_Mr'
	'Title2_Mrs'
	'Title2_Sir'
	'Survived'



In [184]:

    
## Perform the training with cross-validation
set.seed(1)
AIC_glm <- train(Survived ~ .-Sex_female -Title2_Master -Embarked_C -Sex_male, 
                 data = Train, 
                 method = "glmStepAIC", 
                 trControl = numFolds, 
                 na.action = na.omit, 
                 preProcess=c("center", "scale"),
                 trace=FALSE)



In [185]:

    
summary(AIC_glm)









    





Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5318  -0.5549  -0.3578   0.5499   2.5366  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.69743    0.09620  -7.249 4.19e-13 ***
Pclass.L    -1.00514    0.10990  -9.146  < 2e-16 ***
Age         -0.45852    0.18387  -2.494 0.012642 *  
SibSp       -0.54236    0.12913  -4.200 2.67e-05 ***
Parch       -0.19795    0.09945  -1.991 0.046532 *  
Embarked_S1 -0.18664    0.09243  -2.019 0.043458 *  
ChildTRUE    0.20270    0.11816   1.716 0.086246 .  
YoungTRUE   -0.27762    0.15057  -1.844 0.065208 .  
Title2_Mr1  -1.01982    0.27945  -3.649 0.000263 ***
Title2_Mrs1  0.55479    0.25418   2.183 0.029059 *  
Title2_Sir1 -0.36575    0.12075  -3.029 0.002453 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  728.95  on 880  degrees of freedom
AIC: 750.95

Number of Fisher Scoring iterations: 5



In [186]:

    
AIC_glm$results$Accuracy









    




0.828260980592441



In [177]:

    
AIC_glm$results$Accuracy









    




0.830508171603677

We observe that performing stepwise selection slightly decreased the performance of our model. Actually, feature selection does not necessarily improve the accuracy of a model. Its aim is primarily to decrease the model's complexity. Since our aim is to have the best accuracy, we will keep all the variables selected through the manual selection.

3.2 Classification tree: CART (Recursive Partitioning)



In [191]:

    
names(Train)









    





	'Pclass'
	'Sex_female'
	'Sex_male'
	'Age'
	'SibSp'
	'Parch'
	'Fare'
	'Embarked_C'
	'Embarked_Q'
	'Embarked_S'
	'Child'
	'Young'
	'Title2_Master'
	'Title2_Miss'
	'Title2_Mr'
	'Title2_Mrs'
	'Title2_Sir'
	'Survived'



In [26]:

    
## Define the cross-validation experiment
numFolds = trainControl(method = "cv", number = 10 ) # method='cv' for cross-validation, number=10 for 10 folds
cpGrid = expand.grid( .cp = seq(0.01,0.5,0.01))

For this model, we are not going to use the 'formula' option, then our tree will be clearer when we plot it.



In [27]:

    
DV = Train[, "Survived"]
pred = Train[, c('Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Q', 'Embarked_S', 'Child', 
                     'Young', 'Title2_Miss', 'Title2_Mr', 'Title2_Mrs', 'Title2_Sir')]



In [195]:

    
## Perform the cross validation
set.seed(1)
CART1 <- train(pred, DV, method = "rpart", trControl = numFolds, na.action = na.omit, tuneGrid = cpGrid)



In [196]:

    
CART1$bestTune



In [197]:

    
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)



In [198]:

    
plot(CART1)



In [199]:

    
# Give a proper size to plots
options(repr.plot.width=6, repr.plot.height=6)



In [200]:

    
rpart.plot(CART1$finalModel, type=0, extra=106, under=TRUE, tweak=0.8)

The plot shows the probability of survival and the percentage of observations in each leaf.
The title was important: a passenger who was a 'Mr' was more likely to die, and in the 2nd and 1st class, a 'Sir' was also more likely to die. These results mirror what we said earlier about the lower probability of survival in men. As expected, the class of the passenger had an impact on their survival: first and second class passengers were more likely to survive.
Then, among passengers from the 3rd class, the Fare was of interest: interestingly, having paid at least 23 made a person more likely to die...



In [201]:

    
# Best accuracy
max(CART1$results$Accuracy)









    




0.829435648621042

This CART model does not have a better performance than our best logistic regression model.

3.3 Random forest

Random forest = tree bagging + feature sampling



In [28]:

    
## Define the cross-validation experiment
mtryGrid = expand.grid( .mtry = 1:10)
numFolds = trainControl(method = "cv", number = 10 ) # method='cv' for cross-validation, number=10 for 10 folds



In [224]:

    
names(Train)









    





	'Pclass'
	'Sex_female'
	'Sex_male'
	'Age'
	'SibSp'
	'Parch'
	'Fare'
	'Embarked_C'
	'Embarked_Q'
	'Embarked_S'
	'Child'
	'Young'
	'Title2_Master'
	'Title2_Miss'
	'Title2_Mr'
	'Title2_Mrs'
	'Title2_Sir'
	'Survived'
	'Deck'



In [37]:

    
## Perform the cross validation
set.seed(1)
RF1 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
             data = Train, 
             method = "rf", 
             trControl = numFolds, 
             na.action = na.omit, 
             preProcess=c("center", "scale"), 
             tuneGrid = mtryGrid, 
             nodesize = 15, 
             importance = TRUE)



In [225]:

    
RF1$bestTune



In [226]:

    
# Best accuracy
max(RF1$results$Accuracy)









    




0.836163318579049

Our first random forest model is better than our best log and CART models.
However, we lose some of the interpretability that comes with CART. Actually, we can still compute metrics that give us insight into which variables are important. One metric that we can look at is the number of times, aggregated over all of the trees in the random forest model, that a certain variable is selected for a split (i.e. selection frequency).



In [227]:

    
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)



In [228]:

    
vu = varUsed(RF1$finalModel, count=TRUE)
vusorted = sort(vu, decreasing = FALSE, index.return = TRUE)
dotchart(vusorted$x, RF1$finalModel$xNames[vusorted$ix], main = "RF1", xlab = "Selection frequency", cex=0.7)

Here, Fare and Age appear to be important variable.

Another way to see the importance of a variable is to look at the reduction in impurity (i.e. how homogenous each bucket or leaf of the tree is) after that variable has been selected for splitting in all of the trees in the forest (i.e. Gini importance). Or similarly, we can look at the mean decrease in accuracy (i.e. permutation importance).



In [229]:

    
# Give a proper size to plots
options(repr.plot.width=8, repr.plot.height=4)



In [230]:

    
par(mfrow=c(1,2))

# Compute the reduction in impurity (i.e. Gini importance)
Gini <- varImp(RF1, type=2, scale=FALSE) 
Ginisorted = sort(Gini$importance$Overall, decreasing = FALSE, index.return = TRUE)
dotchart(Ginisorted$x, rownames(Gini$importance)[Ginisorted$ix], 
         main = "RF1", 
         xlab = "Gini importance", 
         cex=0.7)

# Compute the reduction in accuracy (i.e. permutation accuracy importance)
Perm <- varImp(RF1, type=1, scale=FALSE) 
Permsorted = sort(Perm$importance$Overall, decreasing = FALSE, index.return = TRUE)
dotchart(Permsorted$x, rownames(Perm$importance)[Permsorted$ix], 
         main = "RF1", 
         xlab = "Permutation accuracy importance", 
         cex=0.7)

Again, Fare and Age appear among the top variables, but we can see that Title2_Mr and Pclass, are also important.

WARNING! We have to be careful here: isn't it a strange coincidence that all the continuous variables are in the top 5 variables used for splitting?...

Checking the literature, we can find an explanation to this behaviour:

Strobl et al (2007) Bias in random forest variable importance measures: Illustrations, sources and a solution, BMC Bioinformatics, 8: 25
Strobl et al (2009) Party on!, The R Journal, 1/2.

In standard tree algorithms, variable selection is biased in favor of variables offering many potential cut-points, so that variables with many categories and continuous variables are artificially preferred.

Therefore, authors suggested that

The randomForest function should not be used when the potential predictor variables vary in their scale of measurement or their number of categories.



In [231]:

    
pred3 <- subset(Train, select=c(Pclass, Age, SibSp, Parch, Fare, Embarked_Q, Embarked_S, Child, Young, 
                                    Title2_Miss, Title2_Mr, Title2_Mrs, Title2_Sir))



In [232]:

    
# Number of "levels" (i.e. potential cut-points) for each variable
sort(apply(pred3, 2, function(x) length(unique(x))), decreasing = TRUE)









    





	Fare
		248
	Age
		89
	SibSp
		7
	Parch
		7
	Pclass
		3
	Embarked_Q
		2
	Embarked_S
		2
	Child
		2
	Young
		2
	Title2_Miss
		2
	Title2_Mr
		2
	Title2_Mrs
		2
	Title2_Sir
		2

Given our dataset, we have to use unbiased tree algorithm (i.e. ctree and cforest functions in the party package). They use subsampling without replacement which provides more reliable variable importance measures.



In [233]:

    
names(Train)









    





	'Pclass'
	'Sex_female'
	'Sex_male'
	'Age'
	'SibSp'
	'Parch'
	'Fare'
	'Embarked_C'
	'Embarked_Q'
	'Embarked_S'
	'Child'
	'Young'
	'Title2_Master'
	'Title2_Miss'
	'Title2_Mr'
	'Title2_Mrs'
	'Title2_Sir'
	'Survived'
	'Deck'



In [234]:

    
## Perform the cross validation
set.seed(1)
RF2 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
             data = Train, 
             method = "cforest", 
             trControl = numFolds, 
             preProcess=c("center", "scale"), 
             tuneGrid = mtryGrid, 
             controls = cforest_unbiased(ntree=500))



In [235]:

    
RF2$bestTune



In [236]:

    
import <- varimp(RF2$finalModel)



In [237]:

    
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)



In [238]:

    
importSorted = sort(import, decreasing = FALSE, index.return = TRUE)
dotchart(importSorted$x, 
         main = "RF2", 
         xlab = "Permutation accuracy importance", 
         cex=0.7)

The 'cforest()' function is not as easy as the 'randomForest()' function to retrieve the variable selection frequency. Let's compute it.

The 'cutpoints_list()' function gives a list of all the cutpoints used for a given variable in a given tree made with the 'party' package (It can be found at: https://github.com/cran/party/blob/master/R/varimp.R). So, the list's length corresponds to the number of times the variable has been used for splitting.



In [239]:

    
cutpoints_list <- function(tree, variableID) {

    cutp <- function(node) {
       if (node[[4]]) return(NULL)
       cp <- NULL
       if (node[[5]][[1]] == variableID)
           cp <- node[[5]][[3]]
       nl <- cutp(node[[8]])
       nr <- cutp(node[[9]])
       return(c(cp, nl, nr))
    }
    return(cutp(tree))
}

Now we can use this function to compute the frequency selection of each variable.



In [240]:

    
# Retrieve the variables' names
inputs <- RF2$finalModel@data@get("input")
xnames <- colnames(inputs)

# Number of variables
nbvar <- length(xnames)

# Number of trees in the forest
nbtree <- length(RF2$finalModel@ensemble)

# Prepare container
nbtimes <- data.frame()

# Compute the number of times a given variable has been used to split a given tree
for (i in 1:nbvar) {                      # for each variable
    for (j in 1:nbtree) {                 # for each tree in the forest
        cutpts <- cutpoints_list(RF2$finalModel@ensemble[[j]], i)
        nbtimes[j,i] <- length(cutpts)
    }
}   
colnames(nbtimes) <- xnames

# Frequency selection of each variable
nbtimesTot <- apply(nbtimes, 2, sum)



In [241]:

    
nbtimesTot









    





	Pclass.L
		1441
	Pclass.Q
		738
	Age
		3482
	SibSp
		1683
	Parch
		1129
	Fare
		4183
	Embarked_Q1
		539
	Embarked_S1
		1333
	ChildTRUE
		382
	YoungTRUE
		1651
	Title2_Miss1
		203
	Title2_Mr1
		546
	Title2_Mrs1
		505
	Title2_Sir1
		307



In [242]:

    
nbtimesSorted = sort(nbtimesTot, decreasing = FALSE, index.return = TRUE)
dotchart(nbtimesSorted$x, 
         main = "RF2", 
         xlab = "Selection frequency", 
         cex=0.7)

Let's check our randomForest and cforest models together.



In [243]:

    
# Give a proper size to plots
options(repr.plot.width=8, repr.plot.height=4)



In [244]:

    
par(mfrow=c(1,2))
dotchart(vusorted$x, 
         RF1$finalModel$xNames[vusorted$ix], 
         main = "RF1 (randomForest)", 
         xlab = "Selection frequency", 
         cex=0.7)
dotchart(nbtimesSorted$x, 
         main = "RF2 (cforest)", 
         xlab = "Selection frequency", 
         cex=0.7)

The use of 'cforest()' has clearly decreased the difference between Age/Fare and other variables such as Pclass or Embarked_S.
We can see that 'Fare' and 'Age' are still the mostly used variables although their impact into either the Gini importance or permutation accuracy importance is lower than other variables.



In [245]:

    
# Best accuracy
max(RF2$results$Accuracy)









    




0.824864090341618

The accuracy of this model is lower than the random forest model, and even lower than the CART and log models...

3.4 Support Vector Machine (SVM)

Since we have a relatively small number of features (1-1,000) and an intermediate number of observations (10-10,000), SVM seems to be a reasonable option, at least better than logistic regression. (with Gaussian kernel)

3.4.1 With linear kernel



In [29]:

    
## Define the cross-validation experiment
costGrid = expand.grid( .C = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10))
numFolds = trainControl(method = "cv", number = 10 )



In [247]:

    
names(Train)









    





	'Pclass'
	'Sex_female'
	'Sex_male'
	'Age'
	'SibSp'
	'Parch'
	'Fare'
	'Embarked_C'
	'Embarked_Q'
	'Embarked_S'
	'Child'
	'Young'
	'Title2_Master'
	'Title2_Miss'
	'Title2_Mr'
	'Title2_Mrs'
	'Title2_Sir'
	'Survived'
	'Deck'



In [248]:

    
## Perform the cross validation
set.seed(1)
svm1 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                          Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
              data = Train, 
              method = "svmLinear", 
              trControl = numFolds, 
              na.action = na.omit, 
              preProcess=c("center", "scale"), 
              tuneGrid = costGrid)



In [249]:

    
svm1$bestTune



In [250]:

    
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)



In [251]:

    
max(svm1$results$Accuracy)









    




0.82717568947906

3.4.2 With Gaussian kernel



In [30]:

    
## Define the cross-validation experiment
costGrid = expand.grid( .C = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10), .sigma = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10))
numFolds = trainControl(method = "cv", number = 10)



In [253]:

    
## Perform the cross validation
set.seed(1)
svm2 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                          Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
              data = Train, 
              method = "svmRadial", 
              trControl = numFolds, 
              na.action = na.omit, 
              preProcess=c("center", "scale"), 
              tuneGrid = costGrid)



In [254]:

    
svm2$bestTune









    





sigma C

	25 0.3 0.3



In [255]:

    
# Give a proper size to plots
options(repr.plot.width=6, repr.plot.height=4)



In [256]:

    
plot(svm2)



In [257]:

    
max(svm2$results$Accuracy)









    




0.830520655998184

This is better than CART but not better than our log and random forest models.

Even if SVM already uses regularisation to avoid overfitting (regularisation parameter C), let's remove some features that were defined as the least important ones by the random forest (i.e. Title2_Miss, Child, Title2_Mrs, Embarked_Q).



In [244]:

    
par(mfrow=c(1,2))
dotchart(vusorted$x, 
         RF1$finalModel$xNames[vusorted$ix], 
         main = "RF1 (randomForest)", 
         xlab = "Selection frequency", 
         cex=0.7)
dotchart(nbtimesSorted$x, 
         main = "RF2 (cforest)", 
         xlab = "Selection frequency", 
         cex=0.7)



In [38]:

    
## Perform the cross validation
set.seed(1)
svm3 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Child + Title2_Miss + 
                          Title2_Mr + Title2_Mrs + Title2_Sir, 
              data = Train, 
              method = "svmRadial", 
              trControl = numFolds, 
              na.action = na.omit, 
              preProcess=c("center", "scale"), 
              tuneGrid = costGrid)



In [269]:

    
## Perform the cross validation
set.seed(1)
svm3 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_S +
                          Title2_Mr + Title2_Sir, 
              data = Train, 
              method = "svmRadial", 
              trControl = numFolds, 
              na.action = na.omit, 
              preProcess=c("center", "scale"), 
              tuneGrid = costGrid)



In [270]:

    
svm3$bestTune









    





sigma C

	43 0.01 10



In [271]:

    
max(svm3$results$Accuracy)









    




0.836164737260243

Now, we reach the same performance as our random forest model.

3.5 Neural network



In [31]:

    
## Define the cross-validation experiment
layerGrid = expand.grid( .size = seq(5, 105, by=10), .decay = c(0, 10^seq(-3, 0, by=1)))
numFolds = trainControl(method = "cv", number = 10 )



In [ ]:

    
## Perform the cross validation
set.seed(1)
nn1 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
             data = Train, 
             method = "nnet", 
             trControl = numFolds, 
             na.action = na.omit, 
             preProcess=c("center", "scale"), 
             tuneGrid = layerGrid, 
             MaxNWts=1900,
             trace=FALSE)



In [ ]:

    
nn1$bestTune



In [ ]:

    
max(nn1$results$Accuracy, na.rm=TRUE)

Neural networks are very sensitive to multicollinearity. Let's remove Fare that is highly correlater with Pclass.



In [40]:

    
## Perform the cross validation
set.seed(1)
nn2 <- train(Survived ~ Pclass + Age + SibSp + Parch + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
             data = Train, 
             method = "nnet", 
             trControl = numFolds, 
             na.action = na.omit, 
             preProcess=c("center", "scale"), 
             tuneGrid = layerGrid,
             MaxNWts=1700,
             trace=FALSE)



In [229]:

    
nn2$bestTune









    





size decay

	40 75 1



In [230]:

    
max(nn2$results$Accuracy, na.rm=TRUE)









    




0.832042253521127

Now, let's remove some features that were defined as the least important ones by the random forest (i.e. Title2_Miss, Child, Title2_Mrs, Embarked_Q).



In [234]:

    
## Perform the cross validation
set.seed(1)
nn3 <- train(Survived ~ Pclass + Age + SibSp + Parch + Embarked_S + Young + 
                        Title2_Mr + Title2_Sir,
             data = Train
             method = "nnet", 
             trControl = numFolds, 
             na.action = na.omit, 
             preProcess=c("center", "scale"), 
             tuneGrid = layerGrid, 
             MaxNWts=1900,
             trace=FALSE)



In [235]:

    
nn3$bestTune









    





size decay

	35 65 1



In [236]:

    
max(nn3$results$Accuracy, na.rm=TRUE)









    




0.826388888888889

That's not better. Let's keep them in the model.

3.6 XGBoost

3.6.1 Linear



In [41]:

    
## Define the cross-validation experiment
tuneGridL = expand.grid(.nrounds = 1:3, 
                       .lambda = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10), 
                       .alpha = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10), #1e-5, 1e-2, 0.1, 1, 100
                       .eta = c(0.01, 0.21, 0.05))   # 0.01-0.2
numFolds = trainControl(method = "cv", number = 10 )



In [42]:

    
## Perform the cross validation
set.seed(1)
xgb1 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
             data = Train, 
             method = "xgbLinear", 
             trControl = numFolds, 
             na.action = na.omit, 
             preProcess=c("center", "scale"), 
             tuneGrid = tuneGridL)









    



Loading required package: xgboost
Warning message:
"package 'xgboost' was built under R version 3.3.3"



In [43]:

    
xgb1$bestTune









    





nrounds lambda alpha eta

	244 2   1   1   0.01



In [44]:

    
max(xgb1$results$Accuracy)









    




0.839509420043128

3.6.2 Tree



In [45]:

    
## Define the cross-validation experiment
tuneGridT = expand.grid(.nrounds = 1:3, 
                       .gamma = c(0.01, 0.03, 0.1, 0.3, 1),
                       .max_depth = seq(3,10,2), 
                       .eta = c(0.01, 0.21, 0.05),
                       .colsample_bytree = c(0.6, 0.7, 0.8, 0.9, 1), 
                       .min_child_weight = seq(1,10,2), 
                       .subsample = c(0.6, 0.7, 0.8, 0.9, 1)) 
numFolds = trainControl(method = "cv", number = 10 )



In [46]:

    
## Perform the cross validation
set.seed(1)
xgb2 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
             data = Train, 
             method = "xgbTree", 
             trControl = numFolds, 
             na.action = na.omit, 
             preProcess=c("center", "scale"), 
             tuneGrid = tuneGridT)



In [47]:

    
xgb2$bestTune









    





nrounds max_depth eta gamma colsample_bytree min_child_weight subsample

	14351 2   9   0.05 0.3 0.7 3   0.9



In [48]:

    
max(xgb2$results$Accuracy)









    




0.847362387924186

4 Compare the best models

4.1 Compare the performances

First, we have to run again the CART2 model, but using the formula and the Train_dum dataframe so that all models use the dame training dataset.



In [36]:

    
set.seed(1)
CART1b <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young + 
                         Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir, 
                data = Train, 
                method = "rpart", 
                trControl = numFolds, 
                na.action = na.omit, 
                tuneGrid = cpGrid)



In [49]:

    
# Collect the resampling results
resamps <- resamples(list(GLM = logmod4,
                          CART = CART1b,
                          RF = RF1,
                          SVM = svm3,
                          NN = nn2,
                          XGBL = xgb1,
                          XGBT = xgb2))
summary(resamps)









    





Call:
summary.resamples(object = resamps)

Models: GLM, CART, RF, SVM, NN, XGBL, XGBT 
Number of resamples: 10 

Accuracy 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
GLM  0.7416  0.8230 0.8324 0.8317  0.8551 0.8764    0
CART 0.7303  0.8118 0.8323 0.8294  0.8551 0.8977    0
RF   0.7528  0.8338 0.8427 0.8362  0.8516 0.8764    0
SVM  0.7303  0.8118 0.8380 0.8328  0.8551 0.8977    0
NN   0.7191  0.8187 0.8258 0.8226  0.8417 0.8764    0
XGBL 0.7528  0.8315 0.8428 0.8395  0.8551 0.8876    0
XGBT 0.7753  0.8455 0.8556 0.8474  0.8648 0.8764    0

Kappa 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
GLM  0.4303  0.6150 0.6405 0.6369  0.6887 0.7338    0
CART 0.4158  0.5882 0.6407 0.6334  0.6887 0.7807    0
RF   0.4453  0.6405 0.6580 0.6438  0.6758 0.7338    0
SVM  0.4090  0.5962 0.6453 0.6406  0.6913 0.7783    0
NN   0.3807  0.6025 0.6253 0.6156  0.6569 0.7338    0
XGBL 0.4453  0.6386 0.6577 0.6523  0.6893 0.7593    0
XGBT 0.5075  0.6597 0.6913 0.6714  0.7089 0.7454    0

Let's plot the results



In [50]:

    
# Give a proper size to plots
options(repr.plot.width=6, repr.plot.height=4)



In [51]:

    
bwplot(resamps, layout = c(2, 1))



In [243]:

    
dotplot(resamps, metric = "Accuracy")



In [52]:

    
# Give a proper size to plots
options(repr.plot.width=5, repr.plot.height=5)



In [53]:

    
splom(resamps, varname.cex=0.8, varname.fontface="bold", axis.text.cex=0.4)

Since models are fit on the same versions of the training data, it makes sense to make inferences on the differences between models. In this way we reduce the within-resample correlation that may exist. We can compute the differences, then use a simple t-test to evaluate the null hypothesis that there is no difference between models. (Max Kuhn)



In [54]:

    
difValues <- diff(resamps)
difValues
summary(difValues)









    





Call:
diff.resamples(x = resamps)

Models: GLM, CART, RF, SVM, NN, XGBL, XGBT 
Metrics: Accuracy, Kappa 
Number of differences: 21 
p-value adjustment: bonferroni 






    





Call:
summary.diff.resamples(object = difValues)

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

Accuracy 
     GLM    CART      RF        SVM       NN        XGBL      XGBT     
GLM          0.002234 -0.004494 -0.001137  0.009040 -0.007840 -0.015693
CART 1.0000           -0.006728 -0.003371  0.006806 -0.010074 -0.017927
RF   1.0000 1.0000               0.003357  0.013534 -0.003346 -0.011199
SVM  1.0000 1.0000    1.0000               0.010177 -0.006703 -0.014556
NN   1.0000 1.0000    1.0000    1.0000              -0.016880 -0.024733
XGBL 1.0000 1.0000    1.0000    1.0000    0.1399              -0.007853
XGBT 1.0000 1.0000    1.0000    1.0000    0.3687    1.0000             

Kappa 
     GLM    CART      RF        SVM       NN        XGBL      XGBT     
GLM          0.003542 -0.006866 -0.003710  0.021340 -0.015408 -0.034500
CART 1.0000           -0.010408 -0.007252  0.017797 -0.018950 -0.038042
RF   1.0000 1.0000               0.003156  0.028206 -0.008542 -0.027634
SVM  1.0000 1.0000    1.0000               0.025050 -0.011698 -0.030790
NN   1.0000 1.0000    1.0000    1.0000              -0.036748 -0.055840
XGBL 1.0000 1.0000    1.0000    1.0000    0.1176              -0.019092
XGBT 1.0000 1.0000    1.0000    1.0000    0.3191    1.0000



In [57]:

    
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=5)



In [58]:

    
bwplot(difValues, layout = c(3, 1), cex=0.8)

We can see that all those models have a very similar accuracy.

4.2 Check correlations among models

Let's check the correlation among models. We will use this information to decide whether we want to combine some models to build a stacked one.



In [59]:

    
model_cor <- modelCor(resamps)
model_cor









    





GLM CART RF SVM NN XGBL XGBT

	GLM 1.0000000 0.9810310 0.7835333 0.9701062 0.8736794 0.9266126 0.7585487
	CART 0.9810310 1.0000000 0.6854624 0.9876399 0.8076806 0.8580292 0.7312503
	RF 0.7835333 0.6854624 1.0000000 0.7069643 0.8535902 0.9313125 0.8111155
	SVM 0.9701062 0.9876399 0.7069643 1.0000000 0.8353027 0.8787428 0.7721849
	NN 0.8736794 0.8076806 0.8535902 0.8353027 1.0000000 0.9333186 0.7599231
	XGBL 0.9266126 0.8580292 0.9313125 0.8787428 0.9333186 1.0000000 0.8692395
	XGBT 0.7585487 0.7312503 0.8111155 0.7721849 0.7599231 0.8692395 1.0000000



In [60]:

    
corrplot(model_cor)

The models are highly correlated to each other, so to assemble them will likely have little benefits.

5 Predictions on the test set



In [61]:

    
str(Test)









    



'data.frame':	418 obs. of  22 variables:
 $ PassengerId  : int  892 893 894 895 896 897 898 899 900 901 ...
 $ Pclass       : Ord.factor w/ 3 levels "1"<"2"<"3": 3 3 2 3 3 3 3 2 3 3 ...
 $ Name         : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 438 1298 1162 1303 1072 1259 178 949 896 994 ...
 $ Sex_female   : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...
 $ Sex_male     : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 1 2 1 2 ...
 $ Age          : num  34.5 47 62 27 22 14 30 26 18 21 ...
 $ SibSp        : int  0 1 0 0 1 0 0 1 0 2 ...
 $ Parch        : int  0 0 0 0 1 0 0 1 0 0 ...
 $ Ticket       : Factor w/ 929 levels "110152","110413",..: 781 841 726 776 252 869 787 159 745 520 ...
 $ Fare         : num  7.83 7 9.69 8.66 12.29 ...
 $ Cabin        : Factor w/ 186 levels "A10","A14","A16",..: NA NA NA NA NA NA NA NA NA NA ...
 $ Embarked_C   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
 $ Embarked_Q   : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 1 1 1 ...
 $ Embarked_S   : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 2 1 2 ...
 $ Title        : Factor w/ 18 levels "Capt","Col","Don",..: 13 14 13 13 14 13 10 13 14 13 ...
 $ Child        : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
 $ Young        : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 2 2 1 2 2 2 2 ...
 $ Title2_Master: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
 $ Title2_Miss  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Title2_Mr    : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 1 2 1 2 ...
 $ Title2_Mrs   : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 1 ...
 $ Title2_Sir   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "dummies")=List of 3
  ..$ Sex     : int  4 5
  ..$ Embarked: int  12 13 14
  ..$ Title2  : int  18 19 20 21 22

5.1 GLM



In [101]:

    
pred_GLM_Test = predict(logmod4, newdata=Test)



In [102]:

    
pred_GLM_final <- cbind(Test$PassengerId, as.character(pred_GLM_Test))



In [103]:

    
colnames(pred_GLM_final) <- c("PassengerId", "Survived")
head(pred_GLM_final)









    





PassengerId Survived

	892 No 
	893 Yes
	894 No 
	895 No 
	896 No 
	897 Yes



In [104]:

    
pred_GLM_final <- as.data.frame(pred_GLM_final)



In [105]:

    
pred_GLM_final$Survived <- revalue(pred_GLM_final$Survived, c("No"="0", "Yes"="1"))



In [107]:

    
write.table(pred_GLM_final, file="pred_glm_final2.csv", sep=",", row.names=FALSE)

leaderboard score on Kaggle: 0.78469

5.2 CART



In [109]:

    
pred_CART_Test = predict(CART1b, newdata=Test)



In [110]:

    
pred_CART_final <- cbind(Test$PassengerId, as.character(pred_CART_Test))



In [111]:

    
colnames(pred_CART_final) <- c("PassengerId", "Survived")



In [112]:

    
pred_CART_final <- as.data.frame(pred_CART_final)



In [113]:

    
pred_CART_final$Survived <- revalue(pred_CART_final$Survived, c("No"="0", "Yes"="1"))



In [114]:

    
write.table(pred_CART_final, file="pred_cart_final2.csv", sep=",", row.names=FALSE)

leaderboard score on Kaggle: 0.79426

5.3 Random forest



In [115]:

    
pred_RF_Test = predict(RF1, newdata=Test)



In [116]:

    
pred_RF_final <- cbind(Test$PassengerId, as.character(pred_RF_Test))



In [117]:

    
colnames(pred_RF_final) <- c("PassengerId", "Survived")



In [118]:

    
pred_RF_final <- as.data.frame(pred_RF_final)



In [119]:

    
pred_RF_final$Survived <- revalue(pred_RF_final$Survived, c("No"="0", "Yes"="1"))



In [120]:

    
write.table(pred_RF_final, file="pred_rf_final2.csv", sep=",", row.names=FALSE)

leaderboard score on Kaggle: 0.77512

5.4 SVM



In [78]:

    
pred_SVM_Test = predict(svm3, newdata=Test)



In [79]:

    
pred_SVM_final <- cbind(Test$PassengerId, as.character(pred_SVM_Test))



In [80]:

    
colnames(pred_SVM_final) <- c("PassengerId", "Survived")



In [81]:

    
pred_SVM_final <- as.data.frame(pred_SVM_final)



In [85]:

    
pred_SVM_final$Survived <- revalue(pred_SVM_final$Survived, c("No"="0", "Yes"="1"))



In [87]:

    
write.table(pred_SVM_final, file="pred_svm_final2.csv", sep=",", row.names=FALSE)