1 Exploratory data analysis
...1.1 Check data
...1.2 Check correlations among variables
...1.3 Visualize some potentially important variables
......1.3.1 Passenger's class
......1.3.2 Women and children first!
2 Feature engineering
....2.1 Create new features 1
......2.1.1 Create Title from Name
....2.2 Impute missing data
......2.2.1 Fare
......2.2.2 Embarked
......2.2.3 Age
....2.3 Create new features 2
......2.3.1 Create Child and Young from Age
......2.3.2 Create Title2 from Title
......2.3.3 Create dummy variables
3 Machine learning
...3.1 Logistic regression
......3.1.1 A first quick and dirty model
......3.1.2 Check learning curves
......3.1.3 Feature selection
.........3.1.3.1 Manual selection
.........3.1.3.2 Automatic selection
...3.2 Classification tree: CART (Recursive Partitioning)
...3.3 Random forest
...3.4 Support Vector Machine (SVM)
......3.4.1 With linear kernel
......3.4.2 With Gaussian kernel
...3.5 Neural network
...3.6 XGBoost
......3.6.1 Linear
......3.6.2 Tree
4 Compare the best models
...4.1 Compare the performances
...4.2 Check correlations among models
5 Predictions on the test set
...5.1 GLM
...5.2 CART
...5.3 Random forest
...5.4 SVM
...5.5 XGBoost (linear)
...5.6 XGBoost (tree)
6 Going further
Context
The aim of this project is to predict survivors in the Titanic dataset. I will follow typical steps of a machine learning project: exploratory data analysis, pre-processing (data imputation, scaling, feature engineering, feature selection...) and machine learning.
Data description
Variable | Description |
---|---|
Survived | Survival (0 = No, 1 = Yes) |
Pclass | Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) |
Sex | Sex (male/female) |
Age | Age (in years) |
SibSp | # of siblings/spouses aboard the Titanic |
Parch | # of parents/children aboard the Titanic |
Ticket | Ticket number |
Fare | Passenger fare |
Cabin | Cabin number |
Embarked | Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) |
In [2]:
# Load libraries
library(polycor) # hetcor()
library(ggplot2) # ggplot()
library(caret) # train(), trainControl(), resamples(), varImp()
library(mice) # mice()
library(lattice) # densityplot()
library(gridExtra) # grid.arrange()
library(dummies) # dummy.data.frame()
library(plyr) # revalue()
library(rpart) # rpart()
library(rpart.plot) # rpart.plot()
library(randomForest) # method = 'rf'
library(party) # method = 'cforest'
library(kernlab) # method = 'svmLinear' and method = 'svmRadial'
library(nnet) # method = 'nnet'
library(corrplot) # corrplot()
library(caretEnsemble)# caretList(), caretStack()
In [3]:
setwd("C:/Users/AurelieD/Documents/Boulot/0-Data scientist/Kaggle/Titanic")
In [4]:
## Load data (some cells of our table are empty, we have to specify to R to treat them as missing values (NA) using na.string)
Train <- read.csv("train.csv", na.strings=c(""))
Test <- read.csv("test.csv", na.strings=c(""))
In [91]:
str(Train)
In [92]:
str(Test)
First, let's take a look at the Train data.
In [93]:
## Check data
head(Train)
In [94]:
str(Train) # check factors' levels
In [5]:
# Specify proper classes to variables
Train$Survived <- factor(Train$Survived)
Train$Pclass <- factor(Train$Pclass, levels=c("1", "2", "3"), ordered=TRUE)
In [96]:
summary(Train) # check min/max/mean... of each variable
This will allow us to:
Since we have factors, ordered factors and numeric variables, we will use the hetcor()
function of the 'polycor' package. It "computes a heterogenous correlation matrix, consisting of Pearson product-moment correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables." (https://cran.r-project.org/web/packages/polycor/polycor.pdf). We will ignore the variables Cabin and Name for preliminary exploration of the data since they contain many missing data and need some processing.
In [97]:
Train_sub <- subset(Train, select=c(Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked))
In [98]:
hetcor(Train_sub, use = "pairwise.complete.obs")$correlations
We observe that:
In [6]:
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)
In [7]:
# Prepare a nice theme (i.e. general appearance of graphs)
theme_new <- theme_classic() +
theme(plot.title = element_text(size=11, face="bold", hjust = 0.5), axis.text.x = element_text(size = 11))
In [101]:
ggplot(data=Train) +
geom_bar(aes(x = Survived, fill = Pclass), position = "fill") +
labs(title = "Distribution of classes\namong dead and safe passengers", y = "Proportion of passengers", x = "") +
scale_x_discrete(labels=c("Died","Survived")) +
scale_fill_brewer(palette="Blues") +
theme_new
In [102]:
ggplot(data=Train) +
geom_bar(aes(x = Pclass, fill = Survived), position = "fill") +
labs(title = "Distribution of dead and safe passengers\namong classes", y = "Proportion of passengers", x = "") +
scale_x_discrete(labels=c("1","2","3")) +
scale_fill_brewer(palette="Blues", labels=c("Died", "Survived"), name = "Survival") +
theme_new
We can see that most of the people who survived were in the first class. On the contrary, most of the people who died were in the 3rd class.
Similarly, here, we can see that most of the first class people survived whereas most of the 3rd class people died.
In [103]:
ggplot(data=Train) +
geom_bar(aes(x = Survived, fill = Sex), position = "fill") +
labs(title = "Distribution of classes\namong dead and safe passengers", y = "Proportion of passengers", x = "") +
scale_x_discrete(labels=c("Died","Survived")) +
scale_fill_brewer(palette="Blues") +
theme_new
Most dead people were men and most surviving people were women.
As expected (from our domain knowledge and our preliminary correlation test), Sex is a good predictor for the survival of passengers.
What about age? Children were supposed to board the life rafts first.
In [104]:
ggplot(data=Train) +
geom_boxplot(aes(x = Survived, y = Age)) +
labs(title = "Distribution of ages\namong dead and safe passengers", x = "") +
scale_x_discrete(labels=c("Died","Survived")) +
theme_new
There seem to be only a slight difference in the distribution of ages among dead and safe passengers. Although the median is equal, the upper and lower quartiles are a bit lower for the surviving passengers meaning that surviving people were slightly younger than dead people.
Let's use an histogram now to look at the details.
In [105]:
ggplot(data = Train, aes(x = Age, group = Survived, fill = Survived)) +
geom_histogram(position="dodge", binwidth = 5, boundary=15) +
labs(title = "Distribution of ages\namong dead and safe passengers") +
scale_fill_manual(labels = c("Died", "Survived"), name = "Survival", values = c("steelblue3", "lightblue")) +
theme_new
Now, we clearly see a difference for some ages: most child under 5 have survived whereas most young passengers between 15-30 have died.
Our aim is to give as much information as we can to the model: the more useful information the model gets, the better it is likely to be.
First, let's combine the Train and Test datasets (while removing temporarily the Survived column of the Train dataset).
In [8]:
Survived <- Train$Survived # We will add it back to the train dataset after processing of the combined data set.
In [9]:
Train$Survived <- NULL
titanic = rbind(Train, Test)
In [108]:
str(titanic)
The name can provide some interesting information. It is in the form: Last Name, Title First name Middle name. In case of a spouse, her first name and last name are added in brackets.
Let's retrieve the title which may reflect the age as well as the social status of the passengers.
In [10]:
titanic$Title <- sub(".+,\\s", "", titanic$Name)
titanic$Title <- as.factor(sub("\\.\\s.+", "", titanic$Title))
First, we can increase the number of observations used in the model by imputing the missing values.
What is the numer of missing values in each variable?
In [110]:
nbNA <- function(x){sum(is.na(x))}
In [111]:
apply(titanic,2,nbNA)
In [112]:
hist(titanic$Fare, breaks=50)
In [113]:
summary(titanic$Fare)
Given that the distribution of Fare is highly skewed, the best seems to impute the missing value with the median.
In [11]:
# Impute Fare with the median
titanic[is.na(titanic[,"Fare"]), "Fare"] <- median(titanic[,"Fare"], na.rm = TRUE)
For Embarked, there are only two passengers with missing values. Who are those passengers?
In [115]:
titanic[is.na(titanic$Embarked), ]
We can see that they are two women > 30 years old, sharing the same first class cabin. Let's see whether we can use this information to impute the missing data.
In [116]:
table(titanic[titanic$Sex == "female" & titanic$Pclass == "1" & titanic$Age > 30,"Embarked"])
Women sharing the same traits have mainly embarked at either Cherbourg (France) or Southampton (UK). Since the two passengers with missing data have an English name, we can assume that they embarked at Southampton.
In [12]:
titanic$Embarked[is.na(titanic$Embarked)] = "S"
In [118]:
# Percentage of missing values in Age
263/nrow(titanic)*100
Given the percentage of missing values for Age, will impute this variable using Multivariate Imputation by Chained Equations (MICE).
First, let's check the missing pattern of Age.
In [119]:
# Are the missing ages equally distributed across male and female?
table(titanic[is.na(titanic$Age), "Sex"])
In [120]:
# Are the missing ages equally distributed across classes?
table(titanic[is.na(titanic$Age), "Pclass"])
We observe that the missing ages are not randomly distributed across all sexes and classes, with more missing data in males and 3rd class passengers. So we probably have a MAR (Missing at random) pattern and not a MCAR pattern (Missing completely ate random).
In [121]:
names(titanic)
Let's initialize the mice function parameters.
In [13]:
# Create a mids object containing the default settings
init <- mice(titanic[, !names(titanic) %in% c("PassengerID", "Name")], maxit=0, seed = 1)
In [14]:
# Selects the best predictors
new.pred<-quickpred(titanic[, !names(titanic) %in% c("PassengerID", "Name")])
In [15]:
# Define the methods
meth <- init$method
meth[c("Cabin")]="" # Skip a variable from imputation while this variable will still be used for prediction.
In [16]:
# Perform imputation (here, we create 5 imputed datasets, the default value for the m parameter)
imputed_Data <- mice(titanic[, !names(titanic) %in% c("PassengerID", "Name")], method=meth, m=10, pred=new.pred, seed = 1, print=FALSE)
Let's inspect the distribution of original and imputed data
In [126]:
densityplot(imputed_Data)
The density of the imputed data for each imputed dataset is showed in magenta while the density of the observed data is showed in blue. We observe that the central tendencies of the density plots of imputed data appear relatively similar to the observed data.
In [127]:
# Give a proper size to plots
options(repr.plot.width=10, repr.plot.height=10)
In [128]:
for (i in 1:10){
namePlot <- paste("plot", i, sep="")
plot <- ggplot(titanic,aes(x=Age)) +
geom_density(data=data.frame(titanic$PassengerId, complete(imputed_Data,i)), alpha = 0.2, fill = "blue")+
geom_density(data=titanic, alpha = 0.2, fill = "Red")+
labs(title=paste("Age Distribution, imputation ", i, sep=""))+
labs(x="Age") +
theme_new
assign(namePlot, plot)
}
In [129]:
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, plot7, plot8, plot9, plot10, nrow=4, ncol=3)
The fifth imputed dataset seems to be the best one, based on the density distribution. Let's use it.
In [17]:
titanic_imp <- complete(imputed_Data, 5)
titanic$Age <- titanic_imp$Age
The Cabin variable may also be of interest: it contains the deck ('deck''cabine_number') which may reflect some proximity to the life rafts (The deck A being the closest deck and the deck G being the farthest one). However, this variable contains a majority of missing values (NA).
We could still try to impute it using the family names, ticket number and class: 1) families are probably people sharing the same name and having the same ticket number; 2) people from the same family were probably on the same deck; 3) first class had the top decks (A-E), second class (D-F), and third class (E-G).
However, given that the deck is highly correlated with Pclass, we will just keep it away for now.
In [18]:
titanic$Child <- factor(titanic$Age <= 5)
titanic$Young <- factor(titanic$Age > 15 & titanic$Age <= 30)
In title, we have many levels which have only one observation or a few observations. This is not optimal for modelisation since those levels may have near-to-zero variance. Let's merge titles into a smaller number of levels, using the age and sex usually related to those titles.
In [132]:
summary(titanic$Title)
In [133]:
titles <- as.character(unique(titanic$Title))
In [134]:
as.data.frame(cbind("Title" = titles,
"Female" = sapply(titles, function(x) nrow(titanic[titanic$Title == x & titanic$Sex == "female",])),
"Male" = sapply(titles, function(x) nrow(titanic[titanic$Title == x & titanic$Sex == "male",])),
"Minimum_Age" = sapply(titles, function(x) min(titanic[titanic$Title == x ,'Age'], na.rm = TRUE)),
"Maximum_Age" = sapply(titles, function(x) max(titanic[titanic$Title == x,'Age'], na.rm = TRUE))), row.names = F)
We can merge titles into:
In [19]:
titanic$Title2 <- titanic$Title
titanic[(titanic$Sex == "male" & titanic$Age <= 14.5),]$Title2 = "Master"
titanic[(titanic$Sex == "male" & titanic$Age > 14.5),]$Title2 = "Mr"
titanic[(titanic$Sex == "female" & titanic$Age <= 14),]$Title2 = "Miss"
titanic[(titanic$Sex == "female" & titanic$Age > 14),]$Title2 = "Mrs"
titanic[titanic$Title == "Capt"|
titanic$Title == "Col"|
titanic$Title == "Don"|
titanic$Title == "Major"|
titanic$Title == "Rev"|
titanic$Title == "Jonkheer"|
(titanic$Title == "Dr" & titanic$Sex == "male"),]$Title2 = "Sir"
titanic$Title2 <- factor(titanic$Title2)
In [136]:
summary(titanic$Title2)
Now, we have levels that are much more balanced.
Let's convert our factor variables into dummy variables. Having dummy variables is not absolutely necessary, however:
In [20]:
titanic_dum <- dummy.data.frame(titanic, names= c("Sex","Embarked", "Title2") , sep = "_")
names(titanic_dum)
In [21]:
# Convert the dummy variables into factors
vars = c("Sex_female", "Sex_male", "Embarked_C", "Embarked_Q", "Embarked_S", "Title2_Master", "Title2_Miss", "Title2_Mr",
"Title2_Mrs", "Title2_Sir")
titanic_dum[,vars] = lapply(titanic_dum[,vars], function(x) as.factor(x))
In [139]:
str(titanic_dum)
Now, we can split again our data into the training and test sets.
In [22]:
Train = titanic_dum[1:891,]
Test = titanic_dum[892:1309,]
And add back the variable Survived to the training set.
In [23]:
Train$Survived <- as.factor(Survived)
Train$Survived <- as.factor(revalue(Train$Survived, c("0" = "No", "1"="Yes")))
In [24]:
# Let's remove variables that we won't use
Train$PassengerId <- NULL
Train$Cabin <- NULL
Train$Ticket <- NULL
Train$Name <- NULL
Train$Title <- NULL
We will use the train()
function of the 'caret' package. It is a very convenient wrapper for modeling, allowing automatic preprocessing (e.g. scaling) and model tuning using resampling (cross-validation, bootstrap, etc.). It provides a uniform interface of the functions themselves, as well as a way to standardize common tasks. Thus, we will be able to compare various algorithms easily.
Note that we will set the same seed prior to each training. This will ensures that the same resampling sets are used, which will come in handy when we compare the resampling profiles between models.
In [143]:
names(Train)
In [25]:
## Define cross-validation experiment
numFolds = trainControl(method = "cv", number = 10) # method='cv' for cross-validation, number=10 for 10 folds
In [145]:
## Perform the training with cross-validation
set.seed(1)
logmod1 <- train(Survived ~ .-Sex_female -Embarked_C -Title2_Master, # We have to remove the "reference" level of dummy variables
data = Train,
method = "glm",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"))
In [68]:
summary(logmod1) # with Pclass as ordered factor
Note 1: The lm()/glm()
functions use linear least squares, which is a direct optimization method (global optimum guarantee and good computational performance). Linear least squares are computed using QR decomposition, which is more efficient than gradient descent, however, they are very sensitive to outliers.
Note 2: We have to normalize our data to have similar scale among the independent variables. By using the parameter preProcess=c("center", "scale")
, the normalization will be performed on the 9 training folds only (not on the whole original training set). This will limit accidental contamination of the training data.
Note 3: Since the Pclass is an ordered factor, the L (linear) parameter gives a measure of the linear trend and Q specifies quadratic terms (which is close to zero in this case because the pattern is linear).
As expected, the class, the fact to be a child and the title (which includes the sex) have a significant impact on the survival of passengers. We observe that the probability of survival is significantly higher if the passenger is a child < 5 and lower if the passenger is in 3rd class and/or has a male title (i.e. Mr or Sir).
Let's see what is the accuracy of the model (i.e. mean accuracy accross cross-validated models).
In [146]:
logmod1$results$Accuracy
That was a really quick and dirty model and the performance is already good! But I think that we can do much better by improving it.
Now, how can we improve our model?
Before rushing into any of these options (which is likely to waste our time), let's be smart and check what is wrong in our model.
In [152]:
set.seed(1)
lc <- learing_curve_dat(Train[, !names(Train) %in% c("Sex_female", "Embarked_C", "Title2_Master")], outcome = "Survived",
proportion = (1:10)/10,
test_prop = 0,
method = "glm",
trControl = numFolds,
na.action = na.omit,
verbose = FALSE)
In [153]:
# Give a proper size to plots
options(repr.plot.width=5, repr.plot.height=4)
In [154]:
ggplot(lc, aes(x = Training_Size, y = Accuracy, color = Data)) +
geom_smooth(method = loess, span = .8) +
theme_classic()
This graph shows the learning curves of the model (Accuracy vs Sample size). "Training" and "Resampling" represent the perfomance of the model on the training set and on the resampling set respectively, with a increasing size of the training set. (The resampling set is the set that was retrieved as a "test" set during the cross-validation process)
We observe that the performance on the training set decreases with the increase of the training set size. Indeed, the more data we get, the less likely is the linear model to fit exactly the data. On the contrary, the performance of the model on the resampling set increases with the increase of the training set size. The more data we train the model with, the better the model is to generalize to unknown data.
We have a slightly lower performance when we fit the model the resampling set compared to the training set. It seems that we have a high variance problem, even if the learning curve is not very clear here... In other words, our model seems to be overfitting. Let's remove some features.
We can perform feature selection, to keep only relevant predictors, either manually or automatically.
As written by Guyon & Elisseeff (2003) (http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf),
There are many potential benefits of feature selection:
- facilitating data visualization and data understanding,
- reducing the measurement and storage requirements,
- reducing training and utilization times,
- defying the curse of dimensionality to improve prediction performance.
First, let's check the warning that appeared when we run our quick and dirty model. It seems that we have a problem related to rank-deficiency. It may stem from many origins, one of which may be that we have some collinear variables. We know that Fare is highly correlated with Pclass. However, removing Fare does not remove the warning.
We have created Title2 by taking into account Sex and Age. Sex and Titles2 are likely highly correlated.
In [161]:
# For example:
hetcor(Train[,c("Sex_male", "Title2_Mr")])$correlations
Let's try to remove Sex_male then.
In [155]:
names(Train)
In [156]:
## Perform the cross validation
set.seed(1)
logmod2 <- train(Survived ~ .-Sex_female -Embarked_C -Title2_Master -Sex_male,
data = Train,
method = "glm",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"))
In [157]:
summary(logmod2)
Let's see what is the accuracy of the model.
In [158]:
logmod2$results$Accuracy
Removing Sex_male did not change the accuracy of our model.
Let's have a look at the importance of each variable.
In [163]:
# absolute value of the t-statistic
varImp(logmod2, scale=FALSE)
We can remove some of the least important variable: Title2_Miss, Title2_Mrs, Embarked_S and Embarked_Q.
In [164]:
names(Train)
In [189]:
## Perform the cross validation
set.seed(1)
logmod3 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Child + Young + Title2_Mr + Title2_Sir,
data = Train,
method = "glm",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"))
In [190]:
logmod3$results$Accuracy
And, now, let's remove Fare which is among the least important variables and collinear with Pclass.
In [34]:
## Perform the cross validation
set.seed(1)
logmod4 <- train(Survived ~ Pclass + Age + SibSp + Parch + Child + Young + Title2_Mr + Title2_Sir,
data = Train,
method = "glm",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"))
In [35]:
logmod4$results$Accuracy
Still better!
Another way to perform feature selection is to use stepwise feature selection based on AIC. Let's try it starting with all the variables.
In [174]:
names(Train)
In [184]:
## Perform the training with cross-validation
set.seed(1)
AIC_glm <- train(Survived ~ .-Sex_female -Title2_Master -Embarked_C -Sex_male,
data = Train,
method = "glmStepAIC",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"),
trace=FALSE)
In [185]:
summary(AIC_glm)
In [186]:
AIC_glm$results$Accuracy
In [177]:
AIC_glm$results$Accuracy
We observe that performing stepwise selection slightly decreased the performance of our model. Actually, feature selection does not necessarily improve the accuracy of a model. Its aim is primarily to decrease the model's complexity. Since our aim is to have the best accuracy, we will keep all the variables selected through the manual selection.
In [191]:
names(Train)
In [26]:
## Define the cross-validation experiment
numFolds = trainControl(method = "cv", number = 10 ) # method='cv' for cross-validation, number=10 for 10 folds
cpGrid = expand.grid( .cp = seq(0.01,0.5,0.01))
For this model, we are not going to use the 'formula' option, then our tree will be clearer when we plot it.
In [27]:
DV = Train[, "Survived"]
pred = Train[, c('Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Q', 'Embarked_S', 'Child',
'Young', 'Title2_Miss', 'Title2_Mr', 'Title2_Mrs', 'Title2_Sir')]
In [195]:
## Perform the cross validation
set.seed(1)
CART1 <- train(pred, DV, method = "rpart", trControl = numFolds, na.action = na.omit, tuneGrid = cpGrid)
In [196]:
CART1$bestTune
In [197]:
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)
In [198]:
plot(CART1)
In [199]:
# Give a proper size to plots
options(repr.plot.width=6, repr.plot.height=6)
In [200]:
rpart.plot(CART1$finalModel, type=0, extra=106, under=TRUE, tweak=0.8)
The plot shows the probability of survival and the percentage of observations in each leaf.
The title was important: a passenger who was a 'Mr' was more likely to die, and in the 2nd and 1st class, a 'Sir' was also more likely to die. These results mirror what we said earlier about the lower probability of survival in men. As expected, the class of the passenger had an impact on their survival: first and second class passengers were more likely to survive.
Then, among passengers from the 3rd class, the Fare was of interest: interestingly, having paid at least 23 made a person more likely to die...
In [201]:
# Best accuracy
max(CART1$results$Accuracy)
This CART model does not have a better performance than our best logistic regression model.
Random forest = tree bagging + feature sampling
In [28]:
## Define the cross-validation experiment
mtryGrid = expand.grid( .mtry = 1:10)
numFolds = trainControl(method = "cv", number = 10 ) # method='cv' for cross-validation, number=10 for 10 folds
In [224]:
names(Train)
In [37]:
## Perform the cross validation
set.seed(1)
RF1 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young +
Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir,
data = Train,
method = "rf",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"),
tuneGrid = mtryGrid,
nodesize = 15,
importance = TRUE)
In [225]:
RF1$bestTune
In [226]:
# Best accuracy
max(RF1$results$Accuracy)
Our first random forest model is better than our best log and CART models.
However, we lose some of the interpretability that comes with CART.
Actually, we can still compute metrics that give us insight into which variables are important. One metric that we can look at is the number of times, aggregated over all of the trees in the random forest model, that a certain variable is selected for a split (i.e. selection frequency).
In [227]:
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)
In [228]:
vu = varUsed(RF1$finalModel, count=TRUE)
vusorted = sort(vu, decreasing = FALSE, index.return = TRUE)
dotchart(vusorted$x, RF1$finalModel$xNames[vusorted$ix], main = "RF1", xlab = "Selection frequency", cex=0.7)
Here, Fare and Age appear to be important variable.
Another way to see the importance of a variable is to look at the reduction in impurity (i.e. how homogenous each bucket or leaf of the tree is) after that variable has been selected for splitting in all of the trees in the forest (i.e. Gini importance). Or similarly, we can look at the mean decrease in accuracy (i.e. permutation importance).
In [229]:
# Give a proper size to plots
options(repr.plot.width=8, repr.plot.height=4)
In [230]:
par(mfrow=c(1,2))
# Compute the reduction in impurity (i.e. Gini importance)
Gini <- varImp(RF1, type=2, scale=FALSE)
Ginisorted = sort(Gini$importance$Overall, decreasing = FALSE, index.return = TRUE)
dotchart(Ginisorted$x, rownames(Gini$importance)[Ginisorted$ix],
main = "RF1",
xlab = "Gini importance",
cex=0.7)
# Compute the reduction in accuracy (i.e. permutation accuracy importance)
Perm <- varImp(RF1, type=1, scale=FALSE)
Permsorted = sort(Perm$importance$Overall, decreasing = FALSE, index.return = TRUE)
dotchart(Permsorted$x, rownames(Perm$importance)[Permsorted$ix],
main = "RF1",
xlab = "Permutation accuracy importance",
cex=0.7)
Again, Fare and Age appear among the top variables, but we can see that Title2_Mr and Pclass, are also important.
WARNING! We have to be careful here: isn't it a strange coincidence that all the continuous variables are in the top 5 variables used for splitting?...
Checking the literature, we can find an explanation to this behaviour:
In standard tree algorithms, variable selection is biased in favor of variables offering many potential cut-points, so that variables with many categories and continuous variables are artificially preferred.
Therefore, authors suggested that
The randomForest function should not be used when the potential predictor variables vary in their scale of measurement or their number of categories.
In [231]:
pred3 <- subset(Train, select=c(Pclass, Age, SibSp, Parch, Fare, Embarked_Q, Embarked_S, Child, Young,
Title2_Miss, Title2_Mr, Title2_Mrs, Title2_Sir))
In [232]:
# Number of "levels" (i.e. potential cut-points) for each variable
sort(apply(pred3, 2, function(x) length(unique(x))), decreasing = TRUE)
Given our dataset, we have to use unbiased tree algorithm (i.e. ctree and cforest functions in the party package). They use subsampling without replacement which provides more reliable variable importance measures.
In [233]:
names(Train)
In [234]:
## Perform the cross validation
set.seed(1)
RF2 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young +
Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir,
data = Train,
method = "cforest",
trControl = numFolds,
preProcess=c("center", "scale"),
tuneGrid = mtryGrid,
controls = cforest_unbiased(ntree=500))
In [235]:
RF2$bestTune
In [236]:
import <- varimp(RF2$finalModel)
In [237]:
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)
In [238]:
importSorted = sort(import, decreasing = FALSE, index.return = TRUE)
dotchart(importSorted$x,
main = "RF2",
xlab = "Permutation accuracy importance",
cex=0.7)
The 'cforest()' function is not as easy as the 'randomForest()' function to retrieve the variable selection frequency. Let's compute it.
The 'cutpoints_list()' function gives a list of all the cutpoints used for a given variable in a given tree made with the 'party' package (It can be found at: https://github.com/cran/party/blob/master/R/varimp.R). So, the list's length corresponds to the number of times the variable has been used for splitting.
In [239]:
cutpoints_list <- function(tree, variableID) {
cutp <- function(node) {
if (node[[4]]) return(NULL)
cp <- NULL
if (node[[5]][[1]] == variableID)
cp <- node[[5]][[3]]
nl <- cutp(node[[8]])
nr <- cutp(node[[9]])
return(c(cp, nl, nr))
}
return(cutp(tree))
}
Now we can use this function to compute the frequency selection of each variable.
In [240]:
# Retrieve the variables' names
inputs <- RF2$finalModel@data@get("input")
xnames <- colnames(inputs)
# Number of variables
nbvar <- length(xnames)
# Number of trees in the forest
nbtree <- length(RF2$finalModel@ensemble)
# Prepare container
nbtimes <- data.frame()
# Compute the number of times a given variable has been used to split a given tree
for (i in 1:nbvar) { # for each variable
for (j in 1:nbtree) { # for each tree in the forest
cutpts <- cutpoints_list(RF2$finalModel@ensemble[[j]], i)
nbtimes[j,i] <- length(cutpts)
}
}
colnames(nbtimes) <- xnames
# Frequency selection of each variable
nbtimesTot <- apply(nbtimes, 2, sum)
In [241]:
nbtimesTot
In [242]:
nbtimesSorted = sort(nbtimesTot, decreasing = FALSE, index.return = TRUE)
dotchart(nbtimesSorted$x,
main = "RF2",
xlab = "Selection frequency",
cex=0.7)
Let's check our randomForest and cforest models together.
In [243]:
# Give a proper size to plots
options(repr.plot.width=8, repr.plot.height=4)
In [244]:
par(mfrow=c(1,2))
dotchart(vusorted$x,
RF1$finalModel$xNames[vusorted$ix],
main = "RF1 (randomForest)",
xlab = "Selection frequency",
cex=0.7)
dotchart(nbtimesSorted$x,
main = "RF2 (cforest)",
xlab = "Selection frequency",
cex=0.7)
The use of 'cforest()' has clearly decreased the difference between Age/Fare and other variables such as Pclass or Embarked_S.
We can see that 'Fare' and 'Age' are still the mostly used variables although their impact into either the Gini importance or permutation accuracy importance is lower than other variables.
In [245]:
# Best accuracy
max(RF2$results$Accuracy)
The accuracy of this model is lower than the random forest model, and even lower than the CART and log models...
In [29]:
## Define the cross-validation experiment
costGrid = expand.grid( .C = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10))
numFolds = trainControl(method = "cv", number = 10 )
In [247]:
names(Train)
In [248]:
## Perform the cross validation
set.seed(1)
svm1 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young +
Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir,
data = Train,
method = "svmLinear",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"),
tuneGrid = costGrid)
In [249]:
svm1$bestTune
In [250]:
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=4)
In [251]:
max(svm1$results$Accuracy)
In [30]:
## Define the cross-validation experiment
costGrid = expand.grid( .C = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10), .sigma = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10))
numFolds = trainControl(method = "cv", number = 10)
In [253]:
## Perform the cross validation
set.seed(1)
svm2 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young +
Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir,
data = Train,
method = "svmRadial",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"),
tuneGrid = costGrid)
In [254]:
svm2$bestTune
In [255]:
# Give a proper size to plots
options(repr.plot.width=6, repr.plot.height=4)
In [256]:
plot(svm2)
In [257]:
max(svm2$results$Accuracy)
This is better than CART but not better than our log and random forest models.
Even if SVM already uses regularisation to avoid overfitting (regularisation parameter C), let's remove some features that were defined as the least important ones by the random forest (i.e. Title2_Miss, Child, Title2_Mrs, Embarked_Q).
In [244]:
par(mfrow=c(1,2))
dotchart(vusorted$x,
RF1$finalModel$xNames[vusorted$ix],
main = "RF1 (randomForest)",
xlab = "Selection frequency",
cex=0.7)
dotchart(nbtimesSorted$x,
main = "RF2 (cforest)",
xlab = "Selection frequency",
cex=0.7)
In [38]:
## Perform the cross validation
set.seed(1)
svm3 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Child + Title2_Miss +
Title2_Mr + Title2_Mrs + Title2_Sir,
data = Train,
method = "svmRadial",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"),
tuneGrid = costGrid)
In [269]:
## Perform the cross validation
set.seed(1)
svm3 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_S +
Title2_Mr + Title2_Sir,
data = Train,
method = "svmRadial",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"),
tuneGrid = costGrid)
In [270]:
svm3$bestTune
In [271]:
max(svm3$results$Accuracy)
Now, we reach the same performance as our random forest model.
In [31]:
## Define the cross-validation experiment
layerGrid = expand.grid( .size = seq(5, 105, by=10), .decay = c(0, 10^seq(-3, 0, by=1)))
numFolds = trainControl(method = "cv", number = 10 )
In [ ]:
## Perform the cross validation
set.seed(1)
nn1 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young +
Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir,
data = Train,
method = "nnet",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"),
tuneGrid = layerGrid,
MaxNWts=1900,
trace=FALSE)
In [ ]:
nn1$bestTune
In [ ]:
max(nn1$results$Accuracy, na.rm=TRUE)
Neural networks are very sensitive to multicollinearity. Let's remove Fare that is highly correlater with Pclass.
In [40]:
## Perform the cross validation
set.seed(1)
nn2 <- train(Survived ~ Pclass + Age + SibSp + Parch + Embarked_Q + Embarked_S + Child + Young +
Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir,
data = Train,
method = "nnet",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"),
tuneGrid = layerGrid,
MaxNWts=1700,
trace=FALSE)
In [229]:
nn2$bestTune
In [230]:
max(nn2$results$Accuracy, na.rm=TRUE)
Now, let's remove some features that were defined as the least important ones by the random forest (i.e. Title2_Miss, Child, Title2_Mrs, Embarked_Q).
In [234]:
## Perform the cross validation
set.seed(1)
nn3 <- train(Survived ~ Pclass + Age + SibSp + Parch + Embarked_S + Young +
Title2_Mr + Title2_Sir,
data = Train
method = "nnet",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"),
tuneGrid = layerGrid,
MaxNWts=1900,
trace=FALSE)
In [235]:
nn3$bestTune
In [236]:
max(nn3$results$Accuracy, na.rm=TRUE)
That's not better. Let's keep them in the model.
In [41]:
## Define the cross-validation experiment
tuneGridL = expand.grid(.nrounds = 1:3,
.lambda = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10),
.alpha = c(0.01, 0.03, 0.1, 0.3, 1, 3, 10), #1e-5, 1e-2, 0.1, 1, 100
.eta = c(0.01, 0.21, 0.05)) # 0.01-0.2
numFolds = trainControl(method = "cv", number = 10 )
In [42]:
## Perform the cross validation
set.seed(1)
xgb1 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young +
Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir,
data = Train,
method = "xgbLinear",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"),
tuneGrid = tuneGridL)
In [43]:
xgb1$bestTune
In [44]:
max(xgb1$results$Accuracy)
In [45]:
## Define the cross-validation experiment
tuneGridT = expand.grid(.nrounds = 1:3,
.gamma = c(0.01, 0.03, 0.1, 0.3, 1),
.max_depth = seq(3,10,2),
.eta = c(0.01, 0.21, 0.05),
.colsample_bytree = c(0.6, 0.7, 0.8, 0.9, 1),
.min_child_weight = seq(1,10,2),
.subsample = c(0.6, 0.7, 0.8, 0.9, 1))
numFolds = trainControl(method = "cv", number = 10 )
In [46]:
## Perform the cross validation
set.seed(1)
xgb2 <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young +
Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir,
data = Train,
method = "xgbTree",
trControl = numFolds,
na.action = na.omit,
preProcess=c("center", "scale"),
tuneGrid = tuneGridT)
In [47]:
xgb2$bestTune
In [48]:
max(xgb2$results$Accuracy)
First, we have to run again the CART2 model, but using the formula and the Train_dum dataframe so that all models use the dame training dataset.
In [36]:
set.seed(1)
CART1b <- train(Survived ~ Pclass + Age + SibSp + Parch + Fare + Embarked_Q + Embarked_S + Child + Young +
Title2_Miss + Title2_Mr + Title2_Mrs + Title2_Sir,
data = Train,
method = "rpart",
trControl = numFolds,
na.action = na.omit,
tuneGrid = cpGrid)
In [49]:
# Collect the resampling results
resamps <- resamples(list(GLM = logmod4,
CART = CART1b,
RF = RF1,
SVM = svm3,
NN = nn2,
XGBL = xgb1,
XGBT = xgb2))
summary(resamps)
Let's plot the results
In [50]:
# Give a proper size to plots
options(repr.plot.width=6, repr.plot.height=4)
In [51]:
bwplot(resamps, layout = c(2, 1))
In [243]:
dotplot(resamps, metric = "Accuracy")
In [52]:
# Give a proper size to plots
options(repr.plot.width=5, repr.plot.height=5)
In [53]:
splom(resamps, varname.cex=0.8, varname.fontface="bold", axis.text.cex=0.4)
Since models are fit on the same versions of the training data, it makes sense to make inferences on the differences between models. In this way we reduce the within-resample correlation that may exist. We can compute the differences, then use a simple t-test to evaluate the null hypothesis that there is no difference between models. (Max Kuhn)
In [54]:
difValues <- diff(resamps)
difValues
summary(difValues)
In [57]:
# Give a proper size to plots
options(repr.plot.width=4, repr.plot.height=5)
In [58]:
bwplot(difValues, layout = c(3, 1), cex=0.8)
We can see that all those models have a very similar accuracy.
Let's check the correlation among models. We will use this information to decide whether we want to combine some models to build a stacked one.
In [59]:
model_cor <- modelCor(resamps)
model_cor
In [60]:
corrplot(model_cor)
The models are highly correlated to each other, so to assemble them will likely have little benefits.
In [61]:
str(Test)
In [101]:
pred_GLM_Test = predict(logmod4, newdata=Test)
In [102]:
pred_GLM_final <- cbind(Test$PassengerId, as.character(pred_GLM_Test))
In [103]:
colnames(pred_GLM_final) <- c("PassengerId", "Survived")
head(pred_GLM_final)
In [104]:
pred_GLM_final <- as.data.frame(pred_GLM_final)
In [105]:
pred_GLM_final$Survived <- revalue(pred_GLM_final$Survived, c("No"="0", "Yes"="1"))
In [107]:
write.table(pred_GLM_final, file="pred_glm_final2.csv", sep=",", row.names=FALSE)
leaderboard score on Kaggle: 0.78469
In [109]:
pred_CART_Test = predict(CART1b, newdata=Test)
In [110]:
pred_CART_final <- cbind(Test$PassengerId, as.character(pred_CART_Test))
In [111]:
colnames(pred_CART_final) <- c("PassengerId", "Survived")
In [112]:
pred_CART_final <- as.data.frame(pred_CART_final)
In [113]:
pred_CART_final$Survived <- revalue(pred_CART_final$Survived, c("No"="0", "Yes"="1"))
In [114]:
write.table(pred_CART_final, file="pred_cart_final2.csv", sep=",", row.names=FALSE)
leaderboard score on Kaggle: 0.79426
In [115]:
pred_RF_Test = predict(RF1, newdata=Test)
In [116]:
pred_RF_final <- cbind(Test$PassengerId, as.character(pred_RF_Test))
In [117]:
colnames(pred_RF_final) <- c("PassengerId", "Survived")
In [118]:
pred_RF_final <- as.data.frame(pred_RF_final)
In [119]:
pred_RF_final$Survived <- revalue(pred_RF_final$Survived, c("No"="0", "Yes"="1"))
In [120]:
write.table(pred_RF_final, file="pred_rf_final2.csv", sep=",", row.names=FALSE)
leaderboard score on Kaggle: 0.77512
In [78]:
pred_SVM_Test = predict(svm3, newdata=Test)
In [79]:
pred_SVM_final <- cbind(Test$PassengerId, as.character(pred_SVM_Test))
In [80]:
colnames(pred_SVM_final) <- c("PassengerId", "Survived")
In [81]:
pred_SVM_final <- as.data.frame(pred_SVM_final)
In [85]:
pred_SVM_final$Survived <- revalue(pred_SVM_final$Survived, c("No"="0", "Yes"="1"))
In [87]:
write.table(pred_SVM_final, file="pred_svm_final2.csv", sep=",", row.names=FALSE)
leaderboard score on Kaggle: 0.78469
In [88]:
pred_XGBL_Test = predict(xgb1, newdata=Test)
In [89]:
pred_XGBL_final <- cbind(Test$PassengerId, as.character(pred_XGBL_Test))
In [90]:
colnames(pred_XGBL_final) <- c("PassengerId", "Survived")
In [91]:
pred_XGBL_final <- as.data.frame(pred_XGBL_final)
In [93]:
pred_XGBL_final$Survived <- revalue(pred_XGBL_final$Survived, c("No"="0", "Yes"="1"))
In [94]:
write.table(pred_XGBL_final, file="pred_xgbL_final2.csv", sep=",", row.names=FALSE)
leaderboard score on Kaggle: 0.79904
In [95]:
pred_XGBT_Test = predict(xgb2, newdata=Test)
In [96]:
pred_XGBT_final <- cbind(Test$PassengerId, as.character(pred_XGBT_Test))
In [97]:
colnames(pred_XGBT_final) <- c("PassengerId", "Survived")
In [98]:
pred_XGBT_final <- as.data.frame(pred_XGBT_final)
In [99]:
pred_XGBT_final$Survived <- revalue(pred_XGBT_final$Survived, c("No"="0", "Yes"="1"))
In [100]:
write.table(pred_XGBT_final, file="pred_xgbT_final2.csv", sep=",", row.names=FALSE)
leaderboard score on Kaggle: 0.73206
To improve the accuracy, we can try to: