Predict Titanic survival: Kaggle Competition
VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored. The following are the definitions used
for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws. Some children travelled
only with a nanny, therefore parch=0 for them. As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
Import the test and training sets
In [912]:
# Import the training and test sets. use the if with variable so I don't have to import everytime I run the script
var1 <- 2
if (var1 == 2) {
library(data.table)
train_data_raw <- fread(
"https://drive.google.com/uc?export=download&id=0B1HWpqllmsS8Y01HRjVaanRQTlE",
stringsAsFactors=F,
na.strings = c("NA","")
)
test_data_raw <- fread(
"https://drive.google.com/uc?export=download&id=0B1HWpqllmsS8bmY4V3d2RWRqNkE",
stringsAsFactors=F,
na.strings = c("NA","")
)
}
train.data <- train_data_raw
test.data <- test_data_raw
In [913]:
#overview of the data
str(train.data)
str(test.data)
In [ ]:
Combine test and train data for data analysis, cleaning, and manipulation. The same munging done to training needs to be done to testing.
In [914]:
# give the training and test sets a column identifying which set they belong to
# so I can easily separate them later
train.data$set <- "Train"
test.data$set <- "Test"
#make target character class
train.data$Survived <- as.character(train.data$Survived)
index_nine <- which(train.data$Survived == '1')
train.data$Survived[index_nine] <- 'Y'
index_nine <- which(train.data$Survived == '0')
train.data$Survived[index_nine] <- 'N'
# Bind training and test sets together
# the test data is missing the Survived column
train_test_data <- dplyr::bind_rows(train.data, test.data)
head(train.data)
tail(train_test_data)
In [915]:
train_test_data$Survived = factor(train_test_data$Survived)
train_test_data$Pclass = factor(train_test_data$Pclass)
str(train_test_data)
In [916]:
#To find missing values in Age:
sum(is.na(train_test_data$Age) == TRUE) #a count of NA's in Age
#To see it as a percentage, count missing values against present values:
sum(is.na(train_test_data$Age) == TRUE) / length(train.data$Age)
#So almost 30% of our Age data is missing
In [917]:
#I'd like to see that applied to each feature. Here I make it an actual percentage and round to 2 decimal places
sapply(train_test_data, function(df) {
round(( sum(is.na(df) == TRUE) / length(df) * 100 ),2)
})
Missing 77% of Cabin data.
Visualize the missing data. Use Amelia package. It's a package specifically for missing data. We want to see where data is missing from the whole data set. Note: in the console, typing AmeliaView() will bring up an interactive GUI
In [918]:
library(Amelia)
missmap(train_test_data, main = "Missing Map", col = c("lightyellow", "darkred"), legend = T)
In [ ]:
In [919]:
table(train_test_data$Embarked, useNA = "always") # useNA=Always says to show the number of NA values
In [920]:
barplot(table(train_test_data$Embarked, useNA = "always"))
An OK way of imputing the 2 missing values is to use the most probable port, which is Southampton.
In [921]:
train_test_data$Embarked[which(is.na(train_test_data$Embarked))] = 'S';
table(train_test_data$Embarked, useNA = "always") # view the distribution to verify no NA's
In [922]:
# Extract title from Name, creating a new variable
train_test_data$Title <- gsub('(.*, )|(\\..*)', '', train_test_data$Name)
Predict Age for NA's. Use a decision tree and view it.
In [923]:
hist(train_test_data$Age, sqrt(nrow(train_test_data)), main = "Passenger Age", xlab = "Age")
Let's see the counts for each title along with a max and min age for each
In [924]:
library(sqldf)
#first get a total count with min and max ages for each title
sqldf("select Title,
count(*) as TitleCount,
min(Age) as Min_Age,
max(Age) as Max_Age
from 'train_test_data'
group by Title
order by count(*) desc")
sqldf("select Title,
count(*) as AgeNACount
from 'train_test_data'
where Age is null
group by Title
order by count(*) desc")
In [925]:
#Title seems important for Age prediction. I'll just change some Mr to Master if under a certain age.
#if title is Mr and age less than 14.5 and age not NA, assign it Master
train_test_data$Title[train_test_data$Title %in% c('Mr') & train_test_data$Age <= 14.5 & !is.na(train_test_data$Age)] <- "Master"
In [926]:
#Also create FamilyName column from passenger name
# create FamilyName from passenger name. FamilyName is the first name in the field
train_test_data$FamilyName <- gsub('(, .*)', '', train_test_data$Name)
train_test_data$FamilyName <- factor(train_test_data$FamilyName)
family size. Maybe if someone belongs to a family they are more likely to die while trying to find other family members instead of jumping into a lifeboat.
In [927]:
# Create a family size variable including the passenger themselves
train_test_data$Fsize <- train_test_data$SibSp + train_test_data$Parch + 1
train_test_data$Fsize <- as.integer(train_test_data$Fsize)
# create a small family variable showing if less than 3 members or not
#train.data$FamilySizeSmall = ifelse(train.data$Fsize + 1 <= 3, 1,0)
# Create a family variable. this will be more unique than just last name. It combines name with family size in one string
train_test_data$Family <- paste(train_test_data$FamilyName, train_test_data$Fsize, sep='_')
train_test_data$Family <- factor(train_test_data$Family)
In [928]:
#Mothers
train_test_data$Mother = ifelse(train_test_data$Title=="Mrs" & train_test_data$Parch > 0, 1,0)
train_test_data$Mother <- factor(train_test_data$Mother)
#singles
train_test_data$Single = ifelse(train_test_data$Fsize == 1, 1,0) # People travelling alone
train_test_data$Single <- factor(train_test_data$Single)
Create variable Deck that is the first letter in the Cabin variable
In [929]:
# Create a Deck variable. Get passenger deck A - F:
train_test_data$DeckT <- substr(train_test_data$Cabin,1,1) # in Cabin, get the first character through the first character
train_test_data$DeckT <- factor(train_test_data$DeckT)
#http://rfunction.com/archives/1692
Since the Family variable is unique to each family, and since families have cabins near each other, maybe I'll use a distinct list of family that has a cabin listed. The join that on the family in order to assign the same cabin to all family members
In [930]:
library(sqldf)
library(dplyr)
#there's a family with two deck that creates a couple extra rows when joined.
DeckFam <- sqldf("select distinct Family, DeckT as Deck_Imp from 'train_test_data' where DeckT is not null")
#Identify duplicates
DeckFam$dup <- duplicated(DeckFam$Family)
#select only where duplicates equals false
DeckFam <- sqldf("select * from 'DeckFam' where dup = 0")
train_test_data <- left_join(train_test_data, DeckFam, by = "Family")
train_test_data$DeckT <- NULL # I don't need this column now
train_test_data$Deck_Imp <- factor(train_test_data$Deck_Imp)
In [931]:
str(train_test_data)
Use anova method to predict continuous variables
In [ ]:
In [ ]:
In [ ]:
In [932]:
# Continuous features into bins
train_test_data$Fare_bins <- cut(train_test_data$Fare, 20, include.lowest = T) #bins for Age did not improve model accuracy
# Look at the bins
table(train_test_data$Fare_bins)
barplot(table(train_test_data$Fare_bins))
In [933]:
#How many NA's now?
sapply(train_test_data, function(df) {
sum(is.na(df) == TRUE) / length(df)
})
In [934]:
#generalize the rare ones to Mr or Miss. rare means <= 2 instances
#There are items in one set and not the other. Like
#Dona was in the Test set but not the training. Need to generalize these for predictability.
train_test_data$Title[train_test_data$Title %in% c("Lady","Lady","Mme","the Countess","Dona","Mlle","Ms")]<-"Miss"
train_test_data$Title[train_test_data$Title %in% c("Capt",'Don','Major','Sir','Jonkheer')] <- 'Mr'
#this next line does the same type of replace. is it slower?
#df[df=="" | df==12] <- NA
In [935]:
library(sqldf)
#first get a total count with min and max ages for each title
sqldf("select Title,
count(*) as TitleCount,
min(Age) as Min_Age,
max(Age) as Max_Age
from 'train_test_data'
group by Title
order by count(*) desc")
Rename before creating dummies
In [936]:
#Rename columns for easy reference later
library(plyr)
train_test_data <- rename(train_test_data, c(
"Survived"="label_Survived",
"Pclass"="feature_Pclass",
"Title" = "feature_Title",
"Age" = "feature_Age",
"Fare" = "feature_Fare",
"SibSp" = "feature_SibSp",
"Parch" = "feature_Parch",
"Deck_Imp" = "feature_DeckImp",
"Embarked" = "feature_Embarked",
"Fsize" = "feature_Fsize"
))
In [937]:
str(train_test_data)
In [938]:
# can't have NA's for some algs so replace with -1
#train_test_data[is.na(train_test_data)] <- 0
In [ ]:
In [939]:
library(caret)
dmy <- dummyVars(" ~ feature_Pclass +
feature_Title +
feature_SibSp +
feature_Parch +
feature_Embarked
"
, data = train_test_data, fullRank=T)
#use " ~ ." for all chr and factor columns
# use fullRank = T to eliminate 1 of the levels
trsf <- data.frame(predict(dmy, newdata = train_test_data))
str(trsf)
str(train_test_data)
In [940]:
#Now that we're done cleaning, we need a simple split of the data back into train and test
trsf <- cbind(
label_Survived = train_test_data$label_Survived,
set = train_test_data$set , trsf,
feature_Age = train_test_data$feature_Age,
feature_Fare = train_test_data$feature_Fare
)
train.data <- trsf %>%
filter(set == "Train")
test.data <- trsf %>%
filter(set == "Test")
test.data$label_Survived <- NULL #get rid of the Survived column. it was added when combined with training
test.data$set <- NULL
train.data$set <- NULL
str(train.data)
Impute the Age and Fare on the train and test sets separately. I don't want to impute a field based on data not in the training set. Same goes for test set Use anova for continuous variables.
In [941]:
#predict Age and Fare for train data NAs
library(rpart)
predicted_age <- rpart(
feature_Age ~ .,
data=train.data[!is.na(train.data$feature_Age),], method="anova"
)
train.data$feature_Age[is.na(train.data$feature_Age)] <- predict(
predicted_age, train.data[is.na(train.data$feature_Age),]
)
#predict Fare
predicted_fare <- rpart(
feature_Fare ~ .,
data=train.data[!is.na(train.data$feature_Fare),], method="anova"
)
train.data$feature_Fare[is.na(train.data$feature_Fare)] <- predict(
predicted_fare, train.data[is.na(train.data$feature_Fare),]
)
#Predict Age and Fare for test data NAs
library(rpart)
predicted_age <- rpart(
feature_Age ~ .,
data=test.data[!is.na(test.data$feature_Age),], method="anova"
)
test.data$feature_Age[is.na(test.data$feature_Age)] <- predict(
predicted_age, test.data[is.na(test.data$feature_Age),]
)
#predict Fare
predicted_fare <- rpart(
feature_Fare ~ .,
data=test.data[!is.na(test.data$feature_Fare),], method="anova"
)
test.data$feature_Fare[is.na(test.data$feature_Fare)] <- predict(
predicted_fare, test.data[is.na(test.data$feature_Fare),]
)
#Look at the decision tree to see what was good features for predicting Age
# Load in the packages to build a fancy plot
library(rattle) #instfullfe.packages("rattle")
library(rpart.plot)
library(RColorBrewer)
# Visualize your new decision tree
fancyRpartPlot(predicted_age)
fancyRpartPlot(predicted_fare)
In [942]:
#And let's see a histogram of Age after imputation:
hist(train.data$feature_Age, sqrt(nrow(train.data)), main = "Passenger Age", xlab = "Age")
Split the data into training and testing sets
In [943]:
# here's a training_set and test_set for accuracy testing
#need a stratified random split using the caret package
library(caret)
set.seed(1)
train.index <- createDataPartition(train.data$label_Survived, p = .75, list = FALSE)
trainSet <- train.data[ train.index,]
testSet <- train.data[-train.index,] #the test set gets data not in the training set
In [944]:
library(caret)
#Set the random seed
set.seed(1)
#Defining the training controls for multiple models
fitControl <- trainControl(
method = "cv",
number = 5,
savePredictions = 'final',
classProbs = T)
#Training the random forest model
model_rf <- train(
trainSet[ , names(trainSet) %like% c("feature")], #feature columns
trainSet[ , names(trainSet) %like% c("label")], #label columns
method='rf',
trControl=fitControl,
tuneLength=3)
#Predicting using random forest model
#Apply the model to the testSet
testSet$pred_rf<-predict(
object = model_rf,
testSet[ , names(testSet) %like% c("feature")])
#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$pred_rf)
In [945]:
#Training the Logistic regression model
model_lr<-train(
trainSet[ , names(trainSet) %like% c("feature")], #feature columns
trainSet[ , names(trainSet) %like% c("label")], #label columns
method='glm',
trControl=fitControl,
tuneLength=3)
#Predicting using knn model
testSet$pred_lr<-predict(
object = model_lr,
testSet[ , names(testSet) %like% c("feature")])
#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$pred_lr)
In [946]:
#Training the knn model
model_knn<-train(
trainSet[ , names(trainSet) %like% c("feature")], #feature columns
trainSet[ , names(trainSet) %like% c("label")], #label columns
method='knn',
trControl=fitControl,
tuneLength=3)
#Predicting using knn model
testSet$pred_knn<-predict(
object = model_knn,
testSet[ , names(testSet) %like% c("feature")]
)
#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$pred_knn)
In [947]:
# Averaging
#Predicting the probabilities
testSet$pred_rf_prob<-predict(
object = model_rf,
testSet[ , names(testSet) %like% c("feature")],
type='prob')
#add Title to predictors for lr
testSet$pred_lr_prob<-predict(
object = model_lr,
testSet[ , names(testSet) %like% c("feature")],
type='prob')
#knn
testSet$pred_knn_prob<-predict(
object = model_knn,
testSet[ , names(testSet) %like% c("feature")],
type='prob')
#Taking average of predictions
testSet$pred_avg<-(
testSet$pred_rf_prob$Y +
testSet$pred_lr_prob$Y +
testSet$pred_knn_prob$Y) / 3
#Splitting into binary classes at 0.5
testSet$pred_avg<-as.factor(ifelse(testSet$pred_avg>0.5,'Y','N'))
#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$pred_avg)
In [948]:
#Majority Voting
#The majority vote
testSet$pred_majority<-as.factor(ifelse(
testSet$pred_rf=='Y' & testSet$pred_knn=='Y','Y',
ifelse(testSet$pred_rf=='Y' & testSet$pred_lr=='Y','Y',
ifelse(testSet$pred_knn=='Y' & testSet$pred_lr=='Y','Y','N'))))
#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$pred_majority)
In [949]:
#Weighted average
#Taking weighted average of predictions
testSet$pred_weighted_avg<-(testSet$pred_rf_prob$Y*0.25)+(testSet$pred_knn_prob$Y*0.25)+(testSet$pred_lr_prob$Y*0.5)
#Splitting into binary classes at 0.5
testSet$pred_weighted_avg<-as.factor(ifelse(testSet$pred_weighted_avg>0.5,'Y','N'))
#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$pred_weighted_avg)
In [950]:
#Stacking
#Predict using each base layer model for training data and test data
#Predicting the out of fold prediction probabilities for training data
trainSet$OOF_pred_rf<-model_rf$pred$Y[order(model_rf$pred$rowIndex)]
trainSet$OOF_pred_knn<-model_knn$pred$Y[order(model_knn$pred$rowIndex)]
trainSet$OOF_pred_lr<-model_lr$pred$Y[order(model_lr$pred$rowIndex)]
#Predicting probabilities for the test data
testSet$OOF_pred_rf<-predict(
model_rf,
testSet[ , names(testSet) %like% c("feature")],
type='prob')$Y
testSet$OOF_pred_knn<-predict(
model_knn,
testSet[ , names(testSet) %like% c("feature")],
type='prob')$Y
testSet$OOF_pred_lr<-predict(
model_lr,
testSet[ , names(testSet) %like% c("feature")],
type='prob')$Y
#Now train the top layer model again on the predictions of the bottom layer models that has been made on the training data
#Predictors for top layer models
predictors_top<-c('OOF_pred_rf','OOF_pred_knn','OOF_pred_lr')
#GBM as top layer model
model_gbm<-
train(
trainSet[ , names(trainSet) %like% c("OOF_pred")],
trainSet[ , names(trainSet) %like% c("label")], #label columns
method='gbm',
trControl=fitControl,tuneLength=3)
#predict using GBM top layer model
testSet$gbm_stacked<-predict(
model_gbm,
testSet[ , names(testSet) %like% c("OOF_pred")]
)
#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$gbm_stacked)
#----------------------
testSet$pred_gbm_prob<-predict(
object = model_gbm,
testSet[ , names(testSet) %like% c("feature")],
type='prob')
probs <- testSet$pred_gbm_prob$Y
# Load the ROCR library
library(ROCR)
# Make a prediction object: predi
predi <- prediction(probs, testSet$label_Survived)
# Make a performance object: perf
perf <- performance(predi, "tpr", "fpr")
auc <- performance(predi, measure = "auc")
auc <- auc@y.values[[1]]
roc.data <- data.frame(fpr=unlist(perf@x.values),
tpr=unlist(perf@y.values))
In [951]:
str(testSet)
In [ ]: