In [ ]:

    
require(randomForest)

Score board

No.	score
1	0.75120
2	0.75598

First try is even worse than the lady first approach.
Second try is with "Embarked" feature.

To improve

Below I ignored Embarked, because in training data, some Embark is NA but in test data, Embark is all available, so the factor they give are different, randomForest cannot work with it.
- This has been fixed, I just ignored the training data
I ignored "Name"/"Cabin"/"Ticket" because they are more than 53 level, which randomForest can not handle.
There are some training data with NA entry, I ignored in the training, might need to include them
Try to figure out why some of the predictions are NA



In [ ]:

    
set.seed(20121228)
train <- read.csv("train.csv")
# train.nona <- na.omit(train)
# drops <- c("PassengerId", "Name", "Cabin", "Ticket", "Embarked")
drops <- c("PassengerId", "Name", "Cabin", "Ticket")
train <- transform(train, Survived = as.factor(Survived))
train <- train[ , !(names(train) %in% drops)]
train <- subset(train, Embarked != "")
train <- transform(train, Embarked = as.factor(as.character(Embarked)))
# train <- train[train$Embarked != "",]
summary(train)



In [ ]:

    
# This is for checking the "Embarked" feature
# e <- train$Embarked
# str(e)



In [ ]:

    
# There are about 889 rows, take 720 of them as training, the rest for validation
train_row <- sample(1:nrow(train), 720)



In [ ]:

    
# rf.titanic <- randomForest(Survived ~ ., data = train, na.action = na.omit, subset = train_row)
rf.titanic <- randomForest(Survived ~ ., data = train, na.action = na.omit, subset = train_row)



In [ ]:

    
# Check the data structure
str(rf.titanic)



In [ ]:

    
train[train_row[c(2,4)],]



In [ ]:

    
sum(train[names(rf.titanic$predicted), ]$Survived == rf.titanic$predicted) / length(rf.titanic$predicted)



In [ ]:

    
# The explanation of
getTree(rf.titanic, 1)



In [ ]:

    
res <- predict(rf.titanic, train[train_row[2],], predict.all = TRUE, )
str(res)

Make the prediction on training data



In [ ]:

    
pred_train <- predict(rf.titanic, train[train_row,])
sum(is.na(pred_train))
head(pred_train)
pred_train[is.na(pred_train)] <- 0
print("Correct rate on training set: ")
rate <- with(data = train[train_row,], sum(Survived == pred_train) / length(pred_train))
print(rate)
length(pred_train)

Make the prediction on validation data



In [ ]:

    
pred_validation <- predict(rf.titanic, train[-train_row,])
length(pred_validation)
sum(is.na(pred_validation))
pred_validation[is.na(pred_validation)] <- 0
res_validation <- train$Survived[-train_row]
# pred_validation <- as.numeric(pred_validation) - 1
# pred_validation
# res_validation
print("Correct rate on validation set: ")
print(sum(res_validation == pred_validation) / length(res_validation))

Make the prediction on test data



In [ ]:

    
test <- read.csv("test.csv")
test <- transform(test, Survived = 0)
test[1, "Survived"] <- 1
test <- transform(test, Survived = as.factor(Survived))
test <- test[ , !(names(test) %in% drops)]
summary(test)



In [ ]:

    
pred <- predict(rf.titanic, test)
str(pred)



In [ ]:

    
result.titanic <- data.frame(PassengerId = 892:1309, Survived = as.numeric(pred) - 1)



In [ ]:

    
summary(result.titanic)
str(result.titanic)
head(result.titanic)

We can see quite a bit prediction is NA, so assume they are all not survived
might be improved here:
Try to figure out why they are NA first



In [ ]:

    
sum(is.na(result.titanic$Survived))
result.titanic$Survived[is.na(result.titanic$Survived)] <- 0



In [ ]:

    
result.titanic



In [ ]:

    
write.csv(result.titanic, file = "my_random_forest_2.csv", row.names = FALSE)

The following part is trying to find out why some prediction is NA

The reason looks pretty simple, there is "NA" in the training set.