In this dataset we are asked to apply machine learning techniques to predict a passanger's chance of surviving using R.
The data that kaggle is giving is stored as CSV
file. We can load the data using read.csv()
function. We will use the training set to build your modle, and the test set to validate it.
In [1]:
# load the train data using read.csv()
train_url <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train <- read.csv(train_url)
In [2]:
# load the test data using read.csv()
test_url <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test <- read.csv(test_url)
In [6]:
# just have a glimpse to our train data
head(train)
Out[6]:
In [7]:
# just have a glimpse to our test data
tail(test)
Out[7]:
Understanding our Data:
Understading the data we have is the most essential thing before analyzing.
test
and train
are data frames, R's way of representing a datasetstr()
gives you ability to explore a data frame, which gives you information such as the data types, the number of observations, and the number of variables.
In [8]:
# Observing train data
str(train)
In [9]:
# Observing test data
str(test)
How many people in our training set survived the disaster with the Titanic?
table()
function gives us the absolute numbers prop.table()
function combination gives us the proportions
table()
command can help us to explore which variables have predictive value.
Let's start examining with the data. First thing to thing about this data what is he survival rates, and how gender survival rates are looking?
In [10]:
# absolute numbers
table(train$Survived)
Out[10]:
In [11]:
# proportions
prop.table(table(train$Survived)) # get proportions according to specified table
Out[11]:
0 represents people who died, and 1 who lived.
We can also do two-way comparison including gender as well:
In [13]:
# We get the died,lived + female male
table(train$Sex, train$Survived)
Out[13]:
In [16]:
# row-wise comparison with proportions
prop.table(table(train$Sex, train$Survived),1)
Out[16]:
Second thing to think is, does age play a role on survival rate? Probably the highest survival rate is for children since they are the ones to be saved first by their families.
Let's test this idea by creating new column called child
In [19]:
train$child <- NA # create a new column with NA values
train$child[train$Age < 18 ] <- 1 # If the age is smaller than 18 we add 1 to represent the individual as child
train$child[train$Age >= 18] <- 0 # otherwise 0 to represent as non-child
# Then do a two-way comparison on the number of children vs adults
prop.table(table(train$child, train$Survived),1) # row-wise
Out[19]:
While less obviously than gender, age also seems to have an impact on survival.
With our previous observations, we discovered that in your training set, females had over a 50% chance of surviving and males had less than a 50% chance of surviving.
Using this very basic discovery we could put our first prediction saying that "all females in the test set survive and all males in the test set die."
To test our first and very basic prediction with the test data we downloaded. But there is no column called Survived
in the test data, we should add that column with our predicted values. Then when you submit your results Kaggle will use the created Survived
column to score our performance
In [21]:
# First we prepare a copy of test
test_one <- test
In [22]:
test_one$Survived <- 0 # created column and intialized to 0
test_one$Survived[test_one$Sex == 'female'] <- 1
In [23]:
head(test_one)
# writing to csv file:
# write.csv(test_one, file = "submission1.csv")
Out[23]:
In the previous part we did all the slicing and dicing yourself to find subsets that have a higher chance of surviving. A decision tree automates this process for us, and outputs a flowchart-like structure that is easy to interpret (we'll make one yourself in the next exercise).
Conceptually, the decision tree algorithm starts with all the data at the root node and scans all the variables for the best one to split on. Once a variable is chosen, we do the split and go down one level (or one node) and repeat. The final nodes at the bottom of the decision tree are known as terminal nodes, and the majority vote of the observations in that node determine how to predict for new observations that end up in that terminal node.
To create first decision tree of ours, we will use R's rpart
package.
In [24]:
# Calling the library
library('rpart')
The rpart()
function creates our first decision tree. Function takes multiple arguments:
formula
specifying variable of interest and the variables used for prediction (formula = Survived ~ Sex + Age
)data
the data set to build the decision tree (train
in our case)method
type of prediction you want. We want to predict a categorical variable, so classification: method = "class"
# Example Syntax of rpart()
my_tree <- rpart(Survived ~ Sex + Age,
data = train,
method ="class")
We can also plot the resulting tree using, plot(my_tree)
and text(my_tree)
In [ ]:
# Creating the decision tree
my_tree <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
data = train,
method = "class")
plot(my_tree) # plotting decision tree
text(my_tree) # making text version of decision tree
my_tree_two <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
data = train,
method = "class")
plot(my_tree_two) # plotting decision tree
text(my_tree_two) # making text version of decision tree
# Load in the packages to build a fancy plot
library(rattle)
library(rpart.plot)
library(RColorBrewer)
# Time to plot your fancy tree
fancyRpartPlot(my_tree_two)
We can use predict()
function to generate our answers:
predict(my_tree, test, type= "class")
Before submitting to Kaggle, you'll have to convert your predictions to a CSV file with exactly 418 entries and 2 columns:
In [36]:
# Making a prediction
my_prediction <- predict(my_tree, test, type="class")
In [37]:
# Creating a solution data.frame
my_solution <- data.frame(PassengerId = test$PassengerId,
Survived = my_prediction)
In [39]:
# Checking the solution data.frame, in any case:
head(my_solution)
Out[39]:
In [38]:
nrow(my_solution)
Out[38]:
In [40]:
# Making a csv file for solution
write.csv(my_solution, file = "my_solution.csv", row.names = FALSE)
When we submit this to Kaggle, you get ~0.78 score. This score put me on 1999.
We can improve our model making more complex rpart()
, which uses two more parameter:
cp
determines when the splitting up of the decision tree stops.minsplit
determines the minimum amount of observations in a leaf of the tree.
In [42]:
model2 <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
data = train, method = "class",
control = rpart.control(minsplit = 50, cp = 0))
# fancyRpartPlot(my_tree_three)
This rpart.control()
gives more precise result. But thats not all, we can also re-engineer the data to get more precise results.
Enter feature engineering: creatively engineering your own features by combining the different existing variables.
Let's see an example of feature engineering. In this example we will create family_size
A valid assumption is that larger families need more time to get together on a sinking ship, and hence have less chance of surviving.
Family size is determined by the variables SibSp
and Parch
, which indicate the number of family members a certain passenger is traveling with. So when doing feature engineering, you add a new variable family_size
, which is the sum of SibSp
and Parch
plus one (the observation itself), to the test and train set.
In [43]:
train_two <- train
In [45]:
train_two$family_size <- train_two$SibSp + train_two$Parch + 1
In [47]:
model3 <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + family_size,
data = train_two, method = "class")
# fancyRpartPlot(model3)
Well done. If you have a close look at your decision tree you see that family_size
is not included. Apparently other variables played a more important role. This is part of the game as well. Sometimes things will not turn out as expected, but it is here that you can make the difference.
Another thing might be about the class difference. Let's have a look...
First we need to create new test and train data with Title column included.
In [48]:
getTitle <- function(data) {
title.dot.start <- regexpr("\\,[A-Z ]{1,20}\\.", data$Name, TRUE)
title.comma.end <- title.dot.start
+ attr(title.dot.start, "match.length")-1
data$Title <- substr(data$Name, title.comma.end+2,title.comma.end+attr(title.dot.start, "match.length")-2)
return (data$Title)
}
In [56]:
train_new <-train
test_new <- test
In [57]:
train_new$Title <- getTitle(train_new)
test_new$Title <- getTitle(test_new)
In [58]:
model4 <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title,
data = train_new, method = "class")
In [ ]:
# Visualize my_tree_five
fancyRpartPlot(model4)
In [ ]:
my_prediction2 <- predict(model4, test_new, type = "class")
In [ ]:
my_solution2 <- data.frame(PassengerId = test_new$PassengerId, Survived = my_prediction)
write.csv(my_solution2, file = "my_solution2.csv", row.names = FALSE)
Random Forest is very popular machine learning technique, often used in such prediction tasks. However we cannot go further into this topic to keep things more close to simple.
In layman's terms, the Random Forest technique handles the overfitting problem you faced with decision trees. It grows multiple (very deep) classification trees using the training set.
At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor.
This approach of overtraining trees, but having the majority's vote count as the actual classification decision, avoids overfitting.
Random forest requires you data to have no missing values.
After this point we will combine the train and test to make all_data
And we will remove NA values and give them values according to most used ones.
In [64]:
load('all_data.RData')
In [65]:
tail(all_data)
Out[65]:
In [66]:
# Passenger on row 62 and 830 do not have a value for embarkment.
# Since many passengers embarked at Southampton, we give them the value S.
all_data$Embarked[c(62, 830)] <- "S"
In [67]:
# Factorize embarkment codes.
all_data$Embarked <- factor(all_data$Embarked)
In [68]:
# Passenger on row 1044 has an NA Fare value. Let's replace it with the median fare value.
all_data$Fare[1044] <- median(all_data$Fare, na.rm = TRUE)
In [69]:
# How to fill in missing Age values?
# We make a prediction of a passengers Age using the other variables and a decision tree model.
# This time you give method = "anova" since you are predicting a continuous variable.
library(rpart)
predicted_age <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + family_size,
data = all_data[!is.na(all_data$Age),], method = "anova")
In [70]:
all_data$Age[is.na(all_data$Age)] <- predict(predicted_age, all_data[is.na(all_data$Age),])
In [71]:
# Split the data back into a train set and a test set
train <- all_data[1:891,]
test <- all_data[892:1309,]
Okay, now we got rid of the NA values from all_data
and we splitted it into train and test according as it was before.
for Random Forest analysis R has a function called randomForest()
in the randomForest
package.
We call it as we did with rpart()
:
formula
. There is no argument class
here to inform the function you're dealing with predicting a categorical variable, so you need to turn Survived
into a factor with two levels: as.factor(Survived) ~ Pclass + Sex + Age
data
argument takes the train data frame.importance
argument to TRUE
you can inspect variable importance.ntree
argument specifies the number of trees to grow. Limit these when having only limited computational power at your disposal.
In [73]:
library('randomForest')
In [74]:
set.seed(111) # set seed for reproducibilty
my_forest <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title,
data = train, importance=TRUE, ntree=1000)
In [75]:
my_prediction <- predict(my_forest, test)
In [76]:
my_solution <- data.frame(PassengerId = test$PassengerId, Survived = my_prediction)
write.csv(my_solution, file = "my_solution3.csv", row.names = FALSE)
We set importance=TRUE
, so we can now see which variables are important using:
varImpPlot(my_forest)
When running the function, two graphs appear: the accuracy plot shows how much worse the model would perform without the included variables. So a high decrease (= high value x-axis) links to a high predictive variable. The second plot is the Gini coefficient. The higher the variable scores here, the more important it is for the model.