Kaggle R Tutorial on Machine Learning

Free Course from Data Camp

In this tutorial we will use R and Kaggle's Titanic competition using Machine Learning Algorithms.

Getting Familiar with the Kaggle data Sets

In this dataset we are asked to apply machine learning techniques to predict a passanger's chance of surviving using R.

The data that kaggle is giving is stored as CSV file. We can load the data using read.csv() function. We will use the training set to build your modle, and the test set to validate it.


In [1]:
# load the train data using read.csv()
train_url <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train <- read.csv(train_url)

In [2]:
# load the test data using read.csv()
test_url <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test <- read.csv(test_url)

In [6]:
# just have a glimpse to our train data
head(train)


Out[6]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1103Braund, Mr. Owen Harrismale2210A/5 211717.25S
2211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female3810PC 1759971.2833C85C
3313Heikkinen, Miss. Lainafemale2600STON/O2. 31012827.925S
4411Futrelle, Mrs. Jacques Heath (Lily May Peel)female351011380353.1C123S
5503Allen, Mr. William Henrymale35003734508.05S
6603Moran, Mr. JamesmaleNA003308778.4583Q

In [7]:
# just have a glimpse to our test data
tail(test)


Out[7]:
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
41313043Henriksson, Miss. Jenny Lovisafemale28003470867.775S
41413053Spector, Mr. WoolfmaleNA00A.5. 32368.05S
41513061Oliva y Ocana, Dona. Ferminafemale3900PC 17758108.9C105C
41613073Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.25S
41713083Ware, Mr. FrederickmaleNA003593098.05S
41813093Peter, Master. Michael JmaleNA11266822.3583C

Understanding our Data:

Understading the data we have is the most essential thing before analyzing.

  • test and train are data frames, R's way of representing a dataset
  • str() gives you ability to explore a data frame, which gives you information such as the data types, the number of observations, and the number of variables.

In [8]:
# Observing train data
str(train)


'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 416 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 525 596 662 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

In [9]:
# Observing test data
str(test)


'data.frame':	418 obs. of  11 variables:
 $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
 $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
 $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
 $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
 $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
 $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
 $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 268 ...
 $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
 $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...

How many people in our training set survived the disaster with the Titanic?

  • table() function gives us the absolute numbers
  • prop.table() function combination gives us the proportions

table() command can help us to explore which variables have predictive value.

Let's start examining with the data. First thing to thing about this data what is he survival rates, and how gender survival rates are looking?


In [10]:
# absolute numbers
table(train$Survived)


Out[10]:
  0   1 
549 342 

In [11]:
# proportions
prop.table(table(train$Survived)) # get proportions according to specified table


Out[11]:
        0         1 
0.6161616 0.3838384 

0 represents people who died, and 1 who lived.

We can also do two-way comparison including gender as well:


In [13]:
# We get the died,lived + female male
table(train$Sex, train$Survived)


Out[13]:
        
           0   1
  female  81 233
  male   468 109

In [16]:
# row-wise comparison with proportions
prop.table(table(train$Sex, train$Survived),1)


Out[16]:
        
                 0         1
  female 0.2579618 0.7420382
  male   0.8110919 0.1889081

Second thing to think is, does age play a role on survival rate? Probably the highest survival rate is for children since they are the ones to be saved first by their families.

Let's test this idea by creating new column called child


In [19]:
train$child <- NA # create a new column with NA values 
train$child[train$Age < 18 ] <- 1 # If the age is smaller than 18 we add 1 to represent the individual as child
train$child[train$Age >= 18] <- 0 # otherwise 0 to represent as non-child
# Then do a two-way comparison on the number of children vs adults 
prop.table(table(train$child, train$Survived),1) # row-wise


Out[19]:
   
            0         1
  0 0.6189684 0.3810316
  1 0.4601770 0.5398230

While less obviously than gender, age also seems to have an impact on survival.

With our previous observations, we discovered that in your training set, females had over a 50% chance of surviving and males had less than a 50% chance of surviving.

Using this very basic discovery we could put our first prediction saying that "all females in the test set survive and all males in the test set die."

To test our first and very basic prediction with the test data we downloaded. But there is no column called Survived in the test data, we should add that column with our predicted values. Then when you submit your results Kaggle will use the created Survived column to score our performance


In [21]:
# First we prepare a copy of test
test_one <- test

In [22]:
test_one$Survived <- 0 # created column and intialized to 0
test_one$Survived[test_one$Sex == 'female'] <- 1

In [23]:
head(test_one)
# writing to csv file:
# write.csv(test_one, file = "submission1.csv")


Out[23]:
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurvived
18923Kelly, Mr. Jamesmale34.5003309117.8292Q0
28933Wilkes, Mrs. James (Ellen Needs)female47103632727S1
38942Myles, Mr. Thomas Francismale62002402769.6875Q0
48953Wirz, Mr. Albertmale27003151548.6625S0
58963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female2211310129812.2875S1
68973Svensson, Mr. Johan Cervinmale140075389.225S0

Part2: Machine Learning with Decision trees

In the previous part we did all the slicing and dicing yourself to find subsets that have a higher chance of surviving. A decision tree automates this process for us, and outputs a flowchart-like structure that is easy to interpret (we'll make one yourself in the next exercise).

Conceptually, the decision tree algorithm starts with all the data at the root node and scans all the variables for the best one to split on. Once a variable is chosen, we do the split and go down one level (or one node) and repeat. The final nodes at the bottom of the decision tree are known as terminal nodes, and the majority vote of the observations in that node determine how to predict for new observations that end up in that terminal node.

To create first decision tree of ours, we will use R's rpart package.


In [24]:
# Calling the library
library('rpart')

The rpart() function creates our first decision tree. Function takes multiple arguments:

  • formula specifying variable of interest and the variables used for prediction (formula = Survived ~ Sex + Age)
  • data the data set to build the decision tree (train in our case)
  • method type of prediction you want. We want to predict a categorical variable, so classification: method = "class"
# Example Syntax of rpart() 
my_tree <- rpart(Survived ~ Sex + Age, 
                 data = train, 
                 method ="class")

We can also plot the resulting tree using, plot(my_tree) and text(my_tree)


In [ ]:
# Creating the decision tree
my_tree <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
                data = train,
                method = "class")
plot(my_tree) # plotting decision tree
text(my_tree) # making text version of decision tree
my_tree_two <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
                data = train,
                method = "class")
plot(my_tree_two) # plotting decision tree
text(my_tree_two) # making text version of decision tree

# Load in the packages to build a fancy plot
library(rattle)
library(rpart.plot)
library(RColorBrewer)

# Time to plot your fancy tree
fancyRpartPlot(my_tree_two)

We can use predict() function to generate our answers:

predict(my_tree, test, type= "class")

Before submitting to Kaggle, you'll have to convert your predictions to a CSV file with exactly 418 entries and 2 columns:

  • PassengerId
  • Survived

In [36]:
# Making a prediction
my_prediction <- predict(my_tree, test, type="class")

In [37]:
# Creating a solution data.frame
my_solution <- data.frame(PassengerId = test$PassengerId,
                          Survived = my_prediction)

In [39]:
# Checking the solution data.frame, in any case:
head(my_solution)


Out[39]:
PassengerIdSurvived
18920
28930
38940
48950
58961
68970

In [38]:
nrow(my_solution)


Out[38]:
418

In [40]:
# Making a csv file for solution
write.csv(my_solution, file = "my_solution.csv", row.names = FALSE)

When we submit this to Kaggle, you get ~0.78 score. This score put me on 1999.

We can improve our model making more complex rpart(), which uses two more parameter:

  • cp determines when the splitting up of the decision tree stops.
  • minsplit determines the minimum amount of observations in a leaf of the tree.

In [42]:
model2 <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
                data = train, method = "class",
                control = rpart.control(minsplit = 50, cp = 0))

# fancyRpartPlot(my_tree_three)

This rpart.control() gives more precise result. But thats not all, we can also re-engineer the data to get more precise results.

Enter feature engineering: creatively engineering your own features by combining the different existing variables.

Let's see an example of feature engineering. In this example we will create family_size

A valid assumption is that larger families need more time to get together on a sinking ship, and hence have less chance of surviving.

Family size is determined by the variables SibSp and Parch, which indicate the number of family members a certain passenger is traveling with. So when doing feature engineering, you add a new variable family_size, which is the sum of SibSp and Parch plus one (the observation itself), to the test and train set.


In [43]:
train_two <- train

In [45]:
train_two$family_size <- train_two$SibSp + train_two$Parch + 1

In [47]:
model3 <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + family_size,
                data = train_two, method = "class")
# fancyRpartPlot(model3)

Well done. If you have a close look at your decision tree you see that family_size is not included. Apparently other variables played a more important role. This is part of the game as well. Sometimes things will not turn out as expected, but it is here that you can make the difference.

Another thing might be about the class difference. Let's have a look...

First we need to create new test and train data with Title column included.


In [48]:
getTitle <- function(data) {
title.dot.start <- regexpr("\\,[A-Z ]{1,20}\\.", data$Name, TRUE)
title.comma.end <- title.dot.start
+ attr(title.dot.start, "match.length")-1
data$Title <- substr(data$Name, title.comma.end+2,title.comma.end+attr(title.dot.start, "match.length")-2)
return (data$Title)
}

In [56]:
train_new <-train
test_new <- test

In [57]:
train_new$Title <- getTitle(train_new)
test_new$Title <- getTitle(test_new)

In [58]:
model4 <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title, 
                data = train_new, method = "class")

In [ ]:
# Visualize my_tree_five
fancyRpartPlot(model4)

In [ ]:
my_prediction2 <- predict(model4, test_new, type = "class")

In [ ]:
my_solution2 <- data.frame(PassengerId = test_new$PassengerId, Survived = my_prediction)
write.csv(my_solution2, file = "my_solution2.csv", row.names = FALSE)

Part3: More improvements

Random Forest is very popular machine learning technique, often used in such prediction tasks. However we cannot go further into this topic to keep things more close to simple.

In layman's terms, the Random Forest technique handles the overfitting problem you faced with decision trees. It grows multiple (very deep) classification trees using the training set.

At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor.

This approach of overtraining trees, but having the majority's vote count as the actual classification decision, avoids overfitting.

Random forest requires you data to have no missing values.

After this point we will combine the train and test to make all_data

And we will remove NA values and give them values according to most used ones.


In [64]:
load('all_data.RData')

In [65]:
tail(all_data)


Out[65]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedfamily_sizeTitle
13041304NA3Henriksson, Miss. Jenny Lovisafemale28003470867.775S1Miss
13051305NA3Spector, Mr. WoolfmaleNA00A.5. 32368.05S1Mr
13061306NA1Oliva y Ocana, Dona. Ferminafemale3900PC 17758108.9C105C1Lady
13071307NA3Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.25S1Mr
13081308NA3Ware, Mr. FrederickmaleNA003593098.05S1Mr
13091309NA3Peter, Master. Michael JmaleNA11266822.3583C3Master

In [66]:
# Passenger on row 62 and 830 do not have a value for embarkment. 
# Since many passengers embarked at Southampton, we give them the value S.
all_data$Embarked[c(62, 830)] <- "S"

In [67]:
# Factorize embarkment codes.
all_data$Embarked <- factor(all_data$Embarked)

In [68]:
# Passenger on row 1044 has an NA Fare value. Let's replace it with the median fare value.
all_data$Fare[1044] <- median(all_data$Fare, na.rm = TRUE)

In [69]:
# How to fill in missing Age values?
# We make a prediction of a passengers Age using the other variables and a decision tree model. 
# This time you give method = "anova" since you are predicting a continuous variable.
library(rpart)
predicted_age <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + family_size,
                       data = all_data[!is.na(all_data$Age),], method = "anova")

In [70]:
all_data$Age[is.na(all_data$Age)] <- predict(predicted_age, all_data[is.na(all_data$Age),])

In [71]:
# Split the data back into a train set and a test set
train <- all_data[1:891,]
test <- all_data[892:1309,]

Okay, now we got rid of the NA values from all_data and we splitted it into train and test according as it was before.

for Random Forest analysis R has a function called randomForest() in the randomForest package.

We call it as we did with rpart():

  • First your provide the formula. There is no argument class here to inform the function you're dealing with predicting a categorical variable, so you need to turn Survived into a factor with two levels: as.factor(Survived) ~ Pclass + Sex + Age
  • The data argument takes the train data frame.
  • When you put the importance argument to TRUE you can inspect variable importance.
  • The ntree argument specifies the number of trees to grow. Limit these when having only limited computational power at your disposal.

In [73]:
library('randomForest')

In [74]:
set.seed(111) # set seed for reproducibilty
my_forest <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title,
                         data = train, importance=TRUE, ntree=1000)

In [75]:
my_prediction <- predict(my_forest, test)

In [76]:
my_solution <- data.frame(PassengerId = test$PassengerId, Survived = my_prediction)
write.csv(my_solution, file = "my_solution3.csv", row.names = FALSE)

We set importance=TRUE, so we can now see which variables are important using:

varImpPlot(my_forest)

When running the function, two graphs appear: the accuracy plot shows how much worse the model would perform without the included variables. So a high decrease (= high value x-axis) links to a high predictive variable. The second plot is the Gini coefficient. The higher the variable scores here, the more important it is for the model.