Titanic Survival

Predict Titanic survival: Kaggle Competition

VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.

Import the test and training sets



In [912]:

    
# Import the training and test sets. use the if with variable so I don't have to import everytime I run the script
var1 <- 2

if (var1 == 2) {

library(data.table)
  
train_data_raw <- fread(
    "https://drive.google.com/uc?export=download&id=0B1HWpqllmsS8Y01HRjVaanRQTlE", 
    stringsAsFactors=F, 
    na.strings = c("NA","")
    )

test_data_raw <- fread(
    "https://drive.google.com/uc?export=download&id=0B1HWpqllmsS8bmY4V3d2RWRqNkE", 
    stringsAsFactors=F, 
    na.strings = c("NA","")
    )

}

train.data <- train_data_raw
test.data <- test_data_raw



In [913]:

    
#overview of the data
str(train.data)
str(test.data)









    



Classes 'data.table' and 'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  NA "C85" NA "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...
 - attr(*, ".internal.selfref")=<externalptr> 
Classes 'data.table' and 'data.frame':	418 obs. of  11 variables:
 $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
 $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
 $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
 $ Sex        : chr  "male" "female" "male" "male" ...
 $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
 $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
 $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
 $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
 $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
 $ Cabin      : chr  NA NA NA NA ...
 $ Embarked   : chr  "Q" "S" "Q" "S" ...
 - attr(*, ".internal.selfref")=<externalptr>



In [ ]:

Combine test and train data for data analysis, cleaning, and manipulation. The same munging done to training needs to be done to testing.



In [914]:

    
# give the training and test sets a column identifying which set they belong to
# so I can easily separate them later
train.data$set <- "Train"
test.data$set <- "Test"

#make target character class
train.data$Survived <- as.character(train.data$Survived)


index_nine <- which(train.data$Survived == '1')
train.data$Survived[index_nine] <- 'Y'
index_nine <- which(train.data$Survived == '0')
train.data$Survived[index_nine] <- 'N'



# Bind training and test sets together
# the test data is missing the Survived column
train_test_data <- dplyr::bind_rows(train.data, test.data)


head(train.data)
tail(train_test_data)









    





PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked set

	1                                                  N                                                  3                                                  Braund, Mr. Owen Harris                            male                                               22                                                 1                                                  0                                                  A/5 21171                                           7.2500                                            NA                                                 S                                                  Train                                              
	2                                                  Y                                                  1                                                  Cumings, Mrs. John Bradley (Florence Briggs Thayer) female                                             38                                                 1                                                  0                                                  PC 17599                                           71.2833                                            C85                                                C                                                  Train                                              
	3                                                  Y                                                  3                                                  Heikkinen, Miss. Laina                             female                                             26                                                 0                                                  0                                                  STON/O2. 3101282                                    7.9250                                            NA                                                 S                                                  Train                                              
	4                                                  Y                                                  1                                                  Futrelle, Mrs. Jacques Heath (Lily May Peel)       female                                             35                                                 1                                                  0                                                  113803                                             53.1000                                            C123                                               S                                                  Train                                              
	5                                                  N                                                  3                                                  Allen, Mr. William Henry                           male                                               35                                                 0                                                  0                                                  373450                                              8.0500                                            NA                                                 S                                                  Train                                              
	6                                                  N                                                  3                                                  Moran, Mr. James                                   male                                               NA                                                 0                                                  0                                                  330877                                              8.4583                                            NA                                                 Q                                                  Train                                              









    





PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked set

	1304                          NA                            3                             Henriksson, Miss. Jenny Lovisa female                        28.0                          0                             0                             347086                          7.7750                      NA                            S                             Test                          
	1305                          NA                            3                             Spector, Mr. Woolf            male                            NA                          0                             0                             A.5. 3236                       8.0500                      NA                            S                             Test                          
	1306                          NA                            1                             Oliva y Ocana, Dona. Fermina  female                        39.0                          0                             0                             PC 17758                      108.9000                      C105                          C                             Test                          
	1307                          NA                            3                             Saether, Mr. Simon Sivertsen  male                          38.5                          0                             0                             SOTON/O.Q. 3101262              7.2500                      NA                            S                             Test                          
	1308                          NA                            3                             Ware, Mr. Frederick           male                            NA                          0                             0                             359309                          8.0500                      NA                            S                             Test                          
	1309                          NA                            3                             Peter, Master. Michael J      male                            NA                          1                             1                             2668                           22.3583                      NA                            C                             Test

Convert some types

In R, since nominal, ordinal, interval and ratio variables are treaded differently in statistical modeling, we need to convert a nominal variable from a character into a factor.



In [915]:

    
train_test_data$Survived = factor(train_test_data$Survived)
train_test_data$Pclass = factor(train_test_data$Pclass)

str(train_test_data)









    



Classes 'data.table' and 'data.frame':	1309 obs. of  13 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : Factor w/ 2 levels "N","Y": 1 2 2 2 1 1 1 1 2 2 ...
 $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  NA "C85" NA "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...
 $ set        : chr  "Train" "Train" "Train" "Train" ...
 - attr(*, ".internal.selfref")=<externalptr>

Detect missing values

Missing values misrepresent the sample and might distort inferences about the population.



In [916]:

    
#To find missing values in Age:
sum(is.na(train_test_data$Age) == TRUE)  #a count of NA's in Age
#To see it as a percentage, count missing values against present values:
sum(is.na(train_test_data$Age) == TRUE) / length(train.data$Age)

#So almost 30% of our Age data is missing









    




263






    




0.295173961840628



In [917]:

    
#I'd like to see that applied to each feature. Here I make it an actual percentage and round to 2 decimal places
sapply(train_test_data, function(df) {
  round((      sum(is.na(df) == TRUE) / length(df) * 100    ),2)
})









    





	PassengerId
		0
	Survived
		31.93
	Pclass
		0
	Name
		0
	Sex
		0
	Age
		20.09
	SibSp
		0
	Parch
		0
	Ticket
		0
	Fare
		0.08
	Cabin
		77.46
	Embarked
		0.15
	set
		0

Missing 77% of Cabin data.

Visualize the missing data. Use Amelia package. It's a package specifically for missing data. We want to see where data is missing from the whole data set. Note: in the console, typing AmeliaView() will bring up an interactive GUI



In [918]:

    
library(Amelia)
missmap(train_test_data, main = "Missing Map", col = c("lightyellow", "darkred"), legend = T)









    



Warning message in if (class(obj) == "amelia") {:
"the condition has length > 1 and only the first element will be used"



In [ ]:

Impute missing values



In [919]:

    
table(train_test_data$Embarked, useNA = "always") # useNA=Always says to show the number of NA values









    





   C    Q    S <NA> 
 270  123  914    2



In [920]:

    
barplot(table(train_test_data$Embarked, useNA = "always"))

An OK way of imputing the 2 missing values is to use the most probable port, which is Southampton.



In [921]:

    
train_test_data$Embarked[which(is.na(train_test_data$Embarked))] = 'S';

table(train_test_data$Embarked, useNA = "always") # view the distribution to verify no NA's









    





   C    Q    S <NA> 
 270  123  916    0

Feature Engineering

To impute Age, I need to engineer some features to use to predict Age.

Each persons title is included in their name. I can extract it.



In [922]:

    
# Extract title from Name, creating a new variable
train_test_data$Title <- gsub('(.*, )|(\\..*)', '', train_test_data$Name)

Predict Age for NA's. Use a decision tree and view it.



In [923]:

    
hist(train_test_data$Age, sqrt(nrow(train_test_data)), main = "Passenger Age", xlab = "Age")

Let's see the counts for each title along with a max and min age for each



In [924]:

    
library(sqldf) 

#first get a total count with min and max ages for each title
sqldf("select Title, 
        count(*) as TitleCount, 
        min(Age) as Min_Age, 
        max(Age) as Max_Age 
      from 'train_test_data' 
      group by Title 
      order by count(*) desc")

sqldf("select Title, 
        count(*) as AgeNACount 
      from 'train_test_data' 
      where Age is null 
      group by Title 
      order by count(*) desc")









    





Title TitleCount Min_Age Max_Age

	Mr          757         11.00       80.0        
	Miss        260          0.17       63.0        
	Mrs         197         14.00       76.0        
	Master       61          0.33       14.5        
	Dr            8         23.00       54.0        
	Rev           8         27.00       57.0        
	Col           4         47.00       60.0        
	Major         2         45.00       52.0        
	Mlle          2         24.00       24.0        
	Ms            2         28.00       28.0        
	Capt          1         70.00       70.0        
	Don           1         40.00       40.0        
	Dona          1         39.00       39.0        
	Jonkheer      1         38.00       38.0        
	Lady          1         48.00       48.0        
	Mme           1         24.00       24.0        
	Sir           1         49.00       49.0        
	the Countess   1         33.00       33.0        









    





Title AgeNACount

	Mr    176   
	Miss   50   
	Mrs    27   
	Master   8   
	Dr      1   
	Ms      1



In [925]:

    
#Title seems important for Age prediction. I'll just change some Mr to  Master if under a certain age.
#if title is Mr and age less than 14.5 and age not NA, assign it Master
train_test_data$Title[train_test_data$Title %in% c('Mr') & train_test_data$Age <= 14.5 & !is.na(train_test_data$Age)]  <- "Master"



In [926]:

    
#Also create FamilyName column from passenger name
# create FamilyName from passenger name. FamilyName is the first name in the field
train_test_data$FamilyName <- gsub('(, .*)', '', train_test_data$Name)
train_test_data$FamilyName <- factor(train_test_data$FamilyName)

family size. Maybe if someone belongs to a family they are more likely to die while trying to find other family members instead of jumping into a lifeboat.



In [927]:

    
# Create a family size variable including the passenger themselves
train_test_data$Fsize <- train_test_data$SibSp + train_test_data$Parch + 1
train_test_data$Fsize <- as.integer(train_test_data$Fsize)
# create a small family variable showing if less than 3 members or not
#train.data$FamilySizeSmall = ifelse(train.data$Fsize + 1 <= 3, 1,0)

# Create a family variable. this will be more unique than just last name. It combines name with family size in one string
train_test_data$Family <- paste(train_test_data$FamilyName, train_test_data$Fsize, sep='_')
train_test_data$Family <- factor(train_test_data$Family)



In [928]:

    
#Mothers
train_test_data$Mother = ifelse(train_test_data$Title=="Mrs" & train_test_data$Parch > 0, 1,0)
train_test_data$Mother <- factor(train_test_data$Mother)

#singles
train_test_data$Single = ifelse(train_test_data$Fsize == 1, 1,0) # People travelling alone
train_test_data$Single <- factor(train_test_data$Single)

Create variable Deck that is the first letter in the Cabin variable



In [929]:

    
# Create a Deck variable. Get passenger deck A - F:

train_test_data$DeckT <- substr(train_test_data$Cabin,1,1) # in Cabin, get the first character through the first character
train_test_data$DeckT <- factor(train_test_data$DeckT)
#http://rfunction.com/archives/1692

Since the Family variable is unique to each family, and since families have cabins near each other, maybe I'll use a distinct list of family that has a cabin listed. The join that on the family in order to assign the same cabin to all family members



In [930]:

    
library(sqldf) 
library(dplyr)
#there's a family with two deck that creates a couple extra rows when joined.
DeckFam <- sqldf("select distinct Family, DeckT as Deck_Imp from 'train_test_data' where DeckT is not null")
#Identify duplicates
DeckFam$dup <- duplicated(DeckFam$Family)
#select only where duplicates equals false
DeckFam <- sqldf("select * from 'DeckFam' where dup = 0")

train_test_data <- left_join(train_test_data, DeckFam, by = "Family")

train_test_data$DeckT <- NULL # I don't need this column now
train_test_data$Deck_Imp <- factor(train_test_data$Deck_Imp)



In [931]:

    
str(train_test_data)









    



'data.frame':	1309 obs. of  21 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : Factor w/ 2 levels "N","Y": 1 2 2 2 1 1 1 1 2 2 ...
 $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  NA "C85" NA "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...
 $ set        : chr  "Train" "Train" "Train" "Train" ...
 $ Title      : chr  "Mr" "Mrs" "Miss" "Mrs" ...
 $ FamilyName : Factor w/ 875 levels "Abbing","Abbott",..: 101 183 335 273 16 544 506 614 388 565 ...
 $ Fsize      : int  2 2 1 2 1 1 1 5 3 2 ...
 $ Family     : Factor w/ 928 levels "Abbing_1","Abbott_3",..: 106 191 358 287 16 579 539 654 419 602 ...
 $ Mother     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
 $ Single     : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 2 1 1 1 ...
 $ Deck_Imp   : Factor w/ 8 levels "A","B","C","D",..: NA 3 NA 3 2 NA 5 NA NA NA ...
 $ dup        : logi  NA FALSE NA FALSE FALSE NA ...

Use anova method to predict continuous variables



In [ ]:



In [ ]:



In [ ]:



In [932]:

    
# Continuous features into bins
train_test_data$Fare_bins <- cut(train_test_data$Fare, 20, include.lowest = T) #bins for Age did not improve model accuracy
# Look at the bins
table(train_test_data$Fare_bins)
barplot(table(train_test_data$Fare_bins))









    





[-0.512,25.6]   (25.6,51.2]   (51.2,76.8]    (76.8,102]     (102,128] 
          827           243           102            52            17 
    (128,154]     (154,179]     (179,205]     (205,231]     (231,256] 
           25             4             0            18             3 
    (256,282]     (282,307]     (307,333]     (333,359]     (359,384] 
           13             0             0             0             0 
    (384,410]     (410,435]     (435,461]     (461,487]     (487,513] 
            0             0             0             0             4



In [933]:

    
#How many NA's now?
sapply(train_test_data, function(df) {
  sum(is.na(df) == TRUE) / length(df)
})









    





	PassengerId
		0
	Survived
		0.319327731092437
	Pclass
		0
	Name
		0
	Sex
		0
	Age
		0.200916730328495
	SibSp
		0
	Parch
		0
	Ticket
		0
	Fare
		0.000763941940412529
	Cabin
		0.774637127578304
	Embarked
		0
	set
		0
	Title
		0
	FamilyName
		0
	Fsize
		0
	Family
		0
	Mother
		0
	Single
		0
	Deck_Imp
		0.754010695187166
	dup
		0.754010695187166
	Fare_bins
		0.000763941940412529



In [934]:

    
#generalize the rare ones to Mr or Miss. rare means <= 2 instances
#There are items in one set and not the other. Like
#Dona was in the Test set but not the training. Need to generalize these for predictability.
train_test_data$Title[train_test_data$Title %in% c("Lady","Lady","Mme","the Countess","Dona","Mlle","Ms")]<-"Miss"
train_test_data$Title[train_test_data$Title %in% c("Capt",'Don','Major','Sir','Jonkheer')] <- 'Mr'
#this next line does the same type of replace. is it slower?
#df[df=="" | df==12] <- NA



In [935]:

    
library(sqldf) 

#first get a total count with min and max ages for each title
sqldf("select Title, 
        count(*) as TitleCount, 
        min(Age) as Min_Age, 
        max(Age) as Max_Age 
      from 'train_test_data' 
      group by Title 
      order by count(*) desc")









    





Title TitleCount Min_Age Max_Age

	Mr    758   15.00 80.0  
	Miss  268    0.17 63.0  
	Mrs   197   14.00 76.0  
	Master  66    0.33 14.5  
	Dr      8   23.00 54.0  
	Rev     8   27.00 57.0  
	Col     4   47.00 60.0

Rename before creating dummies



In [936]:

    
#Rename columns for easy reference later
library(plyr)
train_test_data <- rename(train_test_data, c(
  "Survived"="label_Survived", 
  "Pclass"="feature_Pclass",
  "Title" = "feature_Title",
  "Age" = "feature_Age",
  "Fare" = "feature_Fare",
  "SibSp" = "feature_SibSp",
  "Parch" = "feature_Parch",
  "Deck_Imp" = "feature_DeckImp",
  "Embarked" = "feature_Embarked",
    "Fsize" = "feature_Fsize"
  ))



In [937]:

    
str(train_test_data)









    



'data.frame':	1309 obs. of  22 variables:
 $ PassengerId     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ label_Survived  : Factor w/ 2 levels "N","Y": 1 2 2 2 1 1 1 1 2 2 ...
 $ feature_Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Name            : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex             : chr  "male" "female" "female" "female" ...
 $ feature_Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ feature_SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
 $ feature_Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket          : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ feature_Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin           : chr  NA "C85" NA "C123" ...
 $ feature_Embarked: chr  "S" "C" "S" "S" ...
 $ set             : chr  "Train" "Train" "Train" "Train" ...
 $ feature_Title   : chr  "Mr" "Mrs" "Miss" "Mrs" ...
 $ FamilyName      : Factor w/ 875 levels "Abbing","Abbott",..: 101 183 335 273 16 544 506 614 388 565 ...
 $ feature_Fsize   : int  2 2 1 2 1 1 1 5 3 2 ...
 $ Family          : Factor w/ 928 levels "Abbing_1","Abbott_3",..: 106 191 358 287 16 579 539 654 419 602 ...
 $ Mother          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
 $ Single          : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 2 1 1 1 ...
 $ feature_DeckImp : Factor w/ 8 levels "A","B","C","D",..: NA 3 NA 3 2 NA 5 NA NA NA ...
 $ dup             : logi  NA FALSE NA FALSE FALSE NA ...
 $ Fare_bins       : Factor w/ 20 levels "[-0.512,25.6]",..: 1 3 1 3 1 1 3 1 1 2 ...



In [938]:

    
# can't have NA's for some algs so replace with -1
#train_test_data[is.na(train_test_data)] <- 0



In [ ]:



In [939]:

    
library(caret)
dmy <- dummyVars(" ~ feature_Pclass + 
    feature_Title + 
feature_SibSp + 
feature_Parch + 
feature_Embarked
"
, data = train_test_data, fullRank=T) 
#use " ~ ." for all chr and factor columns
# use fullRank = T to eliminate 1 of the levels
trsf <- data.frame(predict(dmy, newdata = train_test_data))

str(trsf)
str(train_test_data)









    



'data.frame':	1309 obs. of  12 variables:
 $ feature_Pclass.2   : num  0 0 0 0 0 0 0 0 0 1 ...
 $ feature_Pclass.3   : num  1 0 1 0 1 1 0 1 1 0 ...
 $ feature_TitleDr    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ feature_TitleMaster: num  0 0 0 0 0 0 0 1 0 0 ...
 $ feature_TitleMiss  : num  0 0 1 0 0 0 0 0 0 0 ...
 $ feature_TitleMr    : num  1 0 0 0 1 1 1 0 0 0 ...
 $ feature_TitleMrs   : num  0 1 0 1 0 0 0 0 1 1 ...
 $ feature_TitleRev   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ feature_SibSp      : num  1 1 0 1 0 0 0 3 0 1 ...
 $ feature_Parch      : num  0 0 0 0 0 0 0 1 2 0 ...
 $ feature_EmbarkedQ  : num  0 0 0 0 0 1 0 0 0 0 ...
 $ feature_EmbarkedS  : num  1 0 1 1 1 0 1 1 1 0 ...
'data.frame':	1309 obs. of  22 variables:
 $ PassengerId     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ label_Survived  : Factor w/ 2 levels "N","Y": 1 2 2 2 1 1 1 1 2 2 ...
 $ feature_Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Name            : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex             : chr  "male" "female" "female" "female" ...
 $ feature_Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ feature_SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
 $ feature_Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket          : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ feature_Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin           : chr  NA "C85" NA "C123" ...
 $ feature_Embarked: chr  "S" "C" "S" "S" ...
 $ set             : chr  "Train" "Train" "Train" "Train" ...
 $ feature_Title   : chr  "Mr" "Mrs" "Miss" "Mrs" ...
 $ FamilyName      : Factor w/ 875 levels "Abbing","Abbott",..: 101 183 335 273 16 544 506 614 388 565 ...
 $ feature_Fsize   : int  2 2 1 2 1 1 1 5 3 2 ...
 $ Family          : Factor w/ 928 levels "Abbing_1","Abbott_3",..: 106 191 358 287 16 579 539 654 419 602 ...
 $ Mother          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
 $ Single          : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 2 1 1 1 ...
 $ feature_DeckImp : Factor w/ 8 levels "A","B","C","D",..: NA 3 NA 3 2 NA 5 NA NA NA ...
 $ dup             : logi  NA FALSE NA FALSE FALSE NA ...
 $ Fare_bins       : Factor w/ 20 levels "[-0.512,25.6]",..: 1 3 1 3 1 1 3 1 1 2 ...



In [940]:

    
#Now that we're done cleaning, we need a simple split of the data back into train and test

trsf <- cbind(
    label_Survived = train_test_data$label_Survived, 
    set = train_test_data$set , trsf, 
    feature_Age = train_test_data$feature_Age, 
    feature_Fare = train_test_data$feature_Fare
)

train.data <- trsf %>%
  filter(set == "Train")
test.data <- trsf %>%
  filter(set == "Test")

test.data$label_Survived <- NULL #get rid of the Survived column. it was added when combined with training
test.data$set <- NULL
train.data$set <- NULL


str(train.data)









    



'data.frame':	891 obs. of  15 variables:
 $ label_Survived     : Factor w/ 2 levels "N","Y": 1 2 2 2 1 1 1 1 2 2 ...
 $ feature_Pclass.2   : num  0 0 0 0 0 0 0 0 0 1 ...
 $ feature_Pclass.3   : num  1 0 1 0 1 1 0 1 1 0 ...
 $ feature_TitleDr    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ feature_TitleMaster: num  0 0 0 0 0 0 0 1 0 0 ...
 $ feature_TitleMiss  : num  0 0 1 0 0 0 0 0 0 0 ...
 $ feature_TitleMr    : num  1 0 0 0 1 1 1 0 0 0 ...
 $ feature_TitleMrs   : num  0 1 0 1 0 0 0 0 1 1 ...
 $ feature_TitleRev   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ feature_SibSp      : num  1 1 0 1 0 0 0 3 0 1 ...
 $ feature_Parch      : num  0 0 0 0 0 0 0 1 2 0 ...
 $ feature_EmbarkedQ  : num  0 0 0 0 0 1 0 0 0 0 ...
 $ feature_EmbarkedS  : num  1 0 1 1 1 0 1 1 1 0 ...
 $ feature_Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ feature_Fare       : num  7.25 71.28 7.92 53.1 8.05 ...

Impute the Age and Fare on the train and test sets separately. I don't want to impute a field based on data not in the training set. Same goes for test set Use anova for continuous variables.



In [941]:

    
#predict Age and Fare for train data NAs

library(rpart)
predicted_age <- rpart(
    feature_Age ~ .,
    data=train.data[!is.na(train.data$feature_Age),], method="anova"
    )
train.data$feature_Age[is.na(train.data$feature_Age)] <- predict(
    predicted_age, train.data[is.na(train.data$feature_Age),]
    )


#predict Fare
predicted_fare <- rpart(
    feature_Fare ~ .,
    data=train.data[!is.na(train.data$feature_Fare),], method="anova"
    )
train.data$feature_Fare[is.na(train.data$feature_Fare)] <- predict(
    predicted_fare, train.data[is.na(train.data$feature_Fare),]
    )



#Predict Age and Fare for test data NAs

library(rpart)
predicted_age <- rpart(
    feature_Age ~ .,
    data=test.data[!is.na(test.data$feature_Age),], method="anova"
    )
test.data$feature_Age[is.na(test.data$feature_Age)] <- predict(
    predicted_age, test.data[is.na(test.data$feature_Age),]
    )


#predict Fare
predicted_fare <- rpart(
    feature_Fare ~ .,
    data=test.data[!is.na(test.data$feature_Fare),], method="anova"
    )
test.data$feature_Fare[is.na(test.data$feature_Fare)] <- predict(
    predicted_fare, test.data[is.na(test.data$feature_Fare),]
    )





#Look at the decision tree to see what was good features for predicting Age
# Load in the packages to build a fancy plot
library(rattle) #instfullfe.packages("rattle")
library(rpart.plot)
library(RColorBrewer)
# Visualize your new decision tree
fancyRpartPlot(predicted_age)
fancyRpartPlot(predicted_fare)



In [942]:

    
#And let's see a histogram of Age after imputation:
hist(train.data$feature_Age, sqrt(nrow(train.data)), main = "Passenger Age", xlab = "Age")

Split the data into training and testing sets



In [943]:

    
# here's a training_set and test_set for accuracy testing
#need a stratified random split using the caret package
library(caret)
set.seed(1)
train.index <- createDataPartition(train.data$label_Survived, p = .75, list = FALSE)
trainSet <- train.data[ train.index,]
testSet <- train.data[-train.index,] #the test set gets data not in the training set



In [944]:

    
library(caret)

#Set the random seed
set.seed(1)


#Defining the training controls for multiple models
fitControl <- trainControl(
  method = "cv",
  number = 5,
savePredictions = 'final',
classProbs = T)

#Training the random forest model
model_rf <- train(
  trainSet[ , names(trainSet) %like% c("feature")], #feature columns
  trainSet[ , names(trainSet) %like% c("label")], #label columns
  method='rf',
  trControl=fitControl,
  tuneLength=3)

#Predicting using random forest model
#Apply the model to the testSet
testSet$pred_rf<-predict(
  object = model_rf, 
  testSet[ , names(testSet) %like% c("feature")])

#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$pred_rf)









    





Confusion Matrix and Statistics

          Reference
Prediction   N   Y
         N 115  22
         Y  26  59
                                          
               Accuracy : 0.7838          
                 95% CI : (0.7238, 0.8361)
    No Information Rate : 0.6351          
    P-Value [Acc > NIR] : 1.264e-06       
                                          
                  Kappa : 0.5383          
 Mcnemar's Test P-Value : 0.665           
                                          
            Sensitivity : 0.8156          
            Specificity : 0.7284          
         Pos Pred Value : 0.8394          
         Neg Pred Value : 0.6941          
             Prevalence : 0.6351          
         Detection Rate : 0.5180          
   Detection Prevalence : 0.6171          
      Balanced Accuracy : 0.7720          
                                          
       'Positive' Class : N



In [945]:

    
#Training the Logistic regression model
model_lr<-train(
  trainSet[ , names(trainSet) %like% c("feature")], #feature columns
  trainSet[ , names(trainSet) %like% c("label")], #label columns
  method='glm',
  trControl=fitControl,
  tuneLength=3)

#Predicting using knn model
testSet$pred_lr<-predict(
  object = model_lr,
  testSet[ , names(testSet) %like% c("feature")])
  

#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$pred_lr)









    



Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
"prediction from a rank-deficient fit may be misleading"





    





Confusion Matrix and Statistics

          Reference
Prediction   N   Y
         N 111  26
         Y  25  60
                                          
               Accuracy : 0.7703          
                 95% CI : (0.7093, 0.8239)
    No Information Rate : 0.6126          
    P-Value [Acc > NIR] : 4.327e-07       
                                          
                  Kappa : 0.515           
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8162          
            Specificity : 0.6977          
         Pos Pred Value : 0.8102          
         Neg Pred Value : 0.7059          
             Prevalence : 0.6126          
         Detection Rate : 0.5000          
   Detection Prevalence : 0.6171          
      Balanced Accuracy : 0.7569          
                                          
       'Positive' Class : N



In [946]:

    
#Training the knn model
model_knn<-train(
  trainSet[ , names(trainSet) %like% c("feature")], #feature columns
  trainSet[ , names(trainSet) %like% c("label")], #label columns
  method='knn',
  trControl=fitControl,
  tuneLength=3)

#Predicting using knn model
testSet$pred_knn<-predict(
  object = model_knn,
  testSet[ , names(testSet) %like% c("feature")]
  )
  

#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$pred_knn)









    





Confusion Matrix and Statistics

          Reference
Prediction   N   Y
         N 109  28
         Y  36  49
                                          
               Accuracy : 0.7117          
                 95% CI : (0.6473, 0.7704)
    No Information Rate : 0.6532          
    P-Value [Acc > NIR] : 0.03754         
                                          
                  Kappa : 0.3789          
 Mcnemar's Test P-Value : 0.38157         
                                          
            Sensitivity : 0.7517          
            Specificity : 0.6364          
         Pos Pred Value : 0.7956          
         Neg Pred Value : 0.5765          
             Prevalence : 0.6532          
         Detection Rate : 0.4910          
   Detection Prevalence : 0.6171          
      Balanced Accuracy : 0.6940          
                                          
       'Positive' Class : N



In [947]:

    
# Averaging

#Predicting the probabilities
testSet$pred_rf_prob<-predict(
  object = model_rf,
    testSet[ , names(testSet) %like% c("feature")],
  type='prob')
#add Title to predictors for lr
testSet$pred_lr_prob<-predict(
  object = model_lr,
    testSet[ , names(testSet) %like% c("feature")],
  type='prob')
#knn
testSet$pred_knn_prob<-predict(
  object = model_knn,
    testSet[ , names(testSet) %like% c("feature")],
  type='prob')

#Taking average of predictions
testSet$pred_avg<-(
    testSet$pred_rf_prob$Y +
    testSet$pred_lr_prob$Y +
    testSet$pred_knn_prob$Y) / 3

#Splitting into binary classes at 0.5
testSet$pred_avg<-as.factor(ifelse(testSet$pred_avg>0.5,'Y','N'))

#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$pred_avg)









    





Confusion Matrix and Statistics

          Reference
Prediction   N   Y
         N 116  21
         Y  25  60
                                          
               Accuracy : 0.7928          
                 95% CI : (0.7335, 0.8441)
    No Information Rate : 0.6351          
    P-Value [Acc > NIR] : 2.686e-07       
                                          
                  Kappa : 0.5576          
 Mcnemar's Test P-Value : 0.6583          
                                          
            Sensitivity : 0.8227          
            Specificity : 0.7407          
         Pos Pred Value : 0.8467          
         Neg Pred Value : 0.7059          
             Prevalence : 0.6351          
         Detection Rate : 0.5225          
   Detection Prevalence : 0.6171          
      Balanced Accuracy : 0.7817          
                                          
       'Positive' Class : N



In [948]:

    
#Majority Voting

#The majority vote
testSet$pred_majority<-as.factor(ifelse(
  testSet$pred_rf=='Y' & testSet$pred_knn=='Y','Y',
  ifelse(testSet$pred_rf=='Y' & testSet$pred_lr=='Y','Y',
  ifelse(testSet$pred_knn=='Y' & testSet$pred_lr=='Y','Y','N'))))

#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$pred_majority)









    





Confusion Matrix and Statistics

          Reference
Prediction   N   Y
         N 113  24
         Y  25  60
                                         
               Accuracy : 0.7793         
                 95% CI : (0.7189, 0.832)
    No Information Rate : 0.6216         
    P-Value [Acc > NIR] : 3.618e-07      
                                         
                  Kappa : 0.5319         
 Mcnemar's Test P-Value : 1              
                                         
            Sensitivity : 0.8188         
            Specificity : 0.7143         
         Pos Pred Value : 0.8248         
         Neg Pred Value : 0.7059         
             Prevalence : 0.6216         
         Detection Rate : 0.5090         
   Detection Prevalence : 0.6171         
      Balanced Accuracy : 0.7666         
                                         
       'Positive' Class : N



In [949]:

    
#Weighted average
#Taking weighted average of predictions
testSet$pred_weighted_avg<-(testSet$pred_rf_prob$Y*0.25)+(testSet$pred_knn_prob$Y*0.25)+(testSet$pred_lr_prob$Y*0.5)

#Splitting into binary classes at 0.5
testSet$pred_weighted_avg<-as.factor(ifelse(testSet$pred_weighted_avg>0.5,'Y','N'))

#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$pred_weighted_avg)









    





Confusion Matrix and Statistics

          Reference
Prediction   N   Y
         N 115  22
         Y  25  60
                                          
               Accuracy : 0.7883          
                 95% CI : (0.7286, 0.8401)
    No Information Rate : 0.6306          
    P-Value [Acc > NIR] : 2.978e-07       
                                          
                  Kappa : 0.549           
 Mcnemar's Test P-Value : 0.7705          
                                          
            Sensitivity : 0.8214          
            Specificity : 0.7317          
         Pos Pred Value : 0.8394          
         Neg Pred Value : 0.7059          
             Prevalence : 0.6306          
         Detection Rate : 0.5180          
   Detection Prevalence : 0.6171          
      Balanced Accuracy : 0.7766          
                                          
       'Positive' Class : N



In [950]:

    
#Stacking



#Predict using each base layer model for training data and test data

#Predicting the out of fold prediction probabilities for training data
trainSet$OOF_pred_rf<-model_rf$pred$Y[order(model_rf$pred$rowIndex)]
trainSet$OOF_pred_knn<-model_knn$pred$Y[order(model_knn$pred$rowIndex)]
trainSet$OOF_pred_lr<-model_lr$pred$Y[order(model_lr$pred$rowIndex)]

#Predicting probabilities for the test data
testSet$OOF_pred_rf<-predict(
  model_rf,
  testSet[ , names(testSet) %like% c("feature")],
  type='prob')$Y
testSet$OOF_pred_knn<-predict(
  model_knn,
  testSet[ , names(testSet) %like% c("feature")],
  type='prob')$Y
testSet$OOF_pred_lr<-predict(
  model_lr,
  testSet[ , names(testSet) %like% c("feature")],
  type='prob')$Y


#Now train the top layer model again on the predictions of the bottom layer models that has been made on the training data

#Predictors for top layer models 
predictors_top<-c('OOF_pred_rf','OOF_pred_knn','OOF_pred_lr') 

#GBM as top layer model 
model_gbm<- 
train(
      trainSet[ , names(trainSet) %like% c("OOF_pred")],
      trainSet[ , names(trainSet) %like% c("label")], #label columns
      method='gbm',
      trControl=fitControl,tuneLength=3)

#predict using GBM top layer model
testSet$gbm_stacked<-predict(
  model_gbm,
  testSet[ , names(testSet) %like% c("OOF_pred")]
  )



#Checking the accuracy of the random forest model
confusionMatrix(testSet$label_Survived,testSet$gbm_stacked)


#----------------------

testSet$pred_gbm_prob<-predict(
  object = model_gbm,
    testSet[ , names(testSet) %like% c("feature")],
  type='prob')


probs <- testSet$pred_gbm_prob$Y
# Load the ROCR library
library(ROCR)
# Make a prediction object: predi
predi <- prediction(probs, testSet$label_Survived)
# Make a performance object: perf
perf <- performance(predi, "tpr", "fpr")
auc <- performance(predi, measure = "auc")
auc <- auc@y.values[[1]]


roc.data <- data.frame(fpr=unlist(perf@x.values),
                       tpr=unlist(perf@y.values))









    



Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2379             nan     0.1000    0.0428
     2        1.1671             nan     0.1000    0.0342
     3        1.1117             nan     0.1000    0.0259
     4        1.0604             nan     0.1000    0.0238
     5        1.0235             nan     0.1000    0.0164
     6        0.9924             nan     0.1000    0.0165
     7        0.9623             nan     0.1000    0.0132
     8        0.9390             nan     0.1000    0.0108
     9        0.9171             nan     0.1000    0.0111
    10        0.8982             nan     0.1000    0.0078
    20        0.7924             nan     0.1000    0.0024
    40        0.7434             nan     0.1000   -0.0008
    60        0.7297             nan     0.1000    0.0003
    80        0.7212             nan     0.1000   -0.0015
   100        0.7141             nan     0.1000   -0.0005
   120        0.7069             nan     0.1000   -0.0010
   140        0.6986             nan     0.1000   -0.0010
   150        0.6956             nan     0.1000   -0.0009

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2425             nan     0.1000    0.0399
     2        1.1686             nan     0.1000    0.0343
     3        1.1089             nan     0.1000    0.0280
     4        1.0549             nan     0.1000    0.0241
     5        1.0102             nan     0.1000    0.0188
     6        0.9715             nan     0.1000    0.0166
     7        0.9435             nan     0.1000    0.0143
     8        0.9180             nan     0.1000    0.0121
     9        0.8942             nan     0.1000    0.0106
    10        0.8725             nan     0.1000    0.0085
    20        0.7692             nan     0.1000    0.0016
    40        0.7080             nan     0.1000   -0.0018
    60        0.6821             nan     0.1000   -0.0017
    80        0.6611             nan     0.1000   -0.0010
   100        0.6469             nan     0.1000   -0.0006
   120        0.6271             nan     0.1000   -0.0009
   140        0.6124             nan     0.1000   -0.0017
   150        0.6079             nan     0.1000   -0.0008

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2367             nan     0.1000    0.0451
     2        1.1535             nan     0.1000    0.0379
     3        1.0922             nan     0.1000    0.0300
     4        1.0389             nan     0.1000    0.0232
     5        0.9928             nan     0.1000    0.0208
     6        0.9545             nan     0.1000    0.0169
     7        0.9287             nan     0.1000    0.0111
     8        0.9042             nan     0.1000    0.0103
     9        0.8814             nan     0.1000    0.0102
    10        0.8619             nan     0.1000    0.0066
    20        0.7500             nan     0.1000    0.0027
    40        0.6848             nan     0.1000   -0.0035
    60        0.6493             nan     0.1000   -0.0017
    80        0.6091             nan     0.1000   -0.0002
   100        0.5821             nan     0.1000   -0.0019
   120        0.5605             nan     0.1000   -0.0011
   140        0.5372             nan     0.1000   -0.0017
   150        0.5242             nan     0.1000   -0.0025

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2463             nan     0.1000    0.0399
     2        1.1722             nan     0.1000    0.0331
     3        1.1192             nan     0.1000    0.0263
     4        1.0784             nan     0.1000    0.0229
     5        1.0409             nan     0.1000    0.0190
     6        1.0047             nan     0.1000    0.0159
     7        0.9765             nan     0.1000    0.0130
     8        0.9472             nan     0.1000    0.0119
     9        0.9255             nan     0.1000    0.0102
    10        0.9089             nan     0.1000    0.0075
    20        0.8102             nan     0.1000    0.0015
    40        0.7725             nan     0.1000    0.0001
    60        0.7595             nan     0.1000   -0.0002
    80        0.7494             nan     0.1000   -0.0012
   100        0.7398             nan     0.1000   -0.0006
   120        0.7306             nan     0.1000   -0.0014
   140        0.7239             nan     0.1000   -0.0017
   150        0.7216             nan     0.1000   -0.0006

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2383             nan     0.1000    0.0440
     2        1.1677             nan     0.1000    0.0349
     3        1.1066             nan     0.1000    0.0309
     4        1.0610             nan     0.1000    0.0222
     5        1.0188             nan     0.1000    0.0186
     6        0.9793             nan     0.1000    0.0175
     7        0.9489             nan     0.1000    0.0121
     8        0.9238             nan     0.1000    0.0113
     9        0.9029             nan     0.1000    0.0110
    10        0.8842             nan     0.1000    0.0095
    20        0.7906             nan     0.1000    0.0013
    40        0.7317             nan     0.1000   -0.0015
    60        0.6932             nan     0.1000   -0.0012
    80        0.6729             nan     0.1000   -0.0008
   100        0.6509             nan     0.1000   -0.0012
   120        0.6342             nan     0.1000   -0.0023
   140        0.6154             nan     0.1000   -0.0005
   150        0.6089             nan     0.1000   -0.0017

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2419             nan     0.1000    0.0445
     2        1.1630             nan     0.1000    0.0341
     3        1.1015             nan     0.1000    0.0269
     4        1.0488             nan     0.1000    0.0235
     5        1.0066             nan     0.1000    0.0198
     6        0.9697             nan     0.1000    0.0150
     7        0.9353             nan     0.1000    0.0148
     8        0.9111             nan     0.1000    0.0078
     9        0.8879             nan     0.1000    0.0103
    10        0.8709             nan     0.1000    0.0062
    20        0.7675             nan     0.1000    0.0003
    40        0.6970             nan     0.1000   -0.0029
    60        0.6588             nan     0.1000   -0.0026
    80        0.6237             nan     0.1000   -0.0019
   100        0.5979             nan     0.1000   -0.0015
   120        0.5718             nan     0.1000   -0.0021
   140        0.5482             nan     0.1000   -0.0027
   150        0.5384             nan     0.1000   -0.0019

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2441             nan     0.1000    0.0420
     2        1.1711             nan     0.1000    0.0346
     3        1.1114             nan     0.1000    0.0289
     4        1.0624             nan     0.1000    0.0229
     5        1.0204             nan     0.1000    0.0172
     6        0.9800             nan     0.1000    0.0168
     7        0.9513             nan     0.1000    0.0136
     8        0.9257             nan     0.1000    0.0099
     9        0.9046             nan     0.1000    0.0085
    10        0.8875             nan     0.1000    0.0066
    20        0.7919             nan     0.1000    0.0020
    40        0.7420             nan     0.1000   -0.0009
    60        0.7276             nan     0.1000    0.0001
    80        0.7177             nan     0.1000   -0.0017
   100        0.7097             nan     0.1000   -0.0006
   120        0.7026             nan     0.1000   -0.0014
   140        0.6965             nan     0.1000   -0.0016
   150        0.6926             nan     0.1000   -0.0013

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2310             nan     0.1000    0.0400
     2        1.1637             nan     0.1000    0.0389
     3        1.0978             nan     0.1000    0.0292
     4        1.0450             nan     0.1000    0.0244
     5        0.9977             nan     0.1000    0.0177
     6        0.9613             nan     0.1000    0.0169
     7        0.9350             nan     0.1000    0.0123
     8        0.9115             nan     0.1000    0.0113
     9        0.8871             nan     0.1000    0.0111
    10        0.8678             nan     0.1000    0.0077
    20        0.7667             nan     0.1000    0.0017
    40        0.7104             nan     0.1000   -0.0019
    60        0.6823             nan     0.1000   -0.0019
    80        0.6537             nan     0.1000   -0.0028
   100        0.6402             nan     0.1000   -0.0004
   120        0.6202             nan     0.1000   -0.0013
   140        0.6030             nan     0.1000   -0.0005
   150        0.5961             nan     0.1000   -0.0014

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2343             nan     0.1000    0.0490
     2        1.1550             nan     0.1000    0.0361
     3        1.0895             nan     0.1000    0.0301
     4        1.0372             nan     0.1000    0.0260
     5        0.9978             nan     0.1000    0.0180
     6        0.9572             nan     0.1000    0.0155
     7        0.9284             nan     0.1000    0.0131
     8        0.8991             nan     0.1000    0.0111
     9        0.8778             nan     0.1000    0.0100
    10        0.8543             nan     0.1000    0.0048
    20        0.7458             nan     0.1000   -0.0013
    40        0.6772             nan     0.1000   -0.0001
    60        0.6329             nan     0.1000   -0.0012
    80        0.6099             nan     0.1000   -0.0020
   100        0.5853             nan     0.1000   -0.0034
   120        0.5649             nan     0.1000   -0.0032
   140        0.5492             nan     0.1000   -0.0009
   150        0.5367             nan     0.1000   -0.0018

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2466             nan     0.1000    0.0368
     2        1.1747             nan     0.1000    0.0328
     3        1.1219             nan     0.1000    0.0245
     4        1.0715             nan     0.1000    0.0199
     5        1.0365             nan     0.1000    0.0156
     6        1.0030             nan     0.1000    0.0154
     7        0.9762             nan     0.1000    0.0139
     8        0.9530             nan     0.1000    0.0103
     9        0.9296             nan     0.1000    0.0087
    10        0.9119             nan     0.1000    0.0077
    20        0.8237             nan     0.1000    0.0005
    40        0.7927             nan     0.1000   -0.0007
    60        0.7822             nan     0.1000   -0.0005
    80        0.7738             nan     0.1000   -0.0007
   100        0.7655             nan     0.1000   -0.0009
   120        0.7591             nan     0.1000   -0.0003
   140        0.7508             nan     0.1000   -0.0011
   150        0.7476             nan     0.1000   -0.0000

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2382             nan     0.1000    0.0442
     2        1.1733             nan     0.1000    0.0331
     3        1.1174             nan     0.1000    0.0245
     4        1.0682             nan     0.1000    0.0227
     5        1.0237             nan     0.1000    0.0190
     6        0.9932             nan     0.1000    0.0155
     7        0.9619             nan     0.1000    0.0131
     8        0.9356             nan     0.1000    0.0107
     9        0.9137             nan     0.1000    0.0096
    10        0.8949             nan     0.1000    0.0080
    20        0.8086             nan     0.1000    0.0005
    40        0.7596             nan     0.1000   -0.0002
    60        0.7315             nan     0.1000   -0.0003
    80        0.7143             nan     0.1000   -0.0010
   100        0.6933             nan     0.1000   -0.0009
   120        0.6798             nan     0.1000   -0.0025
   140        0.6610             nan     0.1000   -0.0016
   150        0.6560             nan     0.1000   -0.0012

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2468             nan     0.1000    0.0366
     2        1.1718             nan     0.1000    0.0351
     3        1.1098             nan     0.1000    0.0295
     4        1.0644             nan     0.1000    0.0197
     5        1.0250             nan     0.1000    0.0177
     6        0.9861             nan     0.1000    0.0151
     7        0.9553             nan     0.1000    0.0101
     8        0.9261             nan     0.1000    0.0099
     9        0.9058             nan     0.1000    0.0079
    10        0.8880             nan     0.1000    0.0056
    20        0.7978             nan     0.1000   -0.0015
    40        0.7333             nan     0.1000   -0.0017
    60        0.6848             nan     0.1000   -0.0005
    80        0.6518             nan     0.1000   -0.0015
   100        0.6214             nan     0.1000   -0.0015
   120        0.5979             nan     0.1000   -0.0014
   140        0.5775             nan     0.1000   -0.0029
   150        0.5713             nan     0.1000   -0.0019

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2440             nan     0.1000    0.0419
     2        1.1704             nan     0.1000    0.0365
     3        1.1071             nan     0.1000    0.0284
     4        1.0583             nan     0.1000    0.0218
     5        1.0174             nan     0.1000    0.0178
     6        0.9809             nan     0.1000    0.0169
     7        0.9505             nan     0.1000    0.0137
     8        0.9260             nan     0.1000    0.0115
     9        0.9013             nan     0.1000    0.0088
    10        0.8848             nan     0.1000    0.0084
    20        0.7882             nan     0.1000    0.0022
    40        0.7393             nan     0.1000   -0.0000
    60        0.7261             nan     0.1000   -0.0005
    80        0.7173             nan     0.1000   -0.0005
   100        0.7090             nan     0.1000   -0.0013
   120        0.7006             nan     0.1000   -0.0013
   140        0.6941             nan     0.1000   -0.0009
   150        0.6897             nan     0.1000   -0.0017

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2390             nan     0.1000    0.0424
     2        1.1588             nan     0.1000    0.0350
     3        1.0996             nan     0.1000    0.0280
     4        1.0478             nan     0.1000    0.0254
     5        1.0047             nan     0.1000    0.0202
     6        0.9684             nan     0.1000    0.0177
     7        0.9370             nan     0.1000    0.0141
     8        0.9070             nan     0.1000    0.0137
     9        0.8809             nan     0.1000    0.0108
    10        0.8606             nan     0.1000    0.0082
    20        0.7663             nan     0.1000    0.0004
    40        0.7104             nan     0.1000   -0.0017
    60        0.6778             nan     0.1000   -0.0009
    80        0.6526             nan     0.1000   -0.0015
   100        0.6321             nan     0.1000   -0.0012
   120        0.6134             nan     0.1000   -0.0007
   140        0.5966             nan     0.1000   -0.0012
   150        0.5883             nan     0.1000   -0.0013

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2387             nan     0.1000    0.0467
     2        1.1625             nan     0.1000    0.0345
     3        1.0947             nan     0.1000    0.0305
     4        1.0406             nan     0.1000    0.0245
     5        0.9943             nan     0.1000    0.0227
     6        0.9504             nan     0.1000    0.0174
     7        0.9176             nan     0.1000    0.0136
     8        0.8926             nan     0.1000    0.0081
     9        0.8666             nan     0.1000    0.0104
    10        0.8469             nan     0.1000    0.0089
    20        0.7336             nan     0.1000    0.0012
    40        0.6595             nan     0.1000   -0.0008
    60        0.6207             nan     0.1000   -0.0010
    80        0.5977             nan     0.1000   -0.0024
   100        0.5722             nan     0.1000   -0.0018
   120        0.5492             nan     0.1000   -0.0018
   140        0.5258             nan     0.1000   -0.0020
   150        0.5155             nan     0.1000   -0.0028

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2472             nan     0.1000    0.0415
     2        1.1748             nan     0.1000    0.0341
     3        1.1218             nan     0.1000    0.0269
     4        1.0774             nan     0.1000    0.0214
     5        1.0308             nan     0.1000    0.0184
     6        1.0002             nan     0.1000    0.0150
     7        0.9688             nan     0.1000    0.0117
     8        0.9433             nan     0.1000    0.0113
     9        0.9232             nan     0.1000    0.0095
    10        0.9037             nan     0.1000    0.0082
    20        0.8096             nan     0.1000    0.0007
    40        0.7677             nan     0.1000   -0.0004
    50        0.7607             nan     0.1000   -0.0013







    





Confusion Matrix and Statistics

          Reference
Prediction   N   Y
         N 118  19
         Y  27  58
                                          
               Accuracy : 0.7928          
                 95% CI : (0.7335, 0.8441)
    No Information Rate : 0.6532          
    P-Value [Acc > NIR] : 3.833e-06       
                                          
                  Kappa : 0.5536          
 Mcnemar's Test P-Value : 0.302           
                                          
            Sensitivity : 0.8138          
            Specificity : 0.7532          
         Pos Pred Value : 0.8613          
         Neg Pred Value : 0.6824          
             Prevalence : 0.6532          
         Detection Rate : 0.5315          
   Detection Prevalence : 0.6171          
      Balanced Accuracy : 0.7835          
                                          
       'Positive' Class : N



In [951]:

    
str(testSet)









    



'data.frame':	222 obs. of  29 variables:
 $ label_Survived     : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 2 1 1 ...
 $ feature_Pclass.2   : num  0 0 0 0 1 0 0 1 0 1 ...
 $ feature_Pclass.3   : num  1 1 0 1 0 1 1 0 1 0 ...
 $ feature_TitleDr    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ feature_TitleMaster: num  0 0 0 1 0 0 1 0 0 0 ...
 $ feature_TitleMiss  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ feature_TitleMr    : num  1 1 1 0 0 1 0 1 0 1 ...
 $ feature_TitleMrs   : num  0 0 0 0 1 0 0 0 1 0 ...
 $ feature_TitleRev   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ feature_SibSp      : num  1 0 0 3 1 1 4 0 1 0 ...
 $ feature_Parch      : num  0 0 0 1 0 5 1 0 0 0 ...
 $ feature_EmbarkedQ  : num  0 1 0 0 0 0 1 0 0 0 ...
 $ feature_EmbarkedS  : num  1 0 1 1 0 1 0 1 1 1 ...
 $ feature_Age        : num  22 29.2 54 2 14 ...
 $ feature_Fare       : num  7.25 8.46 51.86 21.07 30.07 ...
 $ pred_rf            : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 2 1 ...
 $ pred_lr            : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 2 1 ...
 $ pred_knn           : Factor w/ 2 levels "N","Y": 1 1 2 2 2 1 1 1 1 2 ...
 $ pred_rf_prob       :'data.frame':	222 obs. of  2 variables:
  ..$ N: num  0.988 0.976 0.918 0.484 0.09 0.932 0.822 0.994 0.286 0.968 ...
  ..$ Y: num  0.012 0.024 0.082 0.516 0.91 0.068 0.178 0.006 0.714 0.032 ...
 $ pred_lr_prob       :'data.frame':	222 obs. of  2 variables:
  ..$ N: num  0.9453 0.9109 0.6911 0.5993 0.0546 ...
  ..$ Y: num  0.0547 0.0891 0.3089 0.4007 0.9454 ...
 $ pred_knn_prob      :'data.frame':	222 obs. of  2 variables:
  ..$ N: num  1 0.889 0.444 0 0.444 ...
  ..$ Y: num  0 0.111 0.556 1 0.556 ...
 $ pred_avg           : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 2 1 ...
 $ pred_majority      : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 2 1 ...
 $ pred_weighted_avg  : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 2 1 ...
 $ OOF_pred_rf        : num  0.012 0.024 0.082 0.516 0.91 0.068 0.178 0.006 0.714 0.032 ...
 $ OOF_pred_knn       : num  0 0.111 0.556 1 0.556 ...
 $ OOF_pred_lr        : num  0.0547 0.0891 0.3089 0.4007 0.9454 ...
 $ gbm_stacked        : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 2 1 ...
 $ pred_gbm_prob      :'data.frame':	222 obs. of  2 variables:
  ..$ N: num  0.897 0.897 0.927 0.897 0.492 ...
  ..$ Y: num  0.1026 0.1026 0.0726 0.1026 0.5077 ...



In [ ]:

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	set
1	N	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NA	S	Train
2	Y	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.2833	C85	C	Train
3	Y	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NA	S	Train
4	Y	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S	Train
5	N	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NA	S	Train
6	N	3	Moran, Mr. James	male	NA	0	330877	8.4583	NA	Q	Train

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	set
1304	NA	3	Henriksson, Miss. Jenny Lovisa	female	28.0	0	0	347086	7.7750	NA	S	Test
1305	NA	3	Spector, Mr. Woolf	male	NA	0	0	A.5. 3236	8.0500	NA	S	Test
1306	NA	1	Oliva y Ocana, Dona. Fermina	female	39.0	0	0	PC 17758	108.9000	C105	C	Test
1307	NA	3	Saether, Mr. Simon Sivertsen	male	38.5	0	0	SOTON/O.Q. 3101262	7.2500	NA	S	Test
1308	NA	3	Ware, Mr. Frederick	male	NA	0	0	359309	8.0500	NA	S	Test
1309	NA	3	Peter, Master. Michael J	male	NA	1	1	2668	22.3583	NA	C	Test

Title	TitleCount	Min_Age	Max_Age
Mr	757	11.00	80.0
Miss	260	0.17	63.0
Mrs	197	14.00	76.0
Master	61	0.33	14.5
Dr	8	23.00	54.0
Rev	8	27.00	57.0
Col	4	47.00	60.0
Major	2	45.00	52.0
Mlle	2	24.00	24.0
Ms	2	28.00	28.0
Capt	1	70.00	70.0
Don	1	40.00	40.0
Dona	1	39.00	39.0
Jonkheer	1	38.00	38.0
Lady	1	48.00	48.0
Mme	1	24.00	24.0
Sir	1	49.00	49.0
the Countess	1	33.00	33.0

Title	TitleCount	Min_Age	Max_Age
Mr	758	15.00	80.0
Miss	268	0.17	63.0
Mrs	197	14.00	76.0
Master	66	0.33	14.5
Dr	8	23.00	54.0
Rev	8	27.00	57.0
Col	4	47.00	60.0