Predict survival on the Titanic


In [ ]:

About:

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.


In [ ]:


In [ ]:


In [87]:
# #Load the data into a dataframe
titanic_train = read.csv("train.csv", header=T, na.strings=c("","NA"))

In [88]:
# #Structure of the data
str(titanic_train)


'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

In [89]:
head(titanic_train)


Out[89]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1103Braund, Mr. Owen Harrismale2210A/5 211717.25NAS
2211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female3810PC 1759971.2833C85C
3313Heikkinen, Miss. Lainafemale2600STON/O2. 31012827.925NAS
4411Futrelle, Mrs. Jacques Heath (Lily May Peel)female351011380353.1C123S
5503Allen, Mr. William Henrymale35003734508.05NAS
6603Moran, Mr. JamesmaleNA003308778.4583NAQ

In [90]:
summary(titanic_train)


Out[90]:
  PassengerId       Survived          Pclass     
 Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
 1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
 Median :446.0   Median :0.0000   Median :3.000  
 Mean   :446.0   Mean   :0.3838   Mean   :2.309  
 3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
 Max.   :891.0   Max.   :1.0000   Max.   :3.000  
                                                 
                                    Name         Sex           Age       
 Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
 Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
 Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
 Abelson, Mr. Samuel                  :  1                Mean   :29.70  
 Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
 Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
 (Other)                              :885                NA's   :177    
     SibSp           Parch             Ticket         Fare       
 Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
 1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
 Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
 Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
 3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
 Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
                                  (Other) :852                   
         Cabin     Embarked  
 B96 B98    :  4   C   :168  
 C23 C25 C27:  4   Q   : 77  
 G6         :  4   S   :644  
 C22 C26    :  3   NA's:  2  
 D          :  3             
 (Other)    :186             
 NA's       :687             

In [ ]:


In [ ]:


In [ ]:


In [91]:
# #The dependent variable is "Survived" - 0 or 1 AND all the other variables are the independent variables

In [102]:
titanic_train_cleaned = titanic_train[, !(colnames(titanic_train) %in% c("Name","Ticket", "Cabin", "PassengerId"))]

In [103]:
head(titanic_train_cleaned)


Out[103]:
SurvivedPclassSexAgeSibSpParchFareEmbarked
103male22107.25S
211female381071.2833C
313female26007.925S
411female351053.1S
503male35008.05S
603maleNA008.4583Q

In [ ]:


In [ ]:


In [104]:
# #Comparing the number of Men and Women who survived
table(titanic_train_cleaned$Survived, titanic_train_cleaned$Sex)


Out[104]:
   
    female male
  0     81  468
  1    233  109

In [105]:
# #Percentage of Women who survived
233*100/(233+81)


Out[105]:
74.203821656051

In [106]:
# #Percentage of Men who survived
109*100/(109+468)


Out[106]:
18.8908145580589

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [107]:
model_1 = glm(Survived ~ ., data = titanic_train_cleaned, family = "binomial")

In [108]:
summary(model_1)


Out[108]:
Call:
glm(formula = Survived ~ ., family = "binomial", data = titanic_train_cleaned)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7233  -0.6447  -0.3799   0.6326   2.4457  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.637407   0.634550   8.884  < 2e-16 ***
Pclass      -1.199251   0.164619  -7.285 3.22e-13 ***
Sexmale     -2.638476   0.222256 -11.871  < 2e-16 ***
Age         -0.043350   0.008232  -5.266 1.39e-07 ***
SibSp       -0.363208   0.129017  -2.815  0.00487 ** 
Parch       -0.060270   0.123900  -0.486  0.62666    
Fare         0.001432   0.002531   0.566  0.57165    
EmbarkedQ   -0.823545   0.600229  -1.372  0.17005    
EmbarkedS   -0.401213   0.270283  -1.484  0.13770    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 960.90  on 711  degrees of freedom
Residual deviance: 632.34  on 703  degrees of freedom
  (179 observations deleted due to missingness)
AIC: 650.34

Number of Fisher Scoring iterations: 5

In [ ]:


In [ ]:


In [ ]:


In [109]:
titanic_test = read.csv("test.csv")

In [110]:
titanic_test


Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generated
Out[110]:
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
18923Kelly, Mr. Jamesmale34.5003309117.8292Q
28933Wilkes, Mrs. James (Ellen Needs)female47103632727S
38942Myles, Mr. Thomas Francismale62002402769.6875Q
48953Wirz, Mr. Albertmale27003151548.6625S
58963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female2211310129812.2875S
68973Svensson, Mr. Johan Cervinmale140075389.225S
78983Connolly, Miss. Katefemale30003309727.6292Q
88992Caldwell, Mr. Albert Francismale261124873829S
99003Abrahim, Mrs. Joseph (Sophie Halaut Easu)female180026577.2292C
109013Davies, Mr. John Samuelmale2120A/4 4887124.15S
119023Ilieff, Mr. YliomaleNA003492207.8958S
129031Jones, Mr. Charles Cressonmale460069426S
139041Snyder, Mrs. John Pillsbury (Nelle Stevenson)female23102122882.2667B45S
149052Howard, Mr. Benjaminmale63102406526S
159061Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)female4710W.E.P. 573461.175E31S
169072del Carlo, Mrs. Sebastiano (Argenia Genovesi)female2410SC/PARIS 216727.7208C
179082Keane, Mr. Danielmale350023373412.35Q
189093Assaf, Mr. Geriosmale210026927.225C
199103Ilmakangas, Miss. Ida Livijafemale2710STON/O2. 31012707.925S
209113Assaf Khalil, Mrs. Mariana (Miriam")"female450026967.225C
219121Rothschild, Mr. Martinmale5510PC 1760359.4C
229133Olsen, Master. Artur Karlmale901C 173683.1708S
239141Flegenheim, Mrs. Alfred (Antoinette)femaleNA00PC 1759831.6833S
249151Williams, Mr. Richard Norris IImale2101PC 1759761.3792C
259161Ryerson, Mrs. Arthur Larned (Emily Maria Borie)female4813PC 17608262.375B57 B59 B63 B66C
269173Robins, Mr. Alexander Amale5010A/5. 333714.5S
279181Ostby, Miss. Helene Ragnhildfemale220111350961.9792B36C
289193Daher, Mr. Shedidmale22.50026987.225C
299201Brady, Mr. John Bertrammale410011305430.5A21S
309213Samaan, Mr. EliasmaleNA20266221.6792C
31<8b><8b>NANA<8b><8b><8b>NA<8b>NANA
38912803Canavan, Mr. Patrickmale21003648587.75Q
39012813Palsson, Master. Paul Folkemale63134990921.075S
39112821Payne, Mr. Vivian Ponsonbymale23001274993.5B24S
39212831Lines, Mrs. Ernest H (Elizabeth Lindsey James)female5101PC 1759239.4D28S
39312843Abbott, Master. Eugene Josephmale1302C.A. 267320.25S
39412852Gilbert, Mr. Williammale4700C.A. 3076910.5S
39512863Kink-Heilmann, Mr. Antonmale293131515322.025S
39612871Smith, Mrs. Lucien Philip (Mary Eloise Hughes)female18101369560C31S
39712883Colbert, Mr. Patrickmale24003711097.25Q
39812891Frolicher-Stehli, Mrs. Maxmillian (Margaretha Emerentia Stehli)female48111356779.2B41C
39912903Larsson-Rondberg, Mr. Edvard Amale22003470657.775S
40012913Conlon, Mr. Thomas Henrymale3100213327.7333Q
40112921Bonnell, Miss. Carolinefemale300036928164.8667C7S
40212932Gale, Mr. Harrymale38102866421S
40312941Gibson, Miss. Dorothy Winifredfemale220111237859.4C
40412951Carrau, Mr. Jose Pedromale170011305947.1S
40512961Frauenthal, Mr. Isaac Geraldmale43101776527.7208D40C
40612972Nourney, Mr. Alfred (Baron von Drachstedt")"male2000SC/PARIS 216613.8625D38C
40712982Ware, Mr. William Jefferymale23102866610.5S
40812991Widener, Mr. George Duntonmale5011113503211.5C80C
40913003Riordan, Miss. Johanna Hannah""femaleNA003349157.7208Q
41013013Peacock, Miss. Treasteallfemale311SOTON/O.Q. 310131513.775S
41113023Naughton, Miss. HannahfemaleNA003652377.75Q
41213031Minahan, Mrs. William Edward (Lillian E Thorpe)female37101992890C78Q
41313043Henriksson, Miss. Jenny Lovisafemale28003470867.775S
41413053Spector, Mr. WoolfmaleNA00A.5. 32368.05S
41513061Oliva y Ocana, Dona. Ferminafemale3900PC 17758108.9C105C
41613073Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.25S
41713083Ware, Mr. FrederickmaleNA003593098.05S
41813093Peter, Master. Michael JmaleNA11266822.3583C

In [119]:
predictTest = predict.glm(model_1, newdata = titanic_test, type = "response")

In [120]:
survival = ifelse(predictTest>=0.5,1,0)

In [121]:
output = cbind(titanic_test, predictTest, survival)

In [126]:
output


Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "<e2><8b><ae>"): invalid factor level, NA generated
Out[126]:
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedpredictTestsurvival
18923Kelly, Mr. Jamesmale34.5003309117.8292Q0.0518217329918210
28933Wilkes, Mrs. James (Ellen Needs)female47103632727S0.3203431558694190
38942Myles, Mr. Thomas Francismale62002402769.6875Q0.05230475024946760
48953Wirz, Mr. Albertmale27003151548.6625S0.1035782289871680
58963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female2211310129812.2875S0.5692651677319911
68973Svensson, Mr. Johan Cervinmale140075389.225S0.168859896404210
78983Connolly, Miss. Katefemale30003309727.6292Q0.4816412505660620
88992Caldwell, Mr. Albert Francismale261124873829S0.2125125756162210
99003Abrahim, Mrs. Joseph (Sophie Halaut Easu)female180026577.2292C0.7806920189101521
109013Davies, Mr. John Samuelmale2120A/4 4887124.15S0.06899509653401540
119023Ilieff, Mr. YliomaleNA003492207.8958SNANA
129031Jones, Mr. Charles Cressonmale460069426S0.3639151345583560
139041Snyder, Mrs. John Pillsbury (Nelle Stevenson)female23102122882.2667B45S0.9423755316819081
149052Howard, Mr. Benjaminmale63102406526S0.05427849500009370
159061Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)female4710W.E.P. 573461.175E31S0.8486257161149881
169072del Carlo, Mrs. Sebastiano (Argenia Genovesi)female2410SC/PARIS 216727.7208C0.8670336746726011
179082Keane, Mr. Danielmale350023373412.35Q0.1515256329610460
189093Assaf, Mr. Geriosmale210026927.225C0.1826003013218560
199103Ilmakangas, Miss. Ida Livijafemale2710STON/O2. 31012707.925S0.5289994973984221
209113Assaf Khalil, Mrs. Mariana (Miriam")"female450026967.225C0.5247917895802541
219121Rothschild, Mr. Martinmale5510PC 1760359.4C0.2967669886269450
229133Olsen, Master. Artur Karlmale901C 173683.1708S0.1906308842397470
239141Flegenheim, Mrs. Alfred (Antoinette)femaleNA00PC 1759831.6833SNANA
249151Williams, Mr. Richard Norris IImale2101PC 1759761.3792C0.7144161357530921
259161Ryerson, Mrs. Arthur Larned (Emily Maria Borie)female4813PC 17608262.375B57 B59 B63 B66C0.8992530046961051
269173Robins, Mr. Alexander Amale5010A/5. 333714.5S0.02902946983790090
279181Ostby, Miss. Helene Ragnhildfemale220111350961.9792B36C0.9710530208506461
289193Daher, Mr. Shedidmale22.50026987.225C0.1730943849958190
299201Brady, Mr. John Bertrammale410011305430.5A21S0.4169712264804640
309213Samaan, Mr. EliasmaleNA20266221.6792CNANA
31<8b><8b>NANA<8b><8b><8b>NA<8b>NANA<8b><8b>
38912803Canavan, Mr. Patrickmale21003648587.75Q0.08934801917368640
39012813Palsson, Master. Paul Folkemale63134990921.075S0.08472143224778440
39112821Payne, Mr. Vivian Ponsonbymale23001274993.5B24S0.630711422008671
39212831Lines, Mrs. Ernest H (Elizabeth Lindsey James)female5101PC 1759239.4D28S0.8608333393382971
39312843Abbott, Master. Eugene Josephmale1302C.A. 267320.25S0.1604161000802930
39412852Gilbert, Mr. Williammale4700C.A. 3076910.5S0.1390501300038160
39512863Kink-Heilmann, Mr. Antonmale293131515322.025S0.03306841637333090
39612871Smith, Mrs. Lucien Philip (Mary Eloise Hughes)female18101369560C31S0.9516315278031091
39712883Colbert, Mr. Patrickmale24003711097.25Q0.07926413632374160
39812891Frolicher-Stehli, Mrs. Maxmillian (Margaretha Emerentia Stehli)female48111356779.2B41C0.8856705407172741
39912903Larsson-Rondberg, Mr. Edvard Amale22003470657.775S0.1253619970545140
40012913Conlon, Mr. Thomas Henrymale3100213327.7333Q0.05979671895983650
40112921Bonnell, Miss. Carolinefemale300036928164.8667C7S0.9513145339031761
40212932Gale, Mr. Harrymale38102866421S0.1441514722186590
40312941Gibson, Miss. Dorothy Winifredfemale220111237859.4C0.9709490517527171
40412951Carrau, Mr. Jose Pedromale170011305947.1S0.6745735486588841
40512961Frauenthal, Mr. Isaac Geraldmale43101776527.7208D40C0.4042246429409940
40612972Nourney, Mr. Alfred (Baron von Drachstedt")"male2000SC/PARIS 216613.8625D38C0.4386296160817590
40712982Ware, Mr. William Jefferymale23102866610.5S0.2412184600490960
40812991Widener, Mr. George Duntonmale5011113503211.5C80C0.3802430667925990
40913003Riordan, Miss. Johanna Hannah""femaleNA003349157.7208QNANA
41013013Peacock, Miss. Treasteallfemale311SOTON/O.Q. 310131513.775S0.7511271000315081
41113023Naughton, Miss. HannahfemaleNA003652377.75QNANA
41213031Minahan, Mrs. William Edward (Lillian E Thorpe)female37101992890C78Q0.8552389641394871
41313043Henriksson, Miss. Jenny Lovisafemale28003470867.775S0.6072511194635231
41413053Spector, Mr. WoolfmaleNA00A.5. 32368.05SNANA
41513061Oliva y Ocana, Dona. Ferminafemale3900PC 17758108.9C105C0.9480145862134361
41613073Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.25S0.06545905502285090
41713083Ware, Mr. FrederickmaleNA003593098.05SNANA
41813093Peter, Master. Michael JmaleNA11266822.3583CNANA

In [127]:
write.csv(output, "output_1.csv")

In [ ]: