Predict survival on the Titanic


In [ ]:

About:

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.


In [ ]:


In [ ]:


In [59]:
# #Import Python Libraries
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
%matplotlib inline

In [ ]:


In [10]:
# #Read the Training Dataset into a Pandas Dataframe
titanic_train = pd.read_csv("train.csv")

In [9]:
type(titanic_train)


Out[9]:
pandas.core.frame.DataFrame

In [ ]:


In [20]:
# #Overview of the data
titanic_train


Out[20]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21 1 0 28134 11.5000 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48 0 0 17466 25.9292 D17 S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
864 865 0 2 Gill, Mr. John William male 24 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42 0 0 236852 13.0000 NaN S
866 867 1 2 Duran y More, Miss. Asuncion female 27 1 0 SC/PARIS 2149 13.8583 NaN C
867 868 0 1 Roebling, Mr. Washington Augustus II male 31 0 0 PC 17590 50.4958 A24 S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4 1 1 347742 11.1333 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26 0 0 349248 7.8958 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47 1 1 11751 52.5542 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33 0 0 695 5.0000 B51 B53 B55 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47 0 0 345765 9.0000 NaN S
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28 1 0 P/PP 3381 24.0000 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15 0 0 2667 7.2250 NaN C
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.7500 NaN Q

891 rows × 12 columns


In [ ]:


In [22]:
# #Shape of the Dataframe - The dataframe contains 891 Rows and 12 Columns
titanic_train.shape


Out[22]:
(891, 12)

In [ ]:

Description about the 12 columns/features in the dataframe

  • PassengerId - Numerical - Contains a unique id for each passenger(auto-incremented)

  • Survived - Categorical - 0 = Didn't Survive | 1 = Survived

  • Pclass(Passenger Class) - Categorical - 1 = 1st | 2 = 2nd | 3 = 3rd;

    Pclass serves as a proxy for socio-economic status - 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
  • Name - Name

  • Sex - Categorical - Male | Female

  • Age - Numeical

  • SibSp(Number of Siblings/Spouses Aboard)

    Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic

    Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)

  • Parch(Number of Parents/Children Aboard)

    Parent: Mother or Father of Passenger Aboard Titanic

    Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

  • Ticket - Ticket Number

  • Fare - Passenger Fare

  • Cabin - Cabin

  • Embarked - Port of Embarkation - C = Cherbourg | Q = Queenstown | S = Southampton


In [ ]:


In [24]:
# #The essence of the problem is that we are trying to predict whether a passenger aboard the titanic survived or not
# #depending on the various features in this training dataset. Therefore, depending on independent variables such as 
# #- Pclass, Sex, Age etc. our goal is to predict the dependent variable - Survived; that contains the value 0 for 
# #Didn't Survive and 1 for Survived. Furthermore, after building the model we will check this prediction on the Test dataset.

In [ ]:


In [31]:
pd.value_counts(titanic_train["Survived"])


Out[31]:
0    549
1    342
dtype: int64

In [ ]:


In [35]:
# #There are certain columns in our dataset that might be very helpful as features and can therefore be dropped.
titanic_train_reduced = titanic_train.drop(["Name", "Ticket", "Cabin"], axis = 1)

In [68]:
# #To clean the dataset, we could use methods such as multiple imputation to fill in the NA elements.
# #BUT, to start with we can have an extremely clean dataset by dropping all the rows which contains one or more NA elements.
titanic_train_cleaned = titanic_train_reduced.dropna()

In [70]:
# #Convert Categorical Variables into Numerical Variables
titanic_train_cleaned.Sex = titanic_train_cleaned.Sex.apply(lambda sex: 1 if sex == "male" else 0)
titanic_train_cleaned


Out[70]:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 0 22 1 0 7.2500 S
1 2 1 1 0 38 1 0 71.2833 C
2 3 1 3 0 26 0 0 7.9250 S
3 4 1 1 0 35 1 0 53.1000 S
4 5 0 3 0 35 0 0 8.0500 S
6 7 0 1 0 54 0 0 51.8625 S
7 8 0 3 0 2 3 1 21.0750 S
8 9 1 3 0 27 0 2 11.1333 S
9 10 1 2 0 14 1 0 30.0708 C
10 11 1 3 0 4 1 1 16.7000 S
11 12 1 1 0 58 0 0 26.5500 S
12 13 0 3 0 20 0 0 8.0500 S
13 14 0 3 0 39 1 5 31.2750 S
14 15 0 3 0 14 0 0 7.8542 S
15 16 1 2 0 55 0 0 16.0000 S
16 17 0 3 0 2 4 1 29.1250 Q
18 19 0 3 0 31 1 0 18.0000 S
20 21 0 2 0 35 0 0 26.0000 S
21 22 1 2 0 34 0 0 13.0000 S
22 23 1 3 0 15 0 0 8.0292 Q
23 24 1 1 0 28 0 0 35.5000 S
24 25 0 3 0 8 3 1 21.0750 S
25 26 1 3 0 38 1 5 31.3875 S
27 28 0 1 0 19 3 2 263.0000 S
30 31 0 1 0 40 0 0 27.7208 C
33 34 0 2 0 66 0 0 10.5000 S
34 35 0 1 0 28 1 0 82.1708 C
35 36 0 1 0 42 1 0 52.0000 S
37 38 0 3 0 21 0 0 8.0500 S
38 39 0 3 0 18 2 0 18.0000 S
... ... ... ... ... ... ... ... ... ...
856 857 1 1 0 45 1 1 164.8667 S
857 858 1 1 0 51 0 0 26.5500 S
858 859 1 3 0 24 0 3 19.2583 C
860 861 0 3 0 41 2 0 14.1083 S
861 862 0 2 0 21 1 0 11.5000 S
862 863 1 1 0 48 0 0 25.9292 S
864 865 0 2 0 24 0 0 13.0000 S
865 866 1 2 0 42 0 0 13.0000 S
866 867 1 2 0 27 1 0 13.8583 C
867 868 0 1 0 31 0 0 50.4958 S
869 870 1 3 0 4 1 1 11.1333 S
870 871 0 3 0 26 0 0 7.8958 S
871 872 1 1 0 47 1 1 52.5542 S
872 873 0 1 0 33 0 0 5.0000 S
873 874 0 3 0 47 0 0 9.0000 S
874 875 1 2 0 28 1 0 24.0000 C
875 876 1 3 0 15 0 0 7.2250 C
876 877 0 3 0 20 0 0 9.8458 S
877 878 0 3 0 19 0 0 7.8958 S
879 880 1 1 0 56 0 1 83.1583 C
880 881 1 2 0 25 0 1 26.0000 S
881 882 0 3 0 33 0 0 7.8958 S
882 883 0 3 0 22 0 0 10.5167 S
883 884 0 2 0 28 0 0 10.5000 S
884 885 0 3 0 25 0 0 7.0500 S
885 886 0 3 0 39 0 5 29.1250 Q
886 887 0 2 0 27 0 0 13.0000 S
887 888 1 1 0 19 0 0 30.0000 S
889 890 1 1 0 26 0 0 30.0000 C
890 891 0 3 0 32 0 0 7.7500 Q

712 rows × 9 columns


In [ ]:


In [71]:
# #From 891 rows in our original dataset, we have come down to 712 "clean" rows
titanic_train_cleaned.shape


Out[71]:
(712, 9)

In [ ]:


In [ ]:


In [ ]:


In [72]:
titanic_train_cleaned.groupby([titanic_train_cleaned.Survived, titanic_train_cleaned.Sex]).size()


Out[72]:
Survived  Sex
0         0      424
1         0      288
dtype: int64

In [73]:
# #Percentage of Women who survived
195*100/(195+64)


Out[73]:
75

In [74]:
# #Percentage of Men who survived
93*100/(93+360)


Out[74]:
20

In [ ]:


In [90]:
# #TESTING DATASET
titanic_test = pd.read_csv("test.csv")

In [92]:
titanic_test_reduced = titanic_test.drop(["Name", "Ticket", "Cabin"], axis = 1)
titanic_test_cleaned = titanic_test_reduced.dropna()
titanic_test_cleaned.Sex = titanic_test_cleaned.Sex.apply(lambda sex: 1 if sex == "male" else 0)
titanic_test_cleaned


Out[92]:
PassengerId Pclass Sex Age SibSp Parch Fare Embarked
0 892 3 1 34.5 0 0 7.8292 Q
1 893 3 0 47.0 1 0 7.0000 S
2 894 2 1 62.0 0 0 9.6875 Q
3 895 3 1 27.0 0 0 8.6625 S
4 896 3 0 22.0 1 1 12.2875 S
5 897 3 1 14.0 0 0 9.2250 S
6 898 3 0 30.0 0 0 7.6292 Q
7 899 2 1 26.0 1 1 29.0000 S
8 900 3 0 18.0 0 0 7.2292 C
9 901 3 1 21.0 2 0 24.1500 S
11 903 1 1 46.0 0 0 26.0000 S
12 904 1 0 23.0 1 0 82.2667 S
13 905 2 1 63.0 1 0 26.0000 S
14 906 1 0 47.0 1 0 61.1750 S
15 907 2 0 24.0 1 0 27.7208 C
16 908 2 1 35.0 0 0 12.3500 Q
17 909 3 1 21.0 0 0 7.2250 C
18 910 3 0 27.0 1 0 7.9250 S
19 911 3 0 45.0 0 0 7.2250 C
20 912 1 1 55.0 1 0 59.4000 C
21 913 3 1 9.0 0 1 3.1708 S
23 915 1 1 21.0 0 1 61.3792 C
24 916 1 0 48.0 1 3 262.3750 C
25 917 3 1 50.0 1 0 14.5000 S
26 918 1 0 22.0 0 1 61.9792 C
27 919 3 1 22.5 0 0 7.2250 C
28 920 1 1 41.0 0 0 30.5000 S
30 922 2 1 50.0 1 0 26.0000 S
31 923 2 1 24.0 2 0 31.5000 S
32 924 3 0 33.0 1 2 20.5750 S
... ... ... ... ... ... ... ... ...
381 1273 3 1 26.0 0 0 7.8792 Q
383 1275 3 0 19.0 1 0 16.1000 S
385 1277 2 0 24.0 1 2 65.0000 S
386 1278 3 1 24.0 0 0 7.7750 S
387 1279 2 1 57.0 0 0 13.0000 S
388 1280 3 1 21.0 0 0 7.7500 Q
389 1281 3 1 6.0 3 1 21.0750 S
390 1282 1 1 23.0 0 0 93.5000 S
391 1283 1 0 51.0 0 1 39.4000 S
392 1284 3 1 13.0 0 2 20.2500 S
393 1285 2 1 47.0 0 0 10.5000 S
394 1286 3 1 29.0 3 1 22.0250 S
395 1287 1 0 18.0 1 0 60.0000 S
396 1288 3 1 24.0 0 0 7.2500 Q
397 1289 1 0 48.0 1 1 79.2000 C
398 1290 3 1 22.0 0 0 7.7750 S
399 1291 3 1 31.0 0 0 7.7333 Q
400 1292 1 0 30.0 0 0 164.8667 S
401 1293 2 1 38.0 1 0 21.0000 S
402 1294 1 0 22.0 0 1 59.4000 C
403 1295 1 1 17.0 0 0 47.1000 S
404 1296 1 1 43.0 1 0 27.7208 C
405 1297 2 1 20.0 0 0 13.8625 C
406 1298 2 1 23.0 1 0 10.5000 S
407 1299 1 1 50.0 1 1 211.5000 C
409 1301 3 0 3.0 1 1 13.7750 S
411 1303 1 0 37.0 1 0 90.0000 Q
412 1304 3 0 28.0 0 0 7.7750 S
414 1306 1 0 39.0 0 0 108.9000 C
415 1307 3 1 38.5 0 0 7.2500 S

331 rows × 8 columns


In [ ]:


In [ ]:


In [ ]:


In [75]:
# #LOGISTIC REGRESSION

In [76]:
model_1 = linear_model.LogisticRegression()

In [80]:
model_1_dependent_vars = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare"]

In [84]:
titanic_train_cleaned[model_1_dependent_vars].values


Out[84]:
array([[  3.    ,   0.    ,  22.    ,   1.    ,   0.    ,   7.25  ],
       [  1.    ,   0.    ,  38.    ,   1.    ,   0.    ,  71.2833],
       [  3.    ,   0.    ,  26.    ,   0.    ,   0.    ,   7.925 ],
       ..., 
       [  1.    ,   0.    ,  19.    ,   0.    ,   0.    ,  30.    ],
       [  1.    ,   0.    ,  26.    ,   0.    ,   0.    ,  30.    ],
       [  3.    ,   0.    ,  32.    ,   0.    ,   0.    ,   7.75  ]])

In [ ]:


In [85]:
model_1.fit(titanic_train_cleaned[model_1_dependent_vars].values, titanic_train_cleaned.Survived.values)


Out[85]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

In [97]:
model_1_result = model_1.predict(titanic_test_cleaned[model_1_dependent_vars])
model_1_result


Out[97]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

In [ ]:


In [98]:
len(model_1_result)


Out[98]:
331

In [ ]: