Questions about Titanic dataset

Highlights:

  • On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
  • One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
  • Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.

In [82]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

def plot_correlation_map( df ):
    corr = data.corr()
    _ , ax = plt.subplots( figsize =( 12 , 10 ) )
    cmap = sns.diverging_palette( 220 , 10 , as_cmap = True )
    _ = sns.heatmap(
        corr, 
        cmap = cmap,
        square=True, 
        cbar_kws={ 'shrink' : .9 }, 
        ax=ax, 
        annot = True, 
        annot_kws = { 'fontsize' : 12 }
    )

 Dataset


In [83]:
data = pd.read_csv('titanic.csv', na_filter=False)

data.head(20)


Out[83]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
5 6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q
6 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20 0 0 A/5. 2151 8.0500 S
13 14 0 3 Andersson, Mr. Anders Johan male 39 1 5 347082 31.2750 S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14 0 0 350406 7.8542 S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55 0 0 248706 16.0000 S
16 17 0 3 Rice, Master. Eugene male 2 4 1 382652 29.1250 Q
17 18 1 2 Williams, Mr. Charles Eugene male 0 0 244373 13.0000 S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31 1 0 345763 18.0000 S
19 20 1 3 Masselmani, Mrs. Fatima female 0 0 2649 7.2250 C

Variable descriptions

  • Survived: Survival (0 = No; 1 = Yes).
  • pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd).
  • name: Name.
  • sex: Sex.
  • age: Age.
  • sibsp: Number of Siblings/Spouses Aboard.
  • parch: Number of Parents/Children Aboard.
  • ticket: Ticket Number.
  • fare: Passenger Fare.
  • cabin: Cabin.
  • embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).

Questions

1.- Which features are available in the dataset?


In [84]:
print(data.columns.values)


['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']

2.- Which features are categorical or numerical?

  • Categorical: Survived, Pclass, Sex, Embarked.
  • Numerical: PassengerId, Age, SibSp, Parch, Fare.

In [85]:
data.describe()


Out[85]:
PassengerId Survived Pclass SibSp Parch Fare
count 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 8.000000 6.000000 512.329200

3.- Which features are mixed data types?

  • Ticket, Cabin.

4.- Which features may contain errors or typos?

What can we do with in those cases?

5.- Which features contain blank, null or empty values?

What can we do with in those cases?

6.- What do you think are the most important reasons passangers survived the Titanic sinking?

Who has not seen the Titanic film?


In [86]:
plot_correlation_map(data)



In [87]:
sns.countplot(data['Pclass'], hue=data['Survived'])


Out[87]:
<matplotlib.axes._subplots.AxesSubplot at 0x11437b190>

In [88]:
sns.countplot(data['Sex'], hue=data['Survived'])


Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x113f14550>

In [89]:
sns.countplot('Embarked', hue='Survived', data=data)


Out[89]:
<matplotlib.axes._subplots.AxesSubplot at 0x11437b050>

7.- How can we use the feature name?


In [90]:
data.head(10)


Out[90]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
5 6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q
6 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 C

Tip: can we detect married passengers?

Or can we use the title? Mr, Master, etc.

8.- How can we use the features SibSp and Parch?


In [91]:
data.head(10)


Out[91]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
5 6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q
6 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 C

Tip: can we use it to detect the size of the family?

Can you suppose the probability of survival if: singleton, small family and large family?


In [92]:
data['Family']= data['Parch']+ data['SibSp']+1
data.loc[data["Family"] == 1, "FamilySize"] = 'singleton'
data.loc[(data["Family"] > 1)  &  (data["Family"] < 5) , "FamilySize"] = 'small'
data.loc[data["Family"] >4, "FamilySize"] = 'large'
sns.countplot(data['FamilySize'],hue=data['Survived'])


Out[92]:
<matplotlib.axes._subplots.AxesSubplot at 0x113269610>