Questions about Titanic dataset

Highlights:

On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.



In [82]:

    
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

def plot_correlation_map( df ):
    corr = data.corr()
    _ , ax = plt.subplots( figsize =( 12 , 10 ) )
    cmap = sns.diverging_palette( 220 , 10 , as_cmap = True )
    _ = sns.heatmap(
        corr, 
        cmap = cmap,
        square=True, 
        cbar_kws={ 'shrink' : .9 }, 
        ax=ax, 
        annot = True, 
        annot_kws = { 'fontsize' : 12 }
    )

Dataset



In [83]:

    
data = pd.read_csv('titanic.csv', na_filter=False)

data.head(20)









    Out[83]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      
      S
    
    
      5
      6
      0
      3
      Moran, Mr. James
      male
      
      0
      0
      330877
      8.4583
      
      Q
    
    
      6
      7
      0
      1
      McCarthy, Mr. Timothy J
      male
      54
      0
      0
      17463
      51.8625
      E46
      S
    
    
      7
      8
      0
      3
      Palsson, Master. Gosta Leonard
      male
      2
      3
      1
      349909
      21.0750
      
      S
    
    
      8
      9
      1
      3
      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
      female
      27
      0
      2
      347742
      11.1333
      
      S
    
    
      9
      10
      1
      2
      Nasser, Mrs. Nicholas (Adele Achem)
      female
      14
      1
      0
      237736
      30.0708
      
      C
    
    
      10
      11
      1
      3
      Sandstrom, Miss. Marguerite Rut
      female
      4
      1
      1
      PP 9549
      16.7000
      G6
      S
    
    
      11
      12
      1
      1
      Bonnell, Miss. Elizabeth
      female
      58
      0
      0
      113783
      26.5500
      C103
      S
    
    
      12
      13
      0
      3
      Saundercock, Mr. William Henry
      male
      20
      0
      0
      A/5. 2151
      8.0500
      
      S
    
    
      13
      14
      0
      3
      Andersson, Mr. Anders Johan
      male
      39
      1
      5
      347082
      31.2750
      
      S
    
    
      14
      15
      0
      3
      Vestrom, Miss. Hulda Amanda Adolfina
      female
      14
      0
      0
      350406
      7.8542
      
      S
    
    
      15
      16
      1
      2
      Hewlett, Mrs. (Mary D Kingcome)
      female
      55
      0
      0
      248706
      16.0000
      
      S
    
    
      16
      17
      0
      3
      Rice, Master. Eugene
      male
      2
      4
      1
      382652
      29.1250
      
      Q
    
    
      17
      18
      1
      2
      Williams, Mr. Charles Eugene
      male
      
      0
      0
      244373
      13.0000
      
      S
    
    
      18
      19
      0
      3
      Vander Planke, Mrs. Julius (Emelia Maria Vande...
      female
      31
      1
      0
      345763
      18.0000
      
      S
    
    
      19
      20
      1
      3
      Masselmani, Mrs. Fatima
      female
      
      0
      0
      2649
      7.2250
      
      C

Variable descriptions

Survived: Survival (0 = No; 1 = Yes).
pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd).
name: Name.
sex: Sex.
age: Age.
sibsp: Number of Siblings/Spouses Aboard.
parch: Number of Parents/Children Aboard.
ticket: Ticket Number.
fare: Passenger Fare.
cabin: Cabin.
embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).

Questions

1.- Which features are available in the dataset?



In [84]:

    
print(data.columns.values)









    



['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']

2.- Which features are categorical or numerical?

Categorical: Survived, Pclass, Sex, Embarked.
Numerical: PassengerId, Age, SibSp, Parch, Fare.



In [85]:

    
data.describe()









    Out[85]:






  
    
      
      PassengerId
      Survived
      Pclass
      SibSp
      Parch
      Fare
    
  
  
    
      count
      891.000000
      891.000000
      891.000000
      891.000000
      891.000000
      891.000000
    
    
      mean
      446.000000
      0.383838
      2.308642
      0.523008
      0.381594
      32.204208
    
    
      std
      257.353842
      0.486592
      0.836071
      1.102743
      0.806057
      49.693429
    
    
      min
      1.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      223.500000
      0.000000
      2.000000
      0.000000
      0.000000
      7.910400
    
    
      50%
      446.000000
      0.000000
      3.000000
      0.000000
      0.000000
      14.454200
    
    
      75%
      668.500000
      1.000000
      3.000000
      1.000000
      0.000000
      31.000000
    
    
      max
      891.000000
      1.000000
      3.000000
      8.000000
      6.000000
      512.329200

3.- Which features are mixed data types?

Ticket, Cabin.

4.- Which features may contain errors or typos?

What can we do with in those cases?

5.- Which features contain blank, null or empty values?

What can we do with in those cases?

6.- What do you think are the most important reasons passangers survived the Titanic sinking?

Who has not seen the Titanic film?



In [86]:

    
plot_correlation_map(data)



In [87]:

    
sns.countplot(data['Pclass'], hue=data['Survived'])









    Out[87]:





<matplotlib.axes._subplots.AxesSubplot at 0x11437b190>



In [88]:

    
sns.countplot(data['Sex'], hue=data['Survived'])









    Out[88]:





<matplotlib.axes._subplots.AxesSubplot at 0x113f14550>



In [89]:

    
sns.countplot('Embarked', hue='Survived', data=data)









    Out[89]:





<matplotlib.axes._subplots.AxesSubplot at 0x11437b050>

7.- How can we use the feature name?



In [90]:

    
data.head(10)









    Out[90]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      
      S
    
    
      5
      6
      0
      3
      Moran, Mr. James
      male
      
      0
      0
      330877
      8.4583
      
      Q
    
    
      6
      7
      0
      1
      McCarthy, Mr. Timothy J
      male
      54
      0
      0
      17463
      51.8625
      E46
      S
    
    
      7
      8
      0
      3
      Palsson, Master. Gosta Leonard
      male
      2
      3
      1
      349909
      21.0750
      
      S
    
    
      8
      9
      1
      3
      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
      female
      27
      0
      2
      347742
      11.1333
      
      S
    
    
      9
      10
      1
      2
      Nasser, Mrs. Nicholas (Adele Achem)
      female
      14
      1
      0
      237736
      30.0708
      
      C

Tip: can we detect married passengers?

Or can we use the title? Mr, Master, etc.

8.- How can we use the features SibSp and Parch?



In [91]:

    
data.head(10)









    Out[91]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      
      S
    
    
      5
      6
      0
      3
      Moran, Mr. James
      male
      
      0
      0
      330877
      8.4583
      
      Q
    
    
      6
      7
      0
      1
      McCarthy, Mr. Timothy J
      male
      54
      0
      0
      17463
      51.8625
      E46
      S
    
    
      7
      8
      0
      3
      Palsson, Master. Gosta Leonard
      male
      2
      3
      1
      349909
      21.0750
      
      S
    
    
      8
      9
      1
      3
      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
      female
      27
      0
      2
      347742
      11.1333
      
      S
    
    
      9
      10
      1
      2
      Nasser, Mrs. Nicholas (Adele Achem)
      female
      14
      1
      0
      237736
      30.0708
      
      C

Tip: can we use it to detect the size of the family?

Can you suppose the probability of survival if: singleton, small family and large family?



In [92]:

    
data['Family']= data['Parch']+ data['SibSp']+1
data.loc[data["Family"] == 1, "FamilySize"] = 'singleton'
data.loc[(data["Family"] > 1)  &  (data["Family"] < 5) , "FamilySize"] = 'small'
data.loc[data["Family"] >4, "FamilySize"] = 'large'
sns.countplot(data['FamilySize'],hue=data['Survived'])









    Out[92]:





<matplotlib.axes._subplots.AxesSubplot at 0x113269610>

Source and more information at:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.2500		S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.9250		S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.0500		S
5	6	0	3	Moran, Mr. James	male		0	0	330877	8.4583		Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2	3	1	349909	21.0750		S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27	0	2	347742	11.1333		S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14	1	0	237736	30.0708		C
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58	0	0	113783	26.5500	C103	S
12	13	0	3	Saundercock, Mr. William Henry	male	20	0	0	A/5. 2151	8.0500		S
13	14	0	3	Andersson, Mr. Anders Johan	male	39	1	5	347082	31.2750		S
14	15	0	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14	0	0	350406	7.8542		S
15	16	1	2	Hewlett, Mrs. (Mary D Kingcome)	female	55	0	0	248706	16.0000		S
16	17	0	3	Rice, Master. Eugene	male	2	4	1	382652	29.1250		Q
17	18	1	2	Williams, Mr. Charles Eugene	male		0	0	244373	13.0000		S
18	19	0	3	Vander Planke, Mrs. Julius (Emelia Maria Vande...	female	31	1	0	345763	18.0000		S
19	20	1	3	Masselmani, Mrs. Fatima	female		0	0	2649	7.2250		C

	PassengerId	Survived	Pclass	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	8.000000	6.000000	512.329200