Titanic data exercise



In [1]:

    
import pandas as pd
import numpy as np
import glob # to find all files in folder
from datetime import datetime
from datetime import date, time
from dateutil.parser import parse
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_context('notebook')
pd.options.mode.chained_assignment = None  # default='warn'



In [2]:

    
from IPython.core.display import HTML
HTML(filename='Data/titanic.html')









    Out[2]:




Data frame:titanic3
1309 observations and 14 variables, maximum # NAs:1188

Name Labels Units Levels Storage NAs
pclass   3 integer    0
survived Survived double    0
name Name character    0
sex   2 integer    0
age Age Year double  263
sibsp Number of Siblings/Spouses Aboard double    0
parch Number of Parents/Children Aboard double    0
ticket Ticket Number character    0
fare Passenger Fare British Pound (\243) double    1
cabin 187 integer    0
embarked   3 integer    2
boat  28 integer    0
body Body Identification Number double 1188
home.dest Home/Destination character    0



Variable Levels
pclass 1st
2nd
3rd
sex female
male
cabin 
A10
A11
A14
A16
A18
A19
A20
A21
A23
A24
A26
A29
A31
A32
A34
A36
A5
A6
A7
A9
B10
B101
B102
B11
B18
B19
B20
B22
B24
B26
B28
B3
B30
B35
B36
B37
B38
B39
B4
B41
B42
B45
B49
B5
B50
B51 B53 B55
B52 B54 B56
B57 B59 B63 B66
B58 B60
B61
B69
B71
B73
B77
B78
B79
B80
B82 B84
B86
B94
B96 B98
C101
C103
C104
C105
C106
C110
C111
C116
C118
C123
C124
C125
C126
C128
C130
C132
C148
C2
C22 C26
C23 C25 C27
C28
C30
C31
C32
C39
C45
C46
C47
C49
C50
C51
C52
C53
C54
C55 C57
C6
C62 C64
C65
C68
C7
C70
C78
C80
C82
C83
C85
C86
C87
C89
C90
C91
C92
C93
C95
C97
C99
D
D10 D12
D11
D15
D17
D19
D20
D21
D22
D26
D28
D30
D33
D34
D35
D36
D37
D38
D40
D43
D45
D46
D47
D48
D49
D50
D56
D6
D7
D9
E10
E101
E12
E121
E17
E24
E25
E31
E33
E34
E36
E38
E39 E41
E40
E44
E45
E46
E49
E50
E52
E58
E60
E63
E67
E68
E77
E8
F
F E46
F E57
F E69
F G63
F G73
F2
F33
F38
F4
G6
T
embarked Cherbourg
Queenstown
Southampton
boat 
1
10
11
12
13
13 15
13 15 B
14
15
15 16
16
2
3
4
5
5 7
5 9
6
7
8
8 10
9
A
B
C
C D
D



In [3]:

    
original_data = pd.read_excel('Data/titanic.xls')
original_data['total'] = 1 # add a colon only consisting of 1s to make couting easier
original_data.head(2)









    Out[3]:






  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      body
      home.dest
      total
    
  
  
    
      0
      1
      1
      Allen, Miss. Elisabeth Walton
      female
      29.0000
      0
      0
      24160
      211.3375
      B5
      S
      2
      NaN
      St Louis, MO
      1
    
    
      1
      1
      1
      Allison, Master. Hudson Trevor
      male
      0.9167
      1
      2
      113781
      151.5500
      C22 C26
      S
      11
      NaN
      Montreal, PQ / Chesterville, ON
      1

1. Describe each attribute, both with basic statistics and plots. State clearly your assumptions and discuss your findings.

pclass

the class a person belongs to



In [4]:

    
pclass = original_data['pclass']
pclass.unique()









    Out[4]:





array([1, 2, 3])

there are 3 different classes



In [5]:

    
for c in pclass.unique():
    print('nbr in class '+str(c)+': '+str(len(pclass[pclass == c])))









    



nbr in class 1: 323
nbr in class 2: 277
nbr in class 3: 709

most are in class 3, but surprisingly class 1 has more passagengers than class 2



In [6]:

    
plt.hist(pclass.values)









    Out[6]:





(array([ 323.,    0.,    0.,    0.,    0.,  277.,    0.,    0.,    0.,  709.]),
 array([ 1. ,  1.2,  1.4,  1.6,  1.8,  2. ,  2.2,  2.4,  2.6,  2.8,  3. ]),
 <a list of 10 Patch objects>)

survived

States if the passenger survived the titanic sinking



In [7]:

    
surv = original_data['survived']
surv.unique() # to make sure there are only 1 and 0









    Out[7]:





array([1, 0])



In [8]:

    
#how many survived?
surv.sum()









    Out[8]:





500



In [9]:

    
#how many died?
len(surv[surv == 0])









    Out[9]:





809

most died :(



In [10]:

    
100/len(surv.values) * surv.sum()









    Out[10]:





38.19709702062643

only 38% survived

name

the name of the passanger



In [11]:

    
name = original_data['name']
len(name.unique()) == len(name.values)









    Out[11]:





False

apparently there are some with the same name



In [12]:

    
len(name.values) - len(name.unique())









    Out[12]:





2



In [13]:

    
#lets find them
original_data[name.isin(name[name.duplicated()].values)]









    Out[13]:






  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      body
      home.dest
      total
    
  
  
    
      725
      3
      1
      Connolly, Miss. Kate
      female
      22.0
      0
      0
      370373
      7.7500
      NaN
      Q
      13
      NaN
      Ireland
      1
    
    
      726
      3
      0
      Connolly, Miss. Kate
      female
      30.0
      0
      0
      330972
      7.6292
      NaN
      Q
      NaN
      NaN
      Ireland
      1
    
    
      924
      3
      0
      Kelly, Mr. James
      male
      34.5
      0
      0
      330911
      7.8292
      NaN
      Q
      NaN
      70.0
      NaN
      1
    
    
      925
      3
      0
      Kelly, Mr. James
      male
      44.0
      0
      0
      363592
      8.0500
      NaN
      S
      NaN
      NaN
      NaN
      1

sex

the sex of the passenger



In [14]:

    
sex = original_data['sex']
sex.unique()









    Out[14]:





array(['female', 'male'], dtype=object)



In [15]:

    
nbr_males = len(sex[sex == 'male'])



In [16]:

    
nbr_females= len(sex[sex == 'female'])



In [17]:

    
100/len(sex) * nbr_males









    Out[17]:





64.40030557677616

64.4% are male

age

How old the passenger is



In [18]:

    
age = original_data['age']
age.unique()









    Out[18]:





array([ 29.    ,   0.9167,   2.    ,  30.    ,  25.    ,  48.    ,
        63.    ,  39.    ,  53.    ,  71.    ,  47.    ,  18.    ,
        24.    ,  26.    ,  80.    ,      nan,  50.    ,  32.    ,
        36.    ,  37.    ,  42.    ,  19.    ,  35.    ,  28.    ,
        45.    ,  40.    ,  58.    ,  22.    ,  41.    ,  44.    ,
        59.    ,  60.    ,  33.    ,  17.    ,  11.    ,  14.    ,
        49.    ,  76.    ,  46.    ,  27.    ,  64.    ,  55.    ,
        70.    ,  38.    ,  51.    ,  31.    ,   4.    ,  54.    ,
        23.    ,  43.    ,  52.    ,  16.    ,  32.5   ,  21.    ,
        15.    ,  65.    ,  28.5   ,  45.5   ,  56.    ,  13.    ,
        61.    ,  34.    ,   6.    ,  57.    ,  62.    ,  67.    ,
         1.    ,  12.    ,  20.    ,   0.8333,   8.    ,   0.6667,
         7.    ,   3.    ,  36.5   ,  18.5   ,   5.    ,  66.    ,
         9.    ,   0.75  ,  70.5   ,  22.5   ,   0.3333,   0.1667,
        40.5   ,  10.    ,  23.5   ,  34.5   ,  20.5   ,  30.5   ,
        55.5   ,  38.5   ,  14.5   ,  24.5   ,  60.5   ,  74.    ,
         0.4167,  11.5   ,  26.5   ])

There are NaN values! But also floating point values, which is somewhat unusual but not a problem per se.



In [19]:

    
age.min() # a baby?









    Out[19]:





0.16669999999999999



In [20]:

    
age.max()









    Out[20]:





80.0



In [21]:

    
age.mean()









    Out[21]:





29.8811345124283

Age distribution in a boxplot:



In [22]:

    
sns.boxplot(age.dropna().values)









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fb95b4907b8>

And the distribution of age plotted:



In [23]:

    
#plt.hist(age.values)

sibsp

The number of siblings or spouses on the ship



In [24]:

    
sipsp = original_data['sibsp']
sipsp.unique()









    Out[24]:





array([0, 1, 2, 3, 4, 5, 8])



In [25]:

    
sipsp.mean()









    Out[25]:





0.4988540870893812

Plot histogram: Almost all traveled without siblings or spouses. there is apparently one family that traveled together (8 siblings are on board)



In [26]:

    
plt.hist(sipsp)









    Out[26]:





(array([ 891.,  319.,   42.,   20.,    0.,   22.,    6.,    0.,    0.,    9.]),
 array([ 0. ,  0.8,  1.6,  2.4,  3.2,  4. ,  4.8,  5.6,  6.4,  7.2,  8. ]),
 <a list of 10 Patch objects>)

parch

The number of parents or children on the ship



In [27]:

    
parch = original_data['parch']
parch.unique()









    Out[27]:





array([0, 2, 1, 4, 3, 5, 6, 9])



In [28]:

    
parch.mean()









    Out[28]:





0.3850267379679144

Histogram: Again almost noone traveled with their kids. The one big family is again seen here.



In [29]:

    
plt.hist(parch)









    Out[29]:





(array([ 1002.,   170.,   113.,     8.,     6.,     6.,     2.,     0.,
            0.,     2.]),
 array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ]),
 <a list of 10 Patch objects>)

Let's find the family



In [30]:

    
# the kids
original_data[original_data['sibsp'] == 8]









    Out[30]:






  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      body
      home.dest
      total
    
  
  
    
      1170
      3
      0
      Sage, Master. Thomas Henry
      male
      NaN
      8
      2
      CA. 2343
      69.55
      NaN
      S
      NaN
      NaN
      NaN
      1
    
    
      1171
      3
      0
      Sage, Master. William Henry
      male
      14.5
      8
      2
      CA. 2343
      69.55
      NaN
      S
      NaN
      67.0
      NaN
      1
    
    
      1172
      3
      0
      Sage, Miss. Ada
      female
      NaN
      8
      2
      CA. 2343
      69.55
      NaN
      S
      NaN
      NaN
      NaN
      1
    
    
      1173
      3
      0
      Sage, Miss. Constance Gladys
      female
      NaN
      8
      2
      CA. 2343
      69.55
      NaN
      S
      NaN
      NaN
      NaN
      1
    
    
      1174
      3
      0
      Sage, Miss. Dorothy Edith "Dolly"
      female
      NaN
      8
      2
      CA. 2343
      69.55
      NaN
      S
      NaN
      NaN
      NaN
      1
    
    
      1175
      3
      0
      Sage, Miss. Stella Anna
      female
      NaN
      8
      2
      CA. 2343
      69.55
      NaN
      S
      NaN
      NaN
      NaN
      1
    
    
      1176
      3
      0
      Sage, Mr. Douglas Bullen
      male
      NaN
      8
      2
      CA. 2343
      69.55
      NaN
      S
      NaN
      NaN
      NaN
      1
    
    
      1177
      3
      0
      Sage, Mr. Frederick
      male
      NaN
      8
      2
      CA. 2343
      69.55
      NaN
      S
      NaN
      NaN
      NaN
      1
    
    
      1178
      3
      0
      Sage, Mr. George John Jr
      male
      NaN
      8
      2
      CA. 2343
      69.55
      NaN
      S
      NaN
      NaN
      NaN
      1



In [31]:

    
#  the parents
original_data[original_data['parch'] == 9]









    Out[31]:






  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      body
      home.dest
      total
    
  
  
    
      1179
      3
      0
      Sage, Mr. John George
      male
      NaN
      1
      9
      CA. 2343
      69.55
      NaN
      S
      NaN
      NaN
      NaN
      1
    
    
      1180
      3
      0
      Sage, Mrs. John (Annie Bullen)
      female
      NaN
      1
      9
      CA. 2343
      69.55
      NaN
      S
      NaN
      NaN
      NaN
      1

This are the children and the parents of the 'big' familly. Sadly all died :(

ticket

the ticketnbr the passanger had



In [32]:

    
ticket = original_data['ticket']
len(ticket.unique())









    Out[32]:





939



In [33]:

    
ticket.dtype









    Out[33]:





dtype('O')



In [34]:

    
len(ticket[ticket.isnull()])









    Out[34]:





0

All (registered) passengers had a ticket ;)

fare

How many they paid



In [35]:

    
fare = original_data['fare']
fare.mean()









    Out[35]:





33.29547928134572



In [36]:

    
fare.max()









    Out[36]:





512.32920000000001



In [37]:

    
fare.min()









    Out[37]:





0.0

There are people that did not pay anything



In [38]:

    
original_data[fare == 0]









    Out[38]:






  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      body
      home.dest
      total
    
  
  
    
      7
      1
      0
      Andrews, Mr. Thomas Jr
      male
      39.0
      0
      0
      112050
      0.0
      A36
      S
      NaN
      NaN
      Belfast, NI
      1
    
    
      70
      1
      0
      Chisholm, Mr. Roderick Robert Crispin
      male
      NaN
      0
      0
      112051
      0.0
      NaN
      S
      NaN
      NaN
      Liverpool, England / Belfast
      1
    
    
      125
      1
      0
      Fry, Mr. Richard
      male
      NaN
      0
      0
      112058
      0.0
      B102
      S
      NaN
      NaN
      NaN
      1
    
    
      150
      1
      0
      Harrison, Mr. William
      male
      40.0
      0
      0
      112059
      0.0
      B94
      S
      NaN
      110.0
      NaN
      1
    
    
      170
      1
      1
      Ismay, Mr. Joseph Bruce
      male
      49.0
      0
      0
      112058
      0.0
      B52 B54 B56
      S
      C
      NaN
      Liverpool
      1
    
    
      223
      1
      0
      Parr, Mr. William Henry Marsh
      male
      NaN
      0
      0
      112052
      0.0
      NaN
      S
      NaN
      NaN
      Belfast
      1
    
    
      234
      1
      0
      Reuchlin, Jonkheer. John George
      male
      38.0
      0
      0
      19972
      0.0
      NaN
      S
      NaN
      NaN
      Rotterdam, Netherlands
      1
    
    
      363
      2
      0
      Campbell, Mr. William
      male
      NaN
      0
      0
      239853
      0.0
      NaN
      S
      NaN
      NaN
      Belfast
      1
    
    
      384
      2
      0
      Cunningham, Mr. Alfred Fleming
      male
      NaN
      0
      0
      239853
      0.0
      NaN
      S
      NaN
      NaN
      Belfast
      1
    
    
      410
      2
      0
      Frost, Mr. Anthony Wood "Archie"
      male
      NaN
      0
      0
      239854
      0.0
      NaN
      S
      NaN
      NaN
      Belfast
      1
    
    
      473
      2
      0
      Knight, Mr. Robert J
      male
      NaN
      0
      0
      239855
      0.0
      NaN
      S
      NaN
      NaN
      Belfast
      1
    
    
      528
      2
      0
      Parkes, Mr. Francis "Frank"
      male
      NaN
      0
      0
      239853
      0.0
      NaN
      S
      NaN
      NaN
      Belfast
      1
    
    
      581
      2
      0
      Watson, Mr. Ennis Hastings
      male
      NaN
      0
      0
      239856
      0.0
      NaN
      S
      NaN
      NaN
      Belfast
      1
    
    
      896
      3
      0
      Johnson, Mr. Alfred
      male
      49.0
      0
      0
      LINE
      0.0
      NaN
      S
      NaN
      NaN
      NaN
      1
    
    
      898
      3
      0
      Johnson, Mr. William Cahoone Jr
      male
      19.0
      0
      0
      LINE
      0.0
      NaN
      S
      NaN
      NaN
      NaN
      1
    
    
      963
      3
      0
      Leonard, Mr. Lionel
      male
      36.0
      0
      0
      LINE
      0.0
      NaN
      S
      NaN
      NaN
      NaN
      1
    
    
      1254
      3
      1
      Tornquist, Mr. William Henry
      male
      25.0
      0
      0
      LINE
      0.0
      NaN
      S
      15
      NaN
      NaN
      1



In [39]:

    
fare.dtypes









    Out[39]:





dtype('float64')



In [40]:

    
original_data[fare.isnull()]









    Out[40]:






  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      body
      home.dest
      total
    
  
  
    
      1225
      3
      0
      Storey, Mr. Thomas
      male
      60.5
      0
      0
      3701
      NaN
      NaN
      S
      NaN
      261.0
      NaN
      1

there is one NaN value



In [41]:

    
plt.hist(fare.dropna())









    Out[41]:





(array([ 1070.,   154.,    42.,     4.,    21.,    13.,     0.,     0.,
            0.,     4.]),
 array([   0.     ,   51.23292,  102.46584,  153.69876,  204.93168,
         256.1646 ,  307.39752,  358.63044,  409.86336,  461.09628,
         512.3292 ]),
 <a list of 10 Patch objects>)

Someone got ripped of, or got the best room.

cabin

What cabin they are in



In [42]:

    
cabin = original_data['cabin']
cabin.isnull().sum()









    Out[42]:





1014

1014 people have no cabin (all class 3?)



In [43]:

    
plt.hist(original_data[cabin.isnull()]['pclass'])









    Out[43]:





(array([  67.,    0.,    0.,    0.,    0.,  254.,    0.,    0.,    0.,  693.]),
 array([ 1. ,  1.2,  1.4,  1.6,  1.8,  2. ,  2.2,  2.4,  2.6,  2.8,  3. ]),
 <a list of 10 Patch objects>)

Even people in class 1 have no cabin (or it is unknown)

Some people have several cabines, but they are also occupied by several peoples, probablement families. It would be quite complicated to take those 'multiple cabin' entries appart. With more time we could have done it.



In [44]:

    
cabin.head()









    Out[44]:





0         B5
1    C22 C26
2    C22 C26
3    C22 C26
4    C22 C26
Name: cabin, dtype: object

embarked



In [45]:

    
embarked = original_data['embarked']
embarked.unique()









    Out[45]:





array(['S', 'C', nan, 'Q'], dtype=object)



In [46]:

    
len(embarked[embarked.isnull()])









    Out[46]:





2

two people have NaN in 'embarked'



In [47]:

    
sns.countplot(y="embarked", data=original_data, color="c");

boat

On what rescue-boat they were rescued



In [48]:

    
boat = original_data['boat']
boat.unique()









    Out[48]:





array([2, '11', nan, '3', '10', 'D', '4', '9', '6', 'B', '8', 'A', '5',
       '7', 'C', '14', '2', '5 9', '13', '1', '15', '5 7', '8 10', '12',
       '16', '13 15 B', 'C D', '15 16', '13 15'], dtype=object)

some have several boats.

body

the identification number of a body



In [49]:

    
body = original_data['body']
body.count()









    Out[49]:





121

121 bodys got an number

home dest



In [50]:

    
homedest = original_data['home.dest']
len(homedest.dropna().unique())









    Out[50]:





369

369 different home destinations Lets find the most common one



In [51]:

    
original_data[['home.dest', 'total']].groupby(by='home.dest').sum().sort_values(by='total', ascending=False)









    Out[51]:






  
    
      
      total
    
    
      home.dest
      
    
  
  
    
      New York, NY
      64
    
    
      London
      14
    
    
      Montreal, PQ
      10
    
    
      Paris, France
      9
    
    
      Cornwall / Akron, OH
      9
    
    
      Wiltshire, England Niagara Falls, NY
      8
    
    
      Winnipeg, MB
      8
    
    
      Philadelphia, PA
      8
    
    
      Belfast
      7
    
    
      Brooklyn, NY
      7
    
    
      Sweden Winnipeg, MN
      7
    
    
      Haverford, PA / Cooperstown, NY
      5
    
    
      Somerset / Bernardsville, NJ
      5
    
    
      Bulgaria Chicago, IL
      5
    
    
      Ottawa, ON
      5
    
    
      Rotherfield, Sussex, England Essex Co, MA
      5
    
    
      Sweden Worcester, MA
      5
    
    
      San Francisco, CA
      4
    
    
      Guernsey / Elizabeth, NJ
      4
    
    
      Washington, DC
      4
    
    
      Bournmouth, England
      4
    
    
      Haverford, PA
      4
    
    
      Guntur, India / Benton Harbour, MI
      4
    
    
      Bryn Mawr, PA
      4
    
    
      St Louis, MO
      4
    
    
      Ruotsinphytaa, Finland New York, NY
      4
    
    
      Chicago, IL
      4
    
    
      Minneapolis, MN
      4
    
    
      London, England
      4
    
    
      Ireland Chicago, IL
      4
    
    
      ...
      ...
    
    
      Goteborg, Sweden Huntley, IL
      1
    
    
      Goteborg, Sweden / Rockford, IL
      1
    
    
      Glen Ridge, NJ
      1
    
    
      Glasgow / Bangor, ME
      1
    
    
      Germantown, Philadelphia, PA
      1
    
    
      India / Pittsburgh, PA
      1
    
    
      Gallipolis, Ohio / ? Paris / New York
      1
    
    
      Frankfort, KY
      1
    
    
      Foresvik, Norway Portland, ND
      1
    
    
      Folkstone, Kent / New York, NY
      1
    
    
      Finland Sudbury, ON
      1
    
    
      Finland / Washington, DC
      1
    
    
      Guernsey / Montclair, NJ and/or Toledo, Ohio
      1
    
    
      Guernsey / Wilmington, DE
      1
    
    
      Guernsey, England / Edgewood, RI
      1
    
    
      Gunnislake, England / Butte, MT
      1
    
    
      Haddenfield, NJ
      1
    
    
      Halesworth, England
      1
    
    
      Hamilton, ON
      1
    
    
      Harrisburg, PA
      1
    
    
      Harrow, England
      1
    
    
      Harrow-on-the-Hill, Middlesex
      1
    
    
      Hartford, CT
      1
    
    
      Hartford, Huntingdonshire
      1
    
    
      Helsinki, Finland Ashtabula, Ohio
      1
    
    
      Hessle, Yorks
      1
    
    
      Holley, NY
      1
    
    
      Hornsey, England
      1
    
    
      Illinois, USA
      1
    
    
      Kontiolahti, Finland / Detroit, MI
      1
    
  

369 rows × 1 columns

Most come from New York

2. Use the `groupby` method to calculate the proportion of passengers that survived by sex:

First gather the numbers



In [52]:

    
survived_by_sex = original_data[['survived', 'sex']].groupby('sex').sum()
nbr_males = len(original_data[original_data['sex'] == 'male'])
nbr_females = len(original_data[original_data['sex'] == 'female'])
nbr_total = len(original_data['sex'])
survived_by_sex



In [53]:

    
print(nbr_total == nbr_females + nbr_males) # to check if consistent









    



True

Then calcultate the percentages



In [54]:

    
female_survived_percentage = (100/nbr_females) * survived_by_sex.at['female', 'survived']
male_survived_percentage = (100/nbr_males) * survived_by_sex.at['male', 'survived']
print('female surv: '+str(round(female_survived_percentage, 3))+'%')
print('male surv: '+str(round(male_survived_percentage, 3))+'%')









    



female surv: 72.747%
male surv: 19.098%

3. Calculate the same proportion, but by class and sex.



In [55]:

    
# make use of the 'total' column (which is all 1's in the original_data)
survived_by_class = original_data[['pclass', 'sex', 'survived', 'total']].groupby(['pclass', 'sex']).sum()
survived_by_class



In [56]:

    
def combine_surv_total(row):
    #print(row)
    return 100.0/row.total * row.survived

create a new column with the apply method



In [57]:

    
survived_by_class['survived in %'] = survived_by_class.apply(combine_surv_total, axis=1)
survived_by_class









    Out[57]:






  
    
      
      
      survived
      total
      survived in %
    
    
      pclass
      sex
      
      
      
    
  
  
    
      1
      female
      139
      144
      96.527778
    
    
      male
      61
      179
      34.078212
    
    
      2
      female
      94
      106
      88.679245
    
    
      male
      25
      171
      14.619883
    
    
      3
      female
      106
      216
      49.074074
    
    
      male
      75
      493
      15.212982

Here is a plot showing the survive rates. Note that the plot is not based on the data calculated above



In [67]:

    
type(original_data['sex'])









    Out[67]:





pandas.core.series.Series



In [58]:

    
sns.barplot(x='sex', y='survived', hue='pclass', data=original_data);

We can see that 'women first' is true, but also 'class 1 first'

4. Create age categories: children (under 14 years), adolescents (14-20), adult (21-64), and senior(65+), and calculate survival proportions by age category, class and sex.

Create the categories. We use the value -1 to show that the person has a NaN value as age (and put them in the category 'No age'.



In [59]:

    
original_data.age.fillna(-1, inplace=True)
age_cats = pd.cut(original_data.age, [-2, 0+1e-6,14+1e-6,20+1e-6,64+1e-6,120], labels=['No age', 'child','adolescent','adult','senior'], include_lowest=True)



In [60]:

    
original_data['age-category'] = age_cats



In [61]:

    
catsdata = original_data[['sex', 'age-category', 'pclass', 'survived', 'total']]

Then group the data in a sensible way to get the nice Table below.



In [62]:

    
grouped = catsdata.groupby(['sex', 'age-category', 'pclass']).sum().fillna(0)
grouped









    Out[62]:






  
    
      
      
      
      survived
      total
    
    
      sex
      age-category
      pclass
      
      
    
  
  
    
      female
      No age
      1
      11.0
      11.0
    
    
      2
      2.0
      3.0
    
    
      3
      34.0
      64.0
    
    
      child
      1
      1.0
      2.0
    
    
      2
      15.0
      15.0
    
    
      3
      16.0
      33.0
    
    
      adolescent
      1
      14.0
      14.0
    
    
      2
      11.0
      12.0
    
    
      3
      18.0
      33.0
    
    
      adult
      1
      112.0
      116.0
    
    
      2
      66.0
      76.0
    
    
      3
      38.0
      86.0
    
    
      senior
      1
      1.0
      1.0
    
    
      2
      0.0
      0.0
    
    
      3
      0.0
      0.0
    
    
      male
      No age
      1
      8.0
      28.0
    
    
      2
      2.0
      13.0
    
    
      3
      16.0
      144.0
    
    
      child
      1
      5.0
      5.0
    
    
      2
      11.0
      12.0
    
    
      3
      13.0
      40.0
    
    
      adolescent
      1
      1.0
      5.0
    
    
      2
      2.0
      16.0
    
    
      3
      7.0
      61.0
    
    
      adult
      1
      46.0
      134.0
    
    
      2
      10.0
      128.0
    
    
      3
      39.0
      245.0
    
    
      senior
      1
      1.0
      7.0
    
    
      2
      0.0
      2.0
    
    
      3
      0.0
      3.0

And finally calculate the survive portion for all cases



In [63]:

    
def surv_proportions(row):
    if row.total == 0:
        return np.nan
    return round(100.0/row.total * row.survived, 2)

grouped['survive-portion (%)'] = grouped.apply(surv_proportions, axis=1)



In [64]:

    
grouped









    Out[64]:






  
    
      
      
      
      survived
      total
      survive-portion (%)
    
    
      sex
      age-category
      pclass
      
      
      
    
  
  
    
      female
      No age
      1
      11.0
      11.0
      100.00
    
    
      2
      2.0
      3.0
      66.67
    
    
      3
      34.0
      64.0
      53.12
    
    
      child
      1
      1.0
      2.0
      50.00
    
    
      2
      15.0
      15.0
      100.00
    
    
      3
      16.0
      33.0
      48.48
    
    
      adolescent
      1
      14.0
      14.0
      100.00
    
    
      2
      11.0
      12.0
      91.67
    
    
      3
      18.0
      33.0
      54.55
    
    
      adult
      1
      112.0
      116.0
      96.55
    
    
      2
      66.0
      76.0
      86.84
    
    
      3
      38.0
      86.0
      44.19
    
    
      senior
      1
      1.0
      1.0
      100.00
    
    
      2
      0.0
      0.0
      NaN
    
    
      3
      0.0
      0.0
      NaN
    
    
      male
      No age
      1
      8.0
      28.0
      28.57
    
    
      2
      2.0
      13.0
      15.38
    
    
      3
      16.0
      144.0
      11.11
    
    
      child
      1
      5.0
      5.0
      100.00
    
    
      2
      11.0
      12.0
      91.67
    
    
      3
      13.0
      40.0
      32.50
    
    
      adolescent
      1
      1.0
      5.0
      20.00
    
    
      2
      2.0
      16.0
      12.50
    
    
      3
      7.0
      61.0
      11.48
    
    
      adult
      1
      46.0
      134.0
      34.33
    
    
      2
      10.0
      128.0
      7.81
    
    
      3
      39.0
      245.0
      15.92
    
    
      senior
      1
      1.0
      7.0
      14.29
    
    
      2
      0.0
      2.0
      0.00
    
    
      3
      0.0
      3.0
      0.00

Plots

Two plots showing this. The first showing the female and the second shows the male passengers



In [65]:

    
sns.barplot(x="pclass", y="survived", hue="age-category", data=original_data[original_data['sex'] == 'female'])









    Out[65]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fb9584cbe48>

Almost all women from class 1 and 2 survived, in class 3 about 50% survived



In [66]:

    
sns.barplot(x="pclass", y="survived", hue="age-category", data=original_data[original_data['sex'] == 'male'])









    Out[66]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fb958387cf8>

It is interesting to see that almost no men survived, exept children. So 'children before adults' was certainly a thing.



In [ ]:

Name	Labels	Units	Levels	Storage	NAs
pclass			3	integer	0
survived	Survived			double	0
name	Name			character	0
sex			2	integer	0
age	Age	Year		double	263
sibsp	Number of Siblings/Spouses Aboard			double	0
parch	Number of Parents/Children Aboard			double	0
ticket	Ticket Number			character	0
fare	Passenger Fare	British Pound (\243)		double	1
cabin			187	integer	0
embarked			3	integer	2
boat			28	integer	0
body	Body Identification Number			double	1188
home.dest	Home/Destination			character	0

Variable	Levels
pclass	1st
	2nd
	3rd
sex	female
	male
cabin
	A10
	A11
	A14
	A16
	A18
	A19
	A20
	A21
	A23
	A24
	A26
	A29
	A31
	A32
	A34
	A36
	A5
	A6
	A7
	A9
	B10
	B101
	B102
	B11
	B18
	B19
	B20
	B22
	B24
	B26
	B28
	B3
	B30
	B35
	B36
	B37
	B38
	B39
	B4
	B41
	B42
	B45
	B49
	B5
	B50
	B51 B53 B55
	B52 B54 B56
	B57 B59 B63 B66
	B58 B60
	B61
	B69
	B71
	B73
	B77
	B78
	B79
	B80
	B82 B84
	B86
	B94
	B96 B98
	C101
	C103
	C104
	C105
	C106
	C110
	C111
	C116
	C118
	C123
	C124
	C125
	C126
	C128
	C130
	C132
	C148
	C2
	C22 C26
	C23 C25 C27
	C28
	C30
	C31
	C32
	C39
	C45
	C46
	C47
	C49
	C50
	C51
	C52
	C53
	C54
	C55 C57
	C6
	C62 C64
	C65
	C68
	C7
	C70
	C78
	C80
	C82
	C83
	C85
	C86
	C87
	C89
	C90
	C91
	C92
	C93
	C95
	C97
	C99
	D
	D10 D12
	D11
	D15
	D17
	D19
	D20
	D21
	D22
	D26
	D28
	D30
	D33
	D34
	D35
	D36
	D37
	D38
	D40
	D43
	D45
	D46
	D47
	D48
	D49
	D50
	D56
	D6
	D7
	D9
	E10
	E101
	E12
	E121
	E17
	E24
	E25
	E31
	E33
	E34
	E36
	E38
	E39 E41
	E40
	E44
	E45
	E46
	E49
	E50
	E52
	E58
	E60
	E63
	E67
	E68
	E77
	E8
	F
	F E46
	F E57
	F E69
	F G63
	F G73
	F2
	F33
	F38
	F4
	G6
	T
embarked	Cherbourg
	Queenstown
	Southampton
boat
	1
	10
	11
	12
	13
	13 15
	13 15 B
	14
	15
	15 16
	16
	2
	3
	4
	5
	5 7
	5 9
	6
	7
	8
	8 10
	9
	A
	B
	C
	C D
	D

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest	total
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO	1
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON	1

	pclass	survived	name	sex	age	ticket	fare	cabin	embarked	boat	body	home.dest	total
725	3	1	Connolly, Miss. Kate	female	22.0	370373	7.7500	NaN	Q	13	NaN	Ireland	1
726	3	0	Connolly, Miss. Kate	female	30.0	330972	7.6292	NaN	Q	NaN	NaN	Ireland	1
924	3	0	Kelly, Mr. James	male	34.5	330911	7.8292	NaN	Q	NaN	70.0	NaN	1
925	3	0	Kelly, Mr. James	male	44.0	363592	8.0500	NaN	S	NaN	NaN	NaN	1

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest	total
1170	3	Sage, Master. Thomas Henry	male	NaN	8	2	CA. 2343	69.55	NaN	S	NaN	NaN	NaN	1
1171	3	Sage, Master. William Henry	male	14.5	8	2	CA. 2343	69.55	NaN	S	NaN	67.0	NaN	1
1172	3	Sage, Miss. Ada	female	NaN	8	2	CA. 2343	69.55	NaN	S	NaN	NaN	NaN	1
1173	3	Sage, Miss. Constance Gladys	female	NaN	8	2	CA. 2343	69.55	NaN	S	NaN	NaN	NaN	1
1174	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.55	NaN	S	NaN	NaN	NaN	1
1175	3	Sage, Miss. Stella Anna	female	NaN	8	2	CA. 2343	69.55	NaN	S	NaN	NaN	NaN	1
1176	3	Sage, Mr. Douglas Bullen	male	NaN	8	2	CA. 2343	69.55	NaN	S	NaN	NaN	NaN	1
1177	3	Sage, Mr. Frederick	male	NaN	8	2	CA. 2343	69.55	NaN	S	NaN	NaN	NaN	1
1178	3	Sage, Mr. George John Jr	male	NaN	8	2	CA. 2343	69.55	NaN	S	NaN	NaN	NaN	1

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest	total
1179	3	0	Sage, Mr. John George	male	NaN	1	9	CA. 2343	69.55	NaN	S	NaN	NaN	NaN	1
1180	3	0	Sage, Mrs. John (Annie Bullen)	female	NaN	1	9	CA. 2343	69.55	NaN	S	NaN	NaN	NaN	1

	pclass	survived	name	sex	age	ticket	cabin	embarked	boat	body	home.dest	total
7	1	0	Andrews, Mr. Thomas Jr	male	39.0	112050	A36	S	NaN	NaN	Belfast, NI	1
70	1	0	Chisholm, Mr. Roderick Robert Crispin	male	NaN	112051	NaN	S	NaN	NaN	Liverpool, England / Belfast	1
125	1	0	Fry, Mr. Richard	male	NaN	112058	B102	S	NaN	NaN	NaN	1
150	1	0	Harrison, Mr. William	male	40.0	112059	B94	S	NaN	110.0	NaN	1
170	1	1	Ismay, Mr. Joseph Bruce	male	49.0	112058	B52 B54 B56	S	C	NaN	Liverpool	1
223	1	0	Parr, Mr. William Henry Marsh	male	NaN	112052	NaN	S	NaN	NaN	Belfast	1
234	1	0	Reuchlin, Jonkheer. John George	male	38.0	19972	NaN	S	NaN	NaN	Rotterdam, Netherlands	1
363	2	0	Campbell, Mr. William	male	NaN	239853	NaN	S	NaN	NaN	Belfast	1
384	2	0	Cunningham, Mr. Alfred Fleming	male	NaN	239853	NaN	S	NaN	NaN	Belfast	1
410	2	0	Frost, Mr. Anthony Wood "Archie"	male	NaN	239854	NaN	S	NaN	NaN	Belfast	1
473	2	0	Knight, Mr. Robert J	male	NaN	239855	NaN	S	NaN	NaN	Belfast	1
528	2	0	Parkes, Mr. Francis "Frank"	male	NaN	239853	NaN	S	NaN	NaN	Belfast	1
581	2	0	Watson, Mr. Ennis Hastings	male	NaN	239856	NaN	S	NaN	NaN	Belfast	1
896	3	0	Johnson, Mr. Alfred	male	49.0	LINE	NaN	S	NaN	NaN	NaN	1
898	3	0	Johnson, Mr. William Cahoone Jr	male	19.0	LINE	NaN	S	NaN	NaN	NaN	1
963	3	0	Leonard, Mr. Lionel	male	36.0	LINE	NaN	S	NaN	NaN	NaN	1
1254	3	1	Tornquist, Mr. William Henry	male	25.0	LINE	NaN	S	15	NaN	NaN	1

	total
home.dest
New York, NY	64
London	14
Montreal, PQ	10
Paris, France	9
Cornwall / Akron, OH	9
Wiltshire, England Niagara Falls, NY	8
Winnipeg, MB	8
Philadelphia, PA	8
Belfast	7
Brooklyn, NY	7
Sweden Winnipeg, MN	7
Haverford, PA / Cooperstown, NY	5
Somerset / Bernardsville, NJ	5
Bulgaria Chicago, IL	5
Ottawa, ON	5
Rotherfield, Sussex, England Essex Co, MA	5
Sweden Worcester, MA	5
San Francisco, CA	4
Guernsey / Elizabeth, NJ	4
Washington, DC	4
Bournmouth, England	4
Haverford, PA	4
Guntur, India / Benton Harbour, MI	4
Bryn Mawr, PA	4
St Louis, MO	4
Ruotsinphytaa, Finland New York, NY	4
Chicago, IL	4
Minneapolis, MN	4
London, England	4
Ireland Chicago, IL	4
...	...
Goteborg, Sweden Huntley, IL	1
Goteborg, Sweden / Rockford, IL	1
Glen Ridge, NJ	1
Glasgow / Bangor, ME	1
Germantown, Philadelphia, PA	1
India / Pittsburgh, PA	1
Gallipolis, Ohio / ? Paris / New York	1
Frankfort, KY	1
Foresvik, Norway Portland, ND	1
Folkstone, Kent / New York, NY	1
Finland Sudbury, ON	1
Finland / Washington, DC	1
Guernsey / Montclair, NJ and/or Toledo, Ohio	1
Guernsey / Wilmington, DE	1
Guernsey, England / Edgewood, RI	1
Gunnislake, England / Butte, MT	1
Haddenfield, NJ	1
Halesworth, England	1
Hamilton, ON	1
Harrisburg, PA	1
Harrow, England	1
Harrow-on-the-Hill, Middlesex	1
Hartford, CT	1
Hartford, Huntingdonshire	1
Helsinki, Finland Ashtabula, Ohio	1
Hessle, Yorks	1
Holley, NY	1
Hornsey, England	1
Illinois, USA	1
Kontiolahti, Finland / Detroit, MI	1

		survived	total	survived in %
pclass	sex
1	female	139	144	96.527778
1	male	61	179	34.078212
2	female	94	106	88.679245
2	male	25	171	14.619883
3	female	106	216	49.074074
3	male	75	493	15.212982

			survived	total
sex	age-category	pclass
female	No age	1	11.0	11.0
		2	2.0	3.0
		3	34.0	64.0
	child	1	1.0	2.0
		2	15.0	15.0
		3	16.0	33.0
	adolescent	1	14.0	14.0
		2	11.0	12.0
		3	18.0	33.0
	adult	1	112.0	116.0
		2	66.0	76.0
		3	38.0	86.0
	senior	1	1.0	1.0
		2	0.0	0.0
		3	0.0	0.0
male	No age	1	8.0	28.0
		2	2.0	13.0
		3	16.0	144.0
	child	1	5.0	5.0
		2	11.0	12.0
		3	13.0	40.0
	adolescent	1	1.0	5.0
		2	2.0	16.0
		3	7.0	61.0
	adult	1	46.0	134.0
		2	10.0	128.0
		3	39.0	245.0
	senior	1	1.0	7.0
		2	0.0	2.0
		3	0.0	3.0