Titanic data exercise


In [1]:
import pandas as pd
import numpy as np
import glob # to find all files in folder
from datetime import datetime
from datetime import date, time
from dateutil.parser import parse
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_context('notebook')
pd.options.mode.chained_assignment = None  # default='warn'

In [2]:
from IPython.core.display import HTML
HTML(filename='Data/titanic.html')


Out[2]:

Data frame:titanic3

1309 observations and 14 variables, maximum # NAs:1188
NameLabelsUnitsLevelsStorageNAs
pclass
3
integer
0
survivedSurvived
double
0
nameName
character
0
sex
2
integer
0
ageAgeYear
double
263
sibspNumber of Siblings/Spouses Aboard
double
0
parchNumber of Parents/Children Aboard
double
0
ticketTicket Number
character
0
farePassenger FareBritish Pound (\243)
double
1
cabin
187
integer
0
embarked
3
integer
2
boat
28
integer
0
bodyBody Identification Number
double
1188
home.destHome/Destination
character
0

VariableLevels
pclass1st
2nd
3rd
sexfemale
male
cabin
A10
A11
A14
A16
A18
A19
A20
A21
A23
A24
A26
A29
A31
A32
A34
A36
A5
A6
A7
A9
B10
B101
B102
B11
B18
B19
B20
B22
B24
B26
B28
B3
B30
B35
B36
B37
B38
B39
B4
B41
B42
B45
B49
B5
B50
B51 B53 B55
B52 B54 B56
B57 B59 B63 B66
B58 B60
B61
B69
B71
B73
B77
B78
B79
B80
B82 B84
B86
B94
B96 B98
C101
C103
C104
C105
C106
C110
C111
C116
C118
C123
C124
C125
C126
C128
C130
C132
C148
C2
C22 C26
C23 C25 C27
C28
C30
C31
C32
C39
C45
C46
C47
C49
C50
C51
C52
C53
C54
C55 C57
C6
C62 C64
C65
C68
C7
C70
C78
C80
C82
C83
C85
C86
C87
C89
C90
C91
C92
C93
C95
C97
C99
D
D10 D12
D11
D15
D17
D19
D20
D21
D22
D26
D28
D30
D33
D34
D35
D36
D37
D38
D40
D43
D45
D46
D47
D48
D49
D50
D56
D6
D7
D9
E10
E101
E12
E121
E17
E24
E25
E31
E33
E34
E36
E38
E39 E41
E40
E44
E45
E46
E49
E50
E52
E58
E60
E63
E67
E68
E77
E8
F
F E46
F E57
F E69
F G63
F G73
F2
F33
F38
F4
G6
T
embarkedCherbourg
Queenstown
Southampton
boat
1
10
11
12
13
13 15
13 15 B
14
15
15 16
16
2
3
4
5
5 7
5 9
6
7
8
8 10
9
A
B
C
C D
D


In [3]:
original_data = pd.read_excel('Data/titanic.xls')
original_data['total'] = 1 # add a colon only consisting of 1s to make couting easier
original_data.head(2)


Out[3]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest total
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO 1
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON 1

1. Describe each attribute, both with basic statistics and plots. State clearly your assumptions and discuss your findings.

pclass

the class a person belongs to


In [4]:
pclass = original_data['pclass']
pclass.unique()


Out[4]:
array([1, 2, 3])

there are 3 different classes


In [5]:
for c in pclass.unique():
    print('nbr in class '+str(c)+': '+str(len(pclass[pclass == c])))


nbr in class 1: 323
nbr in class 2: 277
nbr in class 3: 709

most are in class 3, but surprisingly class 1 has more passagengers than class 2


In [6]:
plt.hist(pclass.values)


Out[6]:
(array([ 323.,    0.,    0.,    0.,    0.,  277.,    0.,    0.,    0.,  709.]),
 array([ 1. ,  1.2,  1.4,  1.6,  1.8,  2. ,  2.2,  2.4,  2.6,  2.8,  3. ]),
 <a list of 10 Patch objects>)

survived

States if the passenger survived the titanic sinking


In [7]:
surv = original_data['survived']
surv.unique() # to make sure there are only 1 and 0


Out[7]:
array([1, 0])

In [8]:
#how many survived?
surv.sum()


Out[8]:
500

In [9]:
#how many died?
len(surv[surv == 0])


Out[9]:
809

most died :(


In [10]:
100/len(surv.values) * surv.sum()


Out[10]:
38.19709702062643

only 38% survived

name

the name of the passanger


In [11]:
name = original_data['name']
len(name.unique()) == len(name.values)


Out[11]:
False

apparently there are some with the same name


In [12]:
len(name.values) - len(name.unique())


Out[12]:
2

In [13]:
#lets find them
original_data[name.isin(name[name.duplicated()].values)]


Out[13]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest total
725 3 1 Connolly, Miss. Kate female 22.0 0 0 370373 7.7500 NaN Q 13 NaN Ireland 1
726 3 0 Connolly, Miss. Kate female 30.0 0 0 330972 7.6292 NaN Q NaN NaN Ireland 1
924 3 0 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q NaN 70.0 NaN 1
925 3 0 Kelly, Mr. James male 44.0 0 0 363592 8.0500 NaN S NaN NaN NaN 1

sex

the sex of the passenger


In [14]:
sex = original_data['sex']
sex.unique()


Out[14]:
array(['female', 'male'], dtype=object)

In [15]:
nbr_males = len(sex[sex == 'male'])

In [16]:
nbr_females= len(sex[sex == 'female'])

In [17]:
100/len(sex) * nbr_males


Out[17]:
64.40030557677616

64.4% are male

age

How old the passenger is


In [18]:
age = original_data['age']
age.unique()


Out[18]:
array([ 29.    ,   0.9167,   2.    ,  30.    ,  25.    ,  48.    ,
        63.    ,  39.    ,  53.    ,  71.    ,  47.    ,  18.    ,
        24.    ,  26.    ,  80.    ,      nan,  50.    ,  32.    ,
        36.    ,  37.    ,  42.    ,  19.    ,  35.    ,  28.    ,
        45.    ,  40.    ,  58.    ,  22.    ,  41.    ,  44.    ,
        59.    ,  60.    ,  33.    ,  17.    ,  11.    ,  14.    ,
        49.    ,  76.    ,  46.    ,  27.    ,  64.    ,  55.    ,
        70.    ,  38.    ,  51.    ,  31.    ,   4.    ,  54.    ,
        23.    ,  43.    ,  52.    ,  16.    ,  32.5   ,  21.    ,
        15.    ,  65.    ,  28.5   ,  45.5   ,  56.    ,  13.    ,
        61.    ,  34.    ,   6.    ,  57.    ,  62.    ,  67.    ,
         1.    ,  12.    ,  20.    ,   0.8333,   8.    ,   0.6667,
         7.    ,   3.    ,  36.5   ,  18.5   ,   5.    ,  66.    ,
         9.    ,   0.75  ,  70.5   ,  22.5   ,   0.3333,   0.1667,
        40.5   ,  10.    ,  23.5   ,  34.5   ,  20.5   ,  30.5   ,
        55.5   ,  38.5   ,  14.5   ,  24.5   ,  60.5   ,  74.    ,
         0.4167,  11.5   ,  26.5   ])

There are NaN values! But also floating point values, which is somewhat unusual but not a problem per se.


In [19]:
age.min() # a baby?


Out[19]:
0.16669999999999999

In [20]:
age.max()


Out[20]:
80.0

In [21]:
age.mean()


Out[21]:
29.8811345124283

Age distribution in a boxplot:


In [22]:
sns.boxplot(age.dropna().values)


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb95b4907b8>

And the distribution of age plotted:


In [23]:
#plt.hist(age.values)

sibsp

The number of siblings or spouses on the ship


In [24]:
sipsp = original_data['sibsp']
sipsp.unique()


Out[24]:
array([0, 1, 2, 3, 4, 5, 8])

In [25]:
sipsp.mean()


Out[25]:
0.4988540870893812

Plot histogram: Almost all traveled without siblings or spouses. there is apparently one family that traveled together (8 siblings are on board)


In [26]:
plt.hist(sipsp)


Out[26]:
(array([ 891.,  319.,   42.,   20.,    0.,   22.,    6.,    0.,    0.,    9.]),
 array([ 0. ,  0.8,  1.6,  2.4,  3.2,  4. ,  4.8,  5.6,  6.4,  7.2,  8. ]),
 <a list of 10 Patch objects>)

parch

The number of parents or children on the ship


In [27]:
parch = original_data['parch']
parch.unique()


Out[27]:
array([0, 2, 1, 4, 3, 5, 6, 9])

In [28]:
parch.mean()


Out[28]:
0.3850267379679144

Histogram: Again almost noone traveled with their kids. The one big family is again seen here.


In [29]:
plt.hist(parch)


Out[29]:
(array([ 1002.,   170.,   113.,     8.,     6.,     6.,     2.,     0.,
            0.,     2.]),
 array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ]),
 <a list of 10 Patch objects>)

Let's find the family


In [30]:
# the kids
original_data[original_data['sibsp'] == 8]


Out[30]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest total
1170 3 0 Sage, Master. Thomas Henry male NaN 8 2 CA. 2343 69.55 NaN S NaN NaN NaN 1
1171 3 0 Sage, Master. William Henry male 14.5 8 2 CA. 2343 69.55 NaN S NaN 67.0 NaN 1
1172 3 0 Sage, Miss. Ada female NaN 8 2 CA. 2343 69.55 NaN S NaN NaN NaN 1
1173 3 0 Sage, Miss. Constance Gladys female NaN 8 2 CA. 2343 69.55 NaN S NaN NaN NaN 1
1174 3 0 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.55 NaN S NaN NaN NaN 1
1175 3 0 Sage, Miss. Stella Anna female NaN 8 2 CA. 2343 69.55 NaN S NaN NaN NaN 1
1176 3 0 Sage, Mr. Douglas Bullen male NaN 8 2 CA. 2343 69.55 NaN S NaN NaN NaN 1
1177 3 0 Sage, Mr. Frederick male NaN 8 2 CA. 2343 69.55 NaN S NaN NaN NaN 1
1178 3 0 Sage, Mr. George John Jr male NaN 8 2 CA. 2343 69.55 NaN S NaN NaN NaN 1

In [31]:
#  the parents
original_data[original_data['parch'] == 9]


Out[31]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest total
1179 3 0 Sage, Mr. John George male NaN 1 9 CA. 2343 69.55 NaN S NaN NaN NaN 1
1180 3 0 Sage, Mrs. John (Annie Bullen) female NaN 1 9 CA. 2343 69.55 NaN S NaN NaN NaN 1

This are the children and the parents of the 'big' familly. Sadly all died :(

ticket

the ticketnbr the passanger had


In [32]:
ticket = original_data['ticket']
len(ticket.unique())


Out[32]:
939

In [33]:
ticket.dtype


Out[33]:
dtype('O')

In [34]:
len(ticket[ticket.isnull()])


Out[34]:
0

All (registered) passengers had a ticket ;)

fare

How many they paid


In [35]:
fare = original_data['fare']
fare.mean()


Out[35]:
33.29547928134572

In [36]:
fare.max()


Out[36]:
512.32920000000001

In [37]:
fare.min()


Out[37]:
0.0

There are people that did not pay anything


In [38]:
original_data[fare == 0]


Out[38]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest total
7 1 0 Andrews, Mr. Thomas Jr male 39.0 0 0 112050 0.0 A36 S NaN NaN Belfast, NI 1
70 1 0 Chisholm, Mr. Roderick Robert Crispin male NaN 0 0 112051 0.0 NaN S NaN NaN Liverpool, England / Belfast 1
125 1 0 Fry, Mr. Richard male NaN 0 0 112058 0.0 B102 S NaN NaN NaN 1
150 1 0 Harrison, Mr. William male 40.0 0 0 112059 0.0 B94 S NaN 110.0 NaN 1
170 1 1 Ismay, Mr. Joseph Bruce male 49.0 0 0 112058 0.0 B52 B54 B56 S C NaN Liverpool 1
223 1 0 Parr, Mr. William Henry Marsh male NaN 0 0 112052 0.0 NaN S NaN NaN Belfast 1
234 1 0 Reuchlin, Jonkheer. John George male 38.0 0 0 19972 0.0 NaN S NaN NaN Rotterdam, Netherlands 1
363 2 0 Campbell, Mr. William male NaN 0 0 239853 0.0 NaN S NaN NaN Belfast 1
384 2 0 Cunningham, Mr. Alfred Fleming male NaN 0 0 239853 0.0 NaN S NaN NaN Belfast 1
410 2 0 Frost, Mr. Anthony Wood "Archie" male NaN 0 0 239854 0.0 NaN S NaN NaN Belfast 1
473 2 0 Knight, Mr. Robert J male NaN 0 0 239855 0.0 NaN S NaN NaN Belfast 1
528 2 0 Parkes, Mr. Francis "Frank" male NaN 0 0 239853 0.0 NaN S NaN NaN Belfast 1
581 2 0 Watson, Mr. Ennis Hastings male NaN 0 0 239856 0.0 NaN S NaN NaN Belfast 1
896 3 0 Johnson, Mr. Alfred male 49.0 0 0 LINE 0.0 NaN S NaN NaN NaN 1
898 3 0 Johnson, Mr. William Cahoone Jr male 19.0 0 0 LINE 0.0 NaN S NaN NaN NaN 1
963 3 0 Leonard, Mr. Lionel male 36.0 0 0 LINE 0.0 NaN S NaN NaN NaN 1
1254 3 1 Tornquist, Mr. William Henry male 25.0 0 0 LINE 0.0 NaN S 15 NaN NaN 1

In [39]:
fare.dtypes


Out[39]:
dtype('float64')

In [40]:
original_data[fare.isnull()]


Out[40]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest total
1225 3 0 Storey, Mr. Thomas male 60.5 0 0 3701 NaN NaN S NaN 261.0 NaN 1

there is one NaN value


In [41]:
plt.hist(fare.dropna())


Out[41]:
(array([ 1070.,   154.,    42.,     4.,    21.,    13.,     0.,     0.,
            0.,     4.]),
 array([   0.     ,   51.23292,  102.46584,  153.69876,  204.93168,
         256.1646 ,  307.39752,  358.63044,  409.86336,  461.09628,
         512.3292 ]),
 <a list of 10 Patch objects>)

Someone got ripped of, or got the best room.

cabin

What cabin they are in


In [42]:
cabin = original_data['cabin']
cabin.isnull().sum()


Out[42]:
1014

1014 people have no cabin (all class 3?)


In [43]:
plt.hist(original_data[cabin.isnull()]['pclass'])


Out[43]:
(array([  67.,    0.,    0.,    0.,    0.,  254.,    0.,    0.,    0.,  693.]),
 array([ 1. ,  1.2,  1.4,  1.6,  1.8,  2. ,  2.2,  2.4,  2.6,  2.8,  3. ]),
 <a list of 10 Patch objects>)

Even people in class 1 have no cabin (or it is unknown)

Some people have several cabines, but they are also occupied by several peoples, probablement families. It would be quite complicated to take those 'multiple cabin' entries appart. With more time we could have done it.


In [44]:
cabin.head()


Out[44]:
0         B5
1    C22 C26
2    C22 C26
3    C22 C26
4    C22 C26
Name: cabin, dtype: object

embarked


In [45]:
embarked = original_data['embarked']
embarked.unique()


Out[45]:
array(['S', 'C', nan, 'Q'], dtype=object)

In [46]:
len(embarked[embarked.isnull()])


Out[46]:
2

two people have NaN in 'embarked'


In [47]:
sns.countplot(y="embarked", data=original_data, color="c");


boat

On what rescue-boat they were rescued


In [48]:
boat = original_data['boat']
boat.unique()


Out[48]:
array([2, '11', nan, '3', '10', 'D', '4', '9', '6', 'B', '8', 'A', '5',
       '7', 'C', '14', '2', '5 9', '13', '1', '15', '5 7', '8 10', '12',
       '16', '13 15 B', 'C D', '15 16', '13 15'], dtype=object)

some have several boats.

body

the identification number of a body


In [49]:
body = original_data['body']
body.count()


Out[49]:
121

121 bodys got an number

home dest


In [50]:
homedest = original_data['home.dest']
len(homedest.dropna().unique())


Out[50]:
369

369 different home destinations Lets find the most common one


In [51]:
original_data[['home.dest', 'total']].groupby(by='home.dest').sum().sort_values(by='total', ascending=False)


Out[51]:
total
home.dest
New York, NY 64
London 14
Montreal, PQ 10
Paris, France 9
Cornwall / Akron, OH 9
Wiltshire, England Niagara Falls, NY 8
Winnipeg, MB 8
Philadelphia, PA 8
Belfast 7
Brooklyn, NY 7
Sweden Winnipeg, MN 7
Haverford, PA / Cooperstown, NY 5
Somerset / Bernardsville, NJ 5
Bulgaria Chicago, IL 5
Ottawa, ON 5
Rotherfield, Sussex, England Essex Co, MA 5
Sweden Worcester, MA 5
San Francisco, CA 4
Guernsey / Elizabeth, NJ 4
Washington, DC 4
Bournmouth, England 4
Haverford, PA 4
Guntur, India / Benton Harbour, MI 4
Bryn Mawr, PA 4
St Louis, MO 4
Ruotsinphytaa, Finland New York, NY 4
Chicago, IL 4
Minneapolis, MN 4
London, England 4
Ireland Chicago, IL 4
... ...
Goteborg, Sweden Huntley, IL 1
Goteborg, Sweden / Rockford, IL 1
Glen Ridge, NJ 1
Glasgow / Bangor, ME 1
Germantown, Philadelphia, PA 1
India / Pittsburgh, PA 1
Gallipolis, Ohio / ? Paris / New York 1
Frankfort, KY 1
Foresvik, Norway Portland, ND 1
Folkstone, Kent / New York, NY 1
Finland Sudbury, ON 1
Finland / Washington, DC 1
Guernsey / Montclair, NJ and/or Toledo, Ohio 1
Guernsey / Wilmington, DE 1
Guernsey, England / Edgewood, RI 1
Gunnislake, England / Butte, MT 1
Haddenfield, NJ 1
Halesworth, England 1
Hamilton, ON 1
Harrisburg, PA 1
Harrow, England 1
Harrow-on-the-Hill, Middlesex 1
Hartford, CT 1
Hartford, Huntingdonshire 1
Helsinki, Finland Ashtabula, Ohio 1
Hessle, Yorks 1
Holley, NY 1
Hornsey, England 1
Illinois, USA 1
Kontiolahti, Finland / Detroit, MI 1

369 rows × 1 columns

Most come from New York

2. Use the groupby method to calculate the proportion of passengers that survived by sex:

First gather the numbers


In [52]:
survived_by_sex = original_data[['survived', 'sex']].groupby('sex').sum()
nbr_males = len(original_data[original_data['sex'] == 'male'])
nbr_females = len(original_data[original_data['sex'] == 'female'])
nbr_total = len(original_data['sex'])
survived_by_sex


Out[52]:
survived
sex
female 339
male 161

In [53]:
print(nbr_total == nbr_females + nbr_males) # to check if consistent


True

Then calcultate the percentages


In [54]:
female_survived_percentage = (100/nbr_females) * survived_by_sex.at['female', 'survived']
male_survived_percentage = (100/nbr_males) * survived_by_sex.at['male', 'survived']
print('female surv: '+str(round(female_survived_percentage, 3))+'%')
print('male surv: '+str(round(male_survived_percentage, 3))+'%')


female surv: 72.747%
male surv: 19.098%

3. Calculate the same proportion, but by class and sex.


In [55]:
# make use of the 'total' column (which is all 1's in the original_data)
survived_by_class = original_data[['pclass', 'sex', 'survived', 'total']].groupby(['pclass', 'sex']).sum()
survived_by_class


Out[55]:
survived total
pclass sex
1 female 139 144
male 61 179
2 female 94 106
male 25 171
3 female 106 216
male 75 493

In [56]:
def combine_surv_total(row):
    #print(row)
    return 100.0/row.total * row.survived

create a new column with the apply method


In [57]:
survived_by_class['survived in %'] = survived_by_class.apply(combine_surv_total, axis=1)
survived_by_class


Out[57]:
survived total survived in %
pclass sex
1 female 139 144 96.527778
male 61 179 34.078212
2 female 94 106 88.679245
male 25 171 14.619883
3 female 106 216 49.074074
male 75 493 15.212982

Here is a plot showing the survive rates. Note that the plot is not based on the data calculated above


In [67]:
type(original_data['sex'])


Out[67]:
pandas.core.series.Series

In [58]:
sns.barplot(x='sex', y='survived', hue='pclass', data=original_data);


We can see that 'women first' is true, but also 'class 1 first'

4. Create age categories: children (under 14 years), adolescents (14-20), adult (21-64), and senior(65+), and calculate survival proportions by age category, class and sex.

Create the categories. We use the value -1 to show that the person has a NaN value as age (and put them in the category 'No age'.


In [59]:
original_data.age.fillna(-1, inplace=True)
age_cats = pd.cut(original_data.age, [-2, 0+1e-6,14+1e-6,20+1e-6,64+1e-6,120], labels=['No age', 'child','adolescent','adult','senior'], include_lowest=True)

In [60]:
original_data['age-category'] = age_cats

In [61]:
catsdata = original_data[['sex', 'age-category', 'pclass', 'survived', 'total']]

Then group the data in a sensible way to get the nice Table below.


In [62]:
grouped = catsdata.groupby(['sex', 'age-category', 'pclass']).sum().fillna(0)
grouped


Out[62]:
survived total
sex age-category pclass
female No age 1 11.0 11.0
2 2.0 3.0
3 34.0 64.0
child 1 1.0 2.0
2 15.0 15.0
3 16.0 33.0
adolescent 1 14.0 14.0
2 11.0 12.0
3 18.0 33.0
adult 1 112.0 116.0
2 66.0 76.0
3 38.0 86.0
senior 1 1.0 1.0
2 0.0 0.0
3 0.0 0.0
male No age 1 8.0 28.0
2 2.0 13.0
3 16.0 144.0
child 1 5.0 5.0
2 11.0 12.0
3 13.0 40.0
adolescent 1 1.0 5.0
2 2.0 16.0
3 7.0 61.0
adult 1 46.0 134.0
2 10.0 128.0
3 39.0 245.0
senior 1 1.0 7.0
2 0.0 2.0
3 0.0 3.0

And finally calculate the survive portion for all cases


In [63]:
def surv_proportions(row):
    if row.total == 0:
        return np.nan
    return round(100.0/row.total * row.survived, 2)

grouped['survive-portion (%)'] = grouped.apply(surv_proportions, axis=1)

In [64]:
grouped


Out[64]:
survived total survive-portion (%)
sex age-category pclass
female No age 1 11.0 11.0 100.00
2 2.0 3.0 66.67
3 34.0 64.0 53.12
child 1 1.0 2.0 50.00
2 15.0 15.0 100.00
3 16.0 33.0 48.48
adolescent 1 14.0 14.0 100.00
2 11.0 12.0 91.67
3 18.0 33.0 54.55
adult 1 112.0 116.0 96.55
2 66.0 76.0 86.84
3 38.0 86.0 44.19
senior 1 1.0 1.0 100.00
2 0.0 0.0 NaN
3 0.0 0.0 NaN
male No age 1 8.0 28.0 28.57
2 2.0 13.0 15.38
3 16.0 144.0 11.11
child 1 5.0 5.0 100.00
2 11.0 12.0 91.67
3 13.0 40.0 32.50
adolescent 1 1.0 5.0 20.00
2 2.0 16.0 12.50
3 7.0 61.0 11.48
adult 1 46.0 134.0 34.33
2 10.0 128.0 7.81
3 39.0 245.0 15.92
senior 1 1.0 7.0 14.29
2 0.0 2.0 0.00
3 0.0 3.0 0.00

Plots

Two plots showing this. The first showing the female and the second shows the male passengers


In [65]:
sns.barplot(x="pclass", y="survived", hue="age-category", data=original_data[original_data['sex'] == 'female'])


Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb9584cbe48>

Almost all women from class 1 and 2 survived, in class 3 about 50% survived


In [66]:
sns.barplot(x="pclass", y="survived", hue="age-category", data=original_data[original_data['sex'] == 'male'])


Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb958387cf8>

It is interesting to see that almost no men survived, exept children. So 'children before adults' was certainly a thing.


In [ ]: