Titanic Data Set - Statistics Review

Variable Descriptions.

Describe the data.

  • How big?
  • What are the columns and what do they mean?

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import random
%matplotlib inline

In [2]:
titanic = pd.read_csv('titanic.csv')

In [3]:
titanic.shape


Out[3]:
(891, 12)

In [4]:
titanic.dtypes


Out[4]:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [5]:
titanic.describe(include='all')


Out[5]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
count 891.000000 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889
unique NaN NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3
top NaN NaN NaN Jerwan, Mrs. Amin S (Marie Marthe Thuillard) male NaN NaN NaN CA. 2343 NaN G6 S
freq NaN NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644
mean 446.000000 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN NaN
  1. PassengerId: unique identifier of the passenger
  2. Survived: 1 if yes, 0 if no
  3. Pclass: passenger class, 1 through 3
  4. Name: passenger name
  5. Sex: passenger sex, male or female
  6. SibSp: number of siblings/spouses on board
  7. Parch: number of parents/children on board
  8. Ticket: ticket number
  9. Fare: cost of the ticket
  10. Cabin: cabin number
  11. Embarked: port of embarkment (C = Cherbourg, Q = Queenstown, S = Southampton)

What’s the average age of:

  • Any Titanic passenger
  • A survivor
  • A non-surviving first-class passenger
  • Male survivors older than 30 from anywhere but Queenstown

In [6]:
print('Age average:', titanic['Age'].mean())
print('Survivor age average:', titanic.where(titanic['Survived'] == 1)['Age'].mean())
print('Non-surviving first-class age average:', titanic.where((titanic['Survived'] == 0) & (titanic['Pclass'] == 1))['Age'].mean())
print('Male survivors older than 30 not from Queenstown age average:', titanic.where((titanic['Sex'] == 'male') & (titanic['Survived'] == 1) & (titanic['Age'] > 30) & (titanic['Embarked'] != 'Q'))['Age'].mean())


Age average: 29.69911764705882
Survivor age average: 28.343689655172415
Non-surviving first-class age average: 43.6953125
Male survivors older than 30 not from Queenstown age average: 41.48780487804878

For the groups from the previous task, how far (in years) are the average ages from the median ages?


In [7]:
print(titanic['Age'].mean() - titanic['Age'].median())
print(titanic.where(titanic['Survived'] == 1)['Age'].mean() - titanic.where(titanic['Survived'] == 1)['Age'].median())
print(titanic.where((titanic['Survived'] == 0) & (titanic['Pclass'] == 1))['Age'].mean() - titanic.where((titanic['Survived'] == 0) & (titanic['Pclass'] == 1))['Age'].median())
print(titanic.where((titanic['Sex'] == 'male') & (titanic['Survived'] == 1) & (titanic['Age'] > 30) & (titanic['Embarked'] != 'Q'))['Age'].mean() - titanic.where((titanic['Sex'] == 'male') & (titanic['Survived'] == 1) & (titanic['Age'] > 30) & (titanic['Embarked'] != 'Q'))['Age'].median())


1.69911764705882
0.34368965517241534
-1.5546875
3.4878048780487774

What’s the most common:

  • Passenger class
  • Port of Embarkation
  • Number of siblings or spouses aboard for survivors

In [8]:
print('Most common passenger class:', titanic['Pclass'].mode()[0])
print('Most common port of embarkation:', titanic['Embarked'].mode()[0])
print('Most common number of siblings/spouses for survivors:', titanic[titanic['Survived'] == 1]['SibSp'].mode()[0])


Most common passenger class: 3
Most common port of embarkation: S
Most common number of siblings/spouses for survivors: 0

Within what range of standard deviations from the mean (0-1, 1-2, 2-3) is the median ticket price? Is it above or below the mean?

It's between 0 and 1 standard deviations and below the mean:


In [9]:
print((titanic['Fare'].mean() - titanic['Fare'].median()) / titanic['Fare'].std())
print(titanic['Fare'].mean() > titanic['Fare'].median())


0.3571902456652297
True

How much more expensive was the 90th percentile ticket than the 5th percentile ticket? Are they the same class?


In [10]:
perc5 = titanic['Fare'].quantile(0.05)
perc90 = titanic['Fare'].quantile(0.9)

print('5th percentile:', perc5)
print('Class of the 5th percentile:', titanic[titanic['Fare'] == perc5]['Pclass'].unique()[0])
print('90th percentile:', perc90)
print('Class of the 90th percentile:', titanic[titanic['Fare'] == perc90]['Pclass'].unique()[0])


5th percentile: 7.225
Class of the 5th percentile: 3
90th percentile: 77.9583
Class of the 90th percentile: 1

The highest average ticket price was paid by passengers from which port? Null ports don’t count.


In [11]:
titanic.groupby('Embarked')['Fare'].mean().argmax()


Out[11]:
'C'

What is the most common passenger class for each port?


In [12]:
for port in titanic['Embarked'].dropna().unique():
    print('Most common class for {}: {}'.format(port, titanic.where(titanic['Embarked'] == port)['Pclass'].mode()[0]))


Most common class for S: 3.0
Most common class for C: 1.0
Most common class for Q: 3.0

What fraction of surviving 1st-class males paid lower than double the overall median ticket price?


In [13]:
titanic.where((titanic['Survived'] == 1) &
              (titanic['Sex'] == 'male') &
              (titanic['Pclass'] == 1) &
              (titanic['Fare'] < 2 * titanic['Fare'].median())
             )['PassengerId'].count() / titanic.where((titanic['Survived'] == 1) &
                                                      (titanic['Sex'] == 'male') &
                                                      (titanic['Pclass'] == 1))['PassengerId'].count()


Out[13]:
0.24444444444444444

How much older/younger was the average surviving passenger with family members than the average non-surviving passenger without them?


In [14]:
print('Survivor with family members average age:' ,titanic.where((titanic['Survived'] == 1) & (titanic['SibSp'] + titanic['Parch'] > 0))['Age'].mean())
print('Non-survivor without family members average age:' ,titanic.where((titanic['Survived'] == 0) & (titanic['SibSp'] + titanic['Parch'] == 0))['Age'].mean())


Survivor with family members average age: 25.526062500000002
Non-survivor without family members average age: 32.41423357664234

Display the relationship (i.e. make a plot) between survival rate and the quantile of the ticket price for 20 integer quantiles.

  • To be clearer, what I want is for you to specify 20 quantiles, and for each of those quantiles divide the number of survivors in that quantile by the total number of people in that quantile. That’ll give you the survival rate in that quantile.
  • Then plot a line of the survival rate against the ticket fare quantiles.
  • Make sure you label your axes.

In [36]:
import math

# Sort df by fare and reset index from 0
titanic_sortbyfare = titanic.sort_values('Fare').reset_index()
# Add column containing the quantile
titanic_sortbyfare['FareQuantile'] = titanic.index.values
titanic_sortbyfare['FareQuantile'] = titanic_sortbyfare['FareQuantile'].apply(lambda x: math.floor(x / math.ceil(len(titanic_sortbyfare) / 20)))
# Calculate survival rate
titanic_fareq = titanic_sortbyfare.groupby('FareQuantile')['Survived'].apply(lambda x : x.sum() / x.count())

In [37]:
with plt.style.context('seaborn'):
    fig = plt.figure(figsize=(16, 6))
    ax = plt.axes()
    ax.plot(titanic_fareq)
    # Set a locator and formatter for each quantile
    ax.xaxis.set_major_locator(plt.FixedLocator(titanic_fareq.index.values))
    ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda val, pos: (val+1) / 20))
    # Set x axis limits
    ax.set_xlim(min(titanic_fareq.index.values), max(titanic_fareq.index.values))
    # Add labels
    ax.set_xlabel('Fare Quantile')
    ax.set_ylabel('Survival Rate');


For each of the following characteristics, find the median in the data:

  • Age
  • Ticket price
  • Siblings/spouses
  • Parents/children

In [17]:
titanic[['Age', 'Fare', 'SibSp', 'Parch']].median()


Out[17]:
Age      28.0000
Fare     14.4542
SibSp     0.0000
Parch     0.0000
dtype: float64

If you were to use these medians to draw numerical boundaries separating survivors from non-survivors, which of these characteristics would be the best choice and why?


In [83]:
# Function that calculates if a value is below the median of an attribute
def below_median(x, attr):
    if x <= titanic_survrate[attr].median():
        return 1
    elif x > titanic_survrate[attr].median():
        return 0
    else:
        return None

titanic_survrate = titanic[['Survived', 'Age', 'Fare', 'SibSp', 'Parch']].copy()
# Apply function to the attributes
titanic_survrate['AgeBelowMedian'] = titanic_survrate['Age'].apply(lambda x: below_median(x, 'Age'))
titanic_survrate['FareBelowMedian'] = titanic_survrate['Fare'].apply(lambda x: below_median(x, 'Fare'))
titanic_survrate['SibSpBelowMedian'] = titanic_survrate['SibSp'].apply(lambda x: below_median(x, 'SibSp'))
titanic_survrate['ParchBelowMedian'] = titanic_survrate['Parch'].apply(lambda x: below_median(x, 'Parch'))
# Calculate the survival rate above and below the median
print(titanic_survrate.groupby('AgeBelowMedian')['Survived'].sum() / titanic_survrate.groupby('AgeBelowMedian')['Survived'].count())
print(titanic_survrate.groupby('FareBelowMedian')['Survived'].sum() / titanic_survrate.groupby('FareBelowMedian')['Survived'].count())
print(titanic_survrate.groupby('SibSpBelowMedian')['Survived'].sum() / titanic_survrate.groupby('SibSpBelowMedian')['Survived'].count())
print(titanic_survrate.groupby('ParchBelowMedian')['Survived'].sum() / titanic_survrate.groupby('ParchBelowMedian')['Survived'].count())


AgeBelowMedian
0.0    0.403409
1.0    0.408840
Name: Survived, dtype: float64
FareBelowMedian
0    0.518018
1    0.250559
Name: Survived, dtype: float64
SibSpBelowMedian
0    0.466431
1    0.345395
Name: Survived, dtype: float64
ParchBelowMedian
0    0.511737
1    0.343658
Name: Survived, dtype: float64

In [64]:
# This is from the solution
def survival_ratio(predicate):
    series = titanic[predicate]
    return len(series[series['Survived'] == True]) / len(series)

below_at_median = pd.Series(name='Surv. below/at the median')
above_median = pd.Series(name='Surv. above the median')

below_at_median['Age'] = survival_ratio(titanic['Age'] <= titanic['Age'].median())
below_at_median['Fare'] = survival_ratio(titanic['Fare'] <= titanic['Fare'].median())
below_at_median['SibSp'] = survival_ratio(titanic['SibSp'] == titanic['SibSp'].median())
below_at_median['Parch'] = survival_ratio(titanic['Parch'] == titanic['Parch'].median())

above_median['Age'] = survival_ratio(titanic['Age'] > titanic['Age'].median())
above_median['Fare'] = survival_ratio(titanic['Fare'] > titanic['Fare'].median())
above_median['SibSp'] = survival_ratio(titanic['SibSp'] > titanic['SibSp'].median())
above_median['Parch'] = survival_ratio(titanic['Parch'] > titanic['Parch'].median())

survival_median = pd.DataFrame([below_at_median, above_median],
                              columns=['Age', 'Fare', 'SibSp', 'Parch']).transpose()

survival_median['above - below'] = above_median - below_at_median
survival_median


Out[64]:
Surv. below/at the median Surv. above the median above - below
Age 0.408840 0.403409 -0.005431
Fare 0.250559 0.518018 0.267459
SibSp 0.345395 0.466431 0.121036
Parch 0.343658 0.511737 0.168079

Plot the distribution of passenger ages. Choose visually-meaningful bin sizes and label your axes.


In [84]:
fig = plt.figure(figsize=(16, 8))
ax = plt.axes()
# Plot ages using 20 bins
ax.hist(titanic['Age'].dropna(), bins=20)
# Set labels
ax.set_xlabel('Age')
ax.set_ylabel('Number of Passengers')
# Set locators
ax.xaxis.set_major_locator(plt.MaxNLocator(20))


Find the probability that:

  • A passenger survived
  • A passenger was male
  • A passenger was female and had at least one sibling or spouse on board
  • A survivor was from Cherbourg
  • A passenger was less than 10 years old
  • A passenger was between 25 and 40 years old
  • A passenger was either younger than 20 years old or older than 50

In [85]:
print('Survival probability:', titanic['Survived'].sum() / len(titanic))
print('Male probability:', titanic[titanic['Sex'] == 'male']['PassengerId'].count() / len(titanic))
print('Female and at least one sibling/spouse probability:', titanic[(titanic['Sex'] == 'female') & (titanic['SibSp'] > 0)]['PassengerId'].count() / len(titanic))
print('Survivor from Cherbourg probability:', titanic[titanic['Embarked'] == 'C']['PassengerId'].count() / len(titanic))
print('Less than 10 years old probability:', titanic[titanic['Age'] < 10]['PassengerId'].count() / len(titanic))
print('Between 25 and 40 years old probability:', titanic[(titanic['Age'] > 25) & (titanic['Age'] < 40)]['PassengerId'].count() / len(titanic))
print('Less than 20 or more than 50 years old probability:', titanic[(titanic['Age'] < 20) | (titanic['Age'] > 50)]['PassengerId'].count() / len(titanic))


Survival probability: 0.3838383838383838
Male probability: 0.64758698092
Female and at least one sibling/spouse probability: 0.157126823793
Survivor from Cherbourg probability: 0.188552188552
Less than 10 years old probability: 0.0695847362514
Between 25 and 40 years old probability: 0.280583613917
Less than 20 or more than 50 years old probability: 0.255892255892

Knowing nothing else about the passengers aside from the survival rate of the population (see question above), if I choose 100 passengers at random from the passenger list, what’s the probability that exactly 42 passengers survive?


In [87]:
survival_rate = titanic['Survived'].sum() / len(titanic)
stats.binom.pmf(42, 100, survival_rate)


Out[87]:
0.061330411815167886

What’s the probability that at least 42 of those 100 passengers survive?


In [88]:
1 - stats.binom.cdf(41, 100, survival_rate)


Out[88]:
0.25940724207261701

Take random samples of 100 passengers and find out how many you need before the fraction of those samples where at least 42 passengers survive matches the probability you calculated previously (within Δp≈0.05).

Answers will vary based on chosen seeds. What would happen if you drew every sample with the same seed?

Plot the survival fraction vs the number of random samples.


In [95]:
# Set the seed (if I used the same seed for every sample I would always get the same result,
# so the fraction of samples would always be the same, namely 0 or 1 depending on the sample)
random.seed(42)

# Set the target probability from above and Δp
target_prob = 0.2594
delta = 0.05

# Initialize list of fraction of samples with at least 42 survivors
# and counters for number of samples drawn and number of samples with at least 42 passengers
survival_frac = []
n_samples = 0
n_over = 0

# Iterate until the fraction of samples is within delta
while True:
    # Take a new sample
    n_samples += 1
    samp = random.sample(set(np.arange(len(titanic))), 100)
    # Check if survivors >= 42 and add to n_over
    if titanic.iloc[samp, 1].sum() >= 42:
        n_over += 1
    # Calculate the fraction of samples
    survival_frac.append(n_over / n_samples)
    if abs(n_over / n_samples - target_prob) < delta:
        break

In [96]:
survival_frac


Out[96]:
[0.0, 0.0, 0.0, 0.0, 0.2, 0.16666666666666666, 0.14285714285714285, 0.25]

In [93]:
print('Number of samples needed:', len(survival_frac))


Number of samples needed: 8

In [98]:
fig = plt.figure(figsize=(16, 6))
ax = plt.axes()
ax.plot(survival_frac)
# Set labels
ax.set_xlabel('Number of samples')
ax.set_ylabel('Samples with at least 42 survivors')
# Set x axis limits, locators and formatters
ax.set_xlim(0, len(survival_frac) - 1)
ax.xaxis.set_major_locator(plt.MaxNLocator(8))
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda value, pos: int(value + 1)))
# Add a reference line for the target probability
ax.hlines(0.2594, 0, len(survival_frac) - 1, colors='red', linewidth=.5);


Is there a statistically significant difference between:

  • The ages of male and female survivors?
  • The fares paid by passengers from Queenstown and the passengers from Cherbourg?

Use a 95% confidence level.

The difference between the ages of male and female survivors is not statitically relevant (the p-value is above 0.4):


In [100]:
print('Male survivors age average:', titanic[(titanic['Sex'] == 'male') & (titanic['Survived'] == 1)]['Age'].mean())
print('Female survivors age average:', titanic[(titanic['Sex'] == 'female') & (titanic['Survived'] == 1)]['Age'].mean())

print(titanic[(titanic['Sex'] == 'male') & (titanic['Survived'] == 1)]['Age'].std())
print(titanic[(titanic['Sex'] == 'female') & (titanic['Survived'] == 1)]['Age'].std())

stats.ttest_ind(titanic[(titanic['Sex'] == 'male') & (titanic['Survived'] == 1)]['Age'].dropna(),
                titanic[(titanic['Sex'] == 'female') & (titanic['Survived'] == 1)]['Age'].dropna(),
                equal_var=False)


Male survivors age average: 27.276021505376345
Female survivors age average: 28.84771573604061
16.50480299921846
14.175072701120337
Out[100]:
Ttest_indResult(statistic=-0.79089662277024664, pvalue=0.43018823932007377)

The difference between the fares paid by passengers from Queenstown and Cherbourg is statistically significant (the p-value is less than 0.001):


In [102]:
print('Fares paid by passengers embarked in Queenstown average:', titanic[titanic['Embarked'] == 'Q']['Fare'].mean())
print('Fares paid by passengers embarked in Cherbourg average:', titanic[titanic['Embarked'] == 'C']['Fare'].mean())

print(titanic[titanic['Embarked'] == 'Q']['Fare'].std())
print(titanic[titanic['Embarked'] == 'C']['Fare'].std())

stats.ttest_ind(titanic[titanic['Embarked'] == 'Q']['Fare'].dropna(),
                titanic[titanic['Embarked'] == 'C']['Fare'].dropna(),
                equal_var=False)


Fares paid by passengers embarked in Queenstown average: 13.276029870129872
Fares paid by passengers embarked in Cherbourg average: 59.95414404761905
14.188046974998139
83.91299426548599
Out[102]:
Ttest_indResult(statistic=-6.9951971047186809, pvalue=4.5792033919567422e-11)

Accompany your p-values with histograms showing the distributions of both compared populations.


In [109]:
bins0_100 = range(0, 100, 10)

fig = plt.figure(figsize=(16, 6))
ax = plt.axes()
# Plot the distribution of ages for the two groups with 20 bins
ax.hist(titanic[(titanic['Sex'] == 'male') & (titanic['Survived'] == 1)]['Age'].dropna(), bins=bins0_100, color='green', alpha=.5, label='male')
ax.hist(titanic[(titanic['Sex'] == 'female') & (titanic['Survived'] == 1)]['Age'].dropna(), bins=bins0_100, color='yellow', alpha=.5, label='female')
# Add labels and title
ax.set_xlabel('Age')
ax.set_ylabel('Number of Survivors')
ax.set_title('Male and female survivors ages')
# Add legend
ax.legend();



In [110]:
bins0_600 = range(0, 600, 50)

fig = plt.figure(figsize=(16, 6))
ax = plt.axes()
# Plot the distribution of fares for the two groups with 20 bins
ax.hist(titanic[titanic['Embarked'] == 'Q']['Fare'].dropna(), bins=bins0_600, color='green', alpha=.5, label='Queenstown')
ax.hist(titanic[titanic['Embarked'] == 'C']['Fare'].dropna(), bins=bins0_600, color='yellow', alpha=.5, label='Cherbourg')
# Add labels and title
ax.set_xlabel('Fare')
ax.set_ylabel('Number of Survivors')
ax.set_title('Fares of passengers from Queenstown and Cherbourg')
# Add legend
ax.legend();


Did survivors pay more for their tickets than those that did not? Use a 95% confidence level.

The difference between the fares paid by survivors and non-survivors is statistically significant (the p-value is less than 0.001):


In [112]:
print('Survivors average fare:', titanic[titanic['Survived'] == 1]['Fare'].mean())
print('Non-survivors average fare:', titanic[titanic['Survived'] == 0]['Fare'].mean())

print(titanic[titanic['Survived'] == 1]['Fare'].std())
print(titanic[titanic['Survived'] == 0]['Fare'].std())

# One-sided, so divide p-vale by two
stats.ttest_ind(titanic[titanic['Survived'] == 1]['Fare'].dropna(),
                titanic[titanic['Survived'] == 0]['Fare'].dropna(),
                equal_var=False)


Survivors average fare: 48.39540760233917
Non-survivors average fare: 22.117886885245877
66.59699811829472
31.388206530563984
Out[112]:
Ttest_indResult(statistic=6.8390992590852537, pvalue=2.6993323503141236e-11)

Did a given first-class passenger have fewer family members on board than a given third-class passenger? Use a 95% confidence level.

The difference between the number of family member on board for first class passengers and third class passengers is not statitically relevant (the p-value is 0.02):


In [114]:
print('First class average number of family members:', (titanic[titanic['Pclass'] == 1]['Parch'] + titanic[titanic['Pclass'] == 1]['SibSp']).mean())
print('Third class average number of family members:', (titanic[titanic['Pclass'] == 3]['Parch'] + titanic[titanic['Pclass'] == 3]['SibSp']).mean())

print((titanic[titanic['Pclass'] == 1]['Parch'] + titanic[titanic['Pclass'] == 1]['SibSp']).std())
print((titanic[titanic['Pclass'] == 3]['Parch'] + titanic[titanic['Pclass'] == 3]['SibSp']).std())

# One-sided, so divide p-vale by two
stats.ttest_ind((titanic[titanic['Pclass'] == 1]['Parch'] + titanic[titanic['Pclass'] == 1]['SibSp']).dropna(),
                (titanic[titanic['Pclass'] == 3]['Parch'] + titanic[titanic['Pclass'] == 3]['SibSp']).dropna(),
                equal_var=False)


First class average number of family members: 0.7731481481481481
Third class average number of family members: 1.0081466395112015
1.0385236821638482
1.9535250260574035
Out[114]:
Ttest_indResult(statistic=-2.0799075748873195, pvalue=0.037907385748521927)