PROBLEM SET 2

Find the Probability that a passenger survived.



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
titanic_data = pd.read_csv('../knn/train.csv', header=0)
titanic_data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB



In [2]:

    
titanic_data.Survived.mean()









    Out[2]:





0.3838383838383838

38% of Surviving the Titanic.

Probability of female passenger with at least one sibling or spouse on board



In [3]:

    
fem_sibsp = titanic_data[(titanic_data.Sex == 'female') & (titanic_data.SibSp > 0)]

ans = len(fem_sibsp)/len(titanic_data.PassengerId)
print(ans)









    



0.15712682379349047

There is a 15% Chance of being female with a sibling or spouse on board.

What is the probability of being a survivor from Cherbourg



In [4]:

    
from_cherb = titanic_data[(titanic_data.Embarked == 'C') & (titanic_data.Survived == 1)]
print(len(from_cherb))



In [5]:

    
print(len(from_cherb)/len(titanic_data.PassengerId))









    



0.10437710437710437

10% chance of a passenger surviving from Cherbourg

Plot the distribution of passenger ages.



In [6]:

    
titanic_data_no_ages = titanic_data.dropna(subset=['Age'])
%matplotlib inline
h, edges = np.histogram(titanic_data_no_ages.Age.values, bins=20)
plt.figure(figsize=(10, 4))
ax = plt.subplot(111)
ax.bar(edges[:-1], h, width=edges[1] - edges[0])
ax.text(0.9,0.9, '*Known Ages', horizontalalignment='right', transform=ax.transAxes)
ax.set_xlabel('Age Range')
ax.set_ylabel('Number of People')
ax.minorticks_on()
plt.show()
print(len(titanic_data_no_ages))

Probability that a passenger was less than 10 years old



In [7]:

    
print(len(titanic_data[titanic_data.Age < 10])/len(titanic_data.PassengerId))









    



0.06958473625140292

7% probability that a passenger was under 10 years old

What is the probability of exactly 42 passengers surviving if 100 passengers are chosen at random?



In [8]:

    
from scipy.stats import binom

binom.pmf(42, 100, 0.38)









    Out[8]:





0.057647821612310038

Roughly 6%

What's the probability that at least 42 of those 100 passengers survive?



In [9]:

    
prob_42 = binom.cdf(42, 100, 0.38)
print(1 - prob_42)









    



0.176643990901

Roughly 18% chance that at least 42 survive

Is there a statistically significant difference between the ages of male and female survivors?



In [26]:

    
from scipy.stats import ttest_ind

fem_surv_avg_age = titanic_data[(titanic_data.Survived == 1) & (titanic_data.Sex =='female') & (titanic_data.Age > 0)].Age
male_surv_avg_age = titanic_data[(titanic_data.Survived == 1) & (titanic_data.Sex =='male') & (titanic_data.Age > 0)].Age

t_stat, p_value = ttest_ind(male_surv_avg_age, fem_surv_avg_age)

print("Results: %.5f"% p_value)









    



Results: 0.40434

The age difference between the male and female survivors is statistically significant because the p-value is over .05.

Is there a statistically significant difference between fares paid by the passengers from Queenstown and the ones from Cherbourg?



In [25]:

    
queen_fare = titanic_data[(titanic_data['Embarked'] == 'Q')].Fare
cherb_fare = titanic_data[(titanic_data['Embarked'] == 'C')].Fare

t_stat, p_value = ttest_ind(cherb_fare, queen_fare)
print("Results: %.5f"% p_value)









    



Results: 0.00000

There is not a statistically significant difference between the fares paid at Queenstown and Cherbourg ports.

Graph difference of ages between male and female survivors



In [32]:

    
plt.figure(figsize=(10, 4))
opacity = 0.5

plt.hist(fem_surv_avg_age, bins=np.arange(0, 80, 4), alpha=opacity, label='Female')
plt.hist(male_surv_avg_age, bins=np.arange(0, 80, 4), alpha=opacity, label='Male')
plt.legend()
plt.xlabel('Age')
plt.ylabel('Number of Survivors')
plt.show()



In [37]:

    
plt.figure(figsize=(10, 4))
opacity = 0.5
plt.hist(cherb_fare, bins=np.arange(0, 150, 4), alpha=opacity, label='Cherbourg')
plt.hist(queen_fare, bins=np.arange(0, 150, 4), alpha=opacity, label='Queenstown')

plt.legend()
plt.xlabel('Fare')
plt.ylabel('Number of people paying')
plt.show()



In [ ]: