Using Titanic Dataset from Kaggle: link
About Dataset:
VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored. The following are the definitions used
for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws. Some children travelled
only with a nanny, therefore parch=0 for them. As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
Questions we will answer:
In [5]:
import pandas as pd
import pylab as plt
import numpy as np
%matplotlib inline
In [6]:
df = pd.read_csv('titanic_kaggle/train.csv')
df.head()
Out[6]:
In [7]:
df.shape
Out[7]:
In [9]:
df['Pclass'].isnull().value_counts() # Check if there is null value
Out[9]:
In [10]:
df['Survived'].isnull().value_counts()
Out[10]:
In [12]:
# Passengers survived in each class
survivors = df.groupby('Pclass')['Survived'].agg(sum)
survivors
Out[12]:
In [14]:
# Total Passengers in each class
total_passengers = df.groupby('Pclass')['PassengerId'].count()
survivor_percentage = survivors / total_passengers
survivor_percentage
Out[14]:
In [15]:
# Plotting the Total number of survivors
fig = plt.figure()
ax = fig.add_subplot(111)
rect = ax.bar(survivors.index.values.tolist(), survivors, color='blue', width=0.5)
ax.set_ylabel('No. of survivors')
ax.set_title('Total number of survivors based on class')
xTickMarks = survivors.index.values.tolist()
ax.set_xticks(survivors.index.values.tolist())
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, fontsize=20)
plt.show()
In [16]:
#Plotting the percentage of survivors in each class
fig = plt.figure()
ax = fig.add_subplot(111)
rect = ax.bar(survivor_percentage.index.values.tolist(),
survivor_percentage, color='blue', width=0.5)
ax.set_ylabel('Survivor Percentage')
ax.set_title('Percentage of survivors based on class')
xTickMarks = survivors.index.values.tolist()
ax.set_xticks(survivors.index.values.tolist())
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, fontsize=20)
plt.show()
These are our observations:
This is our key takeaway:
In [17]:
# Checking for any null values
df['Sex'].isnull().value_counts()
Out[17]:
In [19]:
# Male passengers survived in each class
male_survivors = df[df['Sex'] == 'male'].groupby('Pclass')['Survived'].agg(sum)
male_survivors
Out[19]:
In [20]:
# Total Male Passengers in each class
male_total_passengers = df[df['Sex'] == 'male'].groupby('Pclass')['PassengerId'].count()
male_total_passengers
Out[20]:
In [21]:
male_survivor_percentage = male_survivors / male_total_passengers
male_survivor_percentage
Out[21]:
In [22]:
# Female Passengers survived in each class
female_survivors = df[df['Sex'] == 'female'].groupby('Pclass')['Survived'].agg(sum)
female_survivors
Out[22]:
In [23]:
# Total Female Passengers in each class
female_total_passengers = df[df['Sex'] == 'female'].groupby('Pclass')['PassengerId'].count()
In [24]:
female_survivor_percentage = female_survivors / female_total_passengers
female_survivor_percentage
Out[24]:
In [25]:
# Plotting the total passengers who survived based on Gender
fig = plt.figure()
ax = fig.add_subplot(111)
index = np.arange(male_survivors.count())
bar_width = 0.35
rect1 = ax.bar(index, male_survivors, bar_width, color='blue',label='Men')
rect2 = ax.bar(index + bar_width, female_survivors, bar_width, color='y', label='Women')
ax.set_ylabel('Survivor Numbers')
ax.set_title('Male and Female survivors based on class')
xTickMarks = male_survivors.index.values.tolist()
ax.set_xticks(index + bar_width)
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, fontsize=20)
plt.legend()
plt.tight_layout()
plt.show()
In [26]:
# Plotting the percentage of passengers who survived based on Gender
fig = plt.figure()
ax = fig.add_subplot(111)
index = np.arange(male_survivor_percentage.count())
bar_width = 0.35
rect1 = ax.bar(index, male_survivor_percentage, bar_width, color='blue', label='Men')
rect2 = ax.bar(index + bar_width, female_survivor_percentage, bar_width, color='y', label='Women')
ax.set_ylabel('Survivor Percentage')
ax.set_title('Percentage Male and Female of survivors based on class')
xTickMarks = male_survivor_percentage.index.values.tolist()
ax.set_xticks(index + bar_width)
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, fontsize=20)
plt.legend()
plt.tight_layout()
plt.show()
These are our observations:
The percentage of male passengers who survived in first and third class, respectively, are comparable This is our key takeaway:
Female passengers were given preference for lifeboats and the majority were saved.
In [27]:
# Checking for the null values
df['SibSp'].isnull().value_counts()
Out[27]:
In [28]:
# Checking for the null values
df['Parch'].isnull().value_counts()
Out[28]:
In [29]:
# Total number of non-survivors in each class
non_survivors = df[(df['SibSp'] > 0) | (df['Parch'] > 0) & (df['Survived'] == 0)].groupby('Pclass')['Survived'].agg('count')
non_survivors
Out[29]:
In [30]:
# Total passengers in each class
total_passengers = df.groupby('Pclass')['PassengerId'].count()
total_passengers
Out[30]:
In [31]:
non_survivor_percentage = non_survivors / total_passengers
non_survivor_percentage
Out[31]:
In [32]:
# Total number of non survivors with family based on class
fig = plt.figure()
ax = fig.add_subplot(111)
rect = ax.bar(non_survivors.index.values.tolist(), non_survivors, color='blue', width=0.5)
ax.set_ylabel('No. of non survivors')
ax.set_title('Total number of non survivors with family based on class')
xTickMarks = non_survivors.index.values.tolist()
ax.set_xticks(non_survivors.index.values.tolist())
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, fontsize=20)
plt.show()
In [33]:
# Plot of percentage of non survivors with family based on class
fig = plt.figure()
ax = fig.add_subplot(111)
rect = ax.bar(non_survivor_percentage.index.values.tolist(), non_survivor_percentage, color='blue', width=0.5)
ax.set_ylabel('Non Survivor Percentage')
ax.set_title('Percentage of non survivors with family based on class')
xTickMarks = non_survivor_percentage.index.values.tolist()
ax.set_xticks(non_survivor_percentage.index.values.tolist())
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, fontsize=20)
plt.show()
These are our observations:
This is our key takeaway:
In [34]:
# Checking for null values
df['Age'].isnull().value_counts()
Out[34]:
In [35]:
# Defining the age binning interval
age_bin = [0, 18, 25, 40, 60, 100]
# Creating the bins
df['AgeBin'] = pd.cut(df.Age, bins=age_bin)
In [42]:
d_temp = df[np.isfinite(df['Age'])]
In [43]:
# Number of survivors based on Age bin
survivors = d_temp.groupby('AgeBin')['Survived'].agg(sum)
survivors
Out[43]:
In [45]:
# Total passengers in each bin
total_passengers = d_temp.groupby('AgeBin')['Survived'].agg('count')
total_passengers
Out[45]:
In [50]:
list(total_passengers.index.values)
Out[50]:
In [51]:
# Plotting the pie chart of total passengers in each bin
plt.pie(total_passengers, labels=list(total_passengers.index.values),
autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('Total Passengers in different age groups')
plt.show()
In [52]:
# Plotting the pie chart of percentage passengers in each bin
plt.pie(survivors, labels=list(total_passengers.index.values),
autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('Survivors in different age groups')
plt.show()
These are our observations:
This is our key takeaway:
In [ ]: