In [1]:
!pip install plotly_express
In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly_express as px
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [3]:
train_df= pd.read_csv('titanic_train.csv')
train_df.head()
Out[3]:
In [4]:
test_df = pd.read_csv('titanic_test.csv')
test_df.head()
Out[4]:
In [5]:
(train_df.shape, test_df.shape)
Out[5]:
In [6]:
gender_sub_df = pd.read_csv('titanic_gender_submission.csv')
gender_sub_df.head()
Out[6]:
In [7]:
gender_sub_df.shape
Out[7]:
Let us look at the data type of the columns present in this data set.
In [8]:
# datatype of columns
train_df.info()
In [9]:
# find missing data
sns.heatmap(train_df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
Out[9]:
Age
and Cabin
do have some missing data, we will impute for these later.
In [10]:
import warnings
warnings.filterwarnings("ignore")
In [11]:
sns.pairplot(train_df[['Survived','Pclass','Age','Fare']], dropna=True)
Out[11]:
In [9]:
plot1 = px.histogram(train_df, x='Age', color='Pclass')
plot1
In [13]:
px.histogram(train_df, x='Age', color='Survived', facet_col='Pclass')
The third class is most skewed to be younger than 2nd and 1st classes. First class has a mostly uniform distribution, while 2nd is centered around 30
. Further, kids (younger than 20) are most likely to be in second and third classes.
Its clear that most of 3rd class did not survive, whereas most of first class did. Majority of children survived, they are in 2nd class.
In [14]:
px.histogram(train_df, x='Fare', color='Pclass')
In [15]:
px.histogram(train_df, x='Fare', facet_col='Pclass', color='Survived', range_x=[0,200], nbins=50)
It is surprising that a lot of 1st class passengers, did pay around the same as 2nd and 3rd. They survived well. 3rd class primarily paid under $50
but only a few survived. Thus, your class is a better indicator than age or fare.
In [16]:
px.histogram(train_df, x='Age', color='Sex')
The distribution of male and female age looks similar, but there are more men in just about any age group. Next let us see how many men survived
In [17]:
px.histogram(train_df, x='Age', color='Sex', facet_col='Survived')
Well, men of early 20th centry have been quite chivalrous. Most women survived immaterial of their age. Next let us see how survival by gender is linked with survival by passenger class.
In [18]:
px.histogram(train_df, x='Age', color='Sex', facet_col='Survived', facet_row='Pclass')
This is a detailed plot to unpack.
In [19]:
sns.countplot(x='Survived', hue='Pclass', data=train_df)
Out[19]:
In [20]:
sns.heatmap(train_df[['Survived','Pclass']])
Out[20]:
In [21]: