rms-titanic-the-unsinkable-eda-and-predictions



In [1]:
!pip install plotly_express


Collecting plotly_express
  Downloading https://files.pythonhosted.org/packages/6d/13/749461981bc356fb71df247585b7e2c1848fb332ac1d728be15627941e19/plotly_express-0.2.2-py2.py3-none-any.whl (74kB)
    100% |████████████████████████████████| 81kB 3.7MB/s ta 0:00:011
Requirement already satisfied: patsy>=0.5 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (0.5.1)
Requirement already satisfied: statsmodels>=0.9.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (0.9.0)
Requirement already satisfied: scipy>=0.18 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (1.2.1)
Requirement already satisfied: numpy>=1.11 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (1.16.2)
Requirement already satisfied: pandas>=0.20.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (0.24.2)
Collecting plotly>=3.9.0 (from plotly_express)
  Downloading https://files.pythonhosted.org/packages/ff/75/3982bac5076d0ce6d23103c03840fcaec90c533409f9d82c19f54512a38a/plotly-3.10.0-py2.py3-none-any.whl (41.5MB)
    100% |████████████████████████████████| 41.5MB 1.7MB/s eta 0:00:01
Requirement already satisfied: six in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from patsy>=0.5->plotly_express) (1.12.0)
Requirement already satisfied: pytz>=2011k in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from pandas>=0.20.0->plotly_express) (2018.9)
Requirement already satisfied: python-dateutil>=2.5.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from pandas>=0.20.0->plotly_express) (2.8.0)
Requirement already satisfied: decorator>=4.0.6 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly>=3.9.0->plotly_express) (4.4.0)
Requirement already satisfied: requests in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly>=3.9.0->plotly_express) (2.21.0)
Collecting retrying>=1.3.3 (from plotly>=3.9.0->plotly_express)
Requirement already satisfied: nbformat>=4.2 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly>=3.9.0->plotly_express) (4.4.0)
Requirement already satisfied: certifi>=2017.4.17 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from requests->plotly>=3.9.0->plotly_express) (2019.3.9)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from requests->plotly>=3.9.0->plotly_express) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from requests->plotly>=3.9.0->plotly_express) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from requests->plotly>=3.9.0->plotly_express) (1.24.1)
Requirement already satisfied: ipython_genutils in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from nbformat>=4.2->plotly>=3.9.0->plotly_express) (0.2.0)
Requirement already satisfied: traitlets>=4.1 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from nbformat>=4.2->plotly>=3.9.0->plotly_express) (4.3.2)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from nbformat>=4.2->plotly>=3.9.0->plotly_express) (3.0.1)
Requirement already satisfied: jupyter_core in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from nbformat>=4.2->plotly>=3.9.0->plotly_express) (4.4.0)
Requirement already satisfied: attrs>=17.4.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2->plotly>=3.9.0->plotly_express) (19.1.0)
Requirement already satisfied: pyrsistent>=0.14.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2->plotly>=3.9.0->plotly_express) (0.14.11)
Requirement already satisfied: setuptools in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2->plotly>=3.9.0->plotly_express) (40.8.0)
Installing collected packages: retrying, plotly, plotly-express
  Found existing installation: plotly 2.5.1
    Uninstalling plotly-2.5.1:
      Successfully uninstalled plotly-2.5.1
Successfully installed plotly-3.10.0 plotly-express-0.2.2 retrying-1.3.3

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly_express as px
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
train_df= pd.read_csv('titanic_train.csv')
train_df.head()


Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [4]:
test_df = pd.read_csv('titanic_test.csv')
test_df.head()


Out[4]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

In [5]:
(train_df.shape, test_df.shape)


Out[5]:
((891, 12), (418, 11))

In [6]:
gender_sub_df = pd.read_csv('titanic_gender_submission.csv')
gender_sub_df.head()


Out[6]:
PassengerId Survived
0 892 0
1 893 1
2 894 0
3 895 0
4 896 1

In [7]:
gender_sub_df.shape


Out[7]:
(418, 2)

Exploratory analysis

Let us look at the data type of the columns present in this data set.


In [8]:
# datatype of columns
train_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Visualize missing data


In [9]:
# find missing data
sns.heatmap(train_df.isnull(), yticklabels=False, cbar=False, cmap='viridis')


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2e9c5432b0>

Age and Cabin do have some missing data, we will impute for these later.


In [10]:
import warnings
warnings.filterwarnings("ignore")

Visualize distributions and correlations


In [11]:
sns.pairplot(train_df[['Survived','Pclass','Age','Fare']], dropna=True)


Out[11]:
<seaborn.axisgrid.PairGrid at 0x7f2e9c8322b0>
  • Does age show any correlation with Fare? Do older people pay higher?
  • Does passenger class show any correlation with Fare? Does first class cost more?
  • Do older passenger prefer better class?
  • Does the passenger class / fare / age / gender determine whether or not you survived?

Visualize distribution of class vs Age


In [9]:
plot1 = px.histogram(train_df, x='Age', color='Pclass')
plot1



In [13]:
px.histogram(train_df, x='Age', color='Survived', facet_col='Pclass')


The third class is most skewed to be younger than 2nd and 1st classes. First class has a mostly uniform distribution, while 2nd is centered around 30. Further, kids (younger than 20) are most likely to be in second and third classes.

Its clear that most of 3rd class did not survive, whereas most of first class did. Majority of children survived, they are in 2nd class.

Visualize fare distribution across passenger class


In [14]:
px.histogram(train_df, x='Fare', color='Pclass')



In [15]:
px.histogram(train_df, x='Fare', facet_col='Pclass', color='Survived', range_x=[0,200], nbins=50)


It is surprising that a lot of 1st class passengers, did pay around the same as 2nd and 3rd. They survived well. 3rd class primarily paid under $50 but only a few survived. Thus, your class is a better indicator than age or fare.

What influence does gender play?

Let us first, look at the distribution of Age across the genders.


In [16]:
px.histogram(train_df, x='Age', color='Sex')


The distribution of male and female age looks similar, but there are more men in just about any age group. Next let us see how many men survived


In [17]:
px.histogram(train_df, x='Age', color='Sex', facet_col='Survived')


Well, men of early 20th centry have been quite chivalrous. Most women survived immaterial of their age. Next let us see how survival by gender is linked with survival by passenger class.


In [18]:
px.histogram(train_df, x='Age', color='Sex', facet_col='Survived', facet_row='Pclass')


This is a detailed plot to unpack.

  • In first class, almost all women survived, but there are a lot of men who died.
  • In class 2, same way most women survived and little to no men survived
  • In class 3, a lot of men and women died. About half the women survived. Even kids died.

In [19]:
sns.countplot(x='Survived', hue='Pclass', data=train_df)


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2e80164748>

Impute for missing age


In [20]:
sns.heatmap(train_df[['Survived','Pclass']])


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2e8047d588>

In [21]: