- Visualize missing data
- Visualize distributions and correlations
Impute for missing age



In [1]:

    
!pip install plotly_express









    



Collecting plotly_express
  Downloading https://files.pythonhosted.org/packages/6d/13/749461981bc356fb71df247585b7e2c1848fb332ac1d728be15627941e19/plotly_express-0.2.2-py2.py3-none-any.whl (74kB)
    100% |████████████████████████████████| 81kB 3.7MB/s ta 0:00:011
Requirement already satisfied: patsy>=0.5 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (0.5.1)
Requirement already satisfied: statsmodels>=0.9.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (0.9.0)
Requirement already satisfied: scipy>=0.18 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (1.2.1)
Requirement already satisfied: numpy>=1.11 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (1.16.2)
Requirement already satisfied: pandas>=0.20.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (0.24.2)
Collecting plotly>=3.9.0 (from plotly_express)
  Downloading https://files.pythonhosted.org/packages/ff/75/3982bac5076d0ce6d23103c03840fcaec90c533409f9d82c19f54512a38a/plotly-3.10.0-py2.py3-none-any.whl (41.5MB)
    100% |████████████████████████████████| 41.5MB 1.7MB/s eta 0:00:01
Requirement already satisfied: six in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from patsy>=0.5->plotly_express) (1.12.0)
Requirement already satisfied: pytz>=2011k in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from pandas>=0.20.0->plotly_express) (2018.9)
Requirement already satisfied: python-dateutil>=2.5.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from pandas>=0.20.0->plotly_express) (2.8.0)
Requirement already satisfied: decorator>=4.0.6 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly>=3.9.0->plotly_express) (4.4.0)
Requirement already satisfied: requests in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly>=3.9.0->plotly_express) (2.21.0)
Collecting retrying>=1.3.3 (from plotly>=3.9.0->plotly_express)
Requirement already satisfied: nbformat>=4.2 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly>=3.9.0->plotly_express) (4.4.0)
Requirement already satisfied: certifi>=2017.4.17 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from requests->plotly>=3.9.0->plotly_express) (2019.3.9)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from requests->plotly>=3.9.0->plotly_express) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from requests->plotly>=3.9.0->plotly_express) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from requests->plotly>=3.9.0->plotly_express) (1.24.1)
Requirement already satisfied: ipython_genutils in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from nbformat>=4.2->plotly>=3.9.0->plotly_express) (0.2.0)
Requirement already satisfied: traitlets>=4.1 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from nbformat>=4.2->plotly>=3.9.0->plotly_express) (4.3.2)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from nbformat>=4.2->plotly>=3.9.0->plotly_express) (3.0.1)
Requirement already satisfied: jupyter_core in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from nbformat>=4.2->plotly>=3.9.0->plotly_express) (4.4.0)
Requirement already satisfied: attrs>=17.4.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2->plotly>=3.9.0->plotly_express) (19.1.0)
Requirement already satisfied: pyrsistent>=0.14.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2->plotly>=3.9.0->plotly_express) (0.14.11)
Requirement already satisfied: setuptools in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2->plotly>=3.9.0->plotly_express) (40.8.0)
Installing collected packages: retrying, plotly, plotly-express
  Found existing installation: plotly 2.5.1
    Uninstalling plotly-2.5.1:
      Successfully uninstalled plotly-2.5.1
Successfully installed plotly-3.10.0 plotly-express-0.2.2 retrying-1.3.3



In [2]:

    
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly_express as px
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline



In [3]:

    
train_df= pd.read_csv('titanic_train.csv')
train_df.head()









    Out[3]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S



In [4]:

    
test_df = pd.read_csv('titanic_test.csv')
test_df.head()









    Out[4]:







  
    
      
      PassengerId
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      892
      3
      Kelly, Mr. James
      male
      34.5
      0
      0
      330911
      7.8292
      NaN
      Q
    
    
      1
      893
      3
      Wilkes, Mrs. James (Ellen Needs)
      female
      47.0
      1
      0
      363272
      7.0000
      NaN
      S
    
    
      2
      894
      2
      Myles, Mr. Thomas Francis
      male
      62.0
      0
      0
      240276
      9.6875
      NaN
      Q
    
    
      3
      895
      3
      Wirz, Mr. Albert
      male
      27.0
      0
      0
      315154
      8.6625
      NaN
      S
    
    
      4
      896
      3
      Hirvonen, Mrs. Alexander (Helga E Lindqvist)
      female
      22.0
      1
      1
      3101298
      12.2875
      NaN
      S



In [5]:

    
(train_df.shape, test_df.shape)









    Out[5]:





((891, 12), (418, 11))



In [6]:

    
gender_sub_df = pd.read_csv('titanic_gender_submission.csv')
gender_sub_df.head()









    Out[6]:







  
    
      
      PassengerId
      Survived
    
  
  
    
      0
      892
      0
    
    
      1
      893
      1
    
    
      2
      894
      0
    
    
      3
      895
      0
    
    
      4
      896
      1



In [7]:

    
gender_sub_df.shape









    Out[7]:





(418, 2)

Exploratory analysis

Let us look at the data type of the columns present in this data set.



In [8]:

    
# datatype of columns
train_df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Visualize missing data



In [9]:

    
# find missing data
sns.heatmap(train_df.isnull(), yticklabels=False, cbar=False, cmap='viridis')









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f2e9c5432b0>

Age and Cabin do have some missing data, we will impute for these later.



In [10]:

    
import warnings
warnings.filterwarnings("ignore")

Visualize distributions and correlations



In [11]:

    
sns.pairplot(train_df[['Survived','Pclass','Age','Fare']], dropna=True)









    Out[11]:





<seaborn.axisgrid.PairGrid at 0x7f2e9c8322b0>

Does age show any correlation with Fare? Do older people pay higher?
Does passenger class show any correlation with Fare? Does first class cost more?
Do older passenger prefer better class?
Does the passenger class / fare / age / gender determine whether or not you survived?

Visualize distribution of class vs Age



In [9]:

    
plot1 = px.histogram(train_df, x='Age', color='Pclass')
plot1



In [13]:

    
px.histogram(train_df, x='Age', color='Survived', facet_col='Pclass')

The third class is most skewed to be younger than 2nd and 1st classes. First class has a mostly uniform distribution, while 2nd is centered around 30. Further, kids (younger than 20) are most likely to be in second and third classes.

Its clear that most of 3rd class did not survive, whereas most of first class did. Majority of children survived, they are in 2nd class.

Visualize fare distribution across passenger class



In [14]:

    
px.histogram(train_df, x='Fare', color='Pclass')



In [15]:

    
px.histogram(train_df, x='Fare', facet_col='Pclass', color='Survived', range_x=[0,200], nbins=50)

It is surprising that a lot of 1st class passengers, did pay around the same as 2nd and 3rd. They survived well. 3rd class primarily paid under $50 but only a few survived. Thus, your class is a better indicator than age or fare.

What influence does gender play?

Let us first, look at the distribution of Age across the genders.



In [16]:

    
px.histogram(train_df, x='Age', color='Sex')

The distribution of male and female age looks similar, but there are more men in just about any age group. Next let us see how many men survived



In [17]:

    
px.histogram(train_df, x='Age', color='Sex', facet_col='Survived')

Well, men of early 20th centry have been quite chivalrous. Most women survived immaterial of their age. Next let us see how survival by gender is linked with survival by passenger class.



In [18]:

    
px.histogram(train_df, x='Age', color='Sex', facet_col='Survived', facet_row='Pclass')

This is a detailed plot to unpack.

In first class, almost all women survived, but there are a lot of men who died.
In class 2, same way most women survived and little to no men survived
In class 3, a lot of men and women died. About half the women survived. Even kids died.



In [19]:

    
sns.countplot(x='Survived', hue='Pclass', data=train_df)









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f2e80164748>

Impute for missing age



In [20]:

    
sns.heatmap(train_df[['Survived','Pclass']])









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f2e8047d588>



In [21]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

Table of Contents