In [1]:
!pip install plotly_express


Collecting plotly_express
  Downloading https://files.pythonhosted.org/packages/6d/13/749461981bc356fb71df247585b7e2c1848fb332ac1d728be15627941e19/plotly_express-0.2.2-py2.py3-none-any.whl (74kB)
    100% |████████████████████████████████| 81kB 3.7MB/s ta 0:00:011
Requirement already satisfied: patsy>=0.5 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (0.5.1)
Requirement already satisfied: statsmodels>=0.9.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (0.9.0)
Requirement already satisfied: scipy>=0.18 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (1.2.1)
Requirement already satisfied: numpy>=1.11 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (1.16.2)
Requirement already satisfied: pandas>=0.20.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly_express) (0.24.2)
Collecting plotly>=3.9.0 (from plotly_express)
  Downloading https://files.pythonhosted.org/packages/ff/75/3982bac5076d0ce6d23103c03840fcaec90c533409f9d82c19f54512a38a/plotly-3.10.0-py2.py3-none-any.whl (41.5MB)
    100% |████████████████████████████████| 41.5MB 1.7MB/s eta 0:00:01
Requirement already satisfied: six in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from patsy>=0.5->plotly_express) (1.12.0)
Requirement already satisfied: pytz>=2011k in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from pandas>=0.20.0->plotly_express) (2018.9)
Requirement already satisfied: python-dateutil>=2.5.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from pandas>=0.20.0->plotly_express) (2.8.0)
Requirement already satisfied: decorator>=4.0.6 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly>=3.9.0->plotly_express) (4.4.0)
Requirement already satisfied: requests in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly>=3.9.0->plotly_express) (2.21.0)
Collecting retrying>=1.3.3 (from plotly>=3.9.0->plotly_express)
Requirement already satisfied: nbformat>=4.2 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from plotly>=3.9.0->plotly_express) (4.4.0)
Requirement already satisfied: certifi>=2017.4.17 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from requests->plotly>=3.9.0->plotly_express) (2019.3.9)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from requests->plotly>=3.9.0->plotly_express) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from requests->plotly>=3.9.0->plotly_express) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from requests->plotly>=3.9.0->plotly_express) (1.24.1)
Requirement already satisfied: ipython_genutils in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from nbformat>=4.2->plotly>=3.9.0->plotly_express) (0.2.0)
Requirement already satisfied: traitlets>=4.1 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from nbformat>=4.2->plotly>=3.9.0->plotly_express) (4.3.2)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from nbformat>=4.2->plotly>=3.9.0->plotly_express) (3.0.1)
Requirement already satisfied: jupyter_core in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from nbformat>=4.2->plotly>=3.9.0->plotly_express) (4.4.0)
Requirement already satisfied: attrs>=17.4.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2->plotly>=3.9.0->plotly_express) (19.1.0)
Requirement already satisfied: pyrsistent>=0.14.0 in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2->plotly>=3.9.0->plotly_express) (0.14.11)
Requirement already satisfied: setuptools in /Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2->plotly>=3.9.0->plotly_express) (40.8.0)
Installing collected packages: retrying, plotly, plotly-express
  Found existing installation: plotly 2.5.1
    Uninstalling plotly-2.5.1:
      Successfully uninstalled plotly-2.5.1
Successfully installed plotly-3.10.0 plotly-express-0.2.2 retrying-1.3.3

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly_express as px
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
train_df= pd.read_csv('titanic_train.csv')
train_df.head()


Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [4]:
test_df = pd.read_csv('titanic_test.csv')
test_df.head()


Out[4]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

In [5]:
(train_df.shape, test_df.shape)


Out[5]:
((891, 12), (418, 11))

In [6]:
gender_sub_df = pd.read_csv('titanic_gender_submission.csv')
gender_sub_df.head()


Out[6]:
PassengerId Survived
0 892 0
1 893 1
2 894 0
3 895 0
4 896 1

In [7]:
gender_sub_df.shape


Out[7]:
(418, 2)

Exploratory analysis

Let us look at the data type of the columns present in this data set.


In [8]:
# datatype of columns
train_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Visualize missing data


In [9]:
# find missing data
sns.heatmap(train_df.isnull(), yticklabels=False, cbar=False, cmap='viridis')


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2e9c5432b0>

Age and Cabin do have some missing data, we will impute for these later.


In [10]:
import warnings
warnings.filterwarnings("ignore")

Visualize distributions and correlations


In [11]:
sns.pairplot(train_df[['Survived','Pclass','Age','Fare']], dropna=True)


Out[11]:
<seaborn.axisgrid.PairGrid at 0x7f2e9c8322b0>
  • Does age show any correlation with Fare? Do older people pay higher?
  • Does passenger class show any correlation with Fare? Does first class cost more?
  • Do older passenger prefer better class?
  • Does the passenger class / fare / age / gender determine whether or not you survived?

Visualize distribution of class vs Age


In [9]:
plot1 = px.histogram(train_df, x='Age', color='Pclass')
plot1