# Logistic Regression with Python

For this lecture we will be working with the Titanic Data Set from Kaggle. This is a very famous data set and very often is a student's first step in machine learning!

We'll be trying to predict a classification- survival or deceased. Let's begin our understanding of implementing Logistic Regression in Python for classification.

We'll use a "semi-cleaned" version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning not shown in this lecture notebook.

## Import Libraries

Let's import some libraries to get started!

``````

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

``````

## The Data

Let's start by reading in the titanic_train.csv file into a pandas dataframe.

``````

In [2]:

``````
``````

In [3]:

``````
``````

Out[3]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked

0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S

1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C

2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S

3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S

4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S

5
6
0
3
Moran, Mr. James
male
NaN
0
0
330877
8.4583
NaN
Q

6
7
0
1
McCarthy, Mr. Timothy J
male
54.0
0
0
17463
51.8625
E46
S

7
8
0
3
Palsson, Master. Gosta Leonard
male
2.0
3
1
349909
21.0750
NaN
S

8
9
1
3
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
female
27.0
0
2
347742
11.1333
NaN
S

9
10
1
2
female
14.0
1
0
237736
30.0708
NaN
C

10
11
1
3
Sandstrom, Miss. Marguerite Rut
female
4.0
1
1
PP 9549
16.7000
G6
S

11
12
1
1
Bonnell, Miss. Elizabeth
female
58.0
0
0
113783
26.5500
C103
S

12
13
0
3
Saundercock, Mr. William Henry
male
20.0
0
0
A/5. 2151
8.0500
NaN
S

13
14
0
3
male
39.0
1
5
347082
31.2750
NaN
S

14
15
0
3
female
14.0
0
0
350406
7.8542
NaN
S

15
16
1
2
Hewlett, Mrs. (Mary D Kingcome)
female
55.0
0
0
248706
16.0000
NaN
S

16
17
0
3
Rice, Master. Eugene
male
2.0
4
1
382652
29.1250
NaN
Q

17
18
1
2
Williams, Mr. Charles Eugene
male
NaN
0
0
244373
13.0000
NaN
S

18
19
0
3
Vander Planke, Mrs. Julius (Emelia Maria Vande...
female
31.0
1
0
345763
18.0000
NaN
S

19
20
1
3
Masselmani, Mrs. Fatima
female
NaN
0
0
2649
7.2250
NaN
C

20
21
0
2
Fynney, Mr. Joseph J
male
35.0
0
0
239865
26.0000
NaN
S

21
22
1
2
Beesley, Mr. Lawrence
male
34.0
0
0
248698
13.0000
D56
S

22
23
1
3
McGowan, Miss. Anna "Annie"
female
15.0
0
0
330923
8.0292
NaN
Q

23
24
1
1
Sloper, Mr. William Thompson
male
28.0
0
0
113788
35.5000
A6
S

24
25
0
3
Palsson, Miss. Torborg Danira
female
8.0
3
1
349909
21.0750
NaN
S

``````

# Exploratory Data Analysis

Let's begin some exploratory data analysis! We'll start by checking out missing data!

## Missing Data

We can use seaborn to create a simple heatmap to see where we are missing data!

``````

In [4]:

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

``````
``````

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a166095f8>

``````

Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level. We'll probably drop this later, or change it to another feature like "Cabin Known: 1 or 0"

Let's continue on by visualizing some more of the data! Check out the video for full explanations over these plots, this code is just to serve as reference.

``````

In [5]:

sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train,palette='RdBu_r')

``````
``````

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a16bbd240>

``````
``````

In [9]:

# sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')

``````
``````

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x2550eaab748>

``````
``````

In [10]:

# sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow')

``````
``````

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x2550fbbb358>

``````
``````

In [11]:

sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=30)

``````
``````

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x2550fc64748>

``````
``````

In [12]:

train['Age'].hist(bins=30,color='darkred',alpha=0.7)

``````
``````

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x2550fde4b70>

``````
``````

In [13]:

sns.countplot(x='SibSp',data=train)

``````
``````

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x2550ff16208>

``````
``````

In [14]:

train['Fare'].hist(color='green',bins=40,figsize=(8,4))

``````
``````

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x2550ff817b8>

``````

Let's take a quick moment to show an example of cufflinks!

``````

In [15]:

import plotly_express as pex

``````
``````

In [17]:

pex.histogram(data_frame=train, x='Fare', nbins=30)

``````
``````

require(["plotly"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL="https://plot.ly";
if (document.getElementById("c1f29c50-b184-47ae-83b4-748b47926654")) {
Plotly.newPlot("c1f29c50-b184-47ae-83b4-748b47926654", [{"alignmentgroup": "True", "hovertemplate": "Fare=%{x}<br>count=%{y}<extra></extra>", "legendgroup": "", "marker": {"color": "#636efa"}, "name": " ", "nbinsx": 30, "offsetgroup": "", "orientation": "v", "showlegend": false, "x": [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708, 16.7, 26.55, 8.05, 31.275, 7.8542, 16.0, 29.125, 13.0, 18.0, 7.225, 26.0, 13.0, 8.0292, 35.5, 21.075, 31.3875, 7.225, 263.0, 7.8792, 7.8958, 27.7208, 146.5208, 7.75, 10.5, 82.1708, 52.0, 7.2292, 8.05, 18.0, 11.2417, 9.475, 21.0, 7.8958, 41.5792, 7.8792, 8.05, 15.5, 7.75, 21.6792, 17.8, 39.6875, 7.8, 76.7292, 26.0, 61.9792, 35.5, 10.5, 7.2292, 27.75, 46.9, 7.2292, 80.0, 83.475, 27.9, 27.7208, 15.2458, 10.5, 8.1583, 7.925, 8.6625, 10.5, 46.9, 73.5, 14.4542, 56.4958, 7.65, 7.8958, 8.05, 29.0, 12.475, 9.0, 9.5, 7.7875, 47.1, 10.5, 15.85, 34.375, 8.05, 263.0, 8.05, 8.05, 7.8542, 61.175, 20.575, 7.25, 8.05, 34.6542, 63.3583, 23.0, 26.0, 7.8958, 7.8958, 77.2875, 8.6542, 7.925, 7.8958, 7.65, 7.775, 7.8958, 24.15, 52.0, 14.4542, 8.05, 9.825, 14.4583, 7.925, 7.75, 21.0, 247.5208, 31.275, 73.5, 8.05, 30.0708, 13.0, 77.2875, 11.2417, 7.75, 7.1417, 22.3583, 6.975, 7.8958, 7.05, 14.5, 26.0, 13.0, 15.0458, 26.2833, 53.1, 9.2167, 79.2, 15.2458, 7.75, 15.85, 6.75, 11.5, 36.75, 7.7958, 34.375, 26.0, 13.0, 12.525, 66.6, 8.05, 14.5, 7.3125, 61.3792, 7.7333, 8.05, 8.6625, 69.55, 16.1, 15.75, 7.775, 8.6625, 39.6875, 20.525, 55.0, 27.9, 25.925, 56.4958, 33.5, 29.125, 11.1333, 7.925, 30.6958, 7.8542, 25.4667, 28.7125, 13.0, 0.0, 69.55, 15.05, 31.3875, 39.0, 22.025, 50.0, 15.5, 26.55, 15.5, 7.8958, 13.0, 13.0, 7.8542, 26.0, 27.7208, 146.5208, 7.75, 8.4042, 7.75, 13.0, 9.5, 69.55, 6.4958, 7.225, 8.05, 10.4625, 15.85, 18.7875, 7.75, 31.0, 7.05, 21.0, 7.25, 13.0, 7.75, 113.275, 7.925, 27.0, 76.2917, 10.5, 8.05, 13.0, 8.05, 7.8958, 90.0, 9.35, 10.5, 7.25, 13.0, 25.4667, 83.475, 7.775, 13.5, 31.3875, 10.5, 7.55, 26.0, 26.25, 10.5, 12.275, 14.4542, 15.5, 10.5, 7.125, 7.225, 90.0, 7.775, 14.5, 52.5542, 26.0, 7.25, 10.4625, 26.55, 16.1, 20.2125, 15.2458, 79.2, 86.5, 512.3292, 26.0, 7.75, 31.3875, 79.65, 0.0, 7.75, 10.5, 39.6875, 7.775, 153.4625, 135.6333, 31.0, 0.0, 19.5, 29.7, 7.75, 77.9583, 7.75, 0.0, 29.125, 20.25, 7.75, 7.8542, 9.5, 8.05, 26.0, 8.6625, 9.5, 7.8958, 13.0, 7.75, 78.85, 91.0792, 12.875, 8.85, 7.8958, 27.7208, 7.2292, 151.55, 30.5, 247.5208, 7.75, 23.25, 0.0, 12.35, 8.05, 151.55, 110.8833, 108.9, 24.0, 56.9292, 83.1583, 262.375, 26.0, 7.8958, 26.25, 7.8542, 26.0, 14.0, 164.8667, 134.5, 7.25, 7.8958, 12.35, 29.0, 69.55, 135.6333, 6.2375, 13.0, 20.525, 57.9792, 23.25, 28.5, 153.4625, 18.0, 133.65, 7.8958, 66.6, 134.5, 8.05, 35.5, 26.0, 263.0, 13.0, 13.0, 13.0, 13.0, 13.0, 16.1, 15.9, 8.6625, 9.225, 35.0, 7.2292, 17.8, 7.225, 9.5, 55.0, 13.0, 7.8792, 7.8792, 27.9, 27.7208, 14.4542, 7.05, 15.5, 7.25, 75.25, 7.2292, 7.75, 69.3, 55.4417, 6.4958, 8.05, 135.6333, 21.075, 82.1708, 7.25, 211.5, 4.0125, 7.775, 227.525, 15.7417, 7.925, 52.0, 7.8958, 73.5, 46.9, 13.0, 7.7292, 12.0, 120.0, 7.7958, 7.925, 113.275, 16.7, 7.7958, 7.8542, 26.0, 10.5, 12.65, 7.925, 8.05, 9.825, 15.85, 8.6625, 21.0, 7.75, 18.75, 7.775, 25.4667, 7.8958, 6.8583, 90.0, 0.0, 7.925, 8.05, 32.5, 13.0, 13.0, 24.15, 7.8958, 7.7333, 7.875, 14.4, 20.2125, 7.25, 26.0, 26.0, 7.75, 8.05, 26.55, 16.1, 26.0, 7.125, 55.9, 120.0, 34.375, 18.75, 263.0, 10.5, 26.25, 9.5, 7.775, 13.0, 8.1125, 81.8583, 19.5, 26.55, 19.2583, 30.5, 27.75, 19.9667, 27.75, 89.1042, 8.05, 7.8958, 26.55, 51.8625, 10.5, 7.75, 26.55, 8.05, 38.5, 13.0, 8.05, 7.05, 0.0, 26.55, 7.725, 19.2583, 7.25, 8.6625, 27.75, 13.7917, 9.8375, 52.0, 21.0, 7.0458, 7.5208, 12.2875, 46.9, 0.0, 8.05, 9.5875, 91.0792, 25.4667, 90.0, 29.7, 8.05, 15.9, 19.9667, 7.25, 30.5, 49.5042, 8.05, 14.4583, 78.2667, 15.1, 151.55, 7.7958, 8.6625, 7.75, 7.6292, 9.5875, 86.5, 108.9, 26.0, 26.55, 22.525, 56.4958, 7.75, 8.05, 26.2875, 59.4, 7.4958, 34.0208, 10.5, 24.15, 26.0, 7.8958, 93.5, 7.8958, 7.225, 57.9792, 7.2292, 7.75, 10.5, 221.7792, 7.925, 11.5, 26.0, 7.2292, 7.2292, 22.3583, 8.6625, 26.25, 26.55, 106.425, 14.5, 49.5, 71.0, 31.275, 31.275, 26.0, 106.425, 26.0, 26.0, 13.8625, 20.525, 36.75, 110.8833, 26.0, 7.8292, 7.225, 7.775, 26.55, 39.6, 227.525, 79.65, 17.4, 7.75, 7.8958, 13.5, 8.05, 8.05, 24.15, 7.8958, 21.075, 7.2292, 7.8542, 10.5, 51.4792, 26.3875, 7.75, 8.05, 14.5, 13.0, 55.9, 14.4583, 7.925, 30.0, 110.8833, 26.0, 40.125, 8.7125, 79.65, 15.0, 79.2, 8.05, 8.05, 7.125, 78.2667, 7.25, 7.75, 26.0, 24.15, 33.0, 0.0, 7.225, 56.9292, 27.0, 7.8958, 42.4, 8.05, 26.55, 15.55, 7.8958, 30.5, 41.5792, 153.4625, 31.275, 7.05, 15.5, 7.75, 8.05, 65.0, 14.4, 16.1, 39.0, 10.5, 14.4542, 52.5542, 15.7417, 7.8542, 16.1, 32.3208, 12.35, 77.9583, 7.8958, 7.7333, 30.0, 7.0542, 30.5, 0.0, 27.9, 13.0, 7.925, 26.25, 39.6875, 16.1, 7.8542, 69.3, 27.9, 56.4958, 19.2583, 76.7292, 7.8958, 35.5, 7.55, 7.55, 7.8958, 23.0, 8.4333, 7.8292, 6.75, 73.5, 7.8958, 15.5, 13.0, 113.275, 133.65, 7.225, 25.5875, 7.4958, 7.925, 73.5, 13.0, 7.775, 8.05, 52.0, 39.0, 52.0, 10.5, 13.0, 0.0, 7.775, 8.05, 9.8417, 46.9, 512.3292, 8.1375, 76.7292, 9.225, 46.9, 39.0, 41.5792, 39.6875, 10.1708, 7.7958, 211.3375, 57.0, 13.4167, 56.4958, 7.225, 26.55, 13.5, 8.05, 7.7333, 110.8833, 7.65, 227.525, 26.2875, 14.4542, 7.7417, 7.8542, 26.0, 13.5, 26.2875, 151.55, 15.2458, 49.5042, 26.55, 52.0, 9.4833, 13.0, 7.65, 227.525, 10.5, 15.5, 7.775, 33.0, 7.0542, 13.0, 13.0, 53.1, 8.6625, 21.0, 7.7375, 26.0, 7.925, 211.3375, 18.7875, 0.0, 13.0, 13.0, 16.1, 34.375, 512.3292, 7.8958, 7.8958, 30.0, 78.85, 262.375, 16.1, 7.925, 71.0, 20.25, 13.0, 53.1, 7.75, 23.0, 12.475, 9.5, 7.8958, 65.0, 14.5, 7.7958, 11.5, 8.05, 86.5, 14.5, 7.125, 7.2292, 120.0, 7.775, 77.9583, 39.6, 7.75, 24.15, 8.3625, 9.5, 7.8542, 10.5, 7.225, 23.0, 7.75, 7.75, 12.475, 7.7375, 211.3375, 7.2292, 57.0, 30.0, 23.45, 7.05, 7.25, 7.4958, 29.125, 20.575, 79.2, 7.75, 26.0, 69.55, 30.6958, 7.8958, 13.0, 25.9292, 8.6833, 7.2292, 24.15, 13.0, 26.25, 120.0, 8.5167, 6.975, 7.775, 0.0, 7.775, 13.0, 53.1, 7.8875, 24.15, 10.5, 31.275, 8.05, 0.0, 7.925, 37.0042, 6.45, 27.9, 93.5, 8.6625, 0.0, 12.475, 39.6875, 6.95, 56.4958, 37.0042, 7.75, 80.0, 14.4542, 18.75, 7.2292, 7.8542, 8.3, 83.1583, 8.6625, 8.05, 56.4958, 29.7, 7.925, 10.5, 31.0, 6.4375, 8.6625, 7.55, 69.55, 7.8958, 33.0, 89.1042, 31.275, 7.775, 15.2458, 39.4, 26.0, 9.35, 164.8667, 26.55, 19.2583, 7.2292, 14.1083, 11.5, 25.9292, 69.55, 13.0, 13.0, 13.8583, 50.4958, 9.5, 11.1333, 7.8958, 52.5542, 5.0, 9.0, 24.0, 7.225, 9.8458, 7.8958, 7.8958, 83.1583, 26.0, 7.8958, 10.5167, 10.5, 7.05, 29.125, 13.0, 30.0, 23.45, 30.0, 7.75], "xaxis": "x", "yaxis": "y", "type": "histogram", "uid": "f6d2e2a8-f0c1-4dd1-94e3-6a59f9e12baa"}], {"barmode": "stack", "height": 600, "legend": {"tracegroupgap": 0}, "margin": {"t": 60}, "template": {"data": {"barpolar": [{"marker": {"line": {"color": "#E5ECF6", "width": 0.5}}, "type": "barpolar"}], "bar": [{"marker": {"line": {"color": "#E5ECF6", "width": 0.5}}, "type": "bar"}], "carpet": [{"aaxis": {"endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f"}, "baxis": {"endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f"}, "type": "carpet"}], "choropleth": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "choropleth"}], "contourcarpet": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "contourcarpet"}], "contour": [{"autocolorscale": true, "colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "contour"}], "heatmapgl": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "heatmapgl"}], "heatmap": [{"autocolorscale": true, "colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "heatmap"}], "histogram2dcontour": [{"autocolorscale": true, "colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "histogram2dcontour"}], "histogram2d": [{"autocolorscale": true, "colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "histogram2d"}], "histogram": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "histogram"}], "mesh3d": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "mesh3d"}], "parcoords": [{"line": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "parcoords"}], "scatter3d": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatter3d"}], "scattercarpet": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scattercarpet"}], "scattergeo": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scattergeo"}], "scattergl": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scattergl"}], "scattermapbox": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scattermapbox"}], "scatterpolargl": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatterpolargl"}], "scatterpolar": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatterpolar"}], "scatter": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatter"}], "scatterternary": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatterternary"}], "surface": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "surface"}], "table": [{"cells": {"fill": {"color": "#EBF0F8"}, "line": {"color": "white"}}, "header": {"fill": {"color": "#C8D4E3"}, "line": {"color": "white"}}, "type": "table"}]}, "layout": {"annotationdefaults": {"arrowcolor": "#506784", "arrowhead": 0, "arrowwidth": 1}, "colorscale": {"diverging": [[0, "#8e0152"], [0.1, "#c51b7d"], [0.2, "#de77ae"], [0.3, "#f1b6da"], [0.4, "#fde0ef"], [0.5, "#f7f7f7"], [0.6, "#e6f5d0"], [0.7, "#b8e186"], [0.8, "#7fbc41"], [0.9, "#4d9221"], [1, "#276419"]], "sequential": [[0.0, "#0508b8"], [0.0893854748603352, "#1910d8"], [0.1787709497206704, "#3c19f0"], [0.2681564245810056, "#6b1cfb"], [0.3575418994413408, "#981cfd"], [0.44692737430167595, "#bf1cfd"], [0.5363128491620112, "#dd2bfd"], [0.6256983240223464, "#f246fe"], [0.7150837988826816, "#fc67fd"], [0.8044692737430168, "#fe88fc"], [0.8938547486033519, "#fea5fd"], [0.9832402234636871, "#febefe"], [1.0, "#fec3fe"]], "sequentialminus": [[0.0, "#0508b8"], [0.0893854748603352, "#1910d8"], [0.1787709497206704, "#3c19f0"], [0.2681564245810056, "#6b1cfb"], [0.3575418994413408, "#981cfd"], [0.44692737430167595, "#bf1cfd"], [0.5363128491620112, "#dd2bfd"], [0.6256983240223464, "#f246fe"], [0.7150837988826816, "#fc67fd"], [0.8044692737430168, "#fe88fc"], [0.8938547486033519, "#fea5fd"], [0.9832402234636871, "#febefe"], [1.0, "#fec3fe"]]}, "colorway": ["#636efa", "#EF553B", "#00cc96", "#ab63fa", "#19d3f3", "#e763fa", "#FECB52", "#FFA15A", "#FF6692", "#B6E880"], "font": {"color": "#2a3f5f"}, "geo": {"bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white"}, "hovermode": "closest", "mapbox": {"style": "light"}, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": {"angularaxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}, "bgcolor": "#E5ECF6", "radialaxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}}, "scene": {"xaxis": {"backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white"}, "yaxis": {"backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white"}, "zaxis": {"backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white"}}, "shapedefaults": {"fillcolor": "#506784", "line": {"width": 0}, "opacity": 0.4}, "ternary": {"aaxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}, "baxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}, "bgcolor": "#E5ECF6", "caxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}}, "title": {"x": 0.05}, "xaxis": {"automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "zerolinecolor": "white", "zerolinewidth": 2}, "yaxis": {"automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "zerolinecolor": "white", "zerolinewidth": 2}}}, "grid": {"xaxes": ["x"], "yaxes": ["y"], "xgap": 0.1, "ygap": 0.1, "xside": "bottom", "yside": "left"}, "xaxis": {"title": {"text": "Fare"}}, "yaxis": {"title": {"text": "count"}}}, {"showLink": false, "linkText": "Export to plot.ly", "plotlyServerURL": "https://plot.ly"});
}

``````

## Data Cleaning

We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation). However we can be smarter about this and check the average age by passenger class. For example:

``````

In [6]:

plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')

``````
``````

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a16bfa7b8>

``````

We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We'll use these average age values to impute based on Pclass for Age.

``````

In [7]:

def impute_age(cols):
Age = cols[0]
Pclass = cols[1]

if pd.isnull(Age):

if Pclass == 1:
return 37

elif Pclass == 2:
return 29

else:
return 24

else:
return Age

``````

Now apply that function!

``````

In [8]:

train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

``````

Now let's check that heat map again!

``````

In [9]:

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

``````
``````

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a16c95908>

``````

Great! Let's go ahead and drop the Cabin column and the row in Embarked that is NaN.

``````

In [10]:

train.drop('Cabin',axis=1,inplace=True)

``````
``````

In [11]:

``````
``````

Out[11]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Embarked

0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
S

1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C

2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
S

3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
S

4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
S

5
6
0
3
Moran, Mr. James
male
24.0
0
0
330877
8.4583
Q

6
7
0
1
McCarthy, Mr. Timothy J
male
54.0
0
0
17463
51.8625
S

7
8
0
3
Palsson, Master. Gosta Leonard
male
2.0
3
1
349909
21.0750
S

8
9
1
3
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
female
27.0
0
2
347742
11.1333
S

9
10
1
2
female
14.0
1
0
237736
30.0708
C

10
11
1
3
Sandstrom, Miss. Marguerite Rut
female
4.0
1
1
PP 9549
16.7000
S

11
12
1
1
Bonnell, Miss. Elizabeth
female
58.0
0
0
113783
26.5500
S

12
13
0
3
Saundercock, Mr. William Henry
male
20.0
0
0
A/5. 2151
8.0500
S

13
14
0
3
male
39.0
1
5
347082
31.2750
S

14
15
0
3
female
14.0
0
0
350406
7.8542
S

15
16
1
2
Hewlett, Mrs. (Mary D Kingcome)
female
55.0
0
0
248706
16.0000
S

16
17
0
3
Rice, Master. Eugene
male
2.0
4
1
382652
29.1250
Q

17
18
1
2
Williams, Mr. Charles Eugene
male
29.0
0
0
244373
13.0000
S

18
19
0
3
Vander Planke, Mrs. Julius (Emelia Maria Vande...
female
31.0
1
0
345763
18.0000
S

19
20
1
3
Masselmani, Mrs. Fatima
female
24.0
0
0
2649
7.2250
C

20
21
0
2
Fynney, Mr. Joseph J
male
35.0
0
0
239865
26.0000
S

21
22
1
2
Beesley, Mr. Lawrence
male
34.0
0
0
248698
13.0000
S

22
23
1
3
McGowan, Miss. Anna "Annie"
female
15.0
0
0
330923
8.0292
Q

23
24
1
1
Sloper, Mr. William Thompson
male
28.0
0
0
113788
35.5000
S

24
25
0
3
Palsson, Miss. Torborg Danira
female
8.0
3
1
349909
21.0750
S

25
26
1
3
Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...
female
38.0
1
5
347077
31.3875
S

26
27
0
3
Emir, Mr. Farred Chehab
male
24.0
0
0
2631
7.2250
C

27
28
0
1
Fortune, Mr. Charles Alexander
male
19.0
3
2
19950
263.0000
S

28
29
1
3
O'Dwyer, Miss. Ellen "Nellie"
female
24.0
0
0
330959
7.8792
Q

29
30
0
3
Todoroff, Mr. Lalio
male
24.0
0
0
349216
7.8958
S

30
31
0
1
Uruchurtu, Don. Manuel E
male
40.0
0
0
PC 17601
27.7208
C

31
32
1
1
Spencer, Mrs. William Augustus (Marie Eugenie)
female
37.0
1
0
PC 17569
146.5208
C

32
33
1
3
Glynn, Miss. Mary Agatha
female
24.0
0
0
335677
7.7500
Q

33
34
0
2
male
66.0
0
0
C.A. 24579
10.5000
S

34
35
0
1
Meyer, Mr. Edgar Joseph
male
28.0
1
0
PC 17604
82.1708
C

35
36
0
1
Holverson, Mr. Alexander Oskar
male
42.0
1
0
113789
52.0000
S

36
37
1
3
Mamee, Mr. Hanna
male
24.0
0
0
2677
7.2292
C

37
38
0
3
Cann, Mr. Ernest Charles
male
21.0
0
0
A./5. 2152
8.0500
S

38
39
0
3
Vander Planke, Miss. Augusta Maria
female
18.0
2
0
345764
18.0000
S

39
40
1
3
Nicola-Yarred, Miss. Jamila
female
14.0
1
0
2651
11.2417
C

40
41
0
3
Ahlin, Mrs. Johan (Johanna Persdotter Larsson)
female
40.0
1
0
7546
9.4750
S

41
42
0
2
Turpin, Mrs. William John Robert (Dorothy Ann ...
female
27.0
1
0
11668
21.0000
S

42
43
0
3
Kraeff, Mr. Theodor
male
24.0
0
0
349253
7.8958
C

43
44
1
2
Laroche, Miss. Simonne Marie Anne Andree
female
3.0
1
2
SC/Paris 2123
41.5792
C

44
45
1
3
Devaney, Miss. Margaret Delia
female
19.0
0
0
330958
7.8792
Q

45
46
0
3
Rogers, Mr. William John
male
24.0
0
0
S.C./A.4. 23567
8.0500
S

46
47
0
3
Lennon, Mr. Denis
male
24.0
1
0
370371
15.5000
Q

47
48
1
3
O'Driscoll, Miss. Bridget
female
24.0
0
0
14311
7.7500
Q

48
49
0
3
Samaan, Mr. Youssef
male
24.0
2
0
2662
21.6792
C

49
50
0
3
Arnold-Franchi, Mrs. Josef (Josefine Franchi)
female
18.0
1
0
349237
17.8000
S

``````
``````

In [12]:

train.shape

``````
``````

Out[12]:

(891, 11)

``````
``````

In [13]:

train.dropna(inplace=True)

``````
``````

In [14]:

train.shape

``````
``````

Out[14]:

(889, 11)

``````

## Converting Categorical Features

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

``````

In [15]:

train.info()

``````
``````

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    889 non-null int64
Survived       889 non-null int64
Pclass         889 non-null int64
Name           889 non-null object
Sex            889 non-null object
Age            889 non-null float64
SibSp          889 non-null int64
Parch          889 non-null int64
Ticket         889 non-null object
Fare           889 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 83.3+ KB

``````
``````

In [16]:

sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)

``````
``````

In [17]:

``````
``````

Out[17]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

Q
S

0
0
1

1
0
0

2
0
1

3
0
1

4
0
1

``````
``````

In [18]:

``````
``````

Out[18]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

male

0
1

1
0

2
0

3
0

4
1

``````
``````

In [19]:

train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

``````
``````

In [20]:

train = pd.concat([train,sex,embark],axis=1)

``````
``````

In [21]:

``````
``````

Out[21]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

PassengerId
Survived
Pclass
Age
SibSp
Parch
Fare
male
Q
S

0
1
0
3
22.0
1
0
7.2500
1
0
1

1
2
1
1
38.0
1
0
71.2833
0
0
0

2
3
1
3
26.0
0
0
7.9250
0
0
1

3
4
1
1
35.0
1
0
53.1000
0
0
1

4
5
0
3
35.0
0
0
8.0500
1
0
1

``````

Great! Our data is ready for our model!

# Building a Logistic Regression model

Let's start by splitting our data into a training set and test set (there is another test.csv file that you can play around with in case you want to use all this data for training).

## Train Test Split

``````

In [22]:

from sklearn.model_selection import train_test_split

``````
``````

In [23]:

X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1),
train['Survived'], test_size=0.30,
random_state=101)

``````

## Training and Predicting

``````

In [24]:

from sklearn.linear_model import LogisticRegression

``````
``````

In [29]:

logmodel = LogisticRegression()
logmodel.verbose = 1

``````
``````

In [30]:

logmodel.fit(X_train,y_train)

``````
``````

[LibLinear]

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)

Out[30]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l2', random_state=None, solver='warn',
tol=0.0001, verbose=1, warm_start=False)

``````
``````

In [31]:

logmodel.coef_

``````
``````

Out[31]:

array([[ 4.10170317e-04, -7.83334719e-01, -2.61257205e-02,
-2.09907780e-01, -9.55518385e-02,  4.63201983e-03,
-2.33696636e+00, -1.21716646e-02, -2.02780740e-01]])

``````
``````

In [32]:

logmodel.intercept_

``````
``````

Out[32]:

array([3.36140356])

``````
``````

In [52]:

predictions = logmodel.predict(X_test)

``````

Let's move on to evaluate our model!

## Evaluation

We can check precision,recall,f1-score using classification report!

``````

In [42]:

from sklearn.metrics import classification_report

``````
``````

In [43]:

print(classification_report(y_test,predictions))

``````
``````

precision    recall  f1-score   support

0       0.81      0.93      0.86       163
1       0.85      0.65      0.74       104

avg / total       0.82      0.82      0.81       267

``````

Not so bad! You might want to explore other feature engineering and the other titanic_text.csv file, some suggestions for feature engineering:

• Try grabbing the Title (Dr.,Mr.,Mrs,etc..) from the name as a feature
• Maybe the Cabin letter could be a feature
• Is there any info you can get from the ticket?

## Validate against test dataset

``````

In [38]:

``````
``````

Out[38]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

PassengerId
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked

0
892
3
Kelly, Mr. James
male
34.5
0
0
330911
7.8292
NaN
Q

1
893
3
Wilkes, Mrs. James (Ellen Needs)
female
47.0
1
0
363272
7.0000
NaN
S

2
894
2
Myles, Mr. Thomas Francis
male
62.0
0
0
240276
9.6875
NaN
Q

3
895
3
Wirz, Mr. Albert
male
27.0
0
0
315154
8.6625
NaN
S

4
896
3
Hirvonen, Mrs. Alexander (Helga E Lindqvist)
female
22.0
1
1
3101298
12.2875
NaN
S

``````
``````

In [39]:

test_df.shape

``````
``````

Out[39]:

(418, 11)

``````
``````

In [48]:

test_df.iloc[0]

``````
``````

Out[48]:

PassengerId                 892
Pclass                        3
Name           Kelly, Mr. James
Sex                        male
Age                        34.5
SibSp                         0
Parch                         0
Ticket                   330911
Fare                     7.8292
Cabin                       NaN
Embarked                      Q
Name: 0, dtype: object

``````
``````

In [ ]:

``````