Logistic Regression with Python

For this lecture we will be working with the Titanic Data Set from Kaggle. This is a very famous data set and very often is a student's first step in machine learning!

We'll be trying to predict a classification- survival or deceased. Let's begin our understanding of implementing Logistic Regression in Python for classification.

We'll use a "semi-cleaned" version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning not shown in this lecture notebook.

Import Libraries

Let's import some libraries to get started!


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The Data

Let's start by reading in the titanic_train.csv file into a pandas dataframe.


In [2]:
train = pd.read_csv('titanic_train.csv')

In [3]:
train.head(25)


Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S

Exploratory Data Analysis

Let's begin some exploratory data analysis! We'll start by checking out missing data!

Missing Data

We can use seaborn to create a simple heatmap to see where we are missing data!


In [4]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')


Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a166095f8>

Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level. We'll probably drop this later, or change it to another feature like "Cabin Known: 1 or 0"

Let's continue on by visualizing some more of the data! Check out the video for full explanations over these plots, this code is just to serve as reference.


In [5]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train,palette='RdBu_r')


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a16bbd240>

In [9]:
# sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x2550eaab748>

In [10]:
# sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow')


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x2550fbbb358>

In [11]:
sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=30)


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x2550fc64748>

In [12]:
train['Age'].hist(bins=30,color='darkred',alpha=0.7)


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x2550fde4b70>

In [13]:
sns.countplot(x='SibSp',data=train)


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x2550ff16208>

In [14]:
train['Fare'].hist(color='green',bins=40,figsize=(8,4))


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x2550ff817b8>


Let's take a quick moment to show an example of cufflinks!


In [15]:
import plotly_express as pex

In [17]:
pex.histogram(data_frame=train, x='Fare', nbins=30)


Data Cleaning

We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation). However we can be smarter about this and check the average age by passenger class. For example:


In [6]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a16bfa7b8>

We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We'll use these average age values to impute based on Pclass for Age.


In [7]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):

        if Pclass == 1:
            return 37

        elif Pclass == 2:
            return 29

        else:
            return 24

    else:
        return Age

Now apply that function!


In [8]:
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

Now let's check that heat map again!


In [9]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a16c95908>

Great! Let's go ahead and drop the Cabin column and the row in Embarked that is NaN.


In [10]:
train.drop('Cabin',axis=1,inplace=True)

In [11]:
train.head(50)


Out[11]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S
5 6 0 3 Moran, Mr. James male 24.0 0 0 330877 8.4583 Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 Q
17 18 1 2 Williams, Mr. Charles Eugene male 29.0 0 0 244373 13.0000 S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 S
19 20 1 3 Masselmani, Mrs. Fatima female 24.0 0 0 2649 7.2250 C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 S
26 27 0 3 Emir, Mr. Farred Chehab male 24.0 0 0 2631 7.2250 C
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 S
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female 24.0 0 0 330959 7.8792 Q
29 30 0 3 Todoroff, Mr. Lalio male 24.0 0 0 349216 7.8958 S
30 31 0 1 Uruchurtu, Don. Manuel E male 40.0 0 0 PC 17601 27.7208 C
31 32 1 1 Spencer, Mrs. William Augustus (Marie Eugenie) female 37.0 1 0 PC 17569 146.5208 C
32 33 1 3 Glynn, Miss. Mary Agatha female 24.0 0 0 335677 7.7500 Q
33 34 0 2 Wheadon, Mr. Edward H male 66.0 0 0 C.A. 24579 10.5000 S
34 35 0 1 Meyer, Mr. Edgar Joseph male 28.0 1 0 PC 17604 82.1708 C
35 36 0 1 Holverson, Mr. Alexander Oskar male 42.0 1 0 113789 52.0000 S
36 37 1 3 Mamee, Mr. Hanna male 24.0 0 0 2677 7.2292 C
37 38 0 3 Cann, Mr. Ernest Charles male 21.0 0 0 A./5. 2152 8.0500 S
38 39 0 3 Vander Planke, Miss. Augusta Maria female 18.0 2 0 345764 18.0000 S
39 40 1 3 Nicola-Yarred, Miss. Jamila female 14.0 1 0 2651 11.2417 C
40 41 0 3 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) female 40.0 1 0 7546 9.4750 S
41 42 0 2 Turpin, Mrs. William John Robert (Dorothy Ann ... female 27.0 1 0 11668 21.0000 S
42 43 0 3 Kraeff, Mr. Theodor male 24.0 0 0 349253 7.8958 C
43 44 1 2 Laroche, Miss. Simonne Marie Anne Andree female 3.0 1 2 SC/Paris 2123 41.5792 C
44 45 1 3 Devaney, Miss. Margaret Delia female 19.0 0 0 330958 7.8792 Q
45 46 0 3 Rogers, Mr. William John male 24.0 0 0 S.C./A.4. 23567 8.0500 S
46 47 0 3 Lennon, Mr. Denis male 24.0 1 0 370371 15.5000 Q
47 48 1 3 O'Driscoll, Miss. Bridget female 24.0 0 0 14311 7.7500 Q
48 49 0 3 Samaan, Mr. Youssef male 24.0 2 0 2662 21.6792 C
49 50 0 3 Arnold-Franchi, Mrs. Josef (Josefine Franchi) female 18.0 1 0 349237 17.8000 S

In [12]:
train.shape


Out[12]:
(891, 11)

In [13]:
train.dropna(inplace=True)

In [14]:
train.shape


Out[14]:
(889, 11)

Converting Categorical Features

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.


In [15]:
train.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    889 non-null int64
Survived       889 non-null int64
Pclass         889 non-null int64
Name           889 non-null object
Sex            889 non-null object
Age            889 non-null float64
SibSp          889 non-null int64
Parch          889 non-null int64
Ticket         889 non-null object
Fare           889 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 83.3+ KB

In [16]:
sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)

In [17]:
embark.head()


Out[17]:
Q S
0 0 1
1 0 0
2 0 1
3 0 1
4 0 1

In [18]:
sex.head()


Out[18]:
male
0 1
1 0
2 0
3 0
4 1

In [19]:
train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

In [20]:
train = pd.concat([train,sex,embark],axis=1)

In [21]:
train.head()


Out[21]:
PassengerId Survived Pclass Age SibSp Parch Fare male Q S
0 1 0 3 22.0 1 0 7.2500 1 0 1
1 2 1 1 38.0 1 0 71.2833 0 0 0
2 3 1 3 26.0 0 0 7.9250 0 0 1
3 4 1 1 35.0 1 0 53.1000 0 0 1
4 5 0 3 35.0 0 0 8.0500 1 0 1

Great! Our data is ready for our model!

Building a Logistic Regression model

Let's start by splitting our data into a training set and test set (there is another test.csv file that you can play around with in case you want to use all this data for training).

Train Test Split


In [22]:
from sklearn.model_selection import train_test_split

In [23]:
X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1), 
                                                    train['Survived'], test_size=0.30, 
                                                    random_state=101)

Training and Predicting


In [24]:
from sklearn.linear_model import LogisticRegression

In [29]:
logmodel = LogisticRegression()
logmodel.verbose = 1

In [30]:
logmodel.fit(X_train,y_train)


[LibLinear]
/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[30]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=1, warm_start=False)

In [31]:
logmodel.coef_


Out[31]:
array([[ 4.10170317e-04, -7.83334719e-01, -2.61257205e-02,
        -2.09907780e-01, -9.55518385e-02,  4.63201983e-03,
        -2.33696636e+00, -1.21716646e-02, -2.02780740e-01]])

In [32]:
logmodel.intercept_


Out[32]:
array([3.36140356])

In [52]:
predictions = logmodel.predict(X_test)

Let's move on to evaluate our model!

Evaluation

We can check precision,recall,f1-score using classification report!


In [42]:
from sklearn.metrics import classification_report

In [43]:
print(classification_report(y_test,predictions))


             precision    recall  f1-score   support

          0       0.81      0.93      0.86       163
          1       0.85      0.65      0.74       104

avg / total       0.82      0.82      0.81       267

Not so bad! You might want to explore other feature engineering and the other titanic_text.csv file, some suggestions for feature engineering:

  • Try grabbing the Title (Dr.,Mr.,Mrs,etc..) from the name as a feature
  • Maybe the Cabin letter could be a feature
  • Is there any info you can get from the ticket?

Great Job!

Validate against test dataset


In [38]:
test_df = pd.read_csv('titanic_test.csv')
test_df.head()


Out[38]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

In [39]:
test_df.shape


Out[39]:
(418, 11)

In [48]:
test_df.iloc[0]


Out[48]:
PassengerId                 892
Pclass                        3
Name           Kelly, Mr. James
Sex                        male
Age                        34.5
SibSp                         0
Parch                         0
Ticket                   330911
Fare                     7.8292
Cabin                       NaN
Embarked                      Q
Name: 0, dtype: object

In [ ]: