Introduction to Logistic Regression

Learning Objectives

Create Seaborn plots for Exploratory Data Analysis
Train a Logistic Regression Model using Scikit-Learn

Introduction

This lab is in introduction to logistic regression using Python and Scikit-Learn. This lab serves as a foundation for more complex algorithms and machine learning models that you will encounter in the course. In this lab, we will use a synthetic advertising data set, indicating whether or not a particular internet user clicked on an Advertisement on a company website. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.

Each learning objective will correspond to a #TODO in this student lab notebook -- try to complete this notebook first and then review the solution notebook.

Import Libraries



In [ ]:

    
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst



In [3]:

    
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Load the Dataset

We will use a synthetic advertising dataset. This data set contains the following features:

'Daily Time Spent on Site': consumer time on site in minutes
'Age': customer age in years
'Area Income': Avg. Income of geographical area of consumer
'Daily Internet Usage': Avg. minutes a day consumer is on the internet
'Ad Topic Line': Headline of the advertisement
'City': City of consumer
'Male': Whether or not consumer was male
'Country': Country of consumer
'Timestamp': Time at which consumer clicked on Ad or closed window
'Clicked on Ad': 0 or 1 indicated clicking on Ad



In [18]:

    
# TODO 1: Read in the advertising.csv file and set it to a data frame called ad_data.
# TODO: Your code goes here

Check the head of ad_data



In [19]:

    
ad_data.head()









    Out[19]:







  
    
      
      Daily Time Spent on Site
      Age
      Area Income
      Daily Internet Usage
      Ad Topic Line
      City
      Male
      Country
      Timestamp
      Clicked on Ad
    
  
  
    
      0
      68.95
      35
      61833.90
      256.09
      Cloned 5thgeneration orchestration
      Wrightburgh
      0
      Tunisia
      2016-03-27 00:53:11
      0
    
    
      1
      80.23
      31
      68441.85
      193.77
      Monitored national standardization
      West Jodi
      1
      Nauru
      2016-04-04 01:39:02
      0
    
    
      2
      69.47
      26
      59785.94
      236.50
      Organic bottom-line service-desk
      Davidton
      0
      San Marino
      2016-03-13 20:35:42
      0
    
    
      3
      74.15
      29
      54806.18
      245.89
      Triple-buffered reciprocal time-frame
      West Terrifurt
      1
      Italy
      2016-01-10 02:31:19
      0
    
    
      4
      68.37
      35
      73889.99
      225.58
      Robust logistical utilization
      South Manuel
      0
      Iceland
      2016-06-03 03:36:18
      0

Use info and describe() on ad_data



In [20]:

    
ad_data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
Daily Time Spent on Site    1000 non-null float64
Age                         1000 non-null int64
Area Income                 1000 non-null float64
Daily Internet Usage        1000 non-null float64
Ad Topic Line               1000 non-null object
City                        1000 non-null object
Male                        1000 non-null int64
Country                     1000 non-null object
Timestamp                   1000 non-null object
Clicked on Ad               1000 non-null int64
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB



In [21]:

    
ad_data.describe()









    Out[21]:







  
    
      
      Daily Time Spent on Site
      Age
      Area Income
      Daily Internet Usage
      Male
      Clicked on Ad
    
  
  
    
      count
      1000.000000
      1000.000000
      1000.000000
      1000.000000
      1000.000000
      1000.00000
    
    
      mean
      65.000200
      36.009000
      55000.000080
      180.000100
      0.481000
      0.50000
    
    
      std
      15.853615
      8.785562
      13414.634022
      43.902339
      0.499889
      0.50025
    
    
      min
      32.600000
      19.000000
      13996.500000
      104.780000
      0.000000
      0.00000
    
    
      25%
      51.360000
      29.000000
      47031.802500
      138.830000
      0.000000
      0.00000
    
    
      50%
      68.215000
      35.000000
      57012.300000
      183.130000
      0.000000
      0.50000
    
    
      75%
      78.547500
      42.000000
      65470.635000
      218.792500
      1.000000
      1.00000
    
    
      max
      91.430000
      61.000000
      79484.800000
      269.960000
      1.000000
      1.00000

Let's check for any null values.



In [22]:

    
ad_data.isnull().sum()









    Out[22]:





Daily Time Spent on Site    0
Age                         0
Area Income                 0
Daily Internet Usage        0
Ad Topic Line               0
City                        0
Male                        0
Country                     0
Timestamp                   0
Clicked on Ad               0
dtype: int64

Exploratory Data Analysis (EDA)

Let's use seaborn to explore the data! Try recreating the plots shown below!

TODO 1: Create a histogram of the Age



In [28]:

    
# TODO: Your code goes here









    Out[28]:





Text(0.5, 0, 'Age')

TODO 1: Create a jointplot showing Area Income versus Age.



In [29]:

    
# TODO: Your code goes here









    Out[29]:





<seaborn.axisgrid.JointGrid at 0x7f9391624d68>

TODO 2: Create a jointplot showing the kde distributions of Daily Time spent on site vs. Age.



In [30]:

    
# TODO: Your code goes here

TODO 1: Create a jointplot of 'Daily Time Spent on Site' vs. 'Daily Internet Usage'



In [31]:

    
# TODO: Your code goes here









    Out[31]:





<seaborn.axisgrid.JointGrid at 0x7f939100da90>

Logistic Regression

Logistic regression is a supervised machine learning process. It is similar to linear regression, but rather than predict a continuous value, we try to estimate probabilities by using a logistic function. Note that even though it has regression in the name, it is for classification. While linear regression is acceptable for estimating values, logistic regression is best for predicting the class of an observation

Now it's time to do a train test split, and train our model! You'll have the freedom here to choose columns that you want to train on!



In [44]:

    
from sklearn.model_selection import train_test_split

Next, let's define the features and label. Briefly, feature is input; label is output. This applies to both classification and regression problems.



In [45]:

    
X = ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']]
y = ad_data['Clicked on Ad']

TODO 2: Split the data into training set and testing set using train_test_split



In [46]:

    
# TODO: Your code goes here

Train and fit a logistic regression model on the training set.



In [47]:

    
from sklearn.linear_model import LogisticRegression



In [48]:

    
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)









    Out[48]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Predictions and Evaluations

Now predict values for the testing data.



In [41]:

    
predictions = logmodel.predict(X_test)

Create a classification report for the model.



In [42]:

    
from sklearn.metrics import classification_report



In [49]:

    
print(classification_report(y_test,predictions))









    



             precision    recall  f1-score   support

          0       0.87      0.96      0.91       162
          1       0.96      0.86      0.91       168

avg / total       0.91      0.91      0.91       330

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp
0	68.95	35	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	2016-03-27 00:53:11
1	80.23	31	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	2016-04-04 01:39:02
2	69.47	26	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	2016-03-13 20:35:42
3	74.15	29	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	2016-01-10 02:31:19
4	68.37	35	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	2016-06-03 03:36:18

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Male	Clicked on Ad
count	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.00000
mean	65.000200	36.009000	55000.000080	180.000100	0.481000	0.50000
std	15.853615	8.785562	13414.634022	43.902339	0.499889	0.50025
min	32.600000	19.000000	13996.500000	104.780000	0.000000	0.00000
25%	51.360000	29.000000	47031.802500	138.830000	0.000000	0.00000
50%	68.215000	35.000000	57012.300000	183.130000	0.000000	0.50000
75%	78.547500	42.000000	65470.635000	218.792500	1.000000	1.00000
max	91.430000	61.000000	79484.800000	269.960000	1.000000	1.00000