This lab is in introduction to logistic regression using Python and Scikit-Learn. This lab serves as a foundation for more complex algorithms and machine learning models that you will encounter in the course. In this lab, we will use a synthetic advertising data set, indicating whether or not a particular internet user clicked on an Advertisement on a company website. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.
Each learning objective will correspond to a #TODO in the student lab notebook -- try to complete that notebook first before reviewing this solution notebook.
In [ ]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst
In [3]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
We will use a synthetic advertising dataset. This data set contains the following features:
In [18]:
# TODO 1: Read in the advertising.csv file and set it to a data frame called ad_data.
ad_data = pd.read_csv('../advertising.csv')
Check the head of ad_data
In [19]:
ad_data.head()
Out[19]:
Use info and describe() on ad_data
In [20]:
ad_data.info()
In [21]:
ad_data.describe()
Out[21]:
Let's check for any null values.
In [22]:
ad_data.isnull().sum()
Out[22]:
TODO 1: Create a histogram of the Age
In [28]:
# TODO 1
sns.set_style('whitegrid')
ad_data['Age'].hist(bins=30)
plt.xlabel('Age')
Out[28]:
TODO 1: Create a jointplot showing Area Income versus Age.
In [29]:
# TODO 1
sns.jointplot(x='Age',y='Area Income',data=ad_data)
Out[29]:
TODO 2: Create a jointplot showing the kde distributions of Daily Time spent on site vs. Age.
In [30]:
# TODO 2
sns.jointplot(x='Age',y='Daily Time Spent on Site',data=ad_data,color='red',kind='kde');
TODO 1: Create a jointplot of 'Daily Time Spent on Site' vs. 'Daily Internet Usage'
In [31]:
# TODO 1
sns.jointplot(x='Daily Time Spent on Site',y='Daily Internet Usage',data=ad_data,color='green')
Out[31]:
Logistic regression is a supervised machine learning process. It is similar to linear regression, but rather than predict a continuous value, we try to estimate probabilities by using a logistic function. Note that even though it has regression in the name, it is for classification. While linear regression is acceptable for estimating values, logistic regression is best for predicting the class of an observation
Now it's time to do a train test split, and train our model! You'll have the freedom here to choose columns that you want to train on!
In [44]:
from sklearn.model_selection import train_test_split
Next, let's define the features and label. Briefly, feature is input; label is output. This applies to both classification and regression problems.
In [45]:
X = ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']]
y = ad_data['Clicked on Ad']
TODO 2: Split the data into training set and testing set using train_test_split
In [46]:
# TODO 2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Train and fit a logistic regression model on the training set.
In [47]:
from sklearn.linear_model import LogisticRegression
In [48]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
Out[48]:
In [41]:
predictions = logmodel.predict(X_test)
Create a classification report for the model.
In [42]:
from sklearn.metrics import classification_report
In [49]:
print(classification_report(y_test,predictions))
Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.