In this project we will be working with a fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.
This data set contains the following features:
Import a few libraries you think you'll need (Or just import them as you go along!)
In [6]:
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
ad_data = pd.read_csv('./advertising.csv')
Check the head of ad_data
In [3]:
ad_data.head()
Out[3]:
Use info and describe() on ad_data
In [4]:
ad_data.info()
In [5]:
ad_data.describe()
Out[5]:
In [11]:
ad_data['Age'].plot(kind='hist', bins=40)
Out[11]:
Create a jointplot showing Area Income versus Age.
In [15]:
sns.jointplot(x='Age', y='Area Income', data=ad_data)
Out[15]:
Create a jointplot showing the kde distributions of Daily Time spent on site vs. Age.
In [16]:
sns.jointplot(x='Age', y='Daily Time Spent on Site', data=ad_data, kind='kde', color='red')
Out[16]:
Create a jointplot of 'Daily Time Spent on Site' vs. 'Daily Internet Usage'
In [18]:
sns.jointplot(x='Daily Time Spent on Site', y='Daily Internet Usage', data=ad_data, color='green')
Out[18]:
Finally, create a pairplot with the hue defined by the 'Clicked on Ad' column feature.
In [19]:
sns.pairplot(data = ad_data, hue='Clicked on Ad')
Out[19]:
Split the data into training set and testing set using train_test_split
In [20]:
from sklearn.model_selection import train_test_split
In [22]:
ad_data.info()
In [36]:
X = ad_data[['Daily Time Spent on Site','Age','Area Income','Daily Internet Usage','Male']]
y = ad_data['Clicked on Ad']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=33)
In [37]:
(X_train.shape, X_test.shape)
Out[37]:
Train and fit a logistic regression model on the training set.
In [38]:
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
In [39]:
logmodel.fit(X_train, y_train)
Out[39]:
In [40]:
logmodel.intercept_
Out[40]:
In [41]:
logmodel.intercept_scaling
Out[41]:
In [42]:
logmodel.coef_
Out[42]:
In [34]:
logmodel.coef_.round(4)
Out[34]:
In [43]:
# coefficients after shuffling the train test dataset
logmodel.coef_.round(4)
Out[43]:
In [47]:
train_predictions = logmodel.predict(X_train)
test_predictions = logmodel.predict(X_test)
Create a classification report for the model.
In [46]:
from sklearn.metrics import classification_report
In [55]:
print("Training errors")
print(classification_report(y_true=y_train,
y_pred=train_predictions,
labels=[0,1],
target_names=['Not clicked on ad','Clicked on ad']))
In [56]:
print("Test errors")
print(classification_report(y_true=y_test,
y_pred=test_predictions,
labels=[0,1],
target_names=['Not clicked on ad','Clicked on ad']))
In [ ]: