Timothy Helton
The goal of this assignment is to build a simple linear regression algorithm from scratch. Linear regression is a very useful and simple to understand predicting values, given a set of training data. The outcome of regression is a best fitting line function, which, by definition, is the line that minimizes the sum of the squared errors. When plotted on a 2 dimensional coordinate system, the errors are the distance between the actual Y' and predicted Y' of the line. In machine learning, this line equation Y' = b(x) + A is solved using gradient descent to gradually approach to it. We will be using the statistical approach here that directly solves this line equation without using an iterative algorithm.
NOTE:
This notebook uses code found in the
k2datascience.linear_regression module.
To execute all the cells do one of the following items:
In [ ]:
from k2datascience import linear_regression
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
In [ ]:
ad = linear_regression.AdvertisingSimple()
The Advertising
data set consists of the sales of that product in 200 different
markets, along with advertising budgets for the product in each of those
markets for three different media: TV, radio, and newspaper. Explore the data and decide on which variable you would like to use to predict Sales
.
In [ ]:
ad.data.info()
ad.data.head()
ad.data.describe()
In [ ]:
ad.plot_correlation_joint_plots()
In [ ]:
ad.plot_correlation_heatmap()
The derivation can be found here on Wikipedia.
The general steps are:
The class should do the following:
In [ ]:
ad.simple_stats_fit()
f'Coefficient: {ad.coefficients[0]:.4f}'
f'Intercept: {ad.intercept:.4f}'
f'R-Squared Value: {ad.r2:.3f}'
In [ ]:
ad.plot_simple_stats()
In [ ]:
import statsmodels.formula.api as smf
ln_reg = smf.ols(formula='sales ~ tv', data=ad.data).fit()
ln_reg.params
ln_reg.summary()
In [ ]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
advertising_X = ad.data.tv[:, np.newaxis]
ad_X_train = advertising_X[:-20]
ad_X_test = advertising_X[-20:]
ad_y_train = ad.data.sales[:-20]
ad_y_test = ad.data.sales[-20:]
ln_reg = linear_model.LinearRegression()
ln_reg.fit(ad_X_train, ad_y_train)
f'Coefficients: {ln_reg.coef_}'
f'Intercept: {ln_reg.intercept_}'
mse = np.mean((ln_reg.predict(ad_X_test) - ad_y_test)**2)
f'Mean Squared Error: {mse:.2f}'
variance = ln_reg.score(ad_X_test, ad_y_test)
f'Variance Score: {variance:.2f}'
In [ ]:
fig = plt.figure('Correlation Heatmap', figsize=(8, 6),
facecolor='white', edgecolor='black')
rows, cols = (1, 1)
ax = plt.subplot2grid((rows, cols), (0, 0))
test_sort = np.argsort(ad_X_test.flatten())
ax.scatter(ad_X_test, ad_y_test, alpha=0.5, marker='d')
ax.plot(ad_X_test[test_sort], ln_reg.predict(ad_X_test)[test_sort],
color='black', linestyle='--')
ax.set_title('Sales vs TV Advertising', fontsize=20)
ax.set_xlabel('TV Advertising', fontsize=14)
ax.set_ylabel('Sales', fontsize=14)
plt.show();