Linear Regression


Timothy Helton

The goal of this assignment is to build a simple linear regression algorithm from scratch. Linear regression is a very useful and simple to understand predicting values, given a set of training data. The outcome of regression is a best fitting line function, which, by definition, is the line that minimizes the sum of the squared errors. When plotted on a 2 dimensional coordinate system, the errors are the distance between the actual Y' and predicted Y' of the line. In machine learning, this line equation Y' = b(x) + A is solved using gradient descent to gradually approach to it. We will be using the statistical approach here that directly solves this line equation without using an iterative algorithm.



NOTE:
This notebook uses code found in the k2datascience.linear_regression module. To execute all the cells do one of the following items:

  • Install the k2datascience package to the active Python interpreter.
  • Add k2datascience/k2datascience to the PYTHON_PATH system variable.
  • Create a link to the linear_regression.py file in the same directory as this notebook.


Imports


In [ ]:
from k2datascience import linear_regression

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

Load Data


In [ ]:
ad = linear_regression.AdvertisingSimple()

Exercise 1 - Explore the Data

The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. Explore the data and decide on which variable you would like to use to predict Sales.


In [ ]:
ad.data.info()
ad.data.head()
ad.data.describe()

In [ ]:
ad.plot_correlation_joint_plots()

In [ ]:
ad.plot_correlation_heatmap()

Findings

  • TV advertizing has the largest correlation to sales, and will be used for prediction.

Exercise 2 - Build a Simple Linear Regression Class

The derivation can be found here on Wikipedia.

The general steps are:

  • Calculate mean and variance
  • Calculate covariance
  • Estimate coefficients
  • Make predictions on out-of-sample data

The class should do the following:

  • Fit a set of x,y points
  • Predict the value a new x values based on the coefficients
  • Can plot the best fit line on the points
  • Return the coefficient and intercept
  • Return the coefficient of determination (R^2)

Exercise 3 - Try it out on the Advertising Data Set


In [ ]:
ad.simple_stats_fit()
f'Coefficient: {ad.coefficients[0]:.4f}'
f'Intercept: {ad.intercept:.4f}'
f'R-Squared Value: {ad.r2:.3f}'

In [ ]:
ad.plot_simple_stats()

Exercise 4 - Check via Statsmodels and Scikit-learn


In [ ]:
import statsmodels.formula.api as smf

ln_reg = smf.ols(formula='sales ~ tv', data=ad.data).fit()
ln_reg.params
ln_reg.summary()

In [ ]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model

advertising_X = ad.data.tv[:, np.newaxis]
ad_X_train = advertising_X[:-20]
ad_X_test = advertising_X[-20:]

ad_y_train = ad.data.sales[:-20]
ad_y_test = ad.data.sales[-20:]

ln_reg = linear_model.LinearRegression()
ln_reg.fit(ad_X_train, ad_y_train)

f'Coefficients: {ln_reg.coef_}'
f'Intercept: {ln_reg.intercept_}'
mse = np.mean((ln_reg.predict(ad_X_test) - ad_y_test)**2)
f'Mean Squared Error: {mse:.2f}'
variance = ln_reg.score(ad_X_test, ad_y_test)
f'Variance Score: {variance:.2f}'

In [ ]:
fig = plt.figure('Correlation Heatmap', figsize=(8, 6),
                 facecolor='white', edgecolor='black')
rows, cols = (1, 1)
ax = plt.subplot2grid((rows, cols), (0, 0))

test_sort = np.argsort(ad_X_test.flatten())

ax.scatter(ad_X_test, ad_y_test, alpha=0.5, marker='d')
ax.plot(ad_X_test[test_sort], ln_reg.predict(ad_X_test)[test_sort],
        color='black', linestyle='--')

ax.set_title('Sales vs TV Advertising', fontsize=20)
ax.set_xlabel('TV Advertising', fontsize=14)
ax.set_ylabel('Sales', fontsize=14)

plt.show();

Findings

  • Statsmodels and SciKit-Learn both state they are using Ordinary Least Squares to perform the linear regression, but the coefficient and intercept values are slightly different.
    • Statsmodels uses the entire data set to create the fit
    • SciKit-Learn only uses a portion of the data to create the fit, but I also think they using an interative algorithm.

Additional Optional Exercises

  • Train / test split with RMSE calculation
  • Proper documentation for class methods and attributes
  • Build with NumPy methods and compare computation time
  • Multiple Linear Regression (SGD covered in Advanced Regression Unit)