Linear Regression

Timothy Helton

The goal of this assignment is to build a simple linear regression algorithm from scratch. Linear regression is a very useful and simple to understand predicting values, given a set of training data. The outcome of regression is a best fitting line function, which, by definition, is the line that minimizes the sum of the squared errors. When plotted on a 2 dimensional coordinate system, the errors are the distance between the actual Y' and predicted Y' of the line. In machine learning, this line equation Y' = b(x) + A is solved using gradient descent to gradually approach to it. We will be using the statistical approach here that directly solves this line equation without using an iterative algorithm.

NOTE:
This notebook uses code found in the k2datascience.linear_regression module. To execute all the cells do one of the following items:

Install the k2datascience package to the active Python interpreter.
Add k2datascience/k2datascience to the PYTHON_PATH system variable.
Create a link to the linear_regression.py file in the same directory as this notebook.

Imports



In [ ]:

    
from k2datascience import linear_regression

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

Load Data



In [ ]:

    
ad = linear_regression.AdvertisingSimple()

Exercise 1 - Explore the Data

The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. Explore the data and decide on which variable you would like to use to predict Sales.



In [ ]:

    
ad.data.info()
ad.data.head()
ad.data.describe()



In [ ]:

    
ad.plot_correlation_joint_plots()



In [ ]:

    
ad.plot_correlation_heatmap()

Findings

TV advertizing has the largest correlation to sales, and will be used for prediction.

Exercise 2 - Build a Simple Linear Regression Class

The derivation can be found here on Wikipedia.

The general steps are:

Calculate mean and variance
Calculate covariance
Estimate coefficients
Make predictions on out-of-sample data

The class should do the following:

Fit a set of x,y points
Predict the value a new x values based on the coefficients
Can plot the best fit line on the points
Return the coefficient and intercept
Return the coefficient of determination (R^2)

Exercise 3 - Try it out on the Advertising Data Set



In [ ]:

    
ad.simple_stats_fit()
f'Coefficient: {ad.coefficients[0]:.4f}'
f'Intercept: {ad.intercept:.4f}'
f'R-Squared Value: {ad.r2:.3f}'



In [ ]:

    
ad.plot_simple_stats()

Exercise 4 - Check via Statsmodels and Scikit-learn



In [ ]:

    
import statsmodels.formula.api as smf

ln_reg = smf.ols(formula='sales ~ tv', data=ad.data).fit()
ln_reg.params
ln_reg.summary()



In [ ]:

    
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model

advertising_X = ad.data.tv[:, np.newaxis]
ad_X_train = advertising_X[:-20]
ad_X_test = advertising_X[-20:]

ad_y_train = ad.data.sales[:-20]
ad_y_test = ad.data.sales[-20:]

ln_reg = linear_model.LinearRegression()
ln_reg.fit(ad_X_train, ad_y_train)

f'Coefficients: {ln_reg.coef_}'
f'Intercept: {ln_reg.intercept_}'
mse = np.mean((ln_reg.predict(ad_X_test) - ad_y_test)**2)
f'Mean Squared Error: {mse:.2f}'
variance = ln_reg.score(ad_X_test, ad_y_test)
f'Variance Score: {variance:.2f}'



In [ ]:

    
fig = plt.figure('Correlation Heatmap', figsize=(8, 6),
                 facecolor='white', edgecolor='black')
rows, cols = (1, 1)
ax = plt.subplot2grid((rows, cols), (0, 0))

test_sort = np.argsort(ad_X_test.flatten())

ax.scatter(ad_X_test, ad_y_test, alpha=0.5, marker='d')
ax.plot(ad_X_test[test_sort], ln_reg.predict(ad_X_test)[test_sort],
        color='black', linestyle='--')

ax.set_title('Sales vs TV Advertising', fontsize=20)
ax.set_xlabel('TV Advertising', fontsize=14)
ax.set_ylabel('Sales', fontsize=14)

plt.show();

Findings

Statsmodels and SciKit-Learn both state they are using Ordinary Least Squares to perform the linear regression, but the coefficient and intercept values are slightly different.
- Statsmodels uses the entire data set to create the fit
- SciKit-Learn only uses a portion of the data to create the fit, but I also think they using an interative algorithm.

Additional Optional Exercises

Train / test split with RMSE calculation
Proper documentation for class methods and attributes
Build with NumPy methods and compare computation time
Multiple Linear Regression (SGD covered in Advanced Regression Unit)