OLS is a common regression method of estimating unknown parameters in linear models. This method is designed to minimize the sum of squared errors (sum of squared differences between observed and predicted response values), in order to produce a linear model with parameters that best fit the data. To illustrate this, the following figure shows a collection of data points (blue dots), and a line of best fit (solid blue line) determined using OLS, which is our linear model.
The data points fall above, below, and sometimes right on the line. The vertical distance between a data point and the line (at the point of our prediction) is the error. There are two dotted-red vertical lines to show this error.
Assumptions
The ordinary least squares method assumes several conditions to be true (don't worry too much about these right now):
Condition (2) is always true in the bivariate case (see next section)
The OLS Regression Equations
The simplest form of OLS is the bivariate case, where we have only one independent variable and the dependent variable:
$$\hat y = b_0 + b_1x + \varepsilon$$Beyond this we have the multivariate case, which we refer to as multiple regression. The multiple regression models can be written in the expanded form
$$\hat y = b_0 + b_1X_1 + b_2X_2 + ... + b_nX_n + \varepsilon$$The interpretation of this is similar to the bivariate case, only now we have a parameter for every $X_i$, and instead of a line we have what is called a hyperplane. If we had only two independent variables we would have a regular plane. The compact version of this equation is the matrix form (from linear algebra)
$$\hat y = \boldsymbol X\boldsymbol b + \varepsilon$$Where $\boldsymbol X\boldsymbol b$ represents matrix multiplication.
A Notable Property
Examples
Figure (a) shows random distribution of residuals with no heteroscedasticity -- a linear model is good here.
Figure (b) residuals are not randomly distributed and the increase as the independent variable increases. This could indicate lurking variables that haven't been included in the model.
Figure (c) has non-random residuals that curve around the regression line. This can indicate the need for a polynomial fit (more on this later)
R-squared : Also called the coefficient of determination, this is an often used measure of fit that represents the proportion of variance in the response that can be accounted for by the model. The R-squared value ranges from 0 to 1.
If we have residual plots that indicate non-linear data, we might be able to apply a transformation to make it more linear. Such transformations can enable us to use linear regression methods on non-linear data.
A couple transformations
As you can see, in cases where the transformation is applied to the dependent variable, getting the prediction in terms of the original units requires applying the inverse of the original transform. Other transforms (such as quadratic) can be used. Perhaps the most important thing to note about these transformations is that they much maintain a linear form similar to $y' = b_0 +b_1x'$ where the $'$ indicates possibly transformed versions of the original variables. Without this form, we could not fit linear models.
It will usually not be clear exactly what transformation to apply, and so you must use trial-and-error to determine the best one to use.
In [59]:
e = np.random.randn(50)
w = 3
x = np.random.rand(50)*np.random.randint(0,10,50)
y = w*x + 2*e
In [78]:
x
Out[78]:
In [100]:
x[41], y[41]
Out[100]:
In [105]:
sns.regplot(x, y, ci=False)
plt.plot((x[25], x[25]), (13, y[25]-0.3), 'r:');
plt.plot((x[41], x[41]), (y[41]+0.3, 14.5), 'r:');
In [2]:
import pandas as pd
from pandas import DataFrame as DF, Series
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
gnbu = sns.color_palette('GnBu', 40)
sns.set_style('whitegrid')
In [3]:
sets = ['batting',
'pitching',
'player',
'salary']
data = {}
for s in sets:
file = s + '.csv'
data[s] = pd.read_csv(file)
In [4]:
def summary(data):
# data info
print('DATA INFO \n')
data.info()
print(50*'-', '\n')
# numeric summary
print('NUMERIC \n')
print(data.describe().T)
print(50*'-', '\n')
# categorical summary
print('CATEGORICAL \n')
print(data.describe(include=['O']).T)
In [5]:
# import pandas scatter_matrix to create pair-plots
from pandas.plotting import scatter_matrix
In [6]:
summary(data['batting'])
In [18]:
# features to plot
cols = list(data['batting'].loc[:1, 'r':'rbi'].columns) + ['team_id']
cols
Out[18]:
In [29]:
teams = data['batting'].team_id.value_counts().head(3).index
colors = {teams[0]: 'r', teams[1]: 'g', teams[2]: 'b'}
subset = data['batting'][data['batting'].team_id.isin(teams)][cols].sample(frac=0.15)
In [32]:
teams
Out[32]:
In [31]:
scatter_matrix(subset, c=subset.team_id.apply(lambda x: colors[x]), alpha=0.5, figsize=(14,14));