Jackie Zuker
The model in use has problems with heterscedasticity and multivariate non-normality.
In [16]:
import math
import warnings
from IPython.display import display
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import linear_model
#import statsmodels.formula.api as smf
#import statsmodels as smf
# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format
# Suppress annoying harmless error.
warnings.filterwarnings(
action="ignore",
module="scipy",
message="^internal gelsd"
)
In [17]:
# Acquire, load, and preview the data.
data = pd.read_csv(
'http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv',
index_col=0
)
display(data.head())
# Instantiate and fit our model.
regr = linear_model.LinearRegression()
Y = data['Sales'].values.reshape(-1, 1)
X = data[['TV','Radio','Newspaper']]
regr.fit(X, Y)
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))
In [29]:
# Extract predicted values.
predicted = regr.predict(X).ravel()
actual = data['Sales']
# Calculate the error, also called the residual.
residual = actual - predicted
sns.set_style("whitegrid")
# This looks a bit concerning.
plt.hist(residual)
plt.title('Residual counts (skewed to the right)')
plt.xlabel('Residual')
plt.ylabel('Count')
plt.show()
plt.scatter(predicted, residual)
plt.xlabel('Predicted')
plt.ylabel('Residual')
plt.axhline(y=0)
plt.title('Residual vs. Predicted( Heteroscedasticity present)')
plt.show()
As shown above, the error from the model is not normally distributed. The error is skewed to the right, similar to the raw data itself.
Additionally, the distribution of error terms is not consistent, it is heteroscadastic.
The data is skewed to the right. The data is transformed by taking its square root to see if we can obtain a more normal distribution.
In [30]:
plt.hist(actual)
plt.show()
In [31]:
sqrt_actual = np.sqrt(actual)
In [33]:
plt.hist(sqrt_actual)
plt.show()
That's a little better. Has this helped the multivariate normality? Yes.
In [26]:
# Extract predicted values.
predicted = regr.predict(X).ravel()
actual = data['Sales']
# Calculate the error, also called the residual.
corr_residual = sqrt_actual - predicted
plt.hist(corr_residual)
plt.title('Residual counts')
plt.xlabel('Residual')
plt.ylabel('Count')
plt.show()
In [34]:
plt.scatter(predicted, corr_residual)
plt.xlabel('Predicted')
plt.ylabel('Residual')
plt.axhline(y=0)
plt.title('Residual vs. Predicted')
plt.show()
There was a big improvement here as well. Success!
In [ ]: