Engineering Existing Data to Follow Multivariate Linear Regression Assumptions

Jackie Zuker

Assumptions:

Linear relationship - Features should have a linear relationship with the outcome
Multivariate normality - The error from the model should be normally distributed
Homoscedasticity - The distribution of error should be consistent for all predicted values
Low multicollinearity - correlations between features should be low or non-existent

The model in use has problems with heterscedasticity and multivariate non-normality.



In [16]:

    
import math
import warnings

from IPython.display import display
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import linear_model
#import statsmodels.formula.api as smf
#import statsmodels as smf

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd"
)



In [17]:

    
# Acquire, load, and preview the data.
data = pd.read_csv(
    'http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv',
    index_col=0
)
display(data.head())

# Instantiate and fit our model.
regr = linear_model.LinearRegression()
Y = data['Sales'].values.reshape(-1, 1)
X = data[['TV','Radio','Newspaper']]
regr.fit(X, Y)

# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))









    






  
    
      
      TV
      Radio
      Newspaper
      Sales
    
  
  
    
      1
      230.100
      37.800
      69.200
      22.100
    
    
      2
      44.500
      39.300
      45.100
      10.400
    
    
      3
      17.200
      45.900
      69.300
      9.300
    
    
      4
      151.500
      41.300
      58.500
      18.500
    
    
      5
      180.800
      10.800
      58.400
      12.900
    
  








    



Coefficients: 
 [[ 0.04576465  0.18853002 -0.00103749]]

Intercept: 
 [ 2.93888937]

R-squared:
0.897210638179



In [29]:

    
# Extract predicted values.
predicted = regr.predict(X).ravel()
actual = data['Sales']

# Calculate the error, also called the residual.
residual = actual - predicted

sns.set_style("whitegrid")

# This looks a bit concerning.
plt.hist(residual)
plt.title('Residual counts (skewed to the right)')
plt.xlabel('Residual')
plt.ylabel('Count')
plt.show()

plt.scatter(predicted, residual)
plt.xlabel('Predicted')
plt.ylabel('Residual')
plt.axhline(y=0)
plt.title('Residual vs. Predicted( Heteroscedasticity present)')
plt.show()

As shown above, the error from the model is not normally distributed. The error is skewed to the right, similar to the raw data itself.

Additionally, the distribution of error terms is not consistent, it is heteroscadastic.

Inspect the Data and Transform

The data is skewed to the right. The data is transformed by taking its square root to see if we can obtain a more normal distribution.



In [30]:

    
plt.hist(actual)
plt.show()



In [31]:

    
sqrt_actual = np.sqrt(actual)



In [33]:

    
plt.hist(sqrt_actual)
plt.show()

That's a little better. Has this helped the multivariate normality? Yes.



In [26]:

    
# Extract predicted values.
predicted = regr.predict(X).ravel()
actual = data['Sales']

# Calculate the error, also called the residual.
corr_residual = sqrt_actual - predicted

plt.hist(corr_residual)
plt.title('Residual counts')
plt.xlabel('Residual')
plt.ylabel('Count')
plt.show()

Transforming the data into the sqrt of the data lessened the skewness to the right, and allowed the error from the model to be more normally-distributed. Let's see if our transformation helped the problem with heteroscedasticity.

Homoscedasticity



In [34]:

    
plt.scatter(predicted, corr_residual)
plt.xlabel('Predicted')
plt.ylabel('Residual')
plt.axhline(y=0)
plt.title('Residual vs. Predicted')
plt.show()

There was a big improvement here as well. Success!



In [ ]:

	TV	Radio	Newspaper	Sales
1	230.100	37.800	69.200	22.100
2	44.500	39.300	45.100	10.400
3	17.200	45.900	69.300	9.300
4	151.500	41.300	58.500	18.500
5	180.800	10.800	58.400	12.900