This notebook demonstrates how to conduct a valid regression analysis using a combination of Sklearn and statmodels libraries. While sklearn is popular and powerful from an operational point of view, it does not provide the detailed metrics required to statistically analyze your model, evaluate the importance of predictors, build or simplify your model.
We use other libraries like statmodels
or scipy.stats
to bridge this gap.
ToC
Scikit-learn
is one of the science kits for SciPy
stack. Scikit has a collection of prediction and learning algorithms, grouped into
Each algorithm follows a typical pattern with a fit
, predict
method. In addition you get a set of utility methods that help with splitting datasets into train-test sets and for validating the outputs.
In [3]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
In [4]:
usa_house = pd.read_csv('../udemy_ml_bootcamp/Machine Learning Sections/Linear-Regression/USA_housing.csv')
usa_house.head(5)
Out[4]:
In [4]:
usa_house.info()
In [5]:
usa_house.describe()
Out[5]:
Find the correlation between each of the numerical columns to the house price
In [6]:
sns.pairplot(usa_house)
Out[6]:
From this chart, we now,
In [7]:
fig, axs = plt.subplots(1,2, figsize=[15,5])
sns.distplot(usa_house['Price'], ax=axs[0])
sns.heatmap(usa_house.corr(), ax=axs[1], annot=True)
fig.tight_layout()
In [8]:
usa_house.columns
Out[8]:
In [9]:
X = usa_house[['Avg Income', 'Avg House Age', 'Avg. Number of Rooms',
'Avg. Number of Bedrooms', 'Area Population']]
y = usa_house[['Price']]
In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
In [11]:
len(X_train)
Out[11]:
In [12]:
len(X_test)
Out[12]:
In [13]:
X_test.head()
Out[13]:
In [14]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
In [15]:
lm.fit(X_train, y_train) #no need to capture the return. All is stored in lm
Out[15]:
Create a table showing the coefficient (influence) of each of the columns
In [16]:
cdf = pd.DataFrame(lm.coef_[0], index=X_train.columns, columns=['coefficients'])
cdf
Out[16]:
Note, the coefficients for house age, number of rooms is pretty large. However that does not really mean they are more influential compared to income. It is simply because our dataset has not been normalized and the data range for each of these columns vary widely.
In [17]:
y_predicted = lm.predict(X_test)
len(y_predicted)
Out[17]:
In [18]:
plt.scatter(y_test, y_predicted) #actual vs predicted
Out[18]:
In [19]:
sns.distplot((y_test - y_predicted))
Out[19]:
In [20]:
from sklearn import metrics
metrics.mean_absolute_error(y_test, y_predicted)
Out[20]:
RMSE
In [21]:
import numpy
numpy.sqrt(metrics.mean_squared_error(y_test, y_predicted))
Out[21]:
In [22]:
X_test['predicted_price'] = y_predicted
In [23]:
X_test.head()
Out[23]:
As seen earlier, even though sklearn will perform regression, it is hard to compare which of the predictor variables are influential in determining the house price. To answer this better, let us standardize our data to Std. Normal distribution using sklearn preprocessing.
In [24]:
from sklearn.preprocessing import StandardScaler
s_scaler = StandardScaler()
In [25]:
# get all columns except 'Address' which is non numeric
usa_house.columns[:-1]
Out[25]:
In [26]:
usa_house_scaled = s_scaler.fit_transform(usa_house[usa_house.columns[:-1]])
usa_house_scaled = pd.DataFrame(usa_house_scaled, columns=usa_house.columns[:-1])
usa_house_scaled.head()
Out[26]:
In [37]:
usa_house_scaled.describe().round(3) # round the numbers for dispaly
Out[37]:
In [27]:
X_scaled = usa_house_scaled[['Avg Income', 'Avg House Age', 'Avg. Number of Rooms',
'Avg. Number of Bedrooms', 'Area Population']]
y_scaled = usa_house_scaled[['Price']]
In [28]:
Xscaled_train, Xscaled_test, yscaled_train, yscaled_test = \
train_test_split(X_scaled, y_scaled, test_size=0.33)
In [29]:
lm_scaled = LinearRegression()
lm_scaled.fit(Xscaled_train, yscaled_train)
Out[29]:
In [30]:
cdf_scaled = pd.DataFrame(lm_scaled.coef_[0],
index=Xscaled_train.columns, columns=['coefficients'])
cdf_scaled
Out[30]:
In [59]:
lm_scaled.intercept_
Out[59]:
From the table above, we notice Avg Income
has more influence on the Price
than other variables. This was not apparent before scaling the data. Further this corroborates the correlation
matrix produced during exploratory data analysis.
In [33]:
import statsmodels.api as sm
import statsmodels
In [32]:
from statsmodels.regression import linear_model
In [34]:
yscaled_train.shape
Out[34]:
In [35]:
Xscaled_train = sm.add_constant(Xscaled_train)
sm_ols = linear_model.OLS(yscaled_train, Xscaled_train) # i know, the param order is inverse
sm_model = sm_ols.fit()
In [36]:
sm_model.summary()
Out[36]:
The regression coefficients are identical between sklearn
and statsmodels
libraries. The $R^{2}$ of 0.919
is as high as it gets. This indicates the predicted (train) Price
varies similar to actual. Another measure of health is the S
(std. error) and p-value
of coefficients. The S
of Avg. Number of Bedrooms
is as low as other predictors, however it has a high p-value
indicating a low confidence in predicting its coefficient.
Similar is the p-value
of the intercept.
In [37]:
yscaled_predicted = lm_scaled.predict(Xscaled_test)
residuals_scaled = yscaled_test - yscaled_predicted
In [42]:
fig, axs = plt.subplots(2,2, figsize=(10,10))
# plt.tight_layout()
plt1 = axs[0][0].scatter(x=yscaled_test, y=yscaled_predicted)
axs[0][0].set_title('Fitted vs Predicted')
axs[0][0].set_xlabel('Price - test')
axs[0][0].set_ylabel('Price - predicted')
plt2 = axs[0][1].scatter(x=yscaled_test, y=residuals_scaled)
axs[0][1].hlines(0, xmin=-3, xmax=3)
axs[0][1].set_title('Fitted vs Residuals')
axs[0][1].set_xlabel('Price - test (fitted)')
axs[0][1].set_ylabel('Residuals')
from numpy import random
axs[1][0].scatter(x=sorted(random.randn(len(residuals_scaled))),
y=sorted(residuals_scaled['Price']))
axs[1][0].set_title('QQ plot of Residuals')
axs[1][0].set_xlabel('Std. normal z scores')
axs[1][0].set_ylabel('Residuals')
sns.distplot(residuals_scaled, ax=axs[1][1])
axs[1][1].set_title('Histogram of residuals')
plt.tight_layout()
From the charts above,
0
mean line. There are not patterns evident, which means our model does not leak any systematic phenomena into the residuals (errors)Thus all assumptions hold good.
In [45]:
Xscaled_train.columns
Out[45]:
In [55]:
usa_house_fitted = Xscaled_test[Xscaled_test.columns[0:]]
usa_house_fitted['Price'] = yscaled_test
usa_house_fitted.head()
Out[55]:
In [59]:
usa_house_fitted_inv = s_scaler.inverse_transform(usa_house_fitted)
usa_house_fitted_inv = pd.DataFrame(usa_house_fitted_inv,
columns=usa_house_fitted.columns)
usa_house_fitted_inv.head().round(3)
Out[59]:
In [65]:
yinv_predicted = (yscaled_predicted * s_scaler.scale_[-1]) + s_scaler.mean_[-1]
In [66]:
yinv_predicted.shape
Out[66]:
In [67]:
usa_house_fitted_inv['Price predicted'] = yinv_predicted
usa_house_fitted_inv.head().round(3)
Out[67]:
In [68]:
mse_scaled = metrics.mean_squared_error(usa_house_fitted_inv['Price'],
usa_house_fitted_inv['Price predicted'])
numpy.sqrt(mse_scaled)
Out[68]:
In [69]:
mae_scaled = metrics.mean_absolute_error(usa_house_fitted_inv['Price'],
usa_house_fitted_inv['Price predicted'])
mae_scaled
Out[69]:
In this sample we observed two methods of predicting housing prices. The first involved applying linear regression on the dataset directly. The second involved scaling the features to standard normal distribution and applying a linear model using both sklearn
and statsmodels
packages. We thoroughly inspected the model parameters, vetted that assumptions hold good.
In the end, the accuracy of the models did not increase much after scaling. However, we were able to better determine which predictor variables were influential in a truest sense, not being biased by the scale of the units.
The allure of linear models is the explainability. They may not be the best when it comes to accurate predictions, however they help us answer basic questions better, such as "which characteristics influence the cost of my home, is it # of bedrooms or average income of the residents"?. The answer is the latter in this example.