In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import linear_model
from sklearn.metrics import explained_variance_score, mean_squared_error
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
import statsmodels.api as sm
In [2]:
from sklearn.datasets import load_boston
In [3]:
boston = load_boston()
In [4]:
print(boston.DESCR)
In [5]:
X = pd.DataFrame(boston.data)
X.columns = boston.feature_names
y = pd.DataFrame(boston.target)
y.columns = ['MEDV']
df = pd.concat([X,y], axis=1)
The first lines of this dataset look like this, the last column MEDV
being the target (price).
In [6]:
df.head()
Out[6]:
The seaborn package provides a very convenient way of plotting the values of features and target:
In [7]:
fig = plt.figure(figsize=(18,18))
for i, col in enumerate(X.columns):
ax = fig.add_subplot(4,4,i+1)
sns.distplot(X[col], ax=ax)
ax = fig.add_subplot(4,4,len(X.columns)+2)
sns.distplot(y.MEDV, ax=ax, color='r');
We could already dive a little more into this dataset and its features but this will belong in the next parts. For now, we will use them untouched and focus on our first simple prediction pipeline.
Let's define a function that will go through the whole prediction, post-processing and evaluation, so that we can run it easily.
In this pipeline, we will:
split the dataset into a training and a test set
This will be made in a very simple way for the moment: the test set corresponds to the last 100 values (approx. 20%) of the dataset.
fit a linear regression model on the training data
use this model to make a prediction on test data
compute the mean squared error between the predicted and true targets
plot 1/ the predicted values vs true values, 2/ the distribution of residuals
In the next parts of this case study we will address the limitations of this first pipeline, but it illustrates some important steps.
In [8]:
def run_regression(X, y):
# Split the data into training/test sets
X_train = X.values[:-100]
X_test = X.values[-100:]
# Split the targets into training/test sets
y_train = y.values[:-100]
y_test = y.values[-100:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(X_train, y_train)
# Make prediction
y_pred = regr.predict(X_test)
# Print accuracy score
mse = mean_squared_error(y_pred, y_test)
# Plot outputs
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(121)
ax.scatter(y_test, y_pred, color='black')
ax.plot([0,40],[0,40], color='g')
ax.set_xlim([0,40])
ax.set_ylim([0,40])
ax.set_xlabel('Actual value', fontsize=14)
ax.set_ylabel('Predicted value', fontsize=14)
ax.set_title('Mean squared error: {:.2f}'.format(mse), fontsize=16)
ax2 = fig.add_subplot(122)
sns.distplot(y_test-y_pred, ax=ax2, bins=15)
ax2.set_xlim([-30,30])
ax2.axvline(0, color='k', linestyle='dotted')
ax2.set_title('Residuals distribution', fontsize=16)
In [9]:
run_regression(pd.DataFrame(X.RM),y)
The OLS()
method from the statsmodels library gives an extensive insight into the characteristics of an OLS regression:
In [10]:
olsy = y.values
olsX = X.RM.values
olsX = sm.add_constant(olsX)
model = sm.OLS(olsy, olsX)
results = model.fit()
print(results.summary())
We can at least state that the simple linear regression with this feature only gives pretty bad results.
In [11]:
run_regression(X,y)
Once again, we can get a more detailed insight:
In [12]:
olsy = y.values
olsX = X.values
olsX = sm.add_constant(olsX)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
statsmodels.OLS
warns us about a possible strong multicollinearity. This is a serious warning: remember that the assumption of full column rank for $X$ is at the core of the derivation of the OLS estimator.
In the next parts of this case study we will, at least, have to: