Classification: discrete output
Regression: continuous output
Both are supervised learning
More similar than they appear
Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s).
In [1]:
%load_ext watermark
%watermark -a '' -u -d -v -p numpy,pandas,matplotlib,sklearn,seaborn
The use of watermark
is optional. You can install this IPython extension via "pip install watermark
". For more information, please see: https://github.com/rasbt/watermark.
In [2]:
from IPython.display import Image
%matplotlib inline
# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
Model: $ y = \sum_{i=0}^n w_i x_i = \mathbf{w}^T \mathbf{x} $ with $x_0 = 1$.
Given a collection of sample data $\{\mathbf{x^{(i)}}, y^{(i)} \}$, find the line $\mathbf{w}$ that minimizes the regression error: $$ \begin{align} L(X, Y, \mathbf{w}) = \sum_i \left( y^{(i)} - \hat{y}^{(i)} \right)^2 = \sum_i \left( y^{(i)} - \mathbf{w}^T \mathbf{x}^{(i)} \right)^2 \end{align} $$
$ y = w_0 + w_1 x $
Source: https://archive.ics.uci.edu/ml/datasets/Housing
Boston suburbs
Attributes (1-13) and target (14):
1. CRIM per capita crime rate by town 2. ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS proportion of non-retail business acres per town 4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX nitric oxides concentration (parts per 10 million) 6. RM average number of rooms per dwelling 7. AGE proportion of owner-occupied units built prior to 1940 8. DIS weighted distances to five Boston employment centres 9. RAD index of accessibility to radial highways 10. TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT % lower status of the population 14. MEDV Median value of owner-occupied homes in $1000's
In [3]:
import pandas as pd
# online dataset
data_src = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
# local dataset
data_src = '../datasets/housing/housing.data'
df = pd.read_csv(data_src,
header=None,
sep='\s+')
df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS',
'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df.head()
Out[3]:
In [4]:
print(df.shape)
If the link to the Housing dataset provided above does not work for you, you can find a local copy in this repository at ./../datasets/housing/housing.data.
Or you could fetch it via
In [5]:
if False:
df = pd.read_csv('https://raw.githubusercontent.com/1iyiwei/pyml/master/code/datasets/housing/housing.data',
header=None, sep='\s+')
df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS',
'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df.head()
Use scatter plots to visualize the correlations between pairs of features.
In the seaborn library below, the diagonal lines are histograms for single features.
In [6]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', context='notebook')
cols = ['LSTAT', 'INDUS', 'NOX', 'RM', 'MEDV']
sns.pairplot(df[cols], size=2.5)
plt.tight_layout()
# plt.savefig('./figures/scatter.png', dpi=300)
plt.show()
Some observations:
"You can observe a lot by just watching" - Yogi Berra
Scientific pipeline
A single number to summarize the visual trends.
Correlation $r$ between pairs of underlying variables $x$ and $y$ based on their samples $\{x^{(i)}, y^{(i)}\}, i=1 \; to \; n $.
$$ \begin{align} r &= \frac{\rho_{xy}}{\rho_x \rho_y} \\ \rho_{xy} &= \sum_{i=1}^n \left( x^{(i)} - \mu_x \right) \left( y^{(i)} - \mu_y \right) \\ \rho_x &= \sqrt{\sum_{i=1}^{n} \left( x^{(i)} - \mu_x \right)^2} \\ \rho_y &= \sqrt{\sum_{i=1}^{n} \left( y^{(i)} - \mu_y\right)^2} \end{align} $$$\mu$: mean
$\rho_x$: std
$\rho_{xy}$: covariance
$r \in [-1, 1]$
In [7]:
import numpy as np
# compute correlation
cm = np.corrcoef(df[cols].values.T)
# visualize correlation matrix
sns.set(font_scale=1.5)
hm = sns.heatmap(cm,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 15},
yticklabels=cols,
xticklabels=cols)
# plt.tight_layout()
# plt.savefig('./figures/corr_mat.png', dpi=300)
plt.show()
Observations:
Thus RM or LSTAT can be good candidates for linear regression
In [8]:
sns.reset_orig()
%matplotlib inline
Model: $ y = \sum_{i=0}^n w_i x_i = \mathbf{w}^T \mathbf{x} $ with $x_0 = 1$.
Given a collection of sample data $\{\mathbf{x^{(i)}}, y^{(i)} \}$, find the line $\mathbf{w}$ that minimizes the regression error: $$ \begin{align} L(X, Y, \mathbf{w}) = \frac{1}{2} \sum_i \left( y^{(i)} - \hat{y}^{(i)} \right)^2 = \frac{1}{2} \sum_i \left( y^{(i)} - \mathbf{w}^T \mathbf{x}^{(i)} \right)^2 \end{align} $$
As usual, the $\frac{1}{2}$ term is for convenience of differentiation, to cancel out the square terms: $$ \begin{align} \frac{1}{2} \frac{d x^2}{dx} = x \end{align} $$
This is called ordinary least squares (OLS).
In [9]:
class LinearRegressionGD(object):
def __init__(self, eta=0.001, n_iter=20):
self.eta = eta
self.n_iter = n_iter
def fit(self, X, y):
self.w_ = np.zeros(1 + X.shape[1])
self.cost_ = []
for i in range(self.n_iter):
output = self.net_input(X)
errors = (y - output)
self.w_[1:] += self.eta * X.T.dot(errors)
self.w_[0] += self.eta * errors.sum()
cost = (errors**2).sum() / 2.0
self.cost_.append(cost)
return self
def net_input(self, X):
return np.dot(X, self.w_[1:]) + self.w_[0]
def predict(self, X):
return self.net_input(X)
In [10]:
X = df[['RM']].values
y = df['MEDV'].values
In [11]:
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
X_std = sc_x.fit_transform(X)
#y_std = sc_y.fit_transform(y) # deprecation warning
#y_std = sc_y.fit_transform(y.reshape(-1, 1)).ravel()
y_std = sc_y.fit_transform(y[:, np.newaxis]).flatten()
In [12]:
lr = LinearRegressionGD()
_ = lr.fit(X_std, y_std)
In [13]:
plt.plot(range(1, lr.n_iter+1), lr.cost_)
plt.ylabel('SSE')
plt.xlabel('Epoch')
plt.tight_layout()
# plt.savefig('./figures/cost.png', dpi=300)
plt.show()
The optimization converges after about 5 epochs.
In [14]:
def lin_regplot(X, y, model):
plt.scatter(X, y, c='lightblue')
plt.plot(X, model.predict(X), color='red', linewidth=2)
return
In [15]:
lin_regplot(X_std, y_std, lr)
plt.xlabel('Average number of rooms [RM] (standardized)')
plt.ylabel('Price in $1000\'s [MEDV] (standardized)')
plt.tight_layout()
# plt.savefig('./figures/gradient_fit.png', dpi=300)
plt.show()
The red line confirms the positive correlation between median prices and number of rooms.
But there are some weird things, such as a bunch of points on the celing (MEDV $\simeq$ 3,000) which indicates clipping.
In [16]:
print('Slope: %.3f' % lr.w_[1])
print('Intercept: %.3f' % lr.w_[0])
The correlation computed earlier is 0.7, which fits the slope value.
The intercept should be 0 for standardized data.
In [17]:
# use inverse transform to report back the original values
# num_rooms_std = sc_x.transform([[5.0]])
num_rooms_std = sc_x.transform(np.array([[5.0]]))
price_std = lr.predict(num_rooms_std)
print("Price in $1000's: %.3f" % sc_y.inverse_transform(price_std))
In [18]:
from sklearn.linear_model import LinearRegression
In [19]:
slr = LinearRegression()
slr.fit(X, y) # no need for standardization
y_pred = slr.predict(X)
print('Slope: %.3f' % slr.coef_[0])
print('Intercept: %.3f' % slr.intercept_)
In [20]:
lin_regplot(X, y, slr)
plt.xlabel('Average number of rooms [RM]')
plt.ylabel('Price in $1000\'s [MEDV]')
plt.tight_layout()
# plt.savefig('./figures/scikit_lr_fit.png', dpi=300)
plt.show()
Normal Equations alternative for direct analytic computing without any iteration:
In [21]:
# adding a column vector of "ones"
Xb = np.hstack((np.ones((X.shape[0], 1)), X))
w = np.zeros(X.shape[1])
z = np.linalg.inv(np.dot(Xb.T, Xb))
w = np.dot(z, np.dot(Xb.T, y))
print('Slope: %.3f' % w[1])
print('Intercept: %.3f' % w[0])
Linear regression sensitive to outliers
Not always easy to decide which data samples are outliers
RANSAC (random sample consensus) can deal with this
Basic idea:
Can work with different base regressors
In [22]:
from sklearn.linear_model import RANSACRegressor
if Version(sklearn_version) < '0.18':
ransac = RANSACRegressor(LinearRegression(),
max_trials=100,
min_samples=50,
residual_metric=lambda x: np.sum(np.abs(x), axis=1),
residual_threshold=5.0,
random_state=0)
else:
ransac = RANSACRegressor(LinearRegression(),
max_trials=100,
min_samples=50,
loss='absolute_loss',
residual_threshold=5.0,
random_state=0)
ransac.fit(X, y)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
line_X = np.arange(3, 10, 1)
line_y_ransac = ransac.predict(line_X[:, np.newaxis])
plt.scatter(X[inlier_mask], y[inlier_mask],
c='blue', marker='o', label='Inliers')
plt.scatter(X[outlier_mask], y[outlier_mask],
c='lightgreen', marker='s', label='Outliers')
plt.plot(line_X, line_y_ransac, color='red')
plt.xlabel('Average number of rooms [RM]')
plt.ylabel('Price in $1000\'s [MEDV]')
plt.legend(loc='upper left')
plt.tight_layout()
# plt.savefig('./figures/ransac_fit.png', dpi=300)
plt.show()
In [23]:
print('Slope: %.3f' % ransac.estimator_.coef_[0])
print('Intercept: %.3f' % ransac.estimator_.intercept_)
In [24]:
# trainig/test data split as usual
if Version(sklearn_version) < '0.18':
from sklearn.cross_validation import train_test_split
else:
from sklearn.model_selection import train_test_split
# use all features
X = df.iloc[:, :-1].values
y = df['MEDV'].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0)
In [25]:
slr = LinearRegression()
slr.fit(X_train, y_train)
y_train_pred = slr.predict(X_train)
y_test_pred = slr.predict(X_test)
In [26]:
# plot the residuals: difference between prediction and ground truth
plt.scatter(y_train_pred, y_train_pred - y_train,
c='blue', marker='o', label='Training data')
plt.scatter(y_test_pred, y_test_pred - y_test,
c='lightgreen', marker='s', label='Test data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.legend(loc='upper left')
plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red')
plt.xlim([-10, 50])
plt.tight_layout()
# plt.savefig('./figures/slr_residuals.png', dpi=300)
plt.show()
Perfect regression will have 0 residuals (the red line).
Good regression will have uniform random distribution along that red line.
Other indicate potential problems.
In [27]:
# plot residual against real values
plt.scatter(y_train, y_train_pred - y_train,
c='blue', marker='o', label='Training data')
plt.scatter(y_test, y_test_pred - y_test,
c='lightgreen', marker='s', label='Test data')
plt.xlabel('Real values')
plt.ylabel('Residuals')
plt.legend(loc='best')
plt.hlines(y=0, xmin=-5, xmax=55, lw=2, color='red')
plt.xlim([-5, 55])
plt.tight_layout()
# plt.savefig('./figures/slr_residuals.png', dpi=300)
plt.show()
For n data samples with prediction $y$ and ground truth $t$:
Standardized version of MSE
$$ \begin{align} R^2 &= 1 - \frac{SSE}{SST} \\ SSE &= \sum_{i=1}^{n} \left( y^{(i)} - t^{(i) }\right)^2 \\ SST &= \sum_{i=1}^{n} \left( t^{(i)} - \mu_t \right)^2 \end{align} $$$$ R^2 = 1 - \frac{MSE}{Var(t)} $$$R^2 = 1$ for perfect fit
In [28]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Reguarlization can help simplify models and reduce overfitting
Popular methods for linear regression:
In [29]:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.1) # alpha is like lambda above
ridge.fit(X_train, y_train)
y_train_pred = ridge.predict(X_train)
y_test_pred = ridge.predict(X_test)
print(ridge.coef_)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
In [30]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1) # alpha is like lambda above
lasso.fit(X_train, y_train)
y_train_pred = lasso.predict(X_train)
y_test_pred = lasso.predict(X_test)
print(lasso.coef_)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
In [31]:
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
alphas = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
train_errors = []
test_errors = []
for alpha in alphas:
model = ElasticNet(alpha=alpha, l1_ratio=0.5)
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
train_errors.append(mean_squared_error(y_train, y_train_pred))
test_errors.append(mean_squared_error(y_test, y_test_pred))
print(train_errors)
print(test_errors)
In [32]:
X = np.array([258.0, 270.0, 294.0,
320.0, 342.0, 368.0,
396.0, 446.0, 480.0, 586.0])[:, np.newaxis]
y = np.array([236.4, 234.4, 252.8,
298.6, 314.2, 342.2,
360.8, 368.0, 391.2,
390.8])
In [33]:
from sklearn.preprocessing import PolynomialFeatures
lr = LinearRegression()
pr = LinearRegression()
quadratic = PolynomialFeatures(degree=2) # e.g. from [a, b] to [1, a, b, a*a, a*b, b*b]
X_quad = quadratic.fit_transform(X)
In [34]:
print(X[0, :])
print(X_quad[0, :])
print([1, X[0, :], X[0, :]**2])
In [35]:
# fit linear features
lr.fit(X, y)
X_fit = np.arange(250, 600, 10)[:, np.newaxis]
y_lin_fit = lr.predict(X_fit)
# fit quadratic features
pr.fit(X_quad, y)
y_quad_fit = pr.predict(quadratic.fit_transform(X_fit))
# plot results
plt.scatter(X, y, label='training points')
plt.plot(X_fit, y_lin_fit, label='linear fit', linestyle='--')
plt.plot(X_fit, y_quad_fit, label='quadratic fit')
plt.legend(loc='upper left')
plt.tight_layout()
# plt.savefig('./figures/poly_example.png', dpi=300)
plt.show()
Quadratic polynomial fits this dataset better than linear polynomial
Not always a good idea to use higher degree functions
In [36]:
y_lin_pred = lr.predict(X)
y_quad_pred = pr.predict(X_quad)
In [37]:
print('Training MSE linear: %.3f, quadratic: %.3f' % (
mean_squared_error(y, y_lin_pred),
mean_squared_error(y, y_quad_pred)))
print('Training R^2 linear: %.3f, quadratic: %.3f' % (
r2_score(y, y_lin_pred),
r2_score(y, y_quad_pred)))
In [38]:
X = df[['LSTAT']].values
y = df['MEDV'].values
regr = LinearRegression()
# create quadratic features
quadratic = PolynomialFeatures(degree=2)
cubic = PolynomialFeatures(degree=3)
X_quad = quadratic.fit_transform(X)
X_cubic = cubic.fit_transform(X)
# fit features
X_fit = np.arange(X.min(), X.max(), 1)[:, np.newaxis]
regr = regr.fit(X, y)
y_lin_fit = regr.predict(X_fit)
linear_r2 = r2_score(y, regr.predict(X))
regr = regr.fit(X_quad, y)
y_quad_fit = regr.predict(quadratic.fit_transform(X_fit))
quadratic_r2 = r2_score(y, regr.predict(X_quad))
regr = regr.fit(X_cubic, y)
y_cubic_fit = regr.predict(cubic.fit_transform(X_fit))
cubic_r2 = r2_score(y, regr.predict(X_cubic))
# plot results
plt.scatter(X, y, label='training points', color='lightgray')
plt.plot(X_fit, y_lin_fit,
label='linear (d=1), $R^2=%.2f$' % linear_r2,
color='blue',
lw=2,
linestyle=':')
plt.plot(X_fit, y_quad_fit,
label='quadratic (d=2), $R^2=%.2f$' % quadratic_r2,
color='red',
lw=2,
linestyle='-')
plt.plot(X_fit, y_cubic_fit,
label='cubic (d=3), $R^2=%.2f$' % cubic_r2,
color='green',
lw=2,
linestyle='--')
plt.xlabel('% lower status of the population [LSTAT]')
plt.ylabel('Price in $1000\'s [MEDV]')
plt.legend(loc='upper right')
plt.tight_layout()
# plt.savefig('./figures/polyhouse_example.png', dpi=300)
plt.show()
Transforming the dataset from observation:
$$ \begin{align} X_{log} &= \log{X} \\ Y_{sqrt} &= \sqrt{Y} \end{align} $$
In [39]:
X = df[['LSTAT']].values
y = df['MEDV'].values
# transform features
X_log = np.log(X)
y_sqrt = np.sqrt(y)
# training
regr = regr.fit(X_log, y_sqrt)
linear_r2 = r2_score(y_sqrt, regr.predict(X_log))
# fit features
X_fit = np.arange(X_log.min()-1, X_log.max()+1, 1)[:, np.newaxis]
y_lin_fit = regr.predict(X_fit)
# plot results
plt.scatter(X_log, y_sqrt, label='training points', color='lightgray')
plt.plot(X_fit, y_lin_fit,
label='linear (d=1), $R^2=%.2f$' % linear_r2,
color='blue',
lw=2)
plt.xlabel('log(% lower status of the population [LSTAT])')
plt.ylabel('$\sqrt{Price \; in \; \$1000\'s [MEDV]}$')
plt.legend(loc='lower left')
plt.tight_layout()
# plt.savefig('./figures/transform_example.png', dpi=300)
plt.show()
In classification, we associate a class label for each leaf node.
In regression, we fit a function for each leaf node.
In the simplest case, the function is a constant, which will be the mean of all y values within that node if the loss function is based on MSE.
In this case, the whole tree essentially fits a piecewise constant function to the training data.
Similar to building a decision tree for classification, a decision tree for regression can be built by iteratively splitting each node based on optimizing an objective function, such as MSE mentioned above. Specifically, $$IG(D_p) = I(D_p) - \sum_{j=1}^m \frac{N_j}{N_p} I(D_j)$$ , where $IG$ is the information gain we try to maximize for splitting each parent node $p$ with dataset $D_p$, $I$ is the information measure for a given dataset, and $N_p$ and $N_j$ are the number of data vectors within each parent and child nodes. $m = 2$ for the usual binary split. For MSE, we have $$I(D) = \sum_{i \in D} \left|y^{(i)} - \mu_D \right|^2$$ , where $\mu_D$ is the mean of the target $y$ values of the dataset $D$.
In [40]:
from sklearn.tree import DecisionTreeRegressor
X = df[['LSTAT']].values
y = df['MEDV'].values
tree = DecisionTreeRegressor(max_depth=3)
tree.fit(X, y)
sort_idx = X.flatten().argsort() # sort from small to large for plotting below
lin_regplot(X[sort_idx], y[sort_idx], tree)
plt.xlabel('% lower status of the population [LSTAT]')
plt.ylabel('Price in $1000\'s [MEDV]')
# plt.savefig('./figures/tree_regression.png', dpi=300)
plt.show()
Notice the piecewise constant regions of the decision curve.
In [41]:
X = df.iloc[:, :-1].values
y = df['MEDV'].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.4, random_state=1)
In [42]:
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators=1000,
criterion='mse',
random_state=1,
n_jobs=-1)
forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Overfitting, but still good performance.
In [43]:
plt.scatter(y_train_pred,
y_train_pred - y_train,
c='black',
marker='o',
s=35,
alpha=0.5,
label='Training data')
plt.scatter(y_test_pred,
y_test_pred - y_test,
c='lightgreen',
marker='s',
s=35,
alpha=0.7,
label='Test data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.legend(loc='upper left')
plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='red')
plt.xlim([-10, 50])
plt.tight_layout()
# plt.savefig('./figures/slr_residuals.png', dpi=300)
plt.show()
Looks better than the results shown above via other regressors.
Regression: fit functions to explain continuous variables
Linear regression: simple and common
General function regression: more powerful but watch out for over-fitting
Analogous to classification in many aspects:
RANSAC: a general method for fitting noisy data
We can observe a lot by watching.