Speaker: Yingzhi Gou
Decision Systems Lab,
University of Wollongong
NOTE this jupyter notebook is available on github https://github.com/YingzhiGou/AI-Meetup-Decision-Systems-Lab-UOW
In [ ]:
This model is a linear function of the input features. Given the input $X = \langle x_1, x_2, \ldots, x_n \rangle$, the linear regression model is defined as, $$ \hat{y} = \theta_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n $$ where
Also in vectorized form
$$ \hat{y} = h_{\theta}(X) = \theta^T \cdot X $$where function $h_\theta$ commonly called hypothesis function using parameter $\theta$.
Now we need a measure of how well (or poorly) the model fits the training data.
To find $\theta$ that minimize the cost function (MSE), there is a close-form solution (recalled the first way of training a model). i.e. there is a mathematical equation that gives the result directly, which is called Normal Equation
$$ \hat{\theta} = (\mathbf{X}^T \cdot \mathbf{X})^{-1} \cdot \mathbf{X}^T \cdot \mathbf{y} $$
In [1]:
# prepare some data
import numpy as np
np.random.seed(42) # only if you want reproduceable result
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(X, y)
Out[2]:
In [3]:
# calculate inverse of the matrix X using inv() function
# from NumPy's Linear Algebra module
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 for each instance (why? hint \theta_0 the bias term)
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
In [4]:
print(theta_best)
In [5]:
X_new = np.array([[0],[2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]
y_predict = X_new_b.dot(theta_best)
print(y_predict)
In [6]:
plt.scatter(X, y)
plt.plot(X_new, y_predict, "r-")
Out[6]:
In [7]:
# using Scikit-learn
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
print(lin_reg.intercept_, lin_reg.coef_)
In [8]:
lin_reg.predict(X_new)
Out[8]:
which is the same result.
Normal Equation computes the inverse of $X^T \cdot X$, which is an $n\times n$ matrix (where $n$ is the number of features. The computational complexity of inverting such a matrix is about $O(n^{2.4})$ to $O(n^{3})$[1](#myfootnote1).
However, again, this equation is linear with the number of training examples so it can handle large training sets efficiently, given that all the training data can fit in memory.
1:https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations
In [9]:
def plot_gradient_descent(eta, theta=None, theta_path=None, random_state=None, n_iterations=1000):
if random_state:
np.random.seed(random_state)
if not theta:
theta = np.random.randn(2,1) # random initialization
m = len(X_b)
plt.plot(X, y, "b.")
for iteration in range(n_iterations):
if iteration < 10: # only plot the first 10 iterations
y_predict = X_new_b.dot(theta)
style = "b-" if iteration > 0 else "r--"
plt.plot(X_new, y_predict, style)
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients
if theta_path is not None:
theta_path.append(theta)
plt.xlabel("$x_1$", fontsize=18)
plt.axis([0, 2, 0, 15])
plt.title(r"$\eta = {}$".format(eta), fontsize=16)
In [10]:
# create temp variable to store all theta during training
theta_path_1 = []
theta_path_2 = []
theta_path_3 = []
In [11]:
plt.figure(figsize=(10,4))
plt.subplot(131); plot_gradient_descent(eta=0.02, random_state=42, theta_path=theta_path_1)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(132); plot_gradient_descent(eta=0.1, random_state=42, theta_path=theta_path_2)
plt.subplot(133); plot_gradient_descent(eta=0.5, random_state=42, theta_path=theta_path_3)
plt.show()
In [12]:
# convert python list to numpy array
theta_path_1 = np.array(theta_path_1)
theta_path_2 = np.array(theta_path_2)
theta_path_3 = np.array(theta_path_3)
In [13]:
# simple a grad for heat map
cell_size = 0.05
theta_0s = np.arange(2, 5, cell_size)
theta_1s = np.arange(2, 4, cell_size)
m = len(X_b)
mses = [[np.mean(np.power(X_b.dot([[theta_0], [theta_1]])-y, 2)) for theta_0 in theta_0s] for theta_1 in theta_1s]
mses = np.array(mses)
plt.pcolor(theta_0s, theta_1s, mses)
plt.plot(4, 3, "X", label="target")
plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$", fontsize=20, rotation=0)
plt.colorbar()
plt.show()
In [14]:
# let's virsulize the theta change path of different learning rate
plt.figure()
plt.plot(4, 3, "X", label="target")
plt.pcolormesh(theta_0s, theta_1s, mses)
plt.plot(theta_path_1[:, 0], theta_path_1[:, 1], "-s", alpha=0.4, linewidth=1, label="eta=0.02")
plt.plot(theta_path_2[:, 0], theta_path_2[:, 1], "-+", alpha=0.4, linewidth=2, label="eta=0.1")
plt.plot(theta_path_3[:, 0], theta_path_3[:, 1], "-o", alpha=0.4, linewidth=3, label="eta=0.5")
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$", fontsize=20, rotation=0)
plt.axis([2.5, 4.5, 2.3, 3.9])
plt.colorbar()
plt.show()
In [15]:
import numpy as np
np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X ** 2 + X + 2 + np.random.randn(m, 1)
plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
plt.show()
In [16]:
# use linear model directly
lin_reg = LinearRegression()
lin_reg.fit(X, y)
Out[16]:
In [17]:
plt.scatter(X, y)
plt.plot(X_new, lin_reg.predict([[-5],[5]]), "r-")
Out[17]:
Clearly, it dose not fit.
In [18]:
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X[0]
Out[18]:
In [19]:
X_poly[0]
Out[19]:
now, X_ploy contains not only the original value, but the square of the value too.
In [20]:
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_
Out[20]:
our original formula:
$$ y = \frac{1}{2} x^2 + x + 2 + \mu $$regression model:
$$ \hat{y} = 0.56456263 x^2 + 0.93366893 x + 1.78134581 $$
In [21]:
X_new=np.linspace(-3, 3, 100).reshape(100, 1)
X_new_poly = poly_features.transform(X_new)
y_new = lin_reg.predict(X_new_poly)
plt.plot(X, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.legend(loc="upper left", fontsize=14)
plt.axis([-3, 3, 0, 10])
plt.show()
Preprocessing tool $PloynomialFeatures(degree=d)$ transforms an array containing $n$ features into an array containing $\frac{(n+d)!}{d!n!}$ features. given two features $a$ and $b$, When $degree=3$, it would not only add $a^2, a^3, b^2$ and $b^3$, but also $ab, a^2b$ and $ab^2$.
In [22]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
for style, width, degree in (("g-", 1, 300), ("b--", 2, 2), ("r-+", 2, 1)):
poly_features = PolynomialFeatures(degree=degree, include_bias=False)
std_scaler = StandardScaler()
lin_reg = LinearRegression()
polynomial_regression = Pipeline(
[
("poly_features", poly_features),
("std_scaler", std_scaler),
("lin_reg", lin_reg)
]
)
polynomial_regression.fit(X, y)
y_newbig = polynomial_regression.predict(X_new)
plt.plot(X_new, y_newbig, style, label=str(degree), linewidth=width)
plt.plot(X, y, "b.", linewidth=3)
plt.legend(loc="upper left")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
plt.show()
In [23]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
def plot_learning_curves(model, X, y):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=10)
train_errors, val_errors = [], []
for m in range(1, len(X_train)):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train_predict, y_train[:m]))
val_errors.append(mean_squared_error(y_val_predict, y_val))
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Training set size", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
In [24]:
lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)
plt.axis([0, 80, 0, 3])
plt.show()
In [25]:
from sklearn.pipeline import Pipeline
# degree 2 polynomial
polynomial_regression = Pipeline([
("poly_features", PolynomialFeatures(degree=2, include_bias=False)),
("lin_reg", LinearRegression()),
])
plot_learning_curves(polynomial_regression, X, y)
plt.axis([0, 80, 0, 3])
plt.show()
In [26]:
# degree 10 polynomial
polynomial_regression = Pipeline([
("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
("lin_reg", LinearRegression()),
])
plot_learning_curves(polynomial_regression, X, y)
plt.axis([0, 80, 0, 3])
plt.show()
In [27]:
# degree 300 polynomial
polynomial_regression = Pipeline([
("poly_features", PolynomialFeatures(degree=300, include_bias=False)),
("lin_reg", LinearRegression()),
])
plot_learning_curves(polynomial_regression, X, y)
plt.axis([0, 80, 0, 3])
plt.show()