This week we will be looking in detail at how to implement a supervised regression model using the base scientific computing packages available with python.
We will also be looking at the different packages available for python that implement many of the algorithms we might want to use.
Why implement algorithms from scratch when dedicated packages already exist?
The packages available are very powerful and a real time saver but they can obscure some issues we might encounter if we don't know to look for them. By starting with just numpy these problems will be more obvious. We can address them here and then when we move on we will know what to look for and will be less likely to miss them.
The dedicated machine learning packages implement the different algorithms but we are still responsible for getting our data in a suitable format.
In [1]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
In [2]:
n = 20
x = np.random.random((n,1))
y = 5 + 6 * x ** 2 + np.random.normal(0,0.5, size=(n,1))
plt.plot(x, y, 'b.')
plt.show()
This is a very simple dataset. There is only one input value for each record and then there is the output value. Our goal is to determine the output value or dependent variable, shown on the y-axis, from the input or independent variable, shown on the x-axis.
Our approach should scale to handle multiple input, or independent, variables. The independent variables can be stored in a vector, a 1-dimensional array:
$$X^T = (X_{1}, X_{2}, X_{3})$$As we have multiple records these can be stacked in a 2-dimensional array. Each record becomes one row in the array. Our x variable is already set up in this way.
In linear regression we can compute the value of the dependent variable using the following formula:
$$f(X) = \beta_{0} + \sum_{j=1}^p X_j\beta_j$$The $\beta_{0}$ term is the intercept, and represents the value of the dependent variable when the independent variable is zero.
Calculating a solution is easier if we don't treat the intercept as special. Instead of having an intercept co-efficient that is handled separately we can instead add a variable to each of our records with a value of one.
In [3]:
intercept_x = np.hstack((np.ones((n,1)), x))
intercept_x
Out[3]:
Numpy contains the linalg module with many common functions for performing linear algebra. Using this module finding a solution is quite simple.
In [7]:
np.linalg.lstsq(intercept_x,y)
Out[7]:
The values returned are:
In [8]:
coeff, residuals, rank, sing_vals = np.linalg.lstsq(intercept_x,y)
In [9]:
intercept_x.shape, coeff.T.shape
print(intercept_x, coeff.T)
In [10]:
np.sum(intercept_x * coeff.T, axis=1)
Out[10]:
In [10]:
predictions = np.sum(intercept_x * coeff.T, axis=1)
In [11]:
plt.plot(x, y, 'bo')
plt.plot(x, predictions, 'k-')
plt.show()
In [12]:
predictions.reshape((20,1)) - y
Out[12]:
In [13]:
np.sum((predictions.reshape((20,1)) - y) ** 2), residuals
Out[13]:
Least squares refers to the cost function for this algorithm. The objective is to minimize the residual sum of squares. The difference between the actual and predicted values is calculated, it is squared and then summed over all records. The function is as follows:
$$RSS(\beta) = \sum_{i=1}^{N}(y_i - x_i^T\beta)^2$$Within lstsq all the calculations are performed using matrix arithmetic rather than the more familiar element-wise arithmetic numpy arrays generally perform. Numpy does have a matrix type but matrix arithmetic can also be performed on standard arrays using dedicated methods.
In matrix multiplication the resulting value in any position is the sum of multiplying each value in a row in the first matrix by the corresponding value in a column in the second matrix.
The residual sum of squares can be calculated with the following formula:
$$RSS(\beta) = (y - X\beta)^T(y-X\beta)$$The value of our co-efficients can be calculated with:
$$\hat\beta = (X^TX)^{-1}X^Ty$$Unfortunately, the result is not as visually appealing as in languages that use matrix arithmetic by default.
In [19]:
our_coeff = np.dot(np.dot(np.linalg.inv(np.dot(intercept_x.T, intercept_x)), intercept_x.T), y)
print(coeff, '\n\n', our_coeff)
In [25]:
our_predictions = np.dot(intercept_x, our_coeff)
np.hstack((predictions.reshape(1,20),
our_predictions.reshape(1,20)
))
Out[25]:
In [169]:
plt.plot(x, y, 'ko', label='True values')
plt.plot(x, our_predictions, 'ro', label='Predictions')
plt.legend(numpoints=1, loc=4)
plt.show()
In [33]:
plt.plot(x, y, 'ko', label='True Values')
all_x = np.linspace(0,1, 1000).reshape((1000,1))
intercept_all_x = np.hstack((np.ones((1000,1)), all_x))
print(all_x.shape, intercept_all_x.shape, our_coeff.shape)
print("predictions:", all_x_predictions.shape)
all_x_predictions = np.dot(intercept_all_x, our_coeff)
plt.plot(all_x, all_x_predictions, 'r-', label='Predictions')
plt.legend(numpoints=1, loc=4)
plt.show()
The independent variables can be many different types.
It is important to note that a linear model is only linear with respect to its inputs. Those input variables can take any form.
One approach we can take to improve the predictions from our model would be to add in the square, cube, etc of our existing variable.
In [170]:
x_expanded = np.hstack((x**i for i in range(1,20)))
b, residuals, rank, s = np.linalg.lstsq(x_expanded, y)
print(b)
plt.plot(x, y, 'ko', label='True values')
plt.plot(x, np.dot(x_expanded, b), 'ro', label='Predictions')
plt.legend(numpoints=1, loc=4)
plt.show()
There is a tradeoff with model complexity. As we add more complexity to our model we can fit our training data increasingly well but eventually will lose our ability to generalize to new data.
Very simple models underfit the data and have high bias.
Very complex models overfit the data and have high variance.
The goal is to detect true sources of variation in the data and ignore variation that is just noise.
How do we know if we have a good model? A common approach is to break up our data into a training set, a validation set, and a test set.
What would our best model look like? Because we are using dummy data here we can easily make more.
In [171]:
n = 20
p = 12
training = []
val = []
for i in range(1, p):
np.random.seed(0)
x = np.random.random((n,1))
y = 5 + 6 * x ** 2 + np.random.normal(0,0.5, size=(n,1))
x = np.hstack((x**j for j in np.arange(i)))
our_coeff = np.dot(
np.dot(
np.linalg.inv(
np.dot(
x.T, x
)
), x.T
), y
)
our_predictions = np.dot(x, our_coeff)
our_training_rss = np.sum((y - our_predictions) ** 2)
training.append(our_training_rss)
val_x = np.random.random((n,1))
val_y = 5 + 6 * val_x ** 2 + np.random.normal(0,0.5, size=(n,1))
val_x = np.hstack((val_x**j for j in np.arange(i)))
our_val_pred = np.dot(val_x, our_coeff)
our_val_rss = np.sum((val_y - our_val_pred) ** 2)
val.append(our_val_rss)
#print(i, our_training_rss, our_val_rss)
plt.plot(range(1, p), training, 'ko-', label='training')
plt.plot(range(1, p), val, 'ro-', label='validation')
plt.legend(loc=2)
plt.show()
One limitation of our current implementation is that it is resource intensive. For very large datasets an alternative is needed. Gradient descent is often preferred, and particularly stochastic gradient descent for very large datasets.
Gradient descent is an iterative process, repetitively calculating the error and changing the coefficients slightly to reduce that error. It does this by calculating a gradient and then descending to a minimum in small steps.
Stochastic gradient descent calculates the gradient on a small batch of the data, updates the coefficients, loads the next chunk of the data and repeats the process.
We will just look at a basic gradient descent model.
In [38]:
np.random.seed(0)
n = 200
x = np.random.random((n,1))
y = 5 + 6 * x ** 2 + np.random.normal(0,0.5, size=(n,1))
intercept_x = np.hstack((np.ones((n,1)), x))
coeff, residuals, rank, sing_vals = np.linalg.lstsq(intercept_x,y)
print('lstsq:\n', coeff, '\n')
def gradient_descent(x, y, rounds = 1000, alpha=0.01):
theta = np.zeros((x.shape[1], 1))
costs = []
for i in range(rounds):
prediction = np.dot(x, theta)
error = prediction - y
gradient = np.dot(x.T, error / y.shape[0])
theta -= gradient * alpha
costs.append(np.sum(error ** 2))
return (theta, costs)
theta, costs = gradient_descent(intercept_x, y, rounds=10000)
print('theta:\n', theta, '\n\n', costs[::500])
In [173]:
np.random.seed(0)
n = 200
x = np.random.random((n,1))
y = 5 + 6 * x ** 2 + np.random.normal(0,0.5, size=(n,1))
x = np.hstack((x**j for j in np.arange(20)))
coeff, residuals, rank, sing_vals = np.linalg.lstsq(x,y)
print('lstsq', coeff)
theta, costs = gradient_descent(x, y, rounds=10000)
print(theta, costs[::500])
plt.plot(x[:,1], y, 'ko')
plt.plot(x[:,1], np.dot(x, coeff), 'co')
plt.plot(x[:,1], np.dot(x, theta), 'ro')
plt.show()
General
There is a collection of field specific packages including some with machine learning components on the scipy website. Other packages can often be found searching the python package index.
Deep learning is receiving a lot of attention recently and a number of different packages have been developed.
Scikit-learn is now widely used. It includes modules for:
There are modules for training online models, enabling very large datasets to be analyzed.
There is also a semi-supervised module for situations when you have a large dataset, but only have labels for part of the dataset.
Milk works very well with mahotas, a package for image processing. With the recent improvements in scikit-image milk is now less attractive, although still a strong option
These are both large packages but for whatever reason do not receive the attention that scikit-learn does.
Dato is a relative newcomer and has been receiving a lot of attention lately. Time will tell whether it can compete with scikit-learn.
This week we will continue working on our project ideas. As you develop the outline some points you may want to consider:
For projects developing the object oriented programming component of the course:
For projects developing GUIs or web applications:
For projects developing machine learning models:
You do not need to answer all these questions. Each answer does not need to be complete. Your final project will likely be different to your initial idea.
The goal of the project description is to document your project as you currently envision it and to encourage planning for the earliest stage.
Your project descriptions should be sent to me by our class next week.
In [ ]: