$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$
Given $N$ observations, $\xv_n$, for $n=1,\ldots,N$, and target values, $t_n$, for $n=1,\ldots,N$, what is the simplest model, $g(\xv)$, you can think of?
$$ g(\xv) = 0 $$or maybe
$$ g(\xv) = c $$What is next simplest model?
$$ \begin{align*} g(\xv;\wv) &= w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_D x_D \\ &= w_0 + \sum_{i=1}^D w_i x_i \\ & = \sum_{i=0}^D w_i x_i \mbox{, where } x_0 = 1 \\ &= \wv^T \xv \end{align*} $$Force exerted by spring is proportional to its length. The potential energy stored in the spring is proportinal to the square of its length. Say we want the rod to settle at position that minimizes the potential energy in the springs. $$ \begin{align*} \sum_{n=1}^N (t_n - g(\xv_n;\wv))^2 \end{align*} $$
If $g$ is an affine (linear + constant) function of $x$, $$ g(\xv;\wv) = w_0 + w_1 x $$ with parameters $\wv = (w_0, w_1)$, which parameter values give best fit? $$ \wv_{\mbox{best}} = \argmin{\wv} \sum_{n=1}^N (t_n - g(x_n ; \wv))^2 $$
Set derivative (gradient) with respect to $\wv$ to zero and solve for $\wv$. Let's do this with matrices.
Collect all targets into matrix $T$ and $x$ samples into matrix $X$. ($N$=number samples, $D$=sample dimensionality) $$ \begin{align*} T &= \begin{bmatrix} t_1 \\ t_2 \\ \vdots \\ t_N \end{bmatrix} \\ X &= \begin{bmatrix} x_{1,0} & x_{1,1} & x_{1,2} & \dotsc & x_{1,D} \\ x_{2,0} & x_{2,1} & x_{2,2} & \dotsc & x_{2,D} \\ \vdots \\ x_{N,0} & x_{N,1} & x_{N,2} & \dotsc & x_{N,D} \end{bmatrix}\\ \wv &= \begin{bmatrix} w_0 \\ w_1 \\ \vdots \\ w_D \end{bmatrix} \end{align*} $$
Collection of all differences is $T - X\wv$, which is an $N \times 1$ matrix. To form the square of all values and add them up, just do a dot product $(T-X\wv)^T (T-X\wv)$. This only works if the value we are predicting is a scalar, which means $T$ is a column matrix. If we want to predict more than one value for each sample, $T$ will have more than one column. Let's continue with the derivation assuming $T$ has $K$ columns, meaning we want a linear model with $K$ outputs.
To find the best value for $\wv$, we take the derivative the the sum of squared error objective, set it equal to 0 and solve for $\wv$.
$$ \begin{align*} \frac{\partial \sum_{n=1}^N (\tv_n - g(\xv_n;\wv))^2}{\partial \wv} &= -2 \sum_{n=1}^N (\tv_n - g(\xv_n;\wv) \frac{\partial g(\xv_n;\wv)}{\partial \wv}\\ &= -2 \sum_{n=1}^N (\tv_n - \xv_n^T \wv) \frac{\partial \xv_n^T \wv}{\partial \wv}\\ &= -2 \sum_{n=1}^N (\tv_n - \xv_n^T \wv) \xv_n^T \end{align*} $$Here's where we get the benefit of expressing the $\xv_n$ and $t_n$ samples as matrices. The sum can be performed with a dot product:
$$ \begin{align*} \frac{\partial \sum_{n=1}^N (\tv_n - g(\xv_n;\wv))^2}{\partial \wv} &= -2 \sum_{n=1}^N (\tv_n - \xv_n^T \wv) \xv_n^T\\ &= -2 \Xv^T (\Tv - \Xv \wv) \end{align*} $$Check the sizes and shapes of each matrix in the last equation above.
Now we can set this equal to zero and solve for $\wv$.
$$ \begin{align*} -2 \Xv^T (\Tv - \Xv \wv) &= 0\\ \Xv^T (\Tv - \Xv \wv) &= 0\\ \Xv^T \Tv &= \Xv^T \Xv \wv\\ \wv &= (\Xv^T \Xv)^{-1} \Xv^T \Tv \end{align*} $$And, in python
w = np.dot( np.linalg.inv(np.dot(X.T, X)), np.dot(X.T,T) )
or, you may use the solve function that assumes $\Xv^T \Xv$ is full rank (no linearly dependent columns),
w = np.linalg.solve(np.dot(X.T,X), np.dot(X.T, T))
or, better yet, use the lstsq function that does not make that assumption.
w = np.linalg.lstsq(np.dot(X.T,X), np.dot(X.T, T))
Try the automobile data set at UCI Machine Learning Database Repository. Download auto-mpg.data and auto-mpg.names from the "Data Folder" link near the top of the web page.
In [1]:
!wget http://mlr.cs.umass.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
!wget http://mlr.cs.umass.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names
First, take a look at auto-mpg.names. There you will learn that there are 398 samples, each with 8 numerical attributes and one string attribute. Their names are
First step, read it into python and look at it.
In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Take a look at a few lines of the data file to figure out how to read it in.
In [3]:
!head -40 auto-mpg.data
Instead of relying on the pandas package to read this data, let's try using the simpler numpy.loadtxt function. We see that each line contains a sample consisting of 8 numbers and one string. However, first we have to figure out how to deal with the missing values. The documentation for numpy.loadtxt reveals that the converters argument can be used to specify a function to apply to the values from particular columns of data. We can use this to deal with the missing values. Looking at auto-mpg.data we see that the missing values are marked with question marks. These are all in the fourth column, the horsepower attribute.
Here is a simple function that translates '?' to NaN (np.nan) and anything else is converted to a floating point number.
In [4]:
def missingIsNan(s):
if s == b'?': # single character read as b'?' != '?'
return np.nan
else:
return float(s)
print(missingIsNan('12.32'))
print(missingIsNan(b'?'))
Hey, why not get fancy and write a one-liner?
In [5]:
def missingIsNan(s):
return np.nan if s == b'?' else float(s)
print(missingIsNan('12.32'))
print(missingIsNan(b'?'))
The converters argument to np.loadtxt accepts a python dictionary, which is python's associative array structure.
In [6]:
d = {1: 'a', 2: 'yo', 3: 42, 4: missingIsNan}
d
Out[6]:
In [7]:
d[1]
Out[7]:
In [8]:
d[4]
Out[8]:
In [9]:
d[4](b'?')
Out[9]:
Let's also restrict np.loadtxt to reading just the first 8 columns to avoid dealing with the string in the 9th column.
In [10]:
data = np.loadtxt('auto-mpg.data', usecols=range(8), converters={3: missingIsNan})
In [11]:
data.shape
Out[11]:
In [12]:
data[:3,:]
Out[12]:
We can change the precision that ipython uses to display floating point values.
In [13]:
%precision 3
Out[13]:
In [14]:
data[:3,:]
Out[14]:
As we have done before, we must find the missing values. Let's just remove the samples with missing values.
In [15]:
np.isnan(data)
Out[15]:
In [16]:
np.sum(np.isnan(data))
Out[16]:
In [17]:
np.sum(np.isnan(data), axis=0)
Out[17]:
What does this result tell us?
Yep, all 6 NaN values are in the fourth column, the horsepower column.
Let's just remove those 6 samples from the array. To do this build a boolean mask of length equal to the number of rows in data that is False except for any row that contains a NaN. We can use the np.any or np.all methods to combine boolean values in each row or column. First, always try such ideas in multiple steps, checking each step.
In [18]:
nans = np.isnan(data)
nans.shape
Out[18]:
In [19]:
nans.any(axis=0).shape
Out[19]:
In [20]:
nans.any(axis=1).shape
Out[20]:
That's the one we want, a boolean value for each row, or sample.
In [21]:
goodRowsMask = nans.any(axis=1)
goodRowsMask
Out[21]:
In [22]:
dataNew = data[goodRowsMask,:]
dataNew.shape
Out[22]:
Wait a minute! This gives us only 6 samples. We wanted all samples but these 6, right?
In [23]:
dataNew
Out[23]:
So, let's change all False values to True and True values to False.
In [24]:
goodRowsMask = np.logical_not(goodRowsMask)
In [25]:
dataNew = data[goodRowsMask,:]
dataNew.shape
Out[25]:
In [26]:
dataNew[:3,:]
Out[26]:
In [27]:
np.sum(np.isnan(dataNew),axis=0)
Out[27]:
Remember, the next step after reading data into python is to visualize it. One thing we can do is just plot the value of each attribute in a separate graph. Let's make an array of column names to label the y axes.
In [28]:
names = ['mpg','cylinders','displacement','horsepower','weight',
'acceleration','year','origin']
In [29]:
plt.figure(figsize=(10,10))
nrow,ncol = dataNew.shape
for c in range(ncol):
plt.subplot(3,3, c+1)
plt.plot(dataNew[:,c])
plt.ylabel(names[c])
What is interesting to you in these graphs?
Let's try to predict miles per gallon, mpg, from the other attributes. To set this up, let's make a 392 x 1 column vector, T, of target values containing all of the mpg values, and a 392 x 7 matrix, X, to hold the inputs to our model.
In [30]:
names
Out[30]:
In [31]:
T = dataNew[:, 0:1] # dataNew[:,0] results in a one-dimensional matrix, dataNew[:,0:1] preserves its two-dimensional nature.
In [32]:
X = dataNew[:, 1:]
In [33]:
X.shape, T.shape
Out[33]:
In [34]:
Xnames = names[1:]
Tname = names[0]
Xnames,Tname
Out[34]:
Now, let's see if a linear model makes some sense by plotting the target values versus each of the input variables.
In [35]:
plt.figure(figsize=(10,10))
for c in range(X.shape[1]):
plt.subplot(3,3, c+1)
plt.plot(X[:,c], T, 'o', alpha=0.5)
plt.ylabel(Tname)
plt.xlabel(Xnames[c])
What do you think? Are there any linear relationships between the individual input variables and the target variable? Do they make sense, given your knowledge of automobiles?
Now, to build the linear model. First, let's tack on a column of constant 1 values to the left side of X. With this addition, our matrix multiplication takes care of the $w_0$ coefficient as described above.
In [36]:
X1 = np.hstack((np.ones((X.shape[0],1)), X))
X.shape, X1.shape
Out[36]:
In [37]:
X1[:3,:]
Out[37]:
And, let's add a name to Xnames.
In [38]:
Xnames.insert(0, 'bias')
Xnames
Out[38]:
We could try to fit a linear model to all of the data and check to see how accurately we predict mpg for each sample. However, this will give a much too optimistic expectation of how well the model will do on new data.
A much better way to evaluate a model is to remove, or hold out, some of the samples from the data set used to fit the model. Then apply the model to the held out samples and compare the predicted mpg with the actual mpg. Call these held out samples the test set and the other samples used to fit the model the train set.
How should we partition the data into these two disjoint subsets? A common practice is to randomly select 80% of the samples as training samples and use the remaining 20% of the samples as testing samples.
To partition our samples (rows of X and T) into training and test sets, let's deal with just the row indices. Then we will use the same selected row indices to slice out corresponding rows of X and T.
In [39]:
nrows = X1.shape[0]
nTrain = int(round(nrow*0.8))
nTest = nrow - nTrain
nTrain,nTest,nTrain+nTest
Out[39]:
In [40]:
rows = np.arange(nrows)
np.random.shuffle(rows)
rows
Out[40]:
In [41]:
trainIndices = rows[:nTrain]
testIndices = rows[nTrain:]
trainIndices,testIndices
Out[41]:
Check that the training and testing sets are disjoint.
In [42]:
np.intersect1d(trainIndices, testIndices)
Out[42]:
In [43]:
Xtrain = X1[trainIndices,:]
Ttrain = T[trainIndices,:]
Xtest = X1[testIndices,:]
Ttest = T[testIndices,:]
Xtrain.shape,Ttrain.shape, Xtest.shape,Ttest.shape
Out[43]:
We want to solve for $\wv$ in the equation $X^T X \wv = X^T T$. This is done with the numpy.linalg.lstsq function. Don't be confused by the Ts. Remember something.T is the transpose of something.
In [82]:
w = np.linalg.lstsq(np.dot(Xtrain.T,Xtrain), np.dot(Xtrain.T, Ttrain))
w = w[0] # to only keep the weights, and discard other information returned by lstsq
w,len(w)
Out[82]:
In [45]:
for wi,name in zip(w.flat,Xnames):
print('{:8.3f} {:s}'.format(wi,name))
How can you figure out what np.linalg.lstsq is doing? Try finding the source code!
In [46]:
!locate linalg.py
In my version I see documentation for lstsq that states
Return the least-squares solution to a linear matrix equation.
Solves the equation `a x = b` by computing a vector `x` that
minimizes the Euclidean 2-norm `|| b - a x ||^2`. The equation may
be under-, well-, or over- determined (i.e., the number of
linearly independent rows of `a` can be less than, equal to, or
greater than its number of linearly independent columns). If `a`
is square and of full rank, then `x` (but for round-off error) is
the "exact" solution of the equation.
Now that we have a linear model, which is simply $w$, let's use it to predict mpg for some samples.
In [47]:
X1[:4,:]
Out[47]:
In [48]:
predict = np.dot(X1[:4,:],w)
predict
Out[48]:
How do these predictions compare with the actual mpg values? We can either make a two column matrix, or use a for loop to print them.
In [49]:
np.hstack(( predict, Ttrain[:4,:]))
Out[49]:
In [50]:
print('{:^5} {:^5}'.format('P','T'))
for (p,t) in zip(predict,Ttrain[0:4,:]):
# print(p,t)
print('{:5.2f} {:5.2f}'.format(p[0],t[0]))
Let's try all of the test data and plot the results.
In [51]:
predict = np.dot(Xtest, w)
predict.shape, Ttest.shape
Out[51]:
In [52]:
plt.plot(predict,Ttest,'o')
plt.xlabel('Predicted MPG')
plt.ylabel('Actual MPG')
# add a 45 degree line
a = max(min(predict),min(Ttest))
b = min(max(predict),max(Ttest))
plt.plot([a,b],[a,b], 'r', linewidth=3,alpha=0.7);
Not too shabby! But, how about a numerical measure of accuracy?
Right, just calculate the root-mean-square error (RMSE).
In [53]:
np.sqrt( np.mean( (np.dot(Xtest,w) - Ttest)**2))
Out[53]:
This means we are about 2.8 mpg off in our predictions, on average.
Here again are the weights we learned before for predicting mpg. Can we use these to decide which attibutes are most important? Can we remove some from the model?
In [54]:
for wi,name in zip(w.flat,Xnames):
print('{:8.3f} {:s}'.format(wi,name))
Perhaps year and origin are the most significant. Does this make sense?
A weight magnitude is determined not only by the linear relationship between the corresponding input variable with the target, but also by the variable's range.
What are the ranges of the input variables?
In [55]:
for n,mn,mx in zip(Xnames,np.min(X1,axis=0),np.max(X1,axis=0)):
print('{:>20} {:8.2f} {:8.2f}'.format(n,mn,mx))
The weight for weight is the smallest magnitude, but the range of its values are the largest.
The weight for origin is the largest, but the range of its values are the smallest.
If we could remove the effect of the ranges on the weight magnitudes, we could use the weight magnitudes to judge the relative importance of each input variable. How?
A common approach is to standardize the input variables by adjusting the values to have zero mean and unit variance.
In [56]:
Xs = (X - X.mean(axis=0)) / X.std(axis=0)
Xs.shape
Out[56]:
In [57]:
Xs.mean(0)
Out[57]:
In [58]:
Xs.std(0)
Out[58]:
To do this correctly when partitioning data into training and testing sets, we must always calculate means and standard deviations using only the training set, and use the same means and standard deviations when standardizing the testing set. Remember, you must not use any information about the testing set when building a model. If you do, your test error will be lower than it will be when you truly see new data.
You can do this by storing the means and standard deviations obtained from the training data, and just use them when standardizing the testing data. Don't forget to add the column of 1's after standardizing.
In [59]:
means = np.mean(X,axis=0)
stds = np.std(X,axis=0)
Xs = (X - means) / stds
Xs1 = np.hstack((np.ones((Xs.shape[0],1)), Xs))
w = np.linalg.lstsq( np.dot(Xs1.T,Xs1), np.dot(Xs1.T, T) )[0]
w
Out[59]:
Another way is to construct functions for standardizing that include the calculated means and standard deviations as local variables, by using function closures.
In [60]:
def makeStandardize(X):
means = X.mean(axis=0)
stds = X.std(axis=0)
def standardize(origX):
return (origX - means) / stds
def unStandardize(stdX):
return stds * stdX + means
return (standardize, unStandardize)
Let's start with X again, and tack on the column of 1's after dividing data into training and testing partitions.
In [61]:
Xtrain = X[trainIndices,:]
Ttrain = T[trainIndices,:]
Xtest = X[testIndices,:]
Ttest = T[testIndices,:]
(standardize, unStandardize) = makeStandardize(Xtrain)
XtrainS = standardize(Xtrain)
XtestS = standardize(Xtest)
np.mean(XtrainS,axis=0), np.std(XtrainS,axis=0), np.mean(XtestS,axis=0), np.std(XtestS,axis=0)
Out[61]:
Notice that the means and standard deviations for the testing set are not as close to 0 and 1 as they are for the training set. Why?
Now we can tack on the column of 1's, solve for w, and examine the weight magnitudes.
In [62]:
XtrainS1 = np.hstack((np.ones((XtrainS.shape[0],1)), XtrainS))
XtestS1 = np.hstack((np.ones((XtestS.shape[0],1)), XtestS))
w = np.linalg.lstsq( np.dot(XtrainS1.T,XtrainS1), np.dot(XtrainS1.T, Ttrain))[0] # see this [0]?
for wi,name in zip(w.flat,Xnames):
print('{:8.3f} {:s}'.format(wi,name))
Now what do you observe about the relative magnitudes? If you had a ton of input variables, it would be easier to see if we sorted them by their magnitudes.
In [63]:
np.abs(w)
Out[63]:
In [64]:
np.argsort(np.abs(w.flat))
Out[64]:
In [65]:
np.argsort(np.abs(w.flat))[::-1]
Out[65]:
In [66]:
sortedOrder = np.argsort(np.abs(w.flat))[::-1]
Xnames = np.array(Xnames)
for wi,name in zip(w.flat[sortedOrder],Xnames[sortedOrder]):
print('{:8.3f} {:s}'.format(wi,name))
The beauty of matrix equations now shines. Previously $T$ was an $N \times 1$ matrix of target values, one per sample.
Just add additional columns of target values for each target component. So $T$ becomes an $N \times K$ matrix, if there are $K$ output components. Each row is one $K$-dimensional sample.
Looking again at the solution for $\wv$,
$$ \begin{align*} \wv &= (X^T X)^{-1} X^T T \end{align*} $$we see that this changes $\wv$ from its original form of $D \times 1$ to $D \times K$.
Let's use this to predict miles per gallon and horsepower.
First, let's assemble the data for predicting MPG and HP.
In [67]:
X = dataNew[:,[1,2,4,5,6,7]]
T = dataNew[:,[0,3]]
Tnames = [names[0], names[3]]
X.shape,Xnames,T.shape,Tnames
Out[67]:
In [68]:
Xtrain = X[trainIndices,:]
Ttrain = T[trainIndices,:]
Xtest = X[testIndices,:]
Ttest = T[testIndices,:]
Xtrain.shape, Ttrain.shape, Xtest.shape, Ttest.shape
Out[68]:
In [69]:
standardize,_ = makeStandardize(Xtrain)
XtrainS = standardize(Xtrain)
XtestS = standardize(Xtest)
XtrainS1 = np.hstack((np.ones((XtrainS.shape[0],1)), XtrainS))
Xnames = np.array(['bias']+names)[[0,2,3,5,6,7,8]]
Xnames
Out[69]:
In [70]:
XtrainS1.shape,Ttrain.shape
Out[70]:
In [71]:
w = np.linalg.lstsq( np.dot(XtrainS1.T, XtrainS1), np.dot(XtrainS1.T, Ttrain))[0]
w
Out[71]:
In [72]:
Xnames = np.array(Xnames)
for targeti in range(2):
print('\nTarget {}\n'.format(Tnames[targeti]))
thisw = w[:,targeti]
sortedOrder = np.argsort(np.abs(thisw))[::-1]
for wi,name in zip(thisw[sortedOrder],Xnames[sortedOrder]):
print('{:8.3f} {:s}'.format(wi,name))
Now try predicting both mpg and horsepower.
In [73]:
XtestS1 = np.hstack((np.ones((XtestS.shape[0],1)), XtestS))
prediction = np.dot(XtestS1,w)
prediction.shape
Out[73]:
In [74]:
plt.figure(figsize=(10,10))
for p in range(2):
plt.subplot(2,1,p+1)
plt.plot(prediction[:,p],Ttest[:,p],'o')
plt.xlabel("Predicted " + Tnames[p])
plt.ylabel("Actual " + Tnames[p])
a = max(min(prediction[:,p]),min(Ttest[:,p]))
b = min(max(prediction[:,p]),max(Ttest[:,p]))
plt.plot([a,b],[a,b],'r',linewidth=3)
How well did we do in terms of RMSE?
In [75]:
rmseTrain = np.sqrt(np.mean((np.dot(XtrainS1,w) - Ttrain)**2,axis=0))
rmseTrain
Out[75]:
In [76]:
rmseTest = np.sqrt(np.mean((np.dot(XtestS1,w) - Ttest)**2,axis=0))
rmseTest
Out[76]:
In [77]:
print('Training RMSE: MPG {:4.2f} HP {:4.2f}'.format(*rmseTrain)) #what is that * doing there??
print(' Testing RMSE: MPG {:4.2f} HP {:4.2f}'.format(*rmseTest))
In [ ]: