$y_{i} = f(x_{i}) + e_{i}$
You can easily imagine that there are errors in this model, because you can have two houses that have exactly the same number of square feet, but sell for very different prices because they could have sold at different times. They could have had different numbers of bedrooms, or bathrooms, or size of the yard, or specific location, neighborhoods, school districts. Lots of things that we might not have taken into account in our model.
Screenshot taken from Coursera 6:45
we're gonna compare the actual sales price to the predicted sales price using the any given $\hat{f}$. And the quality metric will tell us how well we do. So there's gonna be some error in our predicted values. And the machine learning algorithm we're gonna use to fit these regression models is gonna try to minimize that error. So it's gonna search over all these functions to reduce the error in these predicted values.
Screenshot taken from Coursera 3:48
What's the equation of a line?
And what this regression model then specifies is that each one of our observations $y_{i}$ is simply that function evaluated at $x_{i}$, so: $$y_{i} = w_{0} + w_{1}x + \epsilon_{i}$$
Screenshot taken from Coursera 1:12
What is the difference between residual and error?
Residual sum of squares (RSS)
The sum of all the differences between predicted values and actual values and then square the sum.
Screenshot taken from Coursera 3:26
Why the hat notation?
eg: Y-hat stands for the predicted values of Y (house-price).
http://mathworld.wolfram.com/Hat.html
https://en.wikipedia.org/wiki/Hat_operator
Screenshot taken from Coursera 2:00
One thing I want to make very, very clear is that the magnitude of this slope, depends both on the units of our input, and on the units of our output. So, in this case, the slope, the units of slope, are dollars per square feet. And so, if I gave you a house that was measured in some other unit, then this coefficient would no longer be appropriate for that.
For example, if the input is square feet and you have another house was measured in square meters instead of square feet, well, clearly I can't just plug into that equation.
Screenshot taken from Coursera 3:54
$RSS(w_{0}, w_{1}) = g(w_{0}, w_{1})$
Our objective here is to minimize over all possible combinations of $w_{0}$, and $w_{1}$
Screenshot taken from Coursera 2:35
Screenshot taken from Coursera 3:49
Example: $g(w) = 5 - (w - 10)^{2}$
$\frac{d g(w)}{d w} = 0 - 2(w-10)^{2} \cdot 1 = -2w + 20$
When we draw this derivative, we can see it has a concave form. How do I find this maximum?
Screenshot taken from Coursera 6:45
If we're looking at these concave situations and our interest is in finding the max over all w of g(w) one thing we can look at is something called a hill-climbing algorithm. Where it's going to be an integrative algorithm where we start somewhere in this space of possible w's and then we keep changing w hoping to get closer and closer to the optimum.
Okay, so, let's say that we start w here. And a question is well should I increase w, move w to the right or should I decrease w and move w to the left to get closer to the optimal.
So, again, the derivative is telling me what I wanna do. We can write down this climbing algorithm:
Example: Let's say I happen to start on this left hand side at this w value here. And at this pointthe derivative is pretty large. This function's pretty steep. So, I'm going to be taking a big step. Then, I compute the derivative. I'm still taking a fairly big step. I keep stepping increasing. What I mean by each of these is I keep increasing w. Keep taking a step in w. Going, computing the derivative and as I get closer to the optimum, the size of the derivative has decreased.
Screenshot taken from Coursera 3:34
We can use the same type of algorithm to find the minimum of a function.
When the derivative is positive we want to decrease w and when the derivative is negative, you wanna increase w
The update of the hill descent algorithm is gonna look almost exactly the same as the hill climbing, except we have the minus sign, so we're going to move in the opposite direction.
Screenshot taken from Coursera 2:40
So in these algorithms, I said that there's a stepsize. Stepsize is denoted as $\eta$. This determines how much you're changing your W at every iteration of this algorithm.
One choice you can look at is something called fixed stepsize or constant stepsize, where, as the name implies, you just set eta equal to some constant. So for example, maybe 0.1.
But what can happen is that you can jump around quite a lot. I keep taking these big steps. And I end up jumping over the optimal to a point here and then I jump back and then I'm going back and forth, and back and forth. And I converge very slowly to the optimal itself.
Screenshot taken from Coursera 2:00
A common choice is to decrease the stepsize as the number of iterations increase.
One thing you have to be careful about is not decreasing the stepsize too rapidly. Because if you're doing that, you're gonna, again, take a while to converge. Because you're just gonna be taking very, very, very small steps. Okay, so in summary choosing your stepsize is just a bit of an art.
Screenshot taken from Coursera 4:30
How are we going to assess our convergence?
Well, we know that the optimum occurs when the derivative of the function is equal to 0. $$\frac{dg(w)}{dw} = 0$$
But what we're gonna see in practice, is that the derivative, it's gonna get smaller and smaller, but it won't be exactly 0. At some point, we're going to want to say, okay, that's good enough. We're close enough to the optimum. I'm gonna terminate this algorithm.
In practice, stop when $$\left|\frac{dg(w)}{dw}\right| < \epsilon$$
$\epsilon$ is a threshold I'm setting. Then if this is satisfied, then I'm gonna terminate the algorithm and return whatever solution I have $w^{(t)}$. So in practice, we're just gonna choose epsilon to be very small. And I wanna emphasize that what very small means depends on the data that you're looking at, what the form of this function is.
Screenshot taken from Coursera 5:30
So up until this point we've talked about functions of just one variable and finding there minimum or maximum. But remember when we were talking about residual sums of squares, we had two variables. Two parameters of our model, w0 and w1. And we wanted to minimize over both.
Let's talk about how we're going to move functions defined over multiple variables. Moving to multiple dimensions here, where when we have these functions in higher dimensions we don't talk about derivatives any more we talk about gradients in their place.
What is a gradient?
The definition of a gradient: it's gonna be a vector, where we're gonna look at what are called the partial derivatives of g. We're going to look at the partial with respect to W zero. The partial of G with respect to W one. W one all the way up to the partial of G with respect to some WP.
It's exactly like a derivative where we're taking the derivative with respect to in this case W one. But what are we going to do with all the other W's? W zero, W two, W three all the up to WP? Well we're just going to treat them like constants.
So the vector represents a (p + 1), cause we're indexing starting at zero. This is a (p+1) -dimensional vector.
Screenshot taken from Coursera 3:05
Work through Example
We have a function: $$g(w) = 5w_{0} + 10w_{0}w_{1} + 2w_{1}^{2} $$
Let's compute the gradient of w
This is my gradient: $$\bigtriangledown g(w) = \begin{bmatrix} 5 + 10w_{1} \\ 10w_{0} + 4w_{1} \end{bmatrix}$$
If I want to look at the gradient at any point on this surface well I'm just going to plug in whatever the W one and W zero values are at this point, and I'm going to compute the gradient. And it'll be some vector. It's just some number in the first component, some number in the second component, and that forms some factor.
Screenshot taken from Coursera 5:08
Instead of looking at these 3D mesh plots that we've been looking at, we can look at a contour plot, where we can kind of think of this as a bird's eye view.
*Screenshot taken from Coursera 2:37
Let's talk about the gradient descent algorithm, which is the analogous algorithm to what I call the hill decent algorithm in 1D.
But, in place of the derivative of the function, we've now specified the gradient of the function. And other than that, everything looks exactly the same. So what we're doing, is we're taking we now have a vector of parameters, and we're updating them all at once. We're taking our previous vector and we're updating with our sum, eta times our gradient which was also a vector. So, it's just the vector analog of the hill descent algorithm.
If we take a look at the picture, we start at a point where the gradient is actually pointing in the direction of steepest assent (up hill). But we're moving in the negative gradient direction.
*Screenshot taken from Coursera 5:30
Now, we can think about applying these optimization notions and optimization algorithms that we described to our specific scenario of interest. Which is searching over all possible lines and finding the line that best fits our data.
So the first thing that's important to mention is the fact that our objective is Convex. And what this is implies is that the solution to this minimization problem is unique. We know there's a unique minimum to this function. And likewise, we know that our gradient descent algorithm will converge to this minimum.
*Screenshot taken from Coursera 1:20
Let's return to the definition of our cost, which is the residual sum of squares of our two parameters, (wo,w1), which I've written again right here.
But before we get to taking the gradient, which is gonna be so crucial for our gradient descent algorithm, let's just note the following fact about derivatives, where if we take the derivative of the sum of functions over some parameter, some variable w. So N different functions, g1 to gN, the derivative distributes across the sum. And we can rewrite this as the sum of the derivative, okay? So the derivative of the sum of functions is the same as the sum of the derivative of the individual functions so in our case, we have that $g_{i}(w)$.
The $g_{i}(w)$ that I'm writing here is equal to: $$ g_{i}(w) = (y_{i} - [w_{0} + w_{1} x_{i}])^{2}$$
And we see that the residual sum of squares is indeed a sum over n different functions of w0 and w1. And so in our case when we're thinking about taking the partial of the residual sum of squares. With respect to, for example w0, this is going to be equal: $$\frac{\partial RSS(w)}{\partial w_{0}} = \sum_{i=1}^n \frac{\partial}{\partial w_{0}}(y_{i} - [w_{0} + w_{1} x_{i}])^{2}$$
And the same holds for W1.
*Screenshot taken from Coursera 3:45
Let's go ahead and actually compute this gradient
And the first thing we're gonna do is take the derivative or the partial with respect to the W0.
Okay, so I'm gonna use this fact that, I showed on the previous slide to take the sum to the outside. And then I'm gonna take the partial with respect to the inside. So, I have a function raised to a power. So i'm gonna bring that power down. I'm gonna get a 2 here, rewrite the function. That's W1 XI, and now the power is just gonna be 1 here. But then, I have to take the derivative of the inside. And so what's the derivative of the inside when I'm taking this derivative with respect to W0? Well, what I have is I have a -1 multiplying W0, and everything else I'm just treating as a constant. So, I need to multiply by -1.
*Screenshot taken from Coursera 4:58
Let's go ahead and now take the derivative or the partial with respect to W1
So in this case, again I'm pulling the sum out same thing happens where I'm gonna bring the 2 down. And I'm gonna rewrite the function here, the inside part of the function, raise it just to the 1 power. And then, when I take the derivative of this part, this inside part, with respect to W1, What do I have? Well all of these things are constants with respect to W1, but what's multiplying W1? I have a (-xi) so I'm going to need to multiply by (-xi).
*Screenshot taken from Coursera 5:58
Let's put this all together
So this is the gradient of our residual sum of squares, and it's a vector of two dimensions because we have two variables, w0 and w1. Now what can w think about doing? Well of course we can think about doing the gradient descent algorithm. But let's hold off on that because what do we know is another way to solve for the minimum of this function?
Well we know we can, just like we talked about in one D, taking the derivative and setting it equal to zero, that was the first approach for solving for the minimum. Well here we can take the gradient and set it equal to zero.
*Screenshot taken from Coursera 7:13
So this is gonna be our Approach 1. And this is drawn here on this 3D mesh plot, where that green surface is the gradient at the minimum. And what we see is that's where the gradient = 0. And that red dot is the, the optimal point that we're gonna be looking at.
Let's take this gradient, set it equal to zero and solve for W0 and W1. Those are gonna be our estimates of our two parameters of our model that define our fitted line.
I'm gonna take the top line and I'm going to set it equal to 0. The reason I'm putting the hats on are now, these are our solutions. These are our estimated values of these parameters.
Break down the math
First lets make our notation shorter:
$\sum\limits_{i=1}^N=\Sigma$
Top term (Row 1)
Bottom term (Row 2)
*Screenshot taken from Coursera 5:11
We discussed the other approach that we can take is to do Gradient descent where we're walking down this surface of residual sum of squares trying to get to the minimum. Of course we might over shoot it and go back and forth but that's the general idea that we're doing this iterative procedure. And in this case it's useful to reinterpret this gradient of the residual sum of squares that we computed previously.
A couple notation:
*Screenshot taken from Coursera 1:30
Then we can write our gradient descent algorithm
while not converged, we're gonna take our previous vector of W0 at iteration T, W1 at iteration T and We're going to subtract eta times the gradient.
*Screenshot taken from Coursera 3:25
We can also rewrite the algorithm.
I want it in this form to provide a little bit of intuition here. Because what happens if overall, we just tend to be underestimating our values y?
So, if overall, we're under predicting $\hat y_i$, then we're gonna have that the sum of yi- y hat i is going to be positive ($\sum [y_i - \hat y_i]$ is positive). Because we're saying that $\hat y_i$ is always below, or in general, below the true value $y_i$, so this is going to be positive.
And what's gonna happen? Well, this term here $\sum [y_i - \hat y_i]$ is positive. We're multiplying by a positive thing, and adding that to our vector W. So $w_0$ is going to increase. And that makes sense, because we have some current estimate of our regression fit. But if generally we're under predicting our observations that means probably that line is too low. So, we wanna shift it up. That means increasing $w_0$.
So, there's a lot of intuition in this formula for what's going on in this gradient descent algorithm. And that's just talking about this first term $w_0$, but then there's this second term $w_1$, which is the slope of the line. And in this case there's a similar intuition. But we need to multiply by this $x_i$, accounting for the fact that this is a slope term.
*Screenshot taken from Coursera 6:00
Let's take a moment to compare the two approaches that we've gone over, either setting the gradient equal to zero or doing gradient descent.
Well, in the case of minimizing residual sum of squares, we showed that both were fairly straight forward to do. But in a lot of the machine learning method's that we're interested in taking the gradient and setting it equal to zero, well there's just no close form solution to that problem. So, often we have to turn to method's like gradient descent.
And likewise, as we're gonna see in the next module, where we turn to having lots of different inputs, lots of different features in our regression. Even though there might be a close form solution to setting the gradient equal to zero, sometimes in practice it can be much more efficient computationally to implement the gradient descent approach.
And likewise, as we're gonna see in the next module, where we turn to having lots of different inputs, lots of different features in our regression. Even though there might be a close form solution to setting the gradient equal to zero, sometimes in practice it can be much more efficient computationally to implement the gradient descent approach.
*Screenshot taken from Coursera 1:20
Let's discuss about the intuition of what happens if we use a different measure of error.
Okay so this residual sum of squares that we've been looking at is something that's called a Symmetric cost function. And that's because what we're assuming when we look at this aerometric is the fact that if we over estimate the value of our house, that has the same cost as if we under estimate the value of the house.
*Screenshot taken from Coursera 2:41
What happens if there might not be symmetric cost to these error?
But what if the cost of listing my house sales price as too high is bigger than the cost if I listed it as too low?
So in this case it might be more appropriate to use an asymmetric cost function where the errors are not weighed equally between these two types of mistakes. So if we choose asymmetric cost, I prefer to underestimate the value than over.
*Screenshot taken from Coursera 3:18
*Screenshot taken from Coursera 0:45