1) Regression fundamentals

1) Data and model

x: input
y: output
f(x): functional relationship, expected relationship between x and y

$y_{i} = f(x_{i}) + e_{i}$

$e_{i}$: error term

You can easily imagine that there are errors in this model, because you can have two houses that have exactly the same number of square feet, but sell for very different prices because they could have sold at different times. They could have had different numbers of bedrooms, or bathrooms, or size of the yard, or specific location, neighborhoods, school districts. Lots of things that we might not have taken into account in our model.

Screenshot taken from Coursera 6:45

2) Block diagram

$\hat{y}$: predicted house sales price
$\hat{f}$: estimated function
y: actual sales price

we're gonna compare the actual sales price to the predicted sales price using the any given $\hat{f}$. And the quality metric will tell us how well we do. So there's gonna be some error in our predicted values. And the machine learning algorithm we're gonna use to fit these regression models is gonna try to minimize that error. So it's gonna search over all these functions to reduce the error in these predicted values.

Screenshot taken from Coursera 3:48

2) The simple linear regression model, its use, and interpretation

1) The simple linear regression model

What's the equation of a line?

it's just (intercept + slope * our variable of interest) so that we're gonna say that's $f(x) = w_{0} + w_{1}x$

And what this regression model then specifies is that each one of our observations $y_{i}$ is simply that function evaluated at $x_{i}$, so: $$y_{i} = w_{0} + w_{1}x + \epsilon_{i}$$

$\epsilon_{i}$: error term, the distance from our specific observation back down to the line
$w_{0}, w_{1}$: intercept and slope respectively, they are called regression coefficients.

Screenshot taken from Coursera 1:12

2) The cost of using a given line

What is the difference between residual and error?

https://www.coursera.org/learn/ml-regression/lecture/WYPGc/the-cost-of-using-a-given-line/discussions/Lx0xn5j1EeW0dw6k4EUmPw

Residual is the difference between the observed value and the predicted value. Error is the difference between the observed value and the (often unknown) true value. As such, residuals refer to samples whereas errors refer to populations.
There is a true function f(x) that we want to learn, and the observed values $y_{i}$ we have are in fact: $y_{i} = f(x_{i}) + e_{i}$, because we need to assume our measures have some error (or noise if you prefer this term). This $e_{i}$ is the real error, because the real value is $f(x_{i})$. In the other hand, the residual is $y_{i} - \hat f(x_{i})$, where $\hat f$ is our approximation (estimation) of the real f(x)

Residual sum of squares (RSS)

The sum of all the differences between predicted values and actual values and then square the sum.

Screenshot taken from Coursera 3:26

3) Using the fitted line

a model is in terms of sum parameters and a fitted line is a specific example within that model class

Why the hat notation?

https://www.coursera.org/learn/ml-regression/lecture/RjYbf/using-the-fitted-line/discussions/QOsWrZkGEeWKNwpBrKr_Fw

In statistics, the hat operator is used to denote the predicted value of the parameter.

eg: Y-hat stands for the predicted values of Y (house-price).

http://mathworld.wolfram.com/Hat.html

https://en.wikipedia.org/wiki/Hat_operator

The hat denotes a predicted value, as contrasted with an observed value. For our purposes right now, I think of the hat value as the value that sits on the regression line, because that's the value our regression analysis would predict. So, for example, the residual for a particular observation is $y_i$ - $\hat y_i$, where $y_i$ is the actual observed outcome at a particular observed value of x and $\hat y_i$ is the value that our regression analysis predicts for y at that same x value.

Screenshot taken from Coursera 2:00

4) Interpreting the fitted line

$\hat{w}$: predicted changed in the output per unit change in the input.

One thing I want to make very, very clear is that the magnitude of this slope, depends both on the units of our input, and on the units of our output. So, in this case, the slope, the units of slope, are dollars per square feet. And so, if I gave you a house that was measured in some other unit, then this coefficient would no longer be appropriate for that.

For example, if the input is square feet and you have another house was measured in square meters instead of square feet, well, clearly I can't just plug into that equation.

Screenshot taken from Coursera 3:54

3) An aside on optimization: one dimensional objectives

1) Defining our least squares optimization objective

$RSS(w_{0}, w_{1}) = g(w_{0}, w_{1})$

Our objective here is to minimize over all possible combinations of $w_{0}$, and $w_{1}$

Screenshot taken from Coursera 2:35

2) Finding maxima or minima analytically

Concave function: The way we can define a concave function is we can look at any two values of w: a, and b. Then we draw a line between a and b, that line lies below g(w) everywhere.
Convex function: Opposite properties of Concave function where the line connects g(a) and g(b) is above g(w) everywhere.
There a functions which are neither Concave nor Convex function where the line lies both below and above the g(w) function.

Screenshot taken from Coursera 3:49

In a concave function, at the point of max g(w), this is the point where the derivative = 0. Same thing for convex function, if we want to find minimum of all w over g(w), at the minimum point, the derivative = 0.

Example: $g(w) = 5 - (w - 10)^{2}$

$\frac{d g(w)}{d w} = 0 - 2(w-10)^{2} \cdot 1 = -2w + 20$

When we draw this derivative, we can see it has a concave form. How do I find this maximum?

I take this derivative and set it equal to 0, and we can solve it for w = 10.

Screenshot taken from Coursera 6:45

3) Finding the max via hill climbing

If we're looking at these concave situations and our interest is in finding the max over all w of g(w) one thing we can look at is something called a hill-climbing algorithm. Where it's going to be an integrative algorithm where we start somewhere in this space of possible w's and then we keep changing w hoping to get closer and closer to the optimum.

Okay, so, let's say that we start w here. And a question is well should I increase w, move w to the right or should I decrease w and move w to the left to get closer to the optimal.

What I can do is I can look at the function at w and I can take the derivative and if the derivative is positive like it is here, this is the case where I want to increase w. If the derivative is negative, then I want to decrease w.
So, we can actually divide the space into two.
- Where on the left of the optimal, we have that the derivative of g with respect to w is greater than 0. And these are the cases where we're gonna wanna increase w.
- And on the right-hand side of the optimum we have that the derivative of g with respect to w is negative. And these are cases where we're gonna wanna decrease w.
- If I'm exactly at the optimum, which maybe I'll call $w^{*}$. I do not want to move to the right or the left, because I want to stay at the optimum. The derivative at this point is 0.

So, again, the derivative is telling me what I wanna do. We can write down this climbing algorithm:

While not converged, I'm gonna take my previous w, where I was at iteration t, so t is the iteration counter. And I'm gonna move in the direction indicated by the derivative. So, if the derivative of the function is positive, I'm going to be increasing w, and if the derivative is negative, I'm going to be decreasing w, and that's exactly what I want to be doing.
But instead of moving exactly the amount specified by the derivative at that point, we can introduce something, I'll call it eta ($\eta$). And $\eta$ is what's called a step size.
So, when I go to compute my next w value, I'm gonna take my previous w value and I'm going to move and amount based on the derivative as determined by the step size.

Example: Let's say I happen to start on this left hand side at this w value here. And at this pointthe derivative is pretty large. This function's pretty steep. So, I'm going to be taking a big step. Then, I compute the derivative. I'm still taking a fairly big step. I keep stepping increasing. What I mean by each of these is I keep increasing w. Keep taking a step in w. Going, computing the derivative and as I get closer to the optimum, the size of the derivative has decreased.

Screenshot taken from Coursera 3:34

4) Finding the min via hill descent

We can use the same type of algorithm to find the minimum of a function.

When the derivative is positive we want to decrease w and when the derivative is negative, you wanna increase w

The update of the hill descent algorithm is gonna look almost exactly the same as the hill climbing, except we have the minus sign, so we're going to move in the opposite direction.

Screenshot taken from Coursera 2:40

5) Choosing stepsize and convergence criteria

So in these algorithms, I said that there's a stepsize. Stepsize is denoted as $\eta$. This determines how much you're changing your W at every iteration of this algorithm.

One choice you can look at is something called fixed stepsize or constant stepsize, where, as the name implies, you just set eta equal to some constant. So for example, maybe 0.1.

But what can happen is that you can jump around quite a lot. I keep taking these big steps. And I end up jumping over the optimal to a point here and then I jump back and then I'm going back and forth, and back and forth. And I converge very slowly to the optimal itself.

Screenshot taken from Coursera 2:00

A common choice is to decrease the stepsize as the number of iterations increase.

One thing you have to be careful about is not decreasing the stepsize too rapidly. Because if you're doing that, you're gonna, again, take a while to converge. Because you're just gonna be taking very, very, very small steps. Okay, so in summary choosing your stepsize is just a bit of an art.

Screenshot taken from Coursera 4:30

How are we going to assess our convergence?

Well, we know that the optimum occurs when the derivative of the function is equal to 0. $$\frac{dg(w)}{dw} = 0$$

But what we're gonna see in practice, is that the derivative, it's gonna get smaller and smaller, but it won't be exactly 0. At some point, we're going to want to say, okay, that's good enough. We're close enough to the optimum. I'm gonna terminate this algorithm.

In practice, stop when $$\left|\frac{dg(w)}{dw}\right| < \epsilon$$

$\epsilon$ is a threshold I'm setting. Then if this is satisfied, then I'm gonna terminate the algorithm and return whatever solution I have $w^{(t)}$. So in practice, we're just gonna choose epsilon to be very small. And I wanna emphasize that what very small means depends on the data that you're looking at, what the form of this function is.

Screenshot taken from Coursera 5:30

4) An aside on optimization: multidimensional objectives

1) Gradients: derivatives in multiple dimensions

So up until this point we've talked about functions of just one variable and finding there minimum or maximum. But remember when we were talking about residual sums of squares, we had two variables. Two parameters of our model, w0 and w1. And we wanted to minimize over both.

Let's talk about how we're going to move functions defined over multiple variables. Moving to multiple dimensions here, where when we have these functions in higher dimensions we don't talk about derivatives any more we talk about gradients in their place.

What is a gradient?

$\bigtriangledown g(w)$: notation of gradient of a function
w: vector of different w's $[w_{0},w_{1},...,w_{p}]$

$$\bigtriangledown g(w) = \begin{bmatrix} \frac{\partial g}{\partial w_{0}} \\ \frac{\partial g}{\partial w_{1}} \\ ... \\ \frac{\partial g}{\partial w_{p}} \end{bmatrix}$$

The definition of a gradient: it's gonna be a vector, where we're gonna look at what are called the partial derivatives of g. We're going to look at the partial with respect to W zero. The partial of G with respect to W one. W one all the way up to the partial of G with respect to some WP.

It's exactly like a derivative where we're taking the derivative with respect to in this case W one. But what are we going to do with all the other W's? W zero, W two, W three all the up to WP? Well we're just going to treat them like constants.

So the vector represents a (p + 1), cause we're indexing starting at zero. This is a (p+1) -dimensional vector.

Screenshot taken from Coursera 3:05

Work through Example

We have a function: $$g(w) = 5w_{0} + 10w_{0}w_{1} + 2w_{1}^{2} $$

Let's compute the gradient of w

partial of G with respect to W zero: $\frac{\partial g}{\partial w_{0}} = 5 + 10w_{1}$, in this case we treat $w_{1}$ like a constant.
partial of G with respect to W one: $\frac{\partial g}{\partial w_{1}} = 10w_{0} + 4w_{1}$, where $w_{0}$ is a constant in this case.

This is my gradient: $$\bigtriangledown g(w) = \begin{bmatrix} 5 + 10w_{1} \\ 10w_{0} + 4w_{1} \end{bmatrix}$$

If I want to look at the gradient at any point on this surface well I'm just going to plug in whatever the W one and W zero values are at this point, and I'm going to compute the gradient. And it'll be some vector. It's just some number in the first component, some number in the second component, and that forms some factor.

Screenshot taken from Coursera 5:08

2) Gradient descent: multidimensional hill descent

Instead of looking at these 3D mesh plots that we've been looking at, we can look at a contour plot, where we can kind of think of this as a bird's eye view.

*Screenshot taken from Coursera 2:37

Let's talk about the gradient descent algorithm, which is the analogous algorithm to what I call the hill decent algorithm in 1D.

But, in place of the derivative of the function, we've now specified the gradient of the function. And other than that, everything looks exactly the same. So what we're doing, is we're taking we now have a vector of parameters, and we're updating them all at once. We're taking our previous vector and we're updating with our sum, eta times our gradient which was also a vector. So, it's just the vector analog of the hill descent algorithm.

If we take a look at the picture, we start at a point where the gradient is actually pointing in the direction of steepest assent (up hill). But we're moving in the negative gradient direction.

*Screenshot taken from Coursera 5:30

5) Finding the least squares line

1) Computing the gradient of RSS

Now, we can think about applying these optimization notions and optimization algorithms that we described to our specific scenario of interest. Which is searching over all possible lines and finding the line that best fits our data.

So the first thing that's important to mention is the fact that our objective is Convex. And what this is implies is that the solution to this minimization problem is unique. We know there's a unique minimum to this function. And likewise, we know that our gradient descent algorithm will converge to this minimum.

*Screenshot taken from Coursera 1:20

Let's return to the definition of our cost, which is the residual sum of squares of our two parameters, (wo,w1), which I've written again right here.

But before we get to taking the gradient, which is gonna be so crucial for our gradient descent algorithm, let's just note the following fact about derivatives, where if we take the derivative of the sum of functions over some parameter, some variable w. So N different functions, g1 to gN, the derivative distributes across the sum. And we can rewrite this as the sum of the derivative, okay? So the derivative of the sum of functions is the same as the sum of the derivative of the individual functions so in our case, we have that $g_{i}(w)$.

The $g_{i}(w)$ that I'm writing here is equal to: $$ g_{i}(w) = (y_{i} - [w_{0} + w_{1} x_{i}])^{2}$$

And we see that the residual sum of squares is indeed a sum over n different functions of w0 and w1. And so in our case when we're thinking about taking the partial of the residual sum of squares. With respect to, for example w0, this is going to be equal: $$\frac{\partial RSS(w)}{\partial w_{0}} = \sum_{i=1}^n \frac{\partial}{\partial w_{0}}(y_{i} - [w_{0} + w_{1} x_{i}])^{2}$$

And the same holds for W1.

*Screenshot taken from Coursera 3:45

Let's go ahead and actually compute this gradient

And the first thing we're gonna do is take the derivative or the partial with respect to the W0.

Okay, so I'm gonna use this fact that, I showed on the previous slide to take the sum to the outside. And then I'm gonna take the partial with respect to the inside. So, I have a function raised to a power. So i'm gonna bring that power down. I'm gonna get a 2 here, rewrite the function. That's W1 XI, and now the power is just gonna be 1 here. But then, I have to take the derivative of the inside. And so what's the derivative of the inside when I'm taking this derivative with respect to W0? Well, what I have is I have a -1 multiplying W0, and everything else I'm just treating as a constant. So, I need to multiply by -1.

*Screenshot taken from Coursera 4:58

Let's go ahead and now take the derivative or the partial with respect to W1

So in this case, again I'm pulling the sum out same thing happens where I'm gonna bring the 2 down. And I'm gonna rewrite the function here, the inside part of the function, raise it just to the 1 power. And then, when I take the derivative of this part, this inside part, with respect to W1, What do I have? Well all of these things are constants with respect to W1, but what's multiplying W1? I have a (-xi) so I'm going to need to multiply by (-xi).

*Screenshot taken from Coursera 5:58

Let's put this all together

So this is the gradient of our residual sum of squares, and it's a vector of two dimensions because we have two variables, w0 and w1. Now what can w think about doing? Well of course we can think about doing the gradient descent algorithm. But let's hold off on that because what do we know is another way to solve for the minimum of this function?

Well we know we can, just like we talked about in one D, taking the derivative and setting it equal to zero, that was the first approach for solving for the minimum. Well here we can take the gradient and set it equal to zero.

*Screenshot taken from Coursera 7:13

2) Approach 1: closed-form solution

So this is gonna be our Approach 1. And this is drawn here on this 3D mesh plot, where that green surface is the gradient at the minimum. And what we see is that's where the gradient = 0. And that red dot is the, the optimal point that we're gonna be looking at.

Let's take this gradient, set it equal to zero and solve for W0 and W1. Those are gonna be our estimates of our two parameters of our model that define our fitted line.

I'm gonna take the top line and I'm going to set it equal to 0. The reason I'm putting the hats on are now, these are our solutions. These are our estimated values of these parameters.

Break down the math

https://www.coursera.org/learn/ml-regression/lecture/G9oBu/approach-1-closed-form-solution/discussions/Ywv6RZfxEeWNbBIwwhtGwQ

First lets make our notation shorter:

$\sum\limits_{i=1}^N=\Sigma$

Top term (Row 1)

1) Break summation down: $$-2(\Sigma y_i-\Sigma w_0 - \Sigma w_1 x_i) = 0$$
2) Divide both sides by -2: $$\Sigma y_i-\Sigma w_0 - \Sigma w_1 x_i = 0$$
3) Summation of a constant replacement: $$\Sigma y_i - Nw_0 - \Sigma w_1 x_i = 0$$
4) Solve for $w_0$: $$Nw_0 = \Sigma y_i - \Sigma w_1 x_i$$
5) Divide by N: $$w_0 = \frac{\Sigma y_i}{N} - \frac{\Sigma w_1 x_i}{N}$$
6) Move $w_1$ constant out: $$w_0 = \frac{\Sigma y_i}{N} - w_1\frac{\Sigma x_i}{N}$$
7) Put party hats on $w_0$ and $w_1$: $$\hat{w_0} = \frac{\Sigma y_i}{N} - \hat{w_1}\frac{\Sigma x_i}{N}$$

Bottom term (Row 2)

1) Factor in $x_i$: $$-2[\Sigma( y_i x_i - w_0 x_i - w_1 x_i^2)]= 0$$
2) Break summation down: $$-2[\Sigma y_i x_i - \Sigma w_0 x_i - \Sigma w_1 x_i^2]= 0$$
3) Divide both sides by -2: $$\Sigma y_i x_i - \Sigma w_0 x_i - \Sigma w_1 x_i^2 = 0$$
4) Move constants $w_0$ and $w_1$ out: $$\Sigma y_i x_i - w_0\Sigma x_i - w_1\Sigma x_i^2 = 0$$
5) Replace $w_0$ with equation from Row 1, Line 6: $$\Sigma y_i x_i - \frac{\Sigma y_i}{N} \Sigma x_i + w_1\frac{\Sigma x_i}{N}\Sigma x_i - w_1\Sigma x_i^2 = 0$$
6) Multiply by 1 namely N/N to remove denominator of N: $$\frac{N}{N}\Sigma y_i x_i - \frac{\Sigma y_i}{N} \Sigma x_i + w_1\frac{\Sigma x_i}{N}\Sigma x_i - w_1\frac{N}{N}\Sigma x_i^2 = 0$$
7) Group numerator and multiply By N to remove denominator: $$N\Sigma y_i x_i - \Sigma y_i \Sigma x_i + w_1\Sigma x_i\Sigma x_i - w_1N\Sigma x_i^2 = 0$$
8) Group $w_1$ terms: $$w_1(\Sigma x_i\Sigma x_i - N\Sigma x_i^2) = \Sigma y_i \Sigma x_i - N\Sigma y_i x_i$$
9) Solve for $w_1$: $$w_1 =\frac{(\Sigma y_i \Sigma x_i - N\Sigma y_i x_i)}{(\Sigma x_i\Sigma x_i - N\Sigma x_i^2)}$$
10) Divide top and bottom by -N (Same as multiply by 1): $$w_1 = \frac{(\Sigma y_i x_i - \frac{\Sigma y_i \Sigma x_i}{N})}{( \Sigma x_i^2 - \frac{\Sigma x_i\Sigma x_i}{N})}$$
11) Put party hat on $w_1$: $$\hat{w_1} = \frac{(\Sigma y_i x_i - \frac{\Sigma y_i \Sigma x_i}{N})}{( \Sigma x_i^2 - \frac{\Sigma x_i\Sigma x_i}{N})}$$

*Screenshot taken from Coursera 5:11

3) Approach 2: gradient descent

We discussed the other approach that we can take is to do Gradient descent where we're walking down this surface of residual sum of squares trying to get to the minimum. Of course we might over shoot it and go back and forth but that's the general idea that we're doing this iterative procedure. And in this case it's useful to reinterpret this gradient of the residual sum of squares that we computed previously.

A couple notation:

$y_i$: actual house sales observation
$(w_0 + w_1 x_i)$: predicted value, but I am gonna write it as a function of $w_0$ and $w_1$, to make it clear that it's the prediction I'm forming when using $w_0$ and $w_1$, so: $\hat y_i(w_0,w_1)$

*Screenshot taken from Coursera 1:30

Then we can write our gradient descent algorithm

while not converged, we're gonna take our previous vector of W0 at iteration T, W1 at iteration T and We're going to subtract eta times the gradient.

*Screenshot taken from Coursera 3:25

We can also rewrite the algorithm.

I want it in this form to provide a little bit of intuition here. Because what happens if overall, we just tend to be underestimating our values y?

So, if overall, we're under predicting $\hat y_i$, then we're gonna have that the sum of yi- y hat i is going to be positive ($\sum [y_i - \hat y_i]$ is positive). Because we're saying that $\hat y_i$ is always below, or in general, below the true value $y_i$, so this is going to be positive.

And what's gonna happen? Well, this term here $\sum [y_i - \hat y_i]$ is positive. We're multiplying by a positive thing, and adding that to our vector W. So $w_0$ is going to increase. And that makes sense, because we have some current estimate of our regression fit. But if generally we're under predicting our observations that means probably that line is too low. So, we wanna shift it up. That means increasing $w_0$.

So, there's a lot of intuition in this formula for what's going on in this gradient descent algorithm. And that's just talking about this first term $w_0$, but then there's this second term $w_1$, which is the slope of the line. And in this case there's a similar intuition. But we need to multiply by this $x_i$, accounting for the fact that this is a slope term.

*Screenshot taken from Coursera 6:00

4) Comparing the approaches

Let's take a moment to compare the two approaches that we've gone over, either setting the gradient equal to zero or doing gradient descent.

Well, in the case of minimizing residual sum of squares, we showed that both were fairly straight forward to do. But in a lot of the machine learning method's that we're interested in taking the gradient and setting it equal to zero, well there's just no close form solution to that problem. So, often we have to turn to method's like gradient descent.

And likewise, as we're gonna see in the next module, where we turn to having lots of different inputs, lots of different features in our regression. Even though there might be a close form solution to setting the gradient equal to zero, sometimes in practice it can be much more efficient computationally to implement the gradient descent approach.

*Screenshot taken from Coursera 1:20

6) Discussion and summary of simple linear regression

1) Asymmetric cost functions

Let's discuss about the intuition of what happens if we use a different measure of error.

Okay so this residual sum of squares that we've been looking at is something that's called a Symmetric cost function. And that's because what we're assuming when we look at this aerometric is the fact that if we over estimate the value of our house, that has the same cost as if we under estimate the value of the house.

*Screenshot taken from Coursera 2:41

What happens if there might not be symmetric cost to these error?

But what if the cost of listing my house sales price as too high is bigger than the cost if I listed it as too low?

If I list the price too high, there will be no offers.
If I list the price too low, I still get offer, but not as high as I could have if I had more accurately estimated the value of the house.

So in this case it might be more appropriate to use an asymmetric cost function where the errors are not weighed equally between these two types of mistakes. So if we choose asymmetric cost, I prefer to underestimate the value than over.

*Screenshot taken from Coursera 3:18

2) A brief recap

*Screenshot taken from Coursera 0:45