1) Feature selection via explicit model enumeration

1-1) The feature selection task

In this module, we're gonna discuss how to select amongst a set of features to include in our model. And to do this, we're first gonna start by describing a way to explicitly search over all possible models. And then what we're gonna do is describe a way to implicitly do a feature selection using regularized regression, akin to the types of ideas we were talking about when we discussed ridge regression.

So let's start by motivating this feature selection task. So question is, why might you want to select amongst your set of features?

One is for efficiency. So let's say you have a problem which has 100 billion features. That might sound like a lot. It is actually a lot, but there are many applications out there these days where we're faced with this many features. Well, every time we go to do prediction with 100 billion features, that multiplication we have to do between our feature vector and the weights on all these features is really computationally intensive.

In contrast, if we assume that the weights on our features are sparse, and what I mean by that is that there are many zeros, then things can be done much more efficiently. Because when we're going to form our prediction, all we need to do is sum over all of the features whose weights are not zero.

But another reason for wanting to do this feature selection and the one that perhaps might be more common, at least classically, is for the reason of interpretability, where we wanna understand what are the features relevant for, for example, a prediction task.

Screenshot taken from Coursera 1:00

So, for example, in our housing application, we might have a really long list of possible features associated with every house. This is actually an example of a set of features that were listed for a house that was listed on Zillow.

And there are lots and lots of detailed things, including what roof type the house has, and whether it includes a microwave or not when it's getting sold. And the question is, are all these features really relevant to assessing the value of the house, or at least how much somebody will go and pay for the house? They probably wouldn't mind if they're spending a couple $100,000 to go buy a microwave. That's probably not factoring in very significantly in making their decision about the value of the house.

So when we're faced with all these features, we might wanna select a subset that are really representative or relevant to our task of predicting the value of a house. So here I've shown perhaps a reasonable subset of features that we might use for assessing the value. And one question we're gonna go through in this module is, how do we think about choosing this subset?

Screenshot taken from Coursera 2:00

And another application we talked about a couple modules ago was this reading your mind task, where we get a scan of your brain, and for our sake, we can just think of this as an image. And then we'd like to predict whether you are happy or sad in response to whatever you were shown. So when we went to take the scan of your brain, you're shown either a word, or an image, or something like this. And we wanna be able to predict how you felt about that just from your brain scan. So we talked about the fact that we treat our inputs, our features, as just the voxels. We can think of them as just pixels in this image, and wanted to relate those pixel intensities to this output, this response of happiness. And in many cases, maybe what we'd like to do is find a small subset of regions in the brain that are relevant to this prediction task. And so here again is a reason for interpretability, that we might wanna do this feature selection task.

Screenshot taken from Coursera 3:00

1-2) All subsets

Okay, so how are we going to go about this feature selection task? Well one option we have, is the obvious choice, which is to search over every possible combination of features we might want to include in our model and look at the performance of each of those models. And that's exactly what the all subsets algorithm does and we're going to describe this now. Okay, well the all subsets algorithm starts by considering a model with absolutely no features in it. Okay, removing all these features we might have in our house and saying what's the performance of that model?

So just to be clear, start with no features and there's still a model for no features. So the model for no features, remember, is just that our observation is simply noise. Okay, so we can assess the performance of this model on our training data, and there is some training error associated with a model with no features. So, we're going to plot that point.

Screenshot taken from Coursera 1:00

And then the next thing we're going to do, is we're going to search over ever possible model of just one feature.

So let's say to start with, we consider a model which is number of bedrooms. And here we're gonna plot the training error of model fit just with number of bedrooms as the feature. Then we're gonna say, well, what's the training error of a model fit just with number of bathrooms, square feet, square feet of the lot, and cycle through each one of our possible features.

And at the end of this we can say out of all models with just one feature, which one fit the training data the best? And in this case it happened to be the model that included square feet living. So we've seen that before, that that's a very relevant feature. Okay, so we're gonna highlight, this is the best fitting model. With only one feature. And we're gonna keep track of this model and we can discard all the other ones.

Screenshot taken from Coursera 2:00

Then we're gonna go and search over all models with 2 features. So search over all combinations, Of 2 features. And we're gonna figure out which one of these has the lowest training error, keep track of that model. And that happens to be a model that has number of bedrooms and number of bathrooms. And that might make sense, because when we're going to search for a property, often someone will say, I want a three bedroom house with two bathrooms. So that might be a reasonable choice for the best model, which is two features.

Screenshot taken from Coursera 3:00

And I wanna emphasize that the best model, which is two features, doesn't have to contain any of the features that were contained in the best model of one feature. So here, our best model with two features has number of bedrooms, number of bathrooms. Whereas in contrast, our best model with just one feature has square feet living. Okay, so these aren't necessarily nested. So maybe I'll write this explicitly.

Best model of size k need not contain features of best model of size k - 1.

Screenshot taken from Coursera 4:00

And we're gonna continue our procedure searching over all models then with 3 features, all models with 4 features, 5 features, and at some point what we're gonna get to is, we're gonna get to a model that has capital D features. That's all of the features that we include, and there is only one such model. So it's just one point here.

Then, what we can do is we can draw this line, the set is connecting the points which represent the set of all best possible models, each with a given number of features. Then the question is, which of these models of these best models with k features do we want to use for our predictions?

Well hopefully it's clear from this course, as well as from this slide, that we don't just wanna choose the model with the lowest training error, because as we know at this point, as we increase model complexity, our training error is gonna go down, and that's what we're seeing in this plot.

Screenshot taken from Coursera 5:00

So instead, it's the same type of choices that we've had previously in this course for choosing between models of various complexity. One choice, is if you have enough data you can access performance on the validation set. That's separate from your training and test set. We also talked about doing cross validation. And in this case there many other metrics we can look at for how to think about penalizing model complexity. There are things called BIC and a long list of other methods that people have for choosing amongst these different models. But we're not going to go through the details of that, for our course we're gonna focus on this notion of error on the validation set.

Screenshot taken from Coursera 6:00

1-3) Complexity of all subsets

Well, this all subsets algorithm might seem great or at least pretty straight forward to implement, but a question is what's the complexity of running all subsets? How many models did we have to evaluate?

Well, clearly what we evaluated all models, but let's quantify what that is. So we looked at the model that was just noise. We looked at the model with just the first feature, second feature, all the way up to the model with just the first two features, the second two features and every possible model up to the full model of all D features.

And what we can do is we can index each one of these models that we searched over by a feature vector. And this feature vector is going to say, so for feature one, feature two, all the way up to feature D, what we're gonna enter is zero if no, that feature is not in the model, and one if yes, that feature is in the model. So it's just gonna be a binary vector indicating which features are present. So, in the case of just noise, or no features we have zeros along this whole vector.

In the case of just the first model, I made the first feature be included in the model, we're just gonna have a one in that first feature location, and zeros everywhere else. I guess for consistency, let me index this as feature zero, feature one. All the way up to feature D. Okay, and we're gonna go through this entire set of possible feature vectors, and how many choices are there for the first entry, two. How many for the second, two, two, two choices for every entry. And how many entries are there? Well, with my new indexing, instead of D there's really D + 1. That's just a little notational choice.

And I did a little back of the envelope calculation for a couple choices of D. So for example, if we had a total of eight different features we were looking over, then we would have to search over 256 models. That actually might be okay. But if we had 30 features, all of a sudden we have to search over 1 billion some number of different models. And if we have 1,000 features, which really is not that many in applications we look at these days, all of a sudden we have to search over 1.07 times 10 to the 301. And for the example I gave with 100 billion features, it's clearly just a huge number. So, what we can see is that typically, and in most situations we're faced with these days. It's just computationally prohibitive to do this all subset search.

Screenshot taken from Coursera 2:00

1-3) Greedy algorithms

So, in cases where you can't do all subsets, what can we do?

Well, one thing we can do, which is gonna be option number two in our series of feature selection algorithms. Is to do a sub-optimal, but tractable approach to selecting amongst features. And that's a set of greedy algorithms that we're gonna consider now. So, one simple algorithm is what's called the forward stepwise algorithm, where you start with some set of possible features. And then what you're gonna do is, you're going to just greedily walk through each, select the next best feature and keep iterating. So, let's be a little bit more explicit about this.

So, first we're gonna start with the empty set, no features. Or you could start with a list of features that you know you want to be included in the model. So, for example the constant feature which might be your feature, h0. But then what you're gonna do is, you're gonna fit your model with whatever your current set of features are to get an estimate of the weights on those features. Your estimate of your regression coefficients.

And then what you're gonna do is, you're gonna search over all the possible features that you might add, and choose the one that best improves your fit. And so, one measure of this might be training errors. So, you choose the feature that most reduces your training error. And then you just include that in your model, and you recurse this procedure. Where now you have a model with some number of features. And you're gonna search over every possible feature you might wanna add and choose the next one that drops your training error the most, and keep going.

Screenshot taken from Coursera 1:00

So, let's visualize this procedure. For this visualization, we're gonna start with the picture that we had in our all subset search. And we're gonna look at how the greedy search differs from the search we would have done in all subsets. So at the first step, we start with no features in our model. So, we're actually gonna be at exactly the same point as we were in all subsets. So, start at same training error as all subsets. Which is equivalent to no features in the model.

Screenshot taken from Coursera 3:00

And then what we're gonna do is we're gonna search over all our features, and choose the one that reduces training error the most. And which one is that gonna be? It was actually going to be the same one we got for all subsets. Because at that point, when we we're doing all subsets and looking over all models of one feature. Again, we have the same choice of choosing the best single feature even though we're doing this greedy procedure. When we start with an empty set, our first best feature is gonna be the same as the same best feature that you would get doing all subsets. So again, stay at all subset solution For best model with one feature. And in that case, remember just like in all subsets, we choose square feet as the best feature when we have just one feature.

Screenshot taken from Coursera 3:30

But then things can start to diverge. Because when we go to our second step in the greedy algorithm, we're forced to keep square feet in our model. And we're just greedily gonna search over the next feature. That's why it's called greedy.

I should've mentioned this previously because we're just doing the "greedy" thing of just saying, what's the next best thing, and not thinking globally about what might be best. Okay. So in this case, maybe the next best thing that we could add to the model, forcing ourselves to keep square feet in our model, is year renovated. So, when we think about what greedy finds, add two features. We have square feet living and year renovated.

Screenshot taken from Coursera 4:00

But if we compare to what our all subsets had, it was number of bedrooms and number of bathrooms. So, this is where we see that things can be different. Because once we're forced to keep square feet, maybe the next thing that's most relevant to value, you can have these very large old houses, but those might not be worth as much. It might be the year that the house was renovated that's really relevant to the value. But as we described before, it might be reasonable to think that the best model with just two features was number of bedrooms and number of bathrooms. Because those are really key attributes that people use to value whether a house is a good fit for them.

Screenshot taken from Coursera 4:30

Well, then we continue with this procedure. We already had square feet living, we had year renovated. And now, in this case, our next feature we're gonna add is waterfront, and we're gonna keep going on this process.

Screenshot taken from Coursera 4:30

And when's the next time that our greedy procedure is gonna meet the same solution as all subsets? So, we started at the same point, we said after the first iteration, we were at the same point. Well, again, we'll be at the same point once we get all the way, obviously, to including all features, coz there's only one such model with all features.

So, there are a couple points that I wanna make about this greedy procedure. One is the fact that from iteration to iteration, error can never decrease. And the reason for that is, even if there's nothing else that you want to add to your model. Even if everything else is junk, or perhaps only making things a lot worse in terms of predictions. You can always just set the weight equal to zero on that additional feature. And then you'd have exactly the same model you had, which is one fewer feature. And so, the training error would be exactly the same. So, we know the training error will never increase. And the other thing that we mentioned is that eventually, the training error will be equivalent to that of all subsets when you're using all of your features, and perhaps, before that as well.

Screenshot taken from Coursera 6:00

Well, when do we stop this greedy procedure? Just to really hammer this home, I'll just ask the question again.

Do we stop when the training error is low enough? No. Remember that training error will decrease as we keep adding model complexity. So, as we keep adding more and more features, our training error will go down.

What about when test error's low enough? Again, no. Remember that we never touch our test data when we're searching over model complexity, or choosing any kind of tuning parameter.

Again, what we're gonna do is use our validation set or cross validation to choose when to stop this greedy procedure.

Screenshot taken from Coursera 6:30

1-4) Complexity of the greedy forward stepwise algorithm

So now let's talk about the complexity of this forward stepwise algorithm. And contrast it with the all subsets procedure. So how many models did we evaluate here?

Well, in the first step, assuming that we have a total of D features, we searched over D models. Then, in the second step, we had D-1 models to search over. So all remaining features we might think about adding. And then the third step, we had D-2 features and so on.

And a question is, how many steps did we take, how many of these and so ons did we have? Well, it depends. It depends on when we chose to stop, but at most we're gonna have D steps, which gets us to the full model. So, the complexity of this is $O(D^2)$. So at most D steps each on the order of looking over D models. And that's gonna be much, much less than $2^D$, which is what we had for all subsets, when D is reasonably large.

Okay, so this procedure is gonna be significantly more computationally efficient or feasible than doing all subsets.

Screenshot taken from Coursera 1:00

Well this was just one of many possible choices you have for greedy algorithms for doing feature selection.

As an example, instead of always starting from an empty model and growing and growing and growing, adding more features, you could do the opposite. You could start from a full model and then choose, at each iteration, which feature to eliminate.

But you could also think about combining these steps. So you could do a forward procedure, where you search over which feature to add, but then every so often you could go back and think about deleting features. Because like we said, you might in the forward process, add a feature that later, once you've added another feature, is no longer relevant.

So, as an example of this, maybe, and it's not completely probably true in this application. But maybe once you've added number of bedrooms and number of baths, maybe square feet is no longer so relevant since these other things can act as a proxy for square feet, or something like that. And there are lots and lots of other variants of algorithms people have proposed with different metrics for when you add or remove features. And different ways to search over the features because this problem of feature selection is really important. It's not a new problem, and so a lot of different procedures have been proposed to solve it.

Screenshot taken from Coursera 2:00

2) Feature selection implicitly via regularized regression

2-1) Can we use regularization for feature selection?

Well, for our third option for feature selection, we're gonna explore a completely different approach which is using regularized regression to implicitly perform feature selection for us. And the algorithm we're gonna explore is called Lasso. And it's really fundamentally changed the field of machine learning, statistics, and engineering. It's had a lot of, lot of impact, in just a number of applications. And it's a really interesting approach.

Let's recall regularized regression, and the context of ridge regression first. Where, remember, we were balancing between the fit of our model on our training data and a measure of the magnitude of our coefficients, where we said that smaller magnitudes of coefficients indicated that things were not as overfit as if you had crazy, large magnitudes. And we introduced this tuning parameter, lambda, which balanced between these two competing objectives. So for our measure of fit, we looked at residual sum of squares. And in the case of ridge regression, when we looked at our measure of the magnitude of the coefficients, we used what's called the L2 norm, so this is just the two norm squared in this case, which is the sum of each of our feature weights squared. Okay, this ridge regression penalty we said encouraged our weights to be small. But one thing I want to emphasize is that I encourage them to be small but not exactly 0.

Screenshot taken from Coursera 1:00

We can see this if we look at the coefficient path that we described for ridge regression, where we see the magnitude of our coefficients shrinking and shrinking towards 0, as we increase our lambda value. And we said in the limit as lambda goes to infinity, in that limit, the coefficients become exactly 0. But for any finite value of lambda, even a really really large value of lambda, we're still just going to have very, very, very small coefficients but they won't be exactly 0.

Screenshot taken from Coursera 1:30

So why does it matter that they're not exactly 0? Why am I emphasizing so much this concept of the coefficients being 0?

Well, this is this concept of sparsity that we talked about before, where if we have coefficients that are exactly 0, well then, for efficiency of our predictions, that's really important because we can just completely remove all the features where their coefficients are 0 from our prediction operation and just use the other coefficients and the other features. And likewise, for interpretability, if we say that one of the coefficients is exactly 0, what we're saying is that that feature is not in our model. So that is doing our feature selection.

Screenshot taken from Coursera 2:00

So a question though, is can we use regularization to get at this idea of doing feature selection, instead of what we talked about before? Where before, when we're talking about all subsets, or greedy algorithms, what we were doing is we were searching over a discrete set of possible solutions, we're searching over the solution that included the first and the fifth feature, or the second and the seventh, or this entire collection of these discrete solutions.

But we'd like to ask here is whether we can start with for example, our full model. And then just shrink some coefficients not towards 0, but exactly to 0. Because if we shrink them exactly to 0, then we're knocking out those coefficients, we're knocking those features out from our model. And instead, the non-zero coefficients are going to indicate our selected features.

Screenshot taken from Coursera 3:00

2-2) Thresholding ridge coefficients?

Okay, well maybe we can just take our ridge regression solution, and just take all the little coefficients and just say they're 0, just get rid of those. We're gonna call those thresholding those away. We're gonna choose some value where, below that value of the magnitude of the coefficient is below the threshold that we choose, just going to say it's not in the model. So, let's explore this idea a little bit. So, here I'm just showing an illustration of a little cartoon of what the weights might look like on a set of features in our housing application. And, I'm choosing some threshold which is this dashed black line. And if the magnitude exceeds that threshold, then I'm gonna say that features in my model.

Screenshot taken from Coursera 0:30

So here in fuchsia, I'm showing the features that have been selected to be in my model after doing this thresholding of my ridge regression coefficients. Might seem like a reasonable approach, but let's dig into this a little bit more.

Screenshot taken from Coursera 1:00

And in particular, let's look at two very related features. So if you look at this list of features, you see, in green, I've highlighted number of bathrooms and number of showers. So, these numbers tend to be very, very close to one another. Because lots of bathrooms have showers, and as the number of showers grow, clearly the number of bathrooms grow because you're very unlikely to have a shower not in a bathroom.

But what's happened here? Well, our model has included nothing having to do with bathrooms, or showers, or anything of this concept. So that doesn't really make a lot of sense. To me it seems like something having to do with how many bathrooms are in the house should be a valuable feature to include when I'm assessing the value of the house. So what's going wrong?

Screenshot taken from Coursera 2:00

Well, if I hadn't included number of showers. Let's just for simplicity's sake, treat the number of showers as exactly equivalent to the number of bathrooms. It might not be exactly equivalent, but they're very strongly related. But like I said for simplicity, let's say they're exactly the same. So if I hadn't included number of showers in my model to begin with, in the full model, then when I did my ridge search, it would've placed that weight that had been on number of showers, on the number of bathrooms.

Because remember, it's a linear model, we're summing over weight times number of bathrooms plus weight times number of showers. So if number of bathrooms equals number of showers, it's equivalent to the sum of those two weights just times number of bathrooms, excluding number of showers from the model.

Screenshot taken from Coursera 3:00

Okay so, the point here is that if I hadn't included this redundant feature, number of showers, what I see now visually, is that number of bathrooms would have been included in my selected model doing the threshholding.

So, the issue that I'm getting at here. It's not specific to the number of bathrooms and number of showers. It's an issue that, if you have a whole collection, maybe not two, maybe a whole set of strongly related features. More formally, statistically I will call these strongly correlated features. Then ridge regression is gonna prefer a solution that places a bunch of smaller weights on all the features, rather than one large weight on one of the features.

Because remember the cost under the ridge regression model is the size of that feature squared. And so if you have one really big one, that's really gonna blow up that cost, that L2 penalty term. Whereas the fit of the model is gonna be basically about the same. Whether I distribute the weights over redundant features or if I put a big one on just for one of them and zeros elsewhere. So what's gonna happen is I'm going to get a bunch of these small weights over the redundant features. And if I think about simply thresholding, I'm gonna discard all of these redundant features. Whereas one of them, or potentially the whole set, really were relevant to my prediction task.

So hopefully, it's clear from this illustration that just taking ridge regression and thresholding out these small weights, is not a solution to our feature selection problem. So instead we're left with this question of, can we use regularization to directly optimize for sparsity?

Screenshot taken from Coursera 4:00

2-3) The lasso objective and its coefficient path

Okay, well in place of our ridge regression objective. What if we took our measure of our magnitude of our coefficients to be what's called the L1 norm. Where we're gonna sum over the absolute value of each one of our coefficients. So, we actually describe this as a reasonable measure of the magnitude of the coefficients when we're discussing ridge regression last module. Well, the result of this is something that leads to sparse solutions. For reasons that we're gonna go through in the remainder of this module. And this objective is referred to as Lasso regression. Or L1 regularized regression.

Screenshot taken from Coursera 0:30

So, just like in ridge regression, lasso is governed by a tuning parameter, lambda, that controls how much we're favoring sparsity of our solutions relative to the fit on our training data.

And so, just to be clear, here, we see that when we're doing our feature selection task, we're searching over a continuous space, this space of lambda values. Lambda's governing the sparsity of the solution and that's in contrast to, for example, the all subsets or greedy approaches, where we talked about those searching over a discrete set of possible solutions. So, it's really a fundamentally different approach to doing feature selection. Okay but let's talk about what happens to our solution as we vary lambda. And again just to emphasize, this lambda is a tuning parameter that in this case is balancing fit and sparsity.

If $\lambda = 0$, this penalty term is completely going to disappear, and our objective is simply going to be to minimize residual sum of squares. That was our old least squares objective. So, we're going to get W hat what I'll call lasso. The solution to our lasso problem is going to be exactly equal to W hat least squares. So, this is equal to our unregularized solution. $$\hat w^{lasso} = \hat w^{LS}$$
In contrast, if $\lambda = \infty$, this is where we are completely favoring this magnitude penalty in completely ignoring the residual square is fit. In this case, what's the thing that minimizes the L1 norm. So, what value of our regression coefficients is gonna have some other absolute value is being the smallest. Well again or just like in ridge when lambda's equal to infinity we're gonna get W hat lasso equal to the zero vector. $$\hat w^{lasso} = 0$$
If $\lambda$ in between, we're gonna get that in this case the one norm of our lasso solution It's gonna be less than or equal to the one norm of our lease square solution and it's gonna be greater than or equal to this zero number. $$0 \leq ||\hat w^{lasso}||_1 \leq ||\hat w^{LS}||_1 $$

So, as of yet, it's not clear why this L 1 norm is leading to sparsity, and we're going to get to that, but let's first just explore this visually.

Screenshot taken from Coursera 2:00

And one way we can see this is from the coefficient path. But first, let's just remember the coefficient path for ridge regression, where we saw that even for a large value of lambda Everything was in our model, just with small coefficients. So, everything has $\hat w_j > 0$ but all $\hat w_j$ are small for large values of our tuning parameter lambda.

Screenshot taken from Coursera 4:00

In contrast, when we look at the coefficient path for lasso, we see a very different pattern.

What we see is that at certain critical values of this tuning parameter lambda. Certain ones of our features jump out of our model. So, for example here we had square feet of the lot size disappears from the model. Here number of bedrooms almost simultaneously with number of floors and number of bathrooms. Followed by the year the house was built.

And then, but one thing that we see, so let me just be clear, that for let's say a value of lambda like this, we have a sparse set of features included in our model. So, the ones I've circled. Are the only feature, sorry. Only features in our model. And all the other ones, have dropped completely exactly to zero.

And one thing that we see is that when lambda is very large, like the large value I showed on the previous plot, the only thing in our model is square feet living. And note that square feet living still has a really significantly large weight on it. So, I'll say large weight on square feet living when everything else is out of the model. Meaning not included in the model. So, square feet living is still very valuable to our predictions, and it would take quite a large lambda value to say that square feet living, even that was not relevant. Eventually, square feet living would be shrunk exactly to 0. But for a much large value of lambda.

But, if I go back to my ridge regression solution. I see that I had a much smaller value on square feet living, because I was distributing weights across many other features in the model. So, that individual impact of square feet living wasn't as clear.

Screenshot taken from Coursera 6:00

3) Geometric intuition for sparsity of lasso solutions

3-1) Visualizing the ridge cost

Now let's see why we get sparsity in our lasso solutions. And, to do this, let's interpret the solution geometrically. But first, to set the stage, let's interpret the ridge regression solution geometrically. And, then we'll get to the lasso.

Well, since visualizations are easier in 2D, let's just look at an example where we have two features, h0 and h1. So, let me just write this, two features for visualization sake. And what I'm writing in this green box is my ridge objective simplified just for having two features, and in this pink box, I'm showing just the residual sum of squares term. And what we're going to do to start with, we're gonna make a contour plot for our residual sum of squares in these two dimensions, w0 by w1, and let's look at this residual sum of squares term. Where inside this sum over n observations we're gonna get terms that look like

$y^2 + w_0^2h_0^2 + w_1^2h_1^2 $ + cross terms = constant

When I finish completing this square here, and if I think about this. And if I sum over all my observations, these sums will pass in here. So, if I think of this as a function of w0 and w1, well, what's this defining? If this is equal to some constant, which is what a contour plot is doing it's looking at an objective equal to different values. Well this is an equation of an ellipse. Because I have my two parameters, w0 and w1, each squared. They're multiplied by some weighting and then there's some other terms having to do with w0 and w1, no power greater than squared, that are coming in here, setting it equal to constant. That by definition is an ellipse.

Okay, so what I see is that, for my residual sum of square's contour plot, I'm gonna get a series of ellipses. So I'm highlighting here one ellipse.

And what this ellipse is is this is $RSS(w_0, w_1$) = const_1.
This next plot is $RSS(w_0, w_1)$ = const_2, which is greater than const_1 and so on.

That's what all these curves are, increasing residual sum of squares. And when I walk around this curve, it's a level set, it's a set of all things being equal value. So, if I look at some $w_0, w_1$ pair and I look at some other point, $w_0^{\prime}, w_1^{\prime}$, while both of these points, here and here, have the same residual sum of squares which is what I called constant one before. So as I'm walking around this circle, every solution $w_0, w_1$ has the exactly the same residule sum of squares. Okay so hopefully what this plot is showing is now clear.

And now let's talk about, what if I just minimize residual sum of squares? Well if I just minimize residual sum of squares, everytime I jump from one of these curves to the next curve to the next curve, all the way into the smallest curve, just this dot in the middle, this is going to smaller and smaller residual sum of squares. So this x here marks the minimum over all possible $w_0, w_1$ of residual sum of squares ($w_0, w_1$) and what is that? That's our least square solution so this is $\hat w^{LS}$.

Screenshot taken from Coursera 3:00

Okay, so that would be my solution if I were just minimizing residual sum of squares, but when I'm looking at my ridge objective, there's also this L2 penalty.

This sum of w0 squared + w1 squared. And what does that look like? So I have w0 squared + w1 squared and when I'm looking at my contour plot I'm looking at setting this equal to some constant. And changing the value of that constant to get these different contours, the different colors that I'm seeing here. Well, what shape is w0 squared plus w1 squared equal to constant? That is exactly the equation of a circle.

So we see that the circle is centered about zero and so each one of these curves here, just to be clear, is my two norm of w squared, equal to, let's say constant one. This next circle is the two norm of w squared equal to sum constant two, which is greater than constant one, and so on. And again, if I look at any point w0 w1 and some other point w0 prime w1 prime, these things have the same norm of the w vector squared. And that's true for all points around the circle.

So let's say I'm just trying to minimize my two norm. What's the solution to that? Well I'm going to jump down these contours to my minimum. And my minimum, let me do it directly in red this time. My minimum is setting w0 and w1 equal to zero. So this is min of w0 w1, my two norm, which I'll write explicitly in this 2D case, is w0 squared + w1 squared, and the solution is 0. Okay, so this would be our ridge regression solution if lambda, the weight on this 2 norm, were infinity. We talked about that before.

Screenshot taken from Coursera 6:00

3-2) Visualizing the ridge solution

Screenshot taken from Coursera 1:00

Discussion

https://www.coursera.org/learn/ml-regression/lecture/1F5pH/visualizing-the-ridge-solution/discussions/l-3ABp6TEeWJ1hK_n1VWVw

Screenshot taken from Coursera 3:00

3-3) Visualizing the lasso cost and solution

Screenshot taken from Coursera 1:00

Discussion

When the L1 norm was first introduced I couldn't really understand the difference between it and the L2 norm. I was really focused on both norms having the same behavior as λ→∞. But once I saw the "Visualizing the lasso solution" animation and the diamond contour I realized that if you have two ellipses they can't touch in a way that gives sparsity, but a diamond can touch RSS and give you sparsity!

That's all, I really liked the animation and contour plots. Super effective teaching method.

Screenshot taken from Coursera 5:00

4) Setting the stage for solving the lasso

4-1) What makes the lasso objective different

Now that we've described our lasso objective and we've given some intuition for why it leads to sparse solutions, now let's turn to how we're going to optimize our objective and actually solve for our lasso solution, for a specific value of lambda, this tuning parameter that's weighing how much we're Including this L1 term in our objective.

So, in our machine learning workflow for regression, we're talking about this gray box here, our machine learning algorithm.

Screenshot taken from Coursera 0:30

And let's first remember what we've done for past objectives when we talked about least squares and ridge regression. What we did was, we took the gradient of our total cost and then we either looked at a closed-form solution, setting that gradient equal to zero, or we used the gradient within an iterative procedure called gradient descent.

Screenshot taken from Coursera 0:50

Well here's our lasso objective. And let's think about taking the gradient.

Well, we know how to take the gradient to the residual sum of squares term, but then we get to this L1 objective. And we have some issues. In particular, what's the derivative of the absolute value? Because remember our L1 objective is $\sum_{j=0}^D |w_j|$. So we're gonna have to take derivatives of this absolute value of wj.

And when I think about this absolute value function. So this is wj, absolute value of wj, well, what's the derivative here?

The derivative any point along this side is gonna be -1. The slope of this line is -1.
And the derivative anywhere on this half of the plain is +1 because the slope is +1.
And then I get to this critical zero point. And what's the derivative there? There's actually no derivative that exists at that point.

So instead of thinking about gradients defined by these derivatives, we can talk about something called subgradients. And we're gonna discuss the concept of subgradients in a more advanced, optional video. But just know that they exist, and they're crucial to the derivation of the algorithms we're gonna talk about for the lasso objective. And if you're interested in learning more, stay tuned for our optional advanced video.

But even if you could compute this derivative that we're saying doesn't exist for this absolute value function, there's still no closed-form solution for our lasso objective. So we can't do the closed-form option, but we could do our gradient descent algorithm. But again not using gradients, using subgradients.

Screenshot taken from Coursera 3:00

4-2) Coordinate descent

But instead of gradient descent or sub-gradient descent, let's take this as an opportunity to teach a really important alternative optimization procedure called Coordinate descent. So, let's just have a little aside on the coordinate decent algorithm, and then we're gonna describe how to apply coordinate descent to solving our lasso objective.

So, our goal here is to minimize some function g. So, this is the same objective that we have whether we are talking about our closed form solution, gradient descent, or this coordinate descent algorithm. But, let me just be very explicit, where. We're saying we wanna minimize over all possible w some g(w), where here, we're assuming g(w) is function of multiple variables. Let's call it g(w0,w1,...,wD). And often, minimizing over a large set of variables can be a very challenging problem. But in contrast, often it's possible to think about optimizing just a single dimension, keeping all of the other dimensions fixed. So easy for each coordinate when keeping others fixed, because that turns into just a 1D optimization problem.

And so, that's the motivation behind coordinate decent, where the coordinate descent algorithm, it's really intuitive. We're just gonna start by initializing our w vector, which will denote $\hat w = 0$. Or you could use some smarter initialization, if you have it. And then while the algorithm's not converged, we're just gonna pick a coordinate out of 0, 1 all the way to d. And we're gonna set $\hat w_j$ equal to the w that minimizes. I'm gonna write this as an omega, so I'm gonna search over all possible values of omega that minimizes g of $(\hat w_0, ..., \hat w_{j-1}, \omega, \hat w_{j+1}, \hat w_D)$.

We're here, these values, everything with a hat on it are values from previous iterations of this coordinate descent algorithm. And our objective here, we're just minimizing over this J coordinate.Okay, so whatever omega value minimizes this joint objective, plugging omega in, is gonna be what we set $\hat w_j$ equal to at this iteration, and then were gonna keep cycling through until conversions.

So, let's look at a little illustration of this just in two dimension. So, here it would just be $w_0, w_1$. And we're gonna pick a dimension, so let's say, we choose this optimizing the $w_1$ dimension, keeping $w_0$ fixed. So I'm just gonna look at this slice, and what's the minimum along this slice? Well the minimum occurs here, if I look at the contours this is the minimum along this dimension, and then what I'm gonna is I'm, in this case just gonna cycle through my coordinates. I'm next gonna look at keeping W1 fixed and optimizing over W0, and here was the minimum, I'm gonna drop down to the minimum which is here and I'm gonna optimize over W1 holding W0 fixed to this value. And I'm gonna keep taking these steps and if I look at the path of coordinate descent, it's gonna look like the following, where I get this axis aligned moves.

Screenshot taken from Coursera 3:00

So how are we gonna pick which coordinate to optimize next? Well there are a couple choices. One is to just choose coordinates randomly. This is called either random or stochastic coordinate descent. Another option is round robin, where I just cycle through, zero, one, two, three, all the way up to D and then cycle through again. And there are other choices you can make as well.

And, I want to make a few comments about coordinate descent. The first is, it's really important to note that there is no stepsize that we're using in this algorithm. So, that's in contrast to gradient descent, where we have to choose our stepsize or stepsize schedule. And hopefully, in the practice that you've had, you've seen the importance of carefully choosing that step size. So it's nice not having to choose that parameter.

And this algorithm is really useful in many, many problems. And you can show that this coordinate descent algorithm converges for searching objective functions. One example of this is, if the objective is strongly convexed, and we talked about strongly convexed functions before. Another example is, for lasso which is very important and that's why we're talking about using coordinate descent for lasso. We're not gonna go through proof of convergence in this case, but know that it does indeed converge.

Screenshot taken from Coursera 5:00

4-3) Normalizing features

Well, our presentation of coordinate ascent for lasso is really simplified if we think about normalizing our features and this is also a practically important concept. So let's take a few minutes to discuss this idea of normalizing features.

Well when we talk about normalizing our features what we're talking about is taking a column of our training data, so very important that we take a column not a row. Okay. So I'll say it once more. We take a column not a row because what does a column represent? Well, for example in our housing application it might represent, all instances of the number of square feet of houses in our training data set. So what we're gonna do is we're gonna do an operation on square feet. That would be one column.

In contrast, if I took a row we're gonna be doing an operation for a specific observation and that operation would be different between different observations and things would be all on different scales between our different observations and that wouldn't make any sense. So please never do that. Please do this observation on columns of the training data.

What's this operation we're gonna do? Well we're gonna take our column, our specific feature like square feet over our n different observations in our training set, and we're going to normalize each of these entries, each of these feature values by the following quantity, and doing this operation is gonna place each one of our features into the same numeric range, because the range of values we might see for square feet versus number of floors or number of bathrooms is dramatically different. After we do this transformation though, everything will be in the same numeric range, so that's why it's a practically important transformation.

But, I wanna emphasize one thing that when you transform your features, when you normalize them. In this way or any way that you choose to transform your features on your training dataset you have to do exactly the same operation on your test dataset. Otherwise you're gonna be mixing up the apples and oranges because for example, you think about transforming your training data from Measuring square feet to measuring square meters for every house but then your test data is still in square feet. When I learn a model in my training data, which is now in square meters, and I go to test it on my test data which is square feet, it's gonna give me bogus answers. Okay, so make sure you do the same operation on both training and test and so, if we're gonna to call this normalizer, which we'll use this notation later, Zj, we're going to divide by exactly the same quantity for each column of our test set. So this same Zj appears here where, just to emphasize, we're summing over all of our training data points in doing this operation here, and we're applying it to one of our test data points.

Screenshot taken from Coursera 2:00

4-4) Coordinate descent for least squares regression (normalized features)

As a last aside, before we get to our coordinate descent algorithm for Lasso. Let's first describe, or rather derive our coordinate descent algorithm for just least squares regression using these normalized features that we've just described.

So here's our least squares objective, which is just minimizing our residual sum of squares. And when we derive our coordinates to send algorithm, remember that we're going to look at a specific variable, $w_j$. We're going to hold all other variables fixed. And we're going to optimize just with respect to $w_j$.

So what we're going to do, is we're going to take the partial with respect to $w_j$. And then think about setting that equal to zero, to find the optimal value for $w_j$, fixing everything else. So this is our statement of just our 1d optimization. Coordinate by coordinate. Okay, so let's write our partial with respect to $w_j$ of our residual sum of squares. And we've derived this in various forms multiple times in this course so I'm not gonna re-derive it again. But, I'm just writing it here and then we're gonna manipulate this into the form that we need in this specific case.

And I wanna highlight those first before I move on that this underbar notation ($\underline {h}$). It was actually introduced when we're talking about our normalized features. So, these are our normalized features. That's what the underbar represents. So, everything i'm doing here is in the context of normalized features.

Okay, so we can rewrite the subjective. What I wanna do is, I specifically wanna pull out this $w_j$ term. That's the term I'm solving for, everything else is assumed fixed. So when I'm thinking about my coordinate distant algorithm, I know the values of all the other $w_j$.

I guess I should mentionwhat this notation means, this $w_{-j}$ means all $w_k$ for $k \neq j$. So that's my shorthand saying I'm just taking my w vector and subtracting off, just removing from that set that jth element $w_j$. Okay, so I'm assuming that $w_{-j}$, those numbers are known at any given iteration of my algorithm, so I wanna separate out $w_j$, which is my variable, and solve for everything in terms of these other quantities.

\begin{eqnarray} \frac{\partial}{\partial w_j}RSS(w) &=& -2 \sum_{i=1}^N h_j(x_i)(y_i - \sum_{j=0}^D w_j h_j(x_i)) \nonumber \\ &=& -2 \sum_{i=1}^N h_j(x_i)(y_i - \sum_{k \neq j} w_k \underline h_k(x_i) - w_j h_j(x_i)) \nonumber \\ &=& -2 \sum_{i=1}^N h_j(x_i)(y_i - \sum_{k \neq j} w_k \underline h_k(x_i)) + 2w_j \sum_{i=1}^N \underline h_j(x_i)^2 \nonumber \\ &=& -2 \rho_j + 2w_j \nonumber \\ \end{eqnarray}

What is the sum of the squared values of my normalized features? Well by definition of normalization, this is $\sum_{i=1}^N \underline h_j(x_i)^2 = 1$.
What is this term here $\sum_{k \neq j} w_k \underline h_k(x_i)$? I'm looking at the sum over all features not equal to j of w k times my feature k. Well that's my predicted value of the $i^{th}$ observation if I excluded the $j^{th}$ feature from the model. So this is my predicted value without including this $j^{th}$ feature.

Screenshot taken from Coursera 3:00

So this is my partial with respect to $w_j$ of my residual sum of squares. What I'm gonna do like I said is I'm gonna set this partial $-2 \rho_j + 2w_j$ equal to zero and I'm going to solve for $w_j$

\begin{eqnarray} \frac{\partial}{\partial w_j}RSS(w) &=& -2 \rho_j + 2w_j = 0 \nonumber \\ w_j = \rho_j \nonumber \\ \end{eqnarray}

So now I've derived my coordinate descent algorithm.

Screenshot taken from Coursera 6:00

And i'm showing this coordinate descent algorithm here. So what we see is we just initialize our $\hat w = 0$, while not converged. We're just gonna cycle through, we're doing round robin here, cycling through each one of our coordinates. We're gonna compute this $\rho_j$ (rho j) term for a given coordinate j, and then we're simply gonna set $\hat w_j = \rho_j$.

So let's discuss this $\rho_j$ term and give a little intuition behind why we're setting $\hat w_j = \rho_j$. Well what $\rho_j$ is doing is it's a measure of the correlation between our future j and this residual between the predicted value. Where remember that prediction was formed excluding that jth feature from the model and the true observation. So we're looking at our prediction, the difference relative to the truth, when j is not included in the model. And then we're seeing how correlated is that with the jth feature. And if that correlation is really high, so if they tend to agree, then what that means is that jth feature is important for the model. And it can be added into the model and reduce the residuals. So equivalently improve our predictions, so what this is going to mean. If these things tend to agree, $\rho_j$ is going to be really large, high correlation. And we're gonna to set $\hat w_j$ large, we're gonna put a weight strongly on this feature being in the model.

But in contrast if they're uncorrelated, if they don't tend to agree across observations, or if the residuals themselves are already very small, because the predictions are already very good. Then what it's saying is that $\rho_j$ is gonna be small and as a result, we're not gonna put much weight on this feature in the model.

Screenshot taken from Coursera 7:00

5) Optimizing the lasso objective

5-1) Coordinate descent for lasso (normalized features)

So, now we're gonna describe what the variant of this coordinate descent algorithm looks like in the case of lasso.

Again, we're gonna be looking at these normalized features. And just remember this is where we left off with our coordinate descent algorithm for just least squares un-regularized regression. And remember the key point was that we set $\hat w_j = \rho_j$. This correlation between our features and the residuals from a prediction, leaving j out of the model.

Screenshot taken from Coursera 0:30

Well in the case of lasso what we're gonna do is, how we set $\hat w_j$ is gonna depend of the value of our tuning parameter lambda. And how that relates to this $\rho_j$ correlation term.

So, in particular if $\rho_j$ is small, if it's in this $[-\frac{\lambda}{2},\frac{\lambda}{2}]$ range, where again what small means is determined by lambda. What we're gonna do is we're gonna set that $\hat w_j$ exactly equal to zero. And here we see the sparsity of our solutions coming out directly here.

But in contrast, if rho j is really large or on the flip side very small, what that means is that the correlation is either very positive or very negative. Then we're gonna include that feature in the model. Just like we did in our least squares solution but relative to our least squares solution, we're gonna decrease the weight.

So, in the positive case if we've a strong correlation $\rho_j$. Instead of putting $\hat w_j = \rho_j$, we're gonna set it equal to $\rho_j - \frac{\lambda}{2}$.
And on the negative side, we're going to add $\frac{\lambda}{2}$.

So let's look at this function of how we're setting w hat j visually.

Screenshot taken from Coursera 1:30

Okay, well this operation that we are performing here in these lasso updates is something called soft thresholding. And so, let's just visualize this.

And to do this we're gonna make a plot of $\rho_j$, that correlation we've been talking about, versus $\hat w_j$, our coefficient that we're setting. And remember, in the least squared solution, we set $\hat w_j = \rho_j$ for least squares. And we can see that just setting lambda equal to zero. Remember, lambda equals zero returns us to our least squares solution, so I'll specifically write least squares there. So that's why we get this line y = x, this green line appearing here. So this represents as a function of $\rho_j$ how we would set $\hat w_j$ for least squares.

And in contrast this fuchsia line here we're showing is for lasso. And what we see is that in the range $[-\frac{\lambda}{2},\frac{\lambda}{2}]$. If this correlation is within this range, meaning that there's not much a relationship between our feature and the residuals from predictions without feature j in our model, we're just gonna completely eliminate that feature. We're gonna set it's weight exactly equal to 0. But if we're outside that range we're still gonna include the feature in the model. But we're gonna shrink the weight on that feature relative to the least square solution by an amount lambda over 2. So this is why it's called soft thresholding, we're shrinking the solution everywhere, but we're strictly driving it to zero from minus lambda over 2 to lambda over 2.

And I just want to mention to contrast with the ridge regression solution where you can show, which we're not going to do here, but you can show that the ridge regression solution. Shrinks the coefficients everywhere, but never strictly to zero. So this is the line $\hat w^{ridge}$. And, Let me just write that this is $\hat w^{lasso}$. Okay, so here we got a very clear visualization of the difference between least squares, ridge, and lasso.

Screenshot taken from Coursera 4:00

5-2) Assessing convergence and other lasso solvers

Well in our coordinate descent algorithm for lasso. And actually all of our coordinate descent algorithms that we've presented we have this line that says while not converged. And the question is how are we assessing Convergence?

Screenshot taken from Coursera 0:10

Well, when should I stop in coordinate descent?

In gradient descent, remember we're looking at the magnitude of that gradient vector. And stopping when the magnitude of the vector was below some tolerance epsilon.

Well here, we don't have these gradients we're computing, so we have to do something else. One thing we know, though, is that for convex objectives, the steps that we're taking as we're going through this algorithm are gonna become smaller and smaller and smaller as we're moving towards our optimum. Well at least in strongly convex functions, we know that we're converging to our optimal solution. And so one thing that we can do is we can measure the size of these steps that we're taking through a full cycle of our different coordinates.

Because I wanna emphasize, we have to cycle through all of our coordinates, zero to d. Before judging whether to stop, because it's possible that one coordinate or a few coordinates might have small steps, but then you get to another coordinate, and you still take a large step. But if, over an entire sweep of all coordinates, if the maximum step that you take in that entire cycle is less than your tolerance epsilon, then that's one way you can assess that your algorithms converged.

Screenshot taken from Coursera 1:30

I also wanna mention that this Coordinate descent algorithm is just one of many possible ways of solving this lasso objective. So classically, lasso was solved using what's called LARS (least angle regression and shrinkage). And that was popular up until roughly 2008 when an older algorithm was kinda rediscovered and popularized. Which is doing this coordinate descent approach for lasso.

But more recently there's been a lot a lot of activity in the area of coming up with efficient parallelized and distributed implementations of lasso solvers. These include a parallel version of coordinate descent. And other parallel learning approaches like parallel stochastic gradient descent or thinking about this kind of distribute and average approach that's fairly popular as well. And one of the most popular approaches specifically for lasso is something called, Alternating direction method of multipliers, or ADMM, and that's been really popular within the community of people using lasso.

Screenshot taken from Coursera 2:30

5-3) Coordinate descent for lasso (unnormalized features)

So finally, I just wanted to present the coordinate descent algorithm for lasso if you don't normalize your features. So this is the most generic form of the algorithm, because of course it applies to normalized features as well.

But let's just remember our algorithm for our normalized features. So, here it is now.

Screenshot taken from Coursera 0:20

And relative to this, the only changes we need to make are what's highlighted in these green boxes. And what we see is that we need to precompute for each one of our features. This term is $z_j$, and that's exactly equivalent to the normalizer that we described when we normalized our features. So if you don't normalize, you still have to compute this normalizer.

But we're gonna use it in a different way as we're going through this algorithm. Where, when we go to compute $\rho_j$, we're looking at our unnormalized features. And when we're forming our predictions, $\hat y_i$, so our prediction for the ith observation, again, that prediction is using unnormalized features. So there are two places in the $\rho_j$ compuation where you would need to change things for unnormalized features. And then finally when we're setting $\hat w_j$ according to the soft thresholding rule, instead of just looking at $\rho_j + \frac{\lambda}{2}$, or $\rho_j - \frac{\lambda}{2}$, or zero. We're gonna divide each of these terms by $z_j$, this normalizer.

Okay, so you see that it's fairly straight forward to implement this for unnormalized features, but the intuition we provided was much clearer for the case of normalized features.

Screenshot taken from Coursera 1:00

6) OPTIONAL ADVANCED MATERIAL: Deriving the lasso coordinate descent update

Deriving the lasso coordinate descen update

Up to this point, we've derived our coordinate descent algorithm for these squares and we've discussed what the variant is for lasso. But we didn't derive explicitly why our coordinate descent algorithm for lasso has that soft thresholding form. And for completeness, we're going to include the derivation but as an optional video. And I really want to emphasize that this video is optional. The material is quite advanced. It's actually material that's typically taught at the level of a Ph.D. or grad level course. So, watch it at your own risk. We think it's still very interesting if you want to see what's under the hood in the lasso solver. But that's our little word of warning before we delve into this content, because again this is an example where the level of what we're gonna present in this video goes much more technical than what we're assuming for the rest of this specialization.

So, let's get into optimizing our lasso objective which we've written here And we're doing it again one coordinate at a time to derive or coordinate descend algorithm. And in this derivation we're gonna use unnormalized features so we're not doing the normalization. This is just our typical h that's appearing here. And we're doing that because that's the most general case we can derive and the normalized case follows directly from this derivation.

Screenshot taken from Coursera 1:00

Okay, so to start with let's look at the partial of our residual sum of squares term with respect to wj and we did this derivation when we're looking at deriving corner to center lease squares, but we did it for Normalized features so, let's do it now for un-normalized features.

When we were talking about normalized features, we said $\sum_{i=1}^N h_j(x_i)^2 = 1$. But now we're looking at unnormalized features, so now $\sum_{i=1}^N h_j(x_i)^2 = z_j$.

So our final RSS term will be: $$-2 \rho_j + 2w_jz_j$$

Screenshot taken from Coursera 4:00

Now, let's turn to our L1 penalty. And this is where things become more complicated. So, in particular what's the partial with respect to the absolute value of wj? Remember, that's all the other terms we're thinking of as held fixed. So, we're just looking at this Wj component. Well, this is where we get these derivatives of absolute values that become problematic. Remember here we have

derivative = -1
derivative = +1
problem point

And we mentioned that instead of thinking about Gradients or these partials, we think about sub gradients.

Screenshot taken from Coursera 4:30

Now let's talk about these sub gradients and formalize this a bit.

But let's first mention that gradients, we can think of as lower bounding convex functions. So, if I look at my convex function at some points a and b. If I take the gradient at my point a that's this tangent plane, so, this is gradient of G at, $Dg(a)$. Then we have, g(b) is greater than or equal to g(a) plus gradient of g(a) times (b-a) $$g(b) \geq g(a) + \triangledown g(a)(b - a)$$

And importantly, gradients are unique at x if the function is differentiable at x.

While subgradients generalize this idea of gradients to non-differentiable points. So, a sub-gradient is really a set, its a set of all the plane that lower bound a function. So, we're gonna saw that V is in our set which we denote by this curly d of g at a point x.

$V \in \partial g(x)$: subgradient of g at x if $g(b) \geq g(a) + V(b - a)$
$\triangledown g(a)$: one plane that lower bounded our function.
$V$: this is one of the planes that lower bounds our function.

So, this is the definition of a sub-gradient. And let's look at it in the context of this absolute value function.

What are all planes that lower bound this absolute value function? Those are all the planes which are drawn as dash lines. We know that the slopes of the absolute value function are:

slope = -1
slop = +1

And we see that anything that has a slope in the range of -1 to 1, any line in the range -1 to 1, is gonna lower bound this absolutely value function.

For $|x|, V \in [-1,1]$, so [-1,1] represents our subgradient of the absolute value of x.

Screenshot taken from Coursera 8:30

So now that we've had our detour into the world of subgradients. We can start talking- instead of the partial of this L1 term, we can talk about the subgradient of this L1 term where here we get Lambda times the subgradient.

We already know that subgradient are the absolute value is the range [-1,1]. So So, lambda times the subgradient of the absolute value is going to be $[-\lambda, \lambda]$ when $w_j = 0$. Remember a subgradient is defined at that problem point $w_j = 0$, but in the case where the derivative exists or that partial exists it's just going to be

$\lambda \cdot (-1)$
$\lambda \cdot [-1,1]$
$\lambda \cdot (1)$

Okay so this is our complete lambda times the sub-gradient of our L1 objective with respect to wj.

Screenshot taken from Coursera 10:00

And now that we have the partial of our residual sum of squares term and the subgradient of our L1 term, we can put it all together and get our subgradient of our entire lasso cost, with respect to wj.

part 1: from RSS
part 2: $\lambda \cdot$ (L1 penalty)

And when we put these things together we get three different cases. Because of the three cases for the L1 objective.

Screenshot taken from Coursera 11:00

Let's pop up a level and say remember before we would take the gradient and set it equal to 0, or in the coordinate descent algorithm we talked about taking the partial with respect to one coordinate then setting that equal to zero to get the update for that dimension.

Here now instead of this partial derivative we're taking the subgradient and we have to set it equal to zero to get the update for this specific jth coordinate. So, let's do that, so we've taken our subgradient, we're setting it equal to 0. Again we have three different cases we need to consider.

Case 1 $(w_j < 0)$: $\large \hat w_j = \frac{\rho_j + \frac{\lambda}{2}}{z_j}$, in this case we have $w_j < 0$, and we need a constraint on $\rho_j$. So if $\rho_j < -\frac{\lambda}{2}$, remember this is that correlation term. So, if $\rho_j < -\frac{\lambda}{2}$ then we know that $(\hat w_j < 0)$ according to this formula.
Case 2 $(w_j = 0)$: there is only one solution $\hat w_j = 0$. But in order to have that be the optimum, we need:
- $-2 \rho_j + \lambda \geq 0 \rightarrow \rho_j \leq \frac{\lambda}{2}$
- $-2 \rho_j - \lambda \leq 0 \rightarrow \rho_j \geq -\frac{\lambda}{2}$
- If we put these two together, we have $-\frac{\lambda}{2} \leq \rho_j \leq \frac{\lambda}{2}$
Case 3 $(w_j > 0)$:$\large \hat w_j = \frac{\rho_j - \frac{\lambda}{2}}{z_j}$. For $\hat w_j > 0$, we need $\rho_j > \frac{\lambda}{2}$

Screenshot taken from Coursera 14:00

This slide just summarizes what I just said. So, this is our optimal 1 D optimization for this lasso objective.

Screenshot taken from Coursera 16:00

Let's talk about this more general form of the soft thresholding rule for lasso in the case of our unnormalized features.

So, remember for our normalized features, there wasn't this $z_j$ here, and what we ended up with for our least square solution when $\lambda = 0$ was just this line, $\hat w_j = \rho_j$.

But now what does our least squares line look like? Well again, we can just set $\lambda = 0$, and we see that this least squares line $\hat w_j^{LS} = \frac{\rho_j}{z_j}$.

Remember $z_j$ is that normalizer summing over the square of all of our features. So, that number will be positive and it's typically gonna be larger than one. Potentially much larger than one. So, relative to a slope, which is a 45 degree angle slope of 1, I'm saying that this line is shrunk more this way, typically, in the case of unnormalized features.

And then, when I look at my lasso solution, $\hat w^{lasso}$. In the range $[-\frac{\lambda}{2}, \frac{\lambda}{2}]$, I get this thresholding of the coefficients exactly to zero, relative to my least square solution. And outside that range, the difference between the least square solution and my lasso solution, is that my coefficients are each shrunk by an amount $\frac{\lambda}{2z_j}$.

But remember that $rho_j$ here as compared to when we talked about normalized features was defined differently. It was defined in terms of our unnormalized features. So, for the same value of lambda that you would use with normalized features you're getting a different relationship here. A different range where things are set to zero.

Screenshot taken from Coursera 18:00

In summary, we've derived this coordinate descent algorithm for lasso in the case of unormalized features. And in particular, the key insight we had was instead of just taking the partial of our objective with respect to $w_j$ we had to take the subgradient of our objective with respect to $w_j$, and that's what leads to these three different cases, because the gradient itself is defined for every value of $w_j$, except for this one critical point, $w_j$ = 0. But in particular we also had a lot of insight into how this soft thresholding gives us the sparsity in our lasso solutions.

Screenshot taken from Coursera 19:00

Discussion

https://www.coursera.org/learn/ml-regression/lecture/6OLyn/deriving-the-lasso-coordinate-descent-update/discussions/ubzF27BlEeWiXxLB9mtqCw

7) Tying up loose ends

7-1) Choosing the penalty strength and other practical issues with lassos

So we've gone through the coordinate descent algorithm for solving our lasso objective for a specific value of lambda, and that begs the question well how do we choose the lambda tuning parameter value?

Well, It's exactly the same as in ridge regression. If we have enough data, we can think about holding out a validation set and using that to choose amongst these different model complexities lambda.

Screenshot taken from Coursera 0:30

Or if we don't have enough data, we talked about doing cross validation.

So these are two very reasonable options for choosing this tuning parameter lambda.

Screenshot taken from Coursera 0:40

But in the case of lasso, I just want to mention that using these types of procedures, assessing the error on a validation set or doing cross validation, it's choosing lambda that provides the best predictive accuracy.

But what that ends up tending to do is choosing a lambda value that's a bit smaller than might be optimal for doing model selection, because for predictive accuracy having slightly less solutions can actually lead to a little bit better predictions on any finite data set, than possibly the true model with the sparsest set of features possible.

So instead, there are other ways that you can choose this tuning parameter lambda and I'll just refer you to other texts like this textbook by Kevin Murphy, Machine Leaning A Probabilistic Perspective for further discussion on this issue.

Screenshot taken from Coursera 1:00

So let's just conclude by discussing a few practical issues with lasso.

The first is the fact that, as we've seen in multiple different ways throughout this module, lasso shrinks the coefficients relative to the least square solution. So what it's doing is increasing the bias of the solution in exchange for having lower variance. So this is doing this automatic bias variance tradeoff but we might wanna still have a low bias solution, so we can actually think about reducing the bias of our solution in the following way.

This is called debiasing the lasso solution, where we run our lasso solver and we get out a set of selected features, so those are the features whose weights were not set exactly to zero, and then what we do is we take that reduced model. The model with these selected features and we run just standard lease squares regression on that reduced model. And in this case, what happens is these features that were deemed relevant to our task, their weights after doing this debiasing procedure will not be shrunk, relative to the weights of a least square solution if we had started exactly with that reduced model. But, of course, that was the whole point, we didn't know which model, so the lasso is allowing us to choose out this model, and then just run least squares on that model.

So these plots show a little illustration of the benefits of debiasing. So the top figure shows the true coefficients for data, so it's generated with 4,096 different coefficients or different features in the model, but only 160 of these had positive coefficients associated with them. So it's a very sparse setup and if you look at the L1 reconstruction, that's the second row of this plot, you see that it's discovered 1,024 features that have non zero weights, has mean squared error of 0.0072, but if you take those 1,024 non zero weight features and just run least squares regression on them, you get the third row. And that has significantly, significantly, lower mean square but in contrast, how do you run least squares on the full model with 4,096 features? You would get a really, really poor estimate of all that's going on and a very large mean square there. So this shows the importance of doing both lasso and possibly this debiasing on top of that.

Screenshot taken from Coursera 4:00

Another issue with lasso is, if you have a collection of strongly correlated features, lasso will tend to just select amongst them pretty much arbitrarily. And what I mean is that, a small tweak in the data might lead to one variable included, whereas a different tweak of the data would have a different one of these variables included. So we're now housing an application. Maybe you could imagine that square feet and lot size are very correlated, and we might just arbitrarily choose between these, but in a lot of cases, you actually wanna include the whole set of correlated variables.

And another issue is the fact that, it's been shown empirically that in many cases, rich regression actually outperforms lasso in terms of predictive performance. So there are other variants of lasso, something called elastic net. That tries to address these set of issues. And what it does is, it fuses both the objectives of ridge and lasso, including both an L1 and an L2 penalty. And you can see this paper for further discussion of these and other issues with the original lasso objective, and how elastic net addresses it.

Screenshot taken from Coursera 5:00

7-2) A brief recap

In summary, we've really learned a lot this module. We've thought about this task of feature selection and we've described ways searching over first, all possible sets of features that we might want to include to come up with the best model. Talked about the challenges of that computationally, and then we turned to thinking about greedy algorithms and then discuss these radialized regression approach of lasso for addressing the feature selection task. So really it's covering a lot of ground. That's really important concepts in machine learning. And this lasso regularized regression approach, although really, really simple, has dramatically transformed the field of machine learning statistics and engineering. It's shown its utility in a variety of different applied domains.

But I wanna mention a really important issue, which we kind of alluded to, which is that for feature selection, not just lasso, but in general when you're thinking about feature selection, you have to be really careful about interpreting the features that you selected. And some reasons for this include the fact that the features you selected are always just in the context of what you provided as the set of possible features to choose from to begin with. And likewise, the set of selected features are really sensitive to correlations between features and, in those cases, small changes in the data can lead to different features being included, too. So to say that one feature is important and the other isn't, you have to be careful with statements like that. And also of course, the set of selective features depends on which algorithm you use. We especially saw this when we talked about those greedy algorithms, like the forward stepwise procedure. But I did want to mention that there are some nice theoretical guarantees for lasso under very specific conditions.

Screenshot taken from Coursera 1:00