Linear models

Notation:

  • $(x,y)$ is one training example, and $(x^{(i)}, y^{(i)})$ refers to the ith training example.

  • Hypothesis function for univariate data $h_\mathbf{w}(x) = w_0 + w_1 x$ with parameters $\mathbf{w}=(w_0, w_1)$

  • Hypothesis function for multivariate case $h_\mathbf{w}(\mathbf{x}) = w_0 + w_1 x_1 + w_2 x_2 + ... + w_d x_d$ with input features $\mathbf{x} = (x_1, x_2,..., x_d)$ and parameters $\mathbf{w} = (w_0, w_1, w_2, ... w_d)$

Solving the paramaters $\mathbf{w}$

Given $n$ data points, the goal is to minimize the cost given below:

$$ \begin{align} J(\mathbf{w}) & = \frac{1}{2 n} \displaystyle \sum_{i=1}^{n}\left(h_\mathbf{w}(x)^{(i)} - y^{(i)}\right)^2 \\ & = \frac{1}{2 n} \displaystyle \sum_{i=1}^{n}\left(w_0 + w_1 x^{(i)} - y^{(i)}\right)^2 \end{align} $$

Solving with Gradient Descent

Moving in the direction of Gradient (first derivative) of the cost function:

$$\nabla J(\mathbf{w}) = \left(\frac{\partial J(\mathbf{w})}{\partial w_0}, \frac{\partial J(\mathbf{w})}{\partial w_1} \right)$$

where the partial derivatives is given as follows

$$\begin{align} \frac{\partial J(\mathbf{w})}{\partial w_0} & = \displaystyle \frac{1}{2 n} \displaystyle \sum_{i=1}^{n} 2.(w_0 + w_1 x^{(i)} - y^{(i)}) = \displaystyle \frac{1}{n} \displaystyle \sum_{i=1}^{n} (w_0 + w_1 x^{(i)} - y^{(i)}) \\ \frac{\partial J(\mathbf{w})}{\partial w_1} & = \displaystyle \frac{1}{2 n} \displaystyle \sum_{i=1}^{n} 2.x^{(i)}.(w_0 + w_1 x^{(i)} - y^{(i)}) = \displaystyle \frac{1}{n} \displaystyle \sum_{i=1}^{n} x^{(i)} (w_0 + w_1 x^{(i)} - y^{(i)}) \end{align}$$
  • Randmly initialize the parameters $\mathbf{w}$

Regularization

  • Ridge

  • Lasso


In [ ]: