Regressions

Author: Yang Long
Email: longyang_123@yeah.net

Linear Regression
Logistic Regression
Softmax Regression
Regularization
Optimization Objective

Linear Regression

The most common form of linear regression can be written as:

$$ y = \theta X $$

which tell us that the value of $y$ is determined by the linear combination of components of $X$. But in the mind, the vector $X$ not only contains the all features ${x_i}$, but also contains the bias $x_0$. In some case, this linear regression can fit and predict the value of $y$ well, but unfortunately, it doesn't often behave as our wish. So for describing the bias between the practical value and analytical one, we define Cost Function J:

$$J = \frac{1}{m} \sum_{i=1}^{m} (\theta X^{(i)} - y^{(i)})^2 $$

Best value for parameters $\theta$ can minimize the cost function.

Logistic Regression

The common form: $$ y = h_{\theta}(X) $$ $$ h_{\theta}(X) = g(\theta X) $$ The sigmoid function: $$ g(z) = \frac {1} {1+e^{-z}}$$

The cost function J for logistic regression would be described by Cross-Entropy as the following form: $$J = \frac{1}{m} \sum_{i=1}^{m} -y^{(i)}log(h_{\theta}(X^{(i)})) - (1-y^{(i)})log(1-h_{\theta}(X^{(i)}))$$

Softmax Regression

$$p(y^{(i)}=j|x^{(i)};\theta)=\frac{e^{\theta^T_j x^{(i)}}} {\sum_{l=1}^{k} {\theta^T_l x^{(i)}}}$$$$J(\theta) = -\frac{1}{m}(\sum_{i=1}^{m} \sum_{j=1}^{k}1\{y^{(i)}=j\} log\frac{e^{\theta^T_j x^{(i)}}} {\sum_{l=1}^{k} {\theta^T_l x^{(i)}}})+\frac {\lambda}{2} \sum_{i=1}^{m} \sum_{j=0}^{k} \theta_{ij}^2$$

Fitting Error

As we know, simple linear regression can not fit some specific function very well, especially for regular function, i.e. $sin(x)$. So we will add or extract more features to minimize the difference between fitting one and practical one.

Overfitting and underfitting

Two kinds of errors frequently happen when fitting the training data -- Overfit and Underfit.

Overfitting



In [14]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import math
x = np.linspace(0,4*math.pi,100)
y = map(math.sin,x)
plt.plot(x,y)
plt.plot(x,y,'o')
plt.show()

Underfitting



In [15]:

    
import matplotlib.pyplot as plt
import numpy as np
import math
x = np.linspace(0,4*math.pi,100)
y = map(math.sin,x)
plt.plot(x,y)
plt.plot(x,y,'o')
plt.show()

Bias and Variance

Polynomial Features

The main property of polynomial features is to use complex polynomial combination to represent the regression form, which will take the high order part into account. $$ h_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_2^2 + \theta_5 x_1 x_2 + \theta_6 x_1^2 x_2 + ... $$

But unfortunately, this process will also take some cost problems, etc, overfitting.

Regularization

For avoiding overfitting, to use regularization is a good way to achieve this purpose.

Linear Regression

The cost function will be written as: $$J = \frac{1}{m} \sum_{i=1}^{m} (\theta X^{(i)} - y^{(i)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n} \theta_{i}^2 $$

But there're something needed to be mentioned, the sum of $\theta$ does NOT contain the bias one $\theta_0$ of ${\theta_i}$.

Losgistic Regression

The cost function will be written as: $$J = \frac{1}{m} \sum_{i=1}^{m} -y^{(i)}log(h_{\theta}(X^{(i)})) - (1-y^{(i)})log(1-h_{\theta}(X^{(i)})) + \frac{\lambda}{2} \sum_{i=1}^{n} \theta_{i}^2$$

But there're something needed to be mentioned, the sum of $\theta$ does NOT contain the bias one $\theta_0$ of ${\theta_i}$.

Optimization Objective

Mean Scale

For samples ${X^{(i)}}$, the average of samples: $$ \mu_j = \frac{1}{m} \sum_{i=1}^{m} X^{(i)}_j $$

The variance: $$ \sigma^2_j = \frac{1}{m} \sum_{i=1}^{m} (X^{(i)}_j-\mu_j)^2$$ where the $\sigma$ represents the standard variance.

The process of Mean Scale: $$ X^{(i)}_j \leftarrow \frac {X^{(i)}_j-\mu_j} {\sigma_j} $$

Gradient Descent

For cost function with regularization, we have $$J = \frac{1}{m} \sum_{i=1}^{m} (\theta X^{(i)} - y^{(i)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n} \theta_{i}^2 $$ We can get the derivative partial part of cost function as the form: $$ \frac {\partial J} {\partial \theta_j} = \frac{2}{m} \sum_{i=1}^{m} (X_j^{(i)}-\mu_j)X_j^{(i)} + \lambda \theta_j$$ Notice that $\theta_0$ doesn't take in account.

For gaussian gradient descent algorithm, we have: $$\theta_j = \theta_j - \alpha \frac {\partial J} {\partial \theta_j}$$ where $\alpha$ is the learning rate.