The most common form of linear regression can be written as:
$$ y = \theta X $$which tell us that the value of $y$ is determined by the linear combination of components of $X$. But in the mind, the vector $X$ not only contains the all features ${x_i}$, but also contains the bias $x_0$. In some case, this linear regression can fit and predict the value of $y$ well, but unfortunately, it doesn't often behave as our wish. So for describing the bias between the practical value and analytical one, we define Cost Function J:
$$J = \frac{1}{m} \sum_{i=1}^{m} (\theta X^{(i)} - y^{(i)})^2 $$Best value for parameters $\theta$ can minimize the cost function.
The common form: $$ y = h_{\theta}(X) $$ $$ h_{\theta}(X) = g(\theta X) $$ The sigmoid function: $$ g(z) = \frac {1} {1+e^{-z}}$$
The cost function J for logistic regression would be described by Cross-Entropy as the following form: $$J = \frac{1}{m} \sum_{i=1}^{m} -y^{(i)}log(h_{\theta}(X^{(i)})) - (1-y^{(i)})log(1-h_{\theta}(X^{(i)}))$$
In [14]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import math
x = np.linspace(0,4*math.pi,100)
y = map(math.sin,x)
plt.plot(x,y)
plt.plot(x,y,'o')
plt.show()
In [15]:
import matplotlib.pyplot as plt
import numpy as np
import math
x = np.linspace(0,4*math.pi,100)
y = map(math.sin,x)
plt.plot(x,y)
plt.plot(x,y,'o')
plt.show()
The main property of polynomial features is to use complex polynomial combination to represent the regression form, which will take the high order part into account. $$ h_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_2^2 + \theta_5 x_1 x_2 + \theta_6 x_1^2 x_2 + ... $$
But unfortunately, this process will also take some cost problems, etc, overfitting.
The cost function will be written as: $$J = \frac{1}{m} \sum_{i=1}^{m} -y^{(i)}log(h_{\theta}(X^{(i)})) - (1-y^{(i)})log(1-h_{\theta}(X^{(i)})) + \frac{\lambda}{2} \sum_{i=1}^{n} \theta_{i}^2$$
But there're something needed to be mentioned, the sum of $\theta$ does NOT contain the bias one $\theta_0$ of ${\theta_i}$.
For samples ${X^{(i)}}$, the average of samples: $$ \mu_j = \frac{1}{m} \sum_{i=1}^{m} X^{(i)}_j $$
The variance: $$ \sigma^2_j = \frac{1}{m} \sum_{i=1}^{m} (X^{(i)}_j-\mu_j)^2$$ where the $\sigma$ represents the standard variance.
The process of Mean Scale: $$ X^{(i)}_j \leftarrow \frac {X^{(i)}_j-\mu_j} {\sigma_j} $$
For cost function with regularization, we have $$J = \frac{1}{m} \sum_{i=1}^{m} (\theta X^{(i)} - y^{(i)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n} \theta_{i}^2 $$ We can get the derivative partial part of cost function as the form: $$ \frac {\partial J} {\partial \theta_j} = \frac{2}{m} \sum_{i=1}^{m} (X_j^{(i)}-\mu_j)X_j^{(i)} + \lambda \theta_j$$ Notice that $\theta_0$ doesn't take in account.
For gaussian gradient descent algorithm, we have: $$\theta_j = \theta_j - \alpha \frac {\partial J} {\partial \theta_j}$$ where $\alpha$ is the learning rate.