Derivation, intuition, and a basic implementation of a logistic regression classifier in the Julia language.
Suppose we're interested in taking the values of a set of predictor variables to decide between two categories.
For example suppose we're amateur meteorologists and we have data to predict:
$(\text{cloudy}=0.5, \text{temperature}=90, \text{time_of_day}=5) \rightarrow \text{rain}$ or $\text{no rain}?$
Or we're wall-street amateurs and we want to predict:
$(\text{company_quarterly_profits}=\$0.2, \text{company_popularity}=0.01, \text{age_of_company}=10\text{yr}) \rightarrow \text{buy stock}$ or $\text{don't buy stock}?$
I.e. we have two discrete choices and some training data mapping feature values to categories.
More formally, suppose we have:
A set of $m$ training examples $T = \{(\vec{x_{i}}, y_i)\}$ for $0 \leq i \leq m - 1$
where $\vec{x_{i}} \in \mathbb{R}^{n+1}$ is the $i^\text{th}$ training example input, and $n$ is the number of features, $y_i \in \mathbb{R}$ is $i^\text{th}$ is the training output, and $y_i \in \{0, 1\}$.
and we let $x_{i,j}$ denote the $j^\text{th}$ feature of the $i^\text{th}$ training example. For all $\vec{x}_i$ take $x_{i,0} = 1$.
In logistic regression classification is approached by thresholding the hypothesis function $h_\vec{\theta} : \mathbb{R}^{n+1} \rightarrow \mathbb{R}$ defined by:
$$h_\vec{\theta}(\vec{x}) = \frac{1}{1 + \mathbb{e}^{-\vec{\theta}^T \vec{x}}}$$
Let's look at its plot:
In [1]:
using PyPlot
t = PyPlot.linspace(-20, 20, 100)
PyPlot.plot(t, 1/(1+exp(-1.0/5*t)), "y.-", label="1/5 * t")
PyPlot.plot(t, 1/(1+exp(-1.0/2*t)), "r.-", label="1/2 * t")
PyPlot.plot(t, 1/(1+exp(-t)), "g.-", label="t")
PyPlot.plot(t, 1/(1+exp(-2*t)), "b.-", label="2 * t")
PyPlot.plot(t, 1/(1+exp(-5*t)), "m.-", label="5 * t")
PyPlot.hlines(0.5, -20, 0)
PyPlot.yticks([i/10 for i=0:10])
PyPlot.legend()
plt.gcf()
Out[1]:
Some important properties of the function to observe:
Now in logistic regression $h_\vec{\theta}(\vec{x}) = \frac{1}{1 + \mathbb{e}^{-\vec{\theta}^T \vec{x}}}$ is a useful hypothesis function because we can use the properties above as follows:
So now the game is to set up $\vec{\theta}^T \vec{x}$ so that we get a value $< 0$ when we want category 0, and $\vec{\theta}^T \vec{x} \ge 0$ when we want category 1; when we do that we're setting up a "decision boundary".
Consider the inner product $\vec{\theta}^T \vec{x} \in \mathbb{R}$ that we can play with:
$$\vec{\theta}^T \vec{x} = <[\theta_0, \theta_1, \cdot, \theta_n], [1, x_1, \cdot, x_n]>
\\ = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdot + \theta_n x_n
$$
Now in a given problem, we need not use the original input values as features; we can scale them, compose products of them--whatever we want.
For example, if we're given values $(x_1, x_2)$, we could transform the training set to instead use $(x_1^2, -x_2^2)$, and then we could fit the parameters $\vec{\theta}$ of the inner product $\vec{\theta}^T \vec{x}$ to look like: $\vec{\theta}^T \vec{x} = \theta_0 + \theta_1 x_1^2 + \theta_2 x_2^2$, which can become a hyperbola when $\theta_2 < 0$ xor $\theta_1 < 0$, or an ellipse/circle when $\theta_0$ and $\theta_1$ match in sign.
For example, consider the data below for which in our training set we know the red dots are category 0, and the green dots category 1.
A decision boundary for which $\vec{\theta}^T \vec{x} = -1 + 1 x_1^2 + 1 x_2^2$, a hyperbola, then separates the data pretty well:
In [2]:
# Plot hyperbola
t1 = PyPlot.linspace(-3, -1, 100)
t2 = PyPlot.linspace(1, 3, 100)
t3 = PyPlot.linspace(-1, 1, 100)
PyPlot.plot(t1, [sqrt(t^2 - 1) for t in t1], "b.-",
t2, [sqrt(t^2 - 1) for t in t2], "b.-",
t1, [-sqrt(t^2 - 1) for t in t1], "b.-",
t2, [-sqrt(t^2 - 1) for t in t2], "b.-")
PyPlot.legend(["Hyperbola x_1^2 - y_2^2 - 1"])
# Plot some inner and outer data
PyPlot.plot(t1, [sqrt(t^2 - 1) - 2*rand() for t in t1], "g.", # In-data
t2, [sqrt(t^2 - 1) - 2*rand() for t in t2], "g.",
t1, [-sqrt(t^2 - 1) + 2*rand() for t in t1], "g.",
t2, [-sqrt(t^2 - 1) + 2*rand() for t in t2], "g.",
t1, [sqrt(t^2 - 1) + 5*rand() for t in t1], "r.", # Out-data
t2, [sqrt(t^2 - 1) + 5*rand() for t in t2], "r.",
t1, [-sqrt(t^2 - 1) - 5*rand() for t in t1], "r.",
t2, [-sqrt(t^2 - 1) - 5*rand() for t in t2], "r.",
t3, [8*rand() for t in t3], "r.",
t3, [-8*rand() for t in t3], "r.")
plt.gcf()
Out[2]:
For example, at the point $\vec{x_e} = (x_1=-2, x_2=0)$, $\vec{\theta}^T \vec{x_e} = -1 + 1 x_1^2 - 1 x_2^2 = -1 + 4 - 0 = 3$
and since $\vec{\theta}^T \vec{x_e} = 3 > 0$, we have that: $h_\vec{\theta}(\vec{x_e}) = \frac{1}{1 + \mathbb{e}^{-\vec{\theta}^T \vec{x_e}}} = \frac{1}{1+e^{-3}} \approx 0.952574127$.
and since 0.952574127 >= 0.5, our fitted hypothesis predicts that the category is 1--i.e. it's in the green as expected.
However, at the point $\vec{x_f} = (x_1=0.5, x_2=-1)$, we have $\vec{\theta}^T \vec{x_e} = -1 + 0.5^2 + (-1)^2 = -1.75$, and so $h_\vec{\theta}(\vec{x_f}) = \frac{1}{1+e^{1.75}} \approx 0.148047198$.
and since 0.148047198 < 0.5, we predict the category to be 0--i.e. in the red as expected.
Note that the construction of appropriate transforms of the inputs to produce a feasible decision boundary takes some skill and finesse; in a small dimensional example it can be eye-balled--but in larger dimensions other tactics need to be employed.
To determine the decision boundary, we can use gradient descent. To do that, we need appropriate cost function, and to determine its gradient.
In logistic regression, the cost function $J : \mathbb{R}^{n+1} \rightarrow \mathbb{R}$ has form:
$$J(\vec{\theta}) = -\frac{1}{m}\sum_{i=1}^{m}(y_i\log h_\vec{\theta}(\vec{x}_{i}) + (1-y_i)\log(1 - h_\vec{\theta}(\vec{x}_{i})) )$$
where as discussed above:
$$h_\vec{\theta}(\vec{x}) = \frac{1}{1 + \mathbb{e}^{-\vec{\theta}^T \vec{x}}}$$
We want to minimize the cost function $J$ by varying the parameter vector $\vec{\theta}$. The cost function being convex gives us a guarantee that we'll converge to a global optima rather than some local one. I.e. we want our cost function to look like this:
Let's see what the cost function looks like for a single sample, and that it's convex:
In [3]:
using PyPlot
PyPlot.plt.clf()
t = PyPlot.linspace(-20, 20, 100)
PyPlot.plot(t, -log(1/(1+exp(-1.0/5*t))),"b.-",
t, -log(1 - 1/(1+exp(-1.0/5*t))),"g.-")
PyPlot.legend(["y=1", "y=0"])
PyPlot.gcf()
Out[3]:
Suppose we instead used the single-sample cost function $(h_\vec{\theta}(\vec{x}) - y_i)^2$ as in linear regression, then we'd have:
$(h_\vec{\theta}(\vec{x}) - y_i)^2 = (\frac{1}{1 + \mathbb{e}^{-\vec{\theta}^T \vec{x}}} - y_i)^2$
Which isn't convex:
In [4]:
using PyPlot
PyPlot.plt.clf()
t = PyPlot.linspace(-20, 20, 100)
PyPlot.plot(t, [x*x for x in (1/(1+exp(-0.5*t)) - 1)],"b.-",
t, [x*x for x in (1/(1+exp(-0.5*t)))],"g.-")
PyPlot.legend(["y=1", "y=0"])
PyPlot.gcf()
Out[4]:
So now we sort of believe it's convex (if this were a math class and not just me goofing off a proof would be nice here).
Now we need to determine what the actual update in gradient descent should be.
So we need: $$ \frac{\partial}{\partial \theta_j} h_\vec{\theta}(\vec{x}_{j}) = \frac{\partial}{\partial \theta_j} \frac{1}{1 + \mathbb{e}^{-\vec{\theta}^T \vec{x}}} \\ = \frac{\partial}{\partial \theta_j} \frac{1}{1 + \mathbb{e}^{-(\theta_{j,0} + \theta_1 x_{j,1} + \theta_2 x_{j,2} + \cdot + \theta_n x_{j,n})}} \\ = x_j e^{\vec{\theta}^T \vec{x}} h_\vec{\theta}(\vec{x}_{j})^2 $$
And substituting back into $J$:
$$
\frac{\partial}{\partial \theta_j} J(\vec{\theta})
= -\frac{1}{m}\sum_{i=1}^{m}(y_i\frac{x_{i,j} e^{\vec{\theta}^T \vec{x}} h_\vec{\theta}(\vec{x}_{i})^2}{h_\vec{\theta}(\vec{x}_{i})} + (1-y_i)\frac{-x_{i,j} e^{\vec{\theta}^T \vec{x}} h_\vec{\theta}(\vec{x}_{i})^2}{1 - h_\vec{\theta}(\vec{x}_{i})})
\\ = -\frac{1}{m}\sum_{i=1}^{m}(y_i x_{i,j} e^{\vec{\theta}^T \vec{x}} h_\vec{\theta}(\vec{x}_{i}) + (1-y_i)\frac{-x_{i,j} e^{\vec{\theta}^T \vec{x}} h_\vec{\theta}(\vec{x}_{i})^2}{1 - h_\vec{\theta}(\vec{x}_{i})})
\\ = -\frac{1}{m}\sum_{i=1}^{m}(x_{i,j} e^{\vec{\theta}^T \vec{x}})(y_i h_\vec{\theta}(\vec{x}_{i}) - (1-y_i)\frac{h_\vec{\theta}(\vec{x}_{i})^2}{1 - h_\vec{\theta}(\vec{x}_{i})})
\\ = -\frac{1}{m}\sum_{i=1}^{m}(x_{i,j} e^{\vec{\theta}^T \vec{x}})\frac{y_i h_\vec{\theta}(\vec{x}_{i}) - y_i h_\vec{\theta}(\vec{x}_{i})^2 + y_i h_\vec{\theta}(\vec{x}_{i})^2 - h_\vec{\theta}(\vec{x}_{i})^2}{1 - h_\vec{\theta}(\vec{x}_{i})}
\\ = -\frac{1}{m}\sum_{i=1}^{m}(x_{i,j} e^{\vec{\theta}^T \vec{x}})\frac{y_i h_\vec{\theta}(\vec{x}_{i}) - h_\vec{\theta}(\vec{x}_{i})^2}{1 - h_\vec{\theta}(\vec{x}_{i})}
\\ = -\frac{1}{m}\sum_{i=1}^{m}(x_{i,j} e^{\vec{\theta}^T \vec{x}})\frac{y_i - h_\vec{\theta}(\vec{x}_{i})}{\frac{1}{h_\vec{\theta}(\vec{x}_{i})} - 1}
\\ = -\frac{1}{m}\sum_{i=1}^{m}(x_{i,j} e^{\vec{\theta}^T \vec{x}})\frac{y_i - h_\vec{\theta}(\vec{x}_{i})}{e^{\vec{\theta}^T \vec{x}} + 1 - 1}
\\ = \frac{1}{m} \sum_{i=1}^{m} x_{i,j} (h_\vec{\theta}(\vec{x}_{i}) - y_i)
$$
So that's our gradient component.
In [5]:
using DataFrames
using RDatasets
using PyPlot
# Parameters:
# training_data is a DataFrames.DataFrame (https://github.com/JuliaStats/DataFrames.jl/blob/master/src/dataframe.jl)
# containing the training data.
#
# reg_formula is a DataFrames.Formula (https://github.com/JuliaStats/DataFrames.jl/blob/master/src/formula.jl)
# specifying the linear equation to be fitted from training data.
#
# Note this is the same as in linear regression; just the hypothesis function changes!
function linreg(training_data::DataFrame, reg_formula::Formula,
step_size::Float64, threshold::Float64, max_iter::Float64)
mf = ModelFrame(reg_formula, training_data)
mm = ModelMatrix(mf) # mm.m is the matrix data with constant-param 1 column added.
# Size of training set
m = size(training_data,1)
# Number of features
n = size(mm, 2)
# Column vector:
theta = zeros(Float64, n, 1)
# Hypothesis fcn: sigmoid(theta_0' * mm.m[i,:])
# x_i vector: mm.m[i,:]
# y vector: model_response(mf)
iter_cnt = 0
diff = 1
while diff > threshold && iter_cnt < max_iter
# grad = (sigmoid(theta dot x_i) - y_i)*x_i
grad = [(1.0/(1+exp(-sum([theta[j]*(mm.m[i,j]) for j=1:n]))) - model_response(mf)[i])*mm.m[i,:] for i=1:m]
theta1 = theta - (step_size/m)*sum(grad)'
diff = sum([abs(x) for x in theta1 - theta])
theta = theta1
iter_cnt += 1
if iter_cnt % 300 == 0
println("Num. iterations completed:", iter_cnt)
println(theta)
incorrect = 0
correct = 0
for i=1:size(df)[1]
d = theta[1] + theta[2]*df["x1s"][i] + theta[3]*df["x2s"][i]
if sum([1.0/(1.0+exp(-d)) >= 0.5 ? 1 : 0, df["y"][i]]) != 1
correct += 1
else
incorrect += 1
#println((1.0/(1.0+exp(-d)), df["y"][i]))
end
end
println("Correct, incorrect:",(correct, incorrect))
end
end
println(("itercnt", iter_cnt))
println(theta)
return theta
end
Out[5]:
So let's try it out on that hyperbola illustration from earlier.
In [6]:
# Plot some inner and outer data
t1 = PyPlot.linspace(-3, -1, 100)
t2 = PyPlot.linspace(1, 3, 100)
t3 = PyPlot.linspace(-1, 1, 100)
x1_data_1 = []
x2_data_1 = []
y_data_1 = []
x1_data_0 = []
x2_data_0 = []
y_data_0 = []
o = [1 for i in t1]
z = [0 for i in t1]
# Category y = 1 data:
x1_data_1 = vcat(x1_data_1, t1, t2, t1, t2)
x2_data_1 = vcat(x2_data_1,
[sqrt(t^2 - 1) - 2*rand() for t in t1],
[sqrt(t^2 - 1) - 2*rand() for t in t2],
[-sqrt(t^2 - 1) + 2*rand() for t in t1],
[-sqrt(t^2 - 1) + 2*rand() for t in t2])
y_data_1 = vcat(y_data_1, o, o, o, o)
#y_data_1 = vcat(y_data_1, z, z, z, z)
PyPlot.plot(x1_data_1, x2_data_1, "g.")
# Category y = 0 data:
x1_data_0 = vcat(x1_data_0, t1, t2, t1, t2, t3, t3)
x2_data_0 = vcat(x2_data_0,
[sqrt(t^2 - 1) + 5*rand() for t in t1],
[sqrt(t^2 - 1) + 5*rand() for t in t2],
[-sqrt(t^2 - 1) - 5*rand() for t in t1],
[-sqrt(t^2 - 1) - 5*rand() for t in t2],
[8*rand() for t in t3],
[-8*rand() for t in t3])
y_data_0 = vcat(y_data_0, z, z, z, z, z, z)
#y_data_0 = vcat(y_data_0, o, o, o, o, o, o)
PyPlot.plot(x1_data_0, x2_data_0, "r.")
#df = DataFrame(x1=[t*t for t in vcat(x1_data_1, x1_data_0)], x2=[t*t for t in vcat(x2_data_1, x2_data_0)], y=vcat(y_data_1, y_data_0))
df = DataFrame(x1=vcat(x1_data_1, x1_data_0),
x1s=[t*t for t in vcat(x1_data_1, x1_data_0)],
x2=vcat(x2_data_1, x2_data_0),
x2s=[t*t for t in vcat(x2_data_1, x2_data_0)],
y=vcat(y_data_1, y_data_0))
head(df)
Out[6]:
Let's see how well the params that we know apriori fit:
In [7]:
p = [-1,1,-1]
incorrect = 0
correct = 0
for i=1:size(df)[1]
d = p[1] + p[2]*df["x1s"][i] + p[3]*df["x2s"][i]
if sum([1.0/(1.0+exp(-d)) >= 0.5 ? 1 : 0, df["y"][i]]) != 1
correct += 1
else
incorrect += 1
#println((1.0/(1.0+exp(-d)), df["y"][i]))
end
end
println("Correct, incorrect:",(correct, incorrect))
In [8]:
#Formula(:(y ~ x1 + x2 + x1s + x2s)),
p = linreg(df,
Formula(:(y ~ x1s + x2s)),
0.2,
1e-6,
1e5)
Out[8]: