Overview

Derivation, intuition, and a basic implementation of a logistic regression classifier in the Julia language.

Background

Suppose we're interested in taking the values of a set of predictor variables to decide between two categories.

For example suppose we're amateur meteorologists and we have data to predict:
$(\text{cloudy}=0.5, \text{temperature}=90, \text{time_of_day}=5) \rightarrow \text{rain}$ or $\text{no rain}?$

Or we're wall-street amateurs and we want to predict:
$(\text{company_quarterly_profits}=\$0.2, \text{company_popularity}=0.01, \text{age_of_company}=10\text{yr}) \rightarrow \text{buy stock}$ or $\text{don't buy stock}?$


I.e. we have two discrete choices and some training data mapping feature values to categories.

Definitions and Discussion

More formally, suppose we have:
A set of $m$ training examples $T = \{(\vec{x_{i}}, y_i)\}$ for $0 \leq i \leq m - 1$
where $\vec{x_{i}} \in \mathbb{R}^{n+1}$ is the $i^\text{th}$ training example input, and $n$ is the number of features, $y_i \in \mathbb{R}$ is $i^\text{th}$ is the training output, and $y_i \in \{0, 1\}$.
and we let $x_{i,j}$ denote the $j^\text{th}$ feature of the $i^\text{th}$ training example. For all $\vec{x}_i$ take $x_{i,0} = 1$.

In logistic regression classification is approached by thresholding the hypothesis function $h_\vec{\theta} : \mathbb{R}^{n+1} \rightarrow \mathbb{R}$ defined by:
$$h_\vec{\theta}(\vec{x}) = \frac{1}{1 + \mathbb{e}^{-\vec{\theta}^T \vec{x}}}$$

Let's look at its plot:


In [1]:
using PyPlot
t = PyPlot.linspace(-20, 20, 100)
PyPlot.plot(t, 1/(1+exp(-1.0/5*t)), "y.-", label="1/5 * t")
PyPlot.plot(t, 1/(1+exp(-1.0/2*t)), "r.-", label="1/2 * t")
PyPlot.plot(t, 1/(1+exp(-t)), "g.-", label="t")
PyPlot.plot(t, 1/(1+exp(-2*t)), "b.-", label="2 * t")
PyPlot.plot(t, 1/(1+exp(-5*t)), "m.-", label="5 * t")
PyPlot.hlines(0.5, -20, 0)
PyPlot.yticks([i/10 for i=0:10])
PyPlot.legend()
plt.gcf()


Out[1]:

Some important properties of the function to observe:

  1. Its domain is $(-\infty, \infty)$
  2. Its range is $(0, 1)$ and it asymptotically approaches 0 as $t \rightarrow -\infty$ and 1 as $t \rightarrow \infty$
  3. The function always passes through $y = 0.5$ when $x = 0$.
  4. When $t \le 0$, $\frac{1}{1 + \mathbb{e}^{-t}} \le 0.5$, and when $t \ge 0$, $\frac{1}{1 + \mathbb{e}^{-t}} \ge 0.5$
  5. Varying the linear function of $t$ in the exponential part expands or contracts the ascent from 0 to 1.
  6. For $\frac{1}{1 + \mathbb{e}^{-c t}}$, as $c \rightarrow 0$, the function approaches a step function from 0 to 1 at $t = 0$, and as $c \rightarrow \infty$, the function approaches the horizontal line at $y = 0.5$ over values of interest.

Now in logistic regression $h_\vec{\theta}(\vec{x}) = \frac{1}{1 + \mathbb{e}^{-\vec{\theta}^T \vec{x}}}$ is a useful hypothesis function because we can use the properties above as follows:

  • We can use property 2. to regard the value of $h_\vec{\theta}(\vec{x})$ as a probability if we want as it's between 0 and 1. (Although $h_\vec{\theta}(\vec{x})$ is definitely not a probability distribution over the real number line.)
  • We can use properties 3. and 4. so that we can set threshold 0.5 for classifying between our two categories.
    E.g. when $h_\vec{\theta}(\vec{x}) \ge 0.5$, we can predict category 1
    and when $h_\vec{\theta}(\vec{x}) < 0.5$, we can predict category 0.
    (The inclusion of 0.5 to decide between category 0 or category 1 is arbitrary; and since the function is operating on the whole real number line, it's only infinitesimally unbalanced.)
  • When $\vec{\theta}^T \vec{x} \ge 0$, we have $h_\vec{\theta}(\vec{x}) \le 0.5$ and when $\vec{\theta}^T \vec{x} \le 0$ we have $h_\vec{\theta}(\vec{x}) \ge 0.5$ (by property 4. above)

So now the game is to set up $\vec{\theta}^T \vec{x}$ so that we get a value $< 0$ when we want category 0, and $\vec{\theta}^T \vec{x} \ge 0$ when we want category 1; when we do that we're setting up a "decision boundary".

Decision Boundaries

Consider the inner product $\vec{\theta}^T \vec{x} \in \mathbb{R}$ that we can play with: $$\vec{\theta}^T \vec{x} = <[\theta_0, \theta_1, \cdot, \theta_n], [1, x_1, \cdot, x_n]> \\ = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdot + \theta_n x_n $$

Now in a given problem, we need not use the original input values as features; we can scale them, compose products of them--whatever we want.

For example, if we're given values $(x_1, x_2)$, we could transform the training set to instead use $(x_1^2, -x_2^2)$, and then we could fit the parameters $\vec{\theta}$ of the inner product $\vec{\theta}^T \vec{x}$ to look like: $\vec{\theta}^T \vec{x} = \theta_0 + \theta_1 x_1^2 + \theta_2 x_2^2$, which can become a hyperbola when $\theta_2 < 0$ xor $\theta_1 < 0$, or an ellipse/circle when $\theta_0$ and $\theta_1$ match in sign.

For example, consider the data below for which in our training set we know the red dots are category 0, and the green dots category 1.
A decision boundary for which $\vec{\theta}^T \vec{x} = -1 + 1 x_1^2 + 1 x_2^2$, a hyperbola, then separates the data pretty well:


In [2]:
# Plot hyperbola
t1 = PyPlot.linspace(-3, -1, 100)
t2 = PyPlot.linspace(1, 3, 100)
t3 = PyPlot.linspace(-1, 1, 100)
PyPlot.plot(t1, [sqrt(t^2 - 1) for t in t1], "b.-",
            t2, [sqrt(t^2 - 1) for t in t2], "b.-",
            t1, [-sqrt(t^2 - 1) for t in t1], "b.-",
            t2, [-sqrt(t^2 - 1) for t in t2], "b.-")
PyPlot.legend(["Hyperbola x_1^2 - y_2^2 - 1"])

# Plot some inner and outer data
PyPlot.plot(t1, [sqrt(t^2 - 1) - 2*rand() for t in t1], "g.", # In-data
            t2, [sqrt(t^2 - 1) - 2*rand() for t in t2], "g.",
            t1, [-sqrt(t^2 - 1) + 2*rand() for t in t1], "g.",
            t2, [-sqrt(t^2 - 1) + 2*rand() for t in t2], "g.",
            
            t1, [sqrt(t^2 - 1) + 5*rand() for t in t1], "r.", # Out-data
            t2, [sqrt(t^2 - 1) + 5*rand() for t in t2], "r.",
            t1, [-sqrt(t^2 - 1) - 5*rand() for t in t1], "r.",
            t2, [-sqrt(t^2 - 1) - 5*rand() for t in t2], "r.",
            t3, [8*rand() for t in t3], "r.",
            t3, [-8*rand() for t in t3], "r.")
plt.gcf()


Out[2]:

For example, at the point $\vec{x_e} = (x_1=-2, x_2=0)$, $\vec{\theta}^T \vec{x_e} = -1 + 1 x_1^2 - 1 x_2^2 = -1 + 4 - 0 = 3$
and since $\vec{\theta}^T \vec{x_e} = 3 > 0$, we have that: $h_\vec{\theta}(\vec{x_e}) = \frac{1}{1 + \mathbb{e}^{-\vec{\theta}^T \vec{x_e}}} = \frac{1}{1+e^{-3}} \approx 0.952574127$.
and since 0.952574127 >= 0.5, our fitted hypothesis predicts that the category is 1--i.e. it's in the green as expected.

However, at the point $\vec{x_f} = (x_1=0.5, x_2=-1)$, we have $\vec{\theta}^T \vec{x_e} = -1 + 0.5^2 + (-1)^2 = -1.75$, and so $h_\vec{\theta}(\vec{x_f}) = \frac{1}{1+e^{1.75}} \approx 0.148047198$.
and since 0.148047198 < 0.5, we predict the category to be 0--i.e. in the red as expected.

Note that the construction of appropriate transforms of the inputs to produce a feasible decision boundary takes some skill and finesse; in a small dimensional example it can be eye-balled--but in larger dimensions other tactics need to be employed.

Fitting parameters to determine the decision boundary

To determine the decision boundary, we can use gradient descent. To do that, we need appropriate cost function, and to determine its gradient.

In logistic regression, the cost function $J : \mathbb{R}^{n+1} \rightarrow \mathbb{R}$ has form:
$$J(\vec{\theta}) = -\frac{1}{m}\sum_{i=1}^{m}(y_i\log h_\vec{\theta}(\vec{x}_{i}) + (1-y_i)\log(1 - h_\vec{\theta}(\vec{x}_{i})) )$$ where as discussed above: $$h_\vec{\theta}(\vec{x}) = \frac{1}{1 + \mathbb{e}^{-\vec{\theta}^T \vec{x}}}$$

We want to minimize the cost function $J$ by varying the parameter vector $\vec{\theta}$. The cost function being convex gives us a guarantee that we'll converge to a global optima rather than some local one. I.e. we want our cost function to look like this:

Let's see what the cost function looks like for a single sample, and that it's convex:


In [3]:
using PyPlot

PyPlot.plt.clf()
t = PyPlot.linspace(-20, 20, 100)
PyPlot.plot(t, -log(1/(1+exp(-1.0/5*t))),"b.-",
            t, -log(1 - 1/(1+exp(-1.0/5*t))),"g.-")
PyPlot.legend(["y=1", "y=0"])
PyPlot.gcf()


Out[3]:

Suppose we instead used the single-sample cost function $(h_\vec{\theta}(\vec{x}) - y_i)^2$ as in linear regression, then we'd have: $(h_\vec{\theta}(\vec{x}) - y_i)^2 = (\frac{1}{1 + \mathbb{e}^{-\vec{\theta}^T \vec{x}}} - y_i)^2$

Which isn't convex:


In [4]:
using PyPlot

PyPlot.plt.clf()
t = PyPlot.linspace(-20, 20, 100)
PyPlot.plot(t, [x*x for x in (1/(1+exp(-0.5*t)) - 1)],"b.-",
            t, [x*x for x in (1/(1+exp(-0.5*t)))],"g.-")
PyPlot.legend(["y=1", "y=0"])
PyPlot.gcf()


Out[4]:

So now we sort of believe it's convex (if this were a math class and not just me goofing off a proof would be nice here).

Gradient descent details

Now we need to determine what the actual update in gradient descent should be.

$$\frac{\partial}{\partial \theta_i} J(\vec{\theta}) = -\frac{\partial}{\partial \theta_i} \frac{1}{m}\sum_{i=1}^{m}(y_i\log h_\vec{\theta}(\vec{x}_{i}) + (1-y_i)\log(1 - h_\vec{\theta}(\vec{x}_{i})) ) \\ = -\frac{1}{m}\sum_{i=1}^{m}(y_i\frac{\partial}{\partial \theta_i}\log h_\vec{\theta}(\vec{x}_{i}) + (1-y_i)\frac{\partial}{\partial \theta_i}\log(1 - h_\vec{\theta}(\vec{x}_{i})) ) \\ = -\frac{1}{m}\sum_{i=1}^{m}(y_i\frac{\frac{\partial}{\partial \theta_i} h_\vec{\theta}(\vec{x}_{i})}{h_\vec{\theta}(\vec{x}_{i})} + (1-y_i)\frac{\frac{\partial}{\partial \theta_i}(1 - h_\vec{\theta}(\vec{x}_{i}))}{1 - h_\vec{\theta}(\vec{x}_{i})}) $$

So we need: $$ \frac{\partial}{\partial \theta_j} h_\vec{\theta}(\vec{x}_{j}) = \frac{\partial}{\partial \theta_j} \frac{1}{1 + \mathbb{e}^{-\vec{\theta}^T \vec{x}}} \\ = \frac{\partial}{\partial \theta_j} \frac{1}{1 + \mathbb{e}^{-(\theta_{j,0} + \theta_1 x_{j,1} + \theta_2 x_{j,2} + \cdot + \theta_n x_{j,n})}} \\ = x_j e^{\vec{\theta}^T \vec{x}} h_\vec{\theta}(\vec{x}_{j})^2 $$

And substituting back into $J$:
$$ \frac{\partial}{\partial \theta_j} J(\vec{\theta}) = -\frac{1}{m}\sum_{i=1}^{m}(y_i\frac{x_{i,j} e^{\vec{\theta}^T \vec{x}} h_\vec{\theta}(\vec{x}_{i})^2}{h_\vec{\theta}(\vec{x}_{i})} + (1-y_i)\frac{-x_{i,j} e^{\vec{\theta}^T \vec{x}} h_\vec{\theta}(\vec{x}_{i})^2}{1 - h_\vec{\theta}(\vec{x}_{i})}) \\ = -\frac{1}{m}\sum_{i=1}^{m}(y_i x_{i,j} e^{\vec{\theta}^T \vec{x}} h_\vec{\theta}(\vec{x}_{i}) + (1-y_i)\frac{-x_{i,j} e^{\vec{\theta}^T \vec{x}} h_\vec{\theta}(\vec{x}_{i})^2}{1 - h_\vec{\theta}(\vec{x}_{i})}) \\ = -\frac{1}{m}\sum_{i=1}^{m}(x_{i,j} e^{\vec{\theta}^T \vec{x}})(y_i h_\vec{\theta}(\vec{x}_{i}) - (1-y_i)\frac{h_\vec{\theta}(\vec{x}_{i})^2}{1 - h_\vec{\theta}(\vec{x}_{i})}) \\ = -\frac{1}{m}\sum_{i=1}^{m}(x_{i,j} e^{\vec{\theta}^T \vec{x}})\frac{y_i h_\vec{\theta}(\vec{x}_{i}) - y_i h_\vec{\theta}(\vec{x}_{i})^2 + y_i h_\vec{\theta}(\vec{x}_{i})^2 - h_\vec{\theta}(\vec{x}_{i})^2}{1 - h_\vec{\theta}(\vec{x}_{i})} \\ = -\frac{1}{m}\sum_{i=1}^{m}(x_{i,j} e^{\vec{\theta}^T \vec{x}})\frac{y_i h_\vec{\theta}(\vec{x}_{i}) - h_\vec{\theta}(\vec{x}_{i})^2}{1 - h_\vec{\theta}(\vec{x}_{i})} \\ = -\frac{1}{m}\sum_{i=1}^{m}(x_{i,j} e^{\vec{\theta}^T \vec{x}})\frac{y_i - h_\vec{\theta}(\vec{x}_{i})}{\frac{1}{h_\vec{\theta}(\vec{x}_{i})} - 1} \\ = -\frac{1}{m}\sum_{i=1}^{m}(x_{i,j} e^{\vec{\theta}^T \vec{x}})\frac{y_i - h_\vec{\theta}(\vec{x}_{i})}{e^{\vec{\theta}^T \vec{x}} + 1 - 1} \\ = \frac{1}{m} \sum_{i=1}^{m} x_{i,j} (h_\vec{\theta}(\vec{x}_{i}) - y_i) $$

So that's our gradient component.

Implementation


In [5]:
using DataFrames
using RDatasets
using PyPlot

# Parameters:
#    training_data is a DataFrames.DataFrame (https://github.com/JuliaStats/DataFrames.jl/blob/master/src/dataframe.jl)
#    containing the training data.
#    
#    reg_formula is a DataFrames.Formula (https://github.com/JuliaStats/DataFrames.jl/blob/master/src/formula.jl)
#    specifying the linear equation to be fitted from training data.
#
# Note this is the same as in linear regression; just the hypothesis function changes!
function linreg(training_data::DataFrame, reg_formula::Formula,
                step_size::Float64, threshold::Float64, max_iter::Float64)
    mf = ModelFrame(reg_formula, training_data)
    mm = ModelMatrix(mf) # mm.m is the matrix data with constant-param 1 column added.
    
    # Size of training set
    m = size(training_data,1)
    # Number of features
    n = size(mm, 2)
    
    # Column vector:
    theta = zeros(Float64, n, 1)
    
    # Hypothesis fcn: sigmoid(theta_0' * mm.m[i,:])
    # x_i vector: mm.m[i,:]
    # y vector: model_response(mf)
    iter_cnt = 0
    diff = 1
    while diff > threshold && iter_cnt < max_iter
        # grad = (sigmoid(theta dot x_i) - y_i)*x_i
        grad = [(1.0/(1+exp(-sum([theta[j]*(mm.m[i,j]) for j=1:n]))) - model_response(mf)[i])*mm.m[i,:] for i=1:m]
        theta1 = theta - (step_size/m)*sum(grad)'
        diff = sum([abs(x) for x in theta1 - theta])
        theta = theta1
        iter_cnt += 1
        if iter_cnt % 300 == 0
            println("Num. iterations completed:", iter_cnt)
            println(theta)
            
            incorrect = 0
            correct = 0
            for i=1:size(df)[1]
                d = theta[1] + theta[2]*df["x1s"][i] + theta[3]*df["x2s"][i]
                if sum([1.0/(1.0+exp(-d)) >= 0.5 ? 1 : 0, df["y"][i]]) != 1
                    correct += 1
                else
                    incorrect += 1
                    #println((1.0/(1.0+exp(-d)), df["y"][i]))
                end
            end
            println("Correct, incorrect:",(correct, incorrect))
        end
    end
    
    println(("itercnt", iter_cnt))
    println(theta)
    
    return theta
end


Out[5]:
# methods for generic function linreg
linreg(training_data::DataFrame,reg_formula::Formula,step_size::Float64,threshold::Float64,max_iter::Float64) at In[5]:15

So let's try it out on that hyperbola illustration from earlier.


In [6]:
# Plot some inner and outer data
t1 = PyPlot.linspace(-3, -1, 100)
t2 = PyPlot.linspace(1, 3, 100)
t3 = PyPlot.linspace(-1, 1, 100)

x1_data_1 = []
x2_data_1 = []
y_data_1 = []
x1_data_0 = []
x2_data_0 = []
y_data_0 = []
o = [1 for i in t1]
z = [0 for i in t1]

# Category y = 1 data:
x1_data_1 = vcat(x1_data_1, t1, t2, t1, t2)
x2_data_1 = vcat(x2_data_1,
              [sqrt(t^2 - 1) - 2*rand() for t in t1],
              [sqrt(t^2 - 1) - 2*rand() for t in t2],
              [-sqrt(t^2 - 1) + 2*rand() for t in t1],
              [-sqrt(t^2 - 1) + 2*rand() for t in t2])
y_data_1 = vcat(y_data_1, o, o, o, o)
#y_data_1 = vcat(y_data_1, z, z, z, z)
PyPlot.plot(x1_data_1, x2_data_1, "g.")

# Category y = 0 data:
x1_data_0 = vcat(x1_data_0, t1, t2, t1, t2, t3, t3)
x2_data_0 = vcat(x2_data_0,
              [sqrt(t^2 - 1) + 5*rand() for t in t1],
              [sqrt(t^2 - 1) + 5*rand() for t in t2],
              [-sqrt(t^2 - 1) - 5*rand() for t in t1],
              [-sqrt(t^2 - 1) - 5*rand() for t in t2],
              [8*rand() for t in t3],
              [-8*rand() for t in t3])
y_data_0 = vcat(y_data_0, z, z, z, z, z, z)
#y_data_0 = vcat(y_data_0, o, o, o, o, o, o)
PyPlot.plot(x1_data_0, x2_data_0, "r.")


#df = DataFrame(x1=[t*t for t in vcat(x1_data_1, x1_data_0)], x2=[t*t for t in vcat(x2_data_1, x2_data_0)], y=vcat(y_data_1, y_data_0))
df = DataFrame(x1=vcat(x1_data_1, x1_data_0),
               x1s=[t*t for t in vcat(x1_data_1, x1_data_0)],
               x2=vcat(x2_data_1, x2_data_0),
               x2s=[t*t for t in vcat(x2_data_1, x2_data_0)],
               y=vcat(y_data_1, y_data_0))
head(df)


Out[6]:
6x5 DataFrame:
              x1     x1s      x2     x2s y
[1,]        -3.0     9.0 1.14475 1.31044 1
[2,]     -2.9798  8.8792 2.42485 5.87989 1
[3,]     -2.9596 8.75921 2.44878 5.99652 1
[4,]    -2.93939 8.64004 1.42315 2.02536 1
[5,]    -2.91919 8.52168 2.71293    7.36 1
[6,]    -2.89899 8.40414 1.89328 3.58451 1

Let's see how well the params that we know apriori fit:


In [7]:
p = [-1,1,-1]
incorrect = 0
correct = 0
for i=1:size(df)[1]
    d = p[1] + p[2]*df["x1s"][i] + p[3]*df["x2s"][i]
    if sum([1.0/(1.0+exp(-d)) >= 0.5 ? 1 : 0, df["y"][i]]) != 1
        correct += 1
    else
        incorrect += 1
        #println((1.0/(1.0+exp(-d)), df["y"][i]))
    end
end
println("Correct, incorrect:",(correct, incorrect))


Correct, incorrect:(965,35)

In [8]:
#Formula(:(y ~ x1 + x2 + x1s + x2s)),
p = linreg(df,
           Formula(:(y ~ x1s + x2s)),
           0.2,
           1e-6,
           1e5)


WARNING: contains(collection, item) is deprecated, use in(item, collection) instead
 in contains at reduce.jl:238
Num. iterations completed:300
.21164818496353752
1.649599871280909
-1.7921767237879513

Correct, incorrect:(964,36)
Num. iterations completed:600
.1332428311514905
1.9160648699471545
-2.08132663156161

Correct, incorrect:(964,36)
Num. iterations completed:900
.06373771901587623
2.064061313440049
-2.234913683720989

Correct, incorrect:(965,35)
Num. iterations completed:1200
.011508198031677684
2.156872178060326
-2.328906673755346

Correct, incorrect:(968,32)
Num. iterations completed:1500
-.026195365984452785
2.2184481396079176
-2.3904383160144618

Correct, incorrect:(969,31)
Num. iterations completed:1800
-.053082401932249994
2.2606052530927565
-2.432267759265166

Correct, incorrect:(968,32)
Num. iterations completed:2100
-.07220086533026651
2.2900259623141213
-2.461358026511765

Correct, incorrect:(967,33)
Num. iterations completed:2400
-.08580531008602572
2.310812228969391
-2.4818805117490164

Correct, incorrect:(967,33)
Num. iterations completed:2700
-.09550659129400778
2.325618269726511
-2.496493099227116

Correct, incorrect:(967,33)
Num. iterations completed:3000
-.10244207895229314
2.3362229276761197
-2.5069610933931292

Correct, incorrect:(967,33)
Num. iterations completed:3300
-.10741268508641993
2.343847215119808
-2.514490419321776

Correct, incorrect:(967,33)
Num. iterations completed:3600
-.1109831485509471
2.3493431679351375
-2.519920767136393

Correct, incorrect:(967,33)
Num. iterations completed:3900
-.1135528880672938
2.353312196121452
-2.5238444638580937

Correct, incorrect:(967,33)
Num. iterations completed:4200
-.11540541658942921
2.356182220529168
-2.526683057926545

Correct, incorrect:(967,33)
Num. iterations completed:4500
-.11674269970817235
2.35825943852446
-2.528738373632415

Correct, incorrect:(967,33)
Num. iterations completed:4800
-.11770908694835452
2.3597638200020885
-2.5302274061449634

Correct, incorrect:(967,33)
Num. iterations completed:5100
-.11840804671276696
2.3608538336024445
-2.531306602961207

Correct, incorrect:(967,33)
Num. iterations completed:5400
-.1189139257856299
2.3616438688127412
-2.532088975762935

Correct, incorrect:(967,33)
Num. iterations completed:5700
-.1192802538220378
2.362216613295947
-2.5326562674716757

Correct, incorrect:(967,33)
Num. iterations completed:6000
-.11954563541555063
2.362631898532803
-2.533067657202781

Correct, incorrect:(967,33)
Num. iterations completed:6300
-.1197379480212347
2.3629330484246065
-2.533366015072821

Correct, incorrect:(967,33)
Num. iterations completed:6600
-.11987734360513642
2.363151449624313
-2.533582409854584

Correct, incorrect:(967,33)
Num. iterations completed:6900
-.11997840137907034
2.3633098488528668
-2.5337393641626176

Correct, incorrect:(967,33)
("itercnt",7053)
-.12001870456327586
2.3633730364427774
-2.5338019778768057

Out[8]:
3x1 Array{Float64,2}:
 -0.120019
  2.36337 
 -2.5338