Overview

Describing and implementing multivariate linear regression in Julia.

Definitions:
Let the set of $m$ training examples be $T = \{(\vec{x_{i}}, y_i)\}$ for $0 \leq i \leq m - 1$
where $\vec{x_{i}} \in \mathbb{R}^n$ is the $i^\text{th}$ training example input, and $n$ is the number of features, $y_i \in \mathbb{R}$ is $i^\text{th}$ is the training output.
Let $x_{i,j}$ denote the $j^\text{th}$ feature of the $i^\text{th}$ training example.

The linear regression hypothesis function $h_\vec{\theta} : \mathbb{R}^n \rightarrow \mathbb{R}$ has form:
$$h_\vec{\theta}(\vec{x}) = \sum_{i=0}^{n} \theta_{i} x_{i} = \vec{\theta}^{T} \vec{x}$$ Where we take $x_0 = 1$, and let $\vec{\theta} \in \mathbb{R}^{n+1}$ be the parameter vector.

The cost function $J : \mathbb{R}^{n+1} \rightarrow \mathbb{R}$ has form:
$$J(\vec{\theta}) = \frac{1}{2 m}\sum_{i=1}^{m}(h_\vec{\theta}(\vec{x}_{i}) - y_i)^2 = \frac{1}{2 m}\sum_{i=1}^{m}(\vec{\theta}^{T} \vec{x} - y_i)^2$$

We want to minimize the cost function $J$ by varying the parameter vector $\vec{\theta}$.

In this case, we can analytically solve this problem by solving the system of equations for $(\theta_0, \theta_1, \dotsc, \theta_{n})^T$ given by $\frac{\partial}{\partial \theta_i}{J(\vec{\theta})} = 0$ for all $i \in \{0,1,...,n\}$.

We can alternatively use an optimization algorithm like gradient descent, and we can recycle it for other min/max problems; so I'll go over that.

Gradient Descent

One method to find the parameter vector is the gradient descent algorithm.

The idea in gradient descent is similar to the idea in single variable calculus:

Where we know that $\frac{d}{dx} f(x) = 0$ at the extrema (points A and B), and $\frac{d}{dx} f(x)$ can otherwise be used direct us toward the nearest (local) minimum/maximum (e.g. the slope of the lines at points C, D, E, F).

For a multivariate function $f : \mathbb{R}^n \rightarrow \mathbb{R}$, we use the vector differential, or gradient: $\vec{\nabla} f(x)$ to guide us to a minimum.
(Mnemonic side note: the latin root "gradi" means "to walk"--with the gradient we walk uphill at the steepest grade.)

For example, in:
A gradient vector field is shown projected beneath a plot of the function $f(x,y) = −(\cos^2 x + \cos^2 y)^2.$

Note that the vectors are pointed to the direction of maximum ascent--i.e. away from the minimum, thus if we want to minimize our cost function $J$ in the linear regression problem, we need to look toward the opposite direction: $-\vec{\nabla} f(x)$.

To see how we actually follow that direction, consider the Taylor expansion of $f(\vec{x}_0 + h\vec{\epsilon})$ where $\epsilon \in \mathbb{R}^{n+1}$ is a unit-vector that we want to minimize in the direction of if we start at vector $\vec{x}_0$, and $h \in \mathbb{R}$ is a constant "step-size":
$$ f(\vec{x}_0 + h\vec{\epsilon}) = f(\vec{x}_0) + h\vec{\nabla} f(\vec{x}_0) \cdot \vec{\epsilon} + h^2\text{(const. error)} $$

If we take $h$ to be very small such that $h^2$ is even smaller, then let's regard $h^2\text{(const. error)} \approx 0$ so that: $$ f(\vec{x}_0 + h\vec{\epsilon}) \approx f(\vec{x}_0) + h\vec{\nabla} f(\vec{x}_0) \cdot \vec{\epsilon} $$

We want to choose a vector $\vec{\epsilon}$ that minimizes $f(\vec{x}_0 + h\vec{\epsilon})$.
i.e. that minimizes $f(\vec{x}_0) + h\vec{\nabla} f(\vec{x}_0) \cdot \vec{\epsilon}$.

Note that $\vec{\nabla} f(\vec{x}_0) \cdot \vec{\epsilon} = \lvert \vec{\nabla} f(\vec{x}_0) \rvert \lvert \vec{\epsilon} \rvert \cos \delta$ where $\delta$ is the angle between $\vec{\nabla} f(\vec{x}_0)$ and $\vec{\epsilon}$.
By choosing unit-vector $\vec{\epsilon} = \frac{-\vec{\nabla} f(\vec{x}_0)}{\lvert \vec{\nabla} f(\vec{x}_0) \rvert}$, we have $\cos(\delta) = -1$, its least value.

So taking: $\vec{x_1} = \vec{x}_0 + h\vec{\epsilon} = \vec{x}_0 - h\vec{\nabla} f(\vec{x}_0)$ for some step-size $h$,
and inspecting whether $f(\vec{x_{i+1}}) - f(\vec{x_{i}})$ reaches convergence within some threshold, we have the basic gradient descent algorithm.

To implement it with our cost function, we need the $i^\text{th}$ component of $\vec{\nabla}J(\vec{\theta})$.
This is given by: $\frac{\partial}{\partial \theta_i} J(\vec{\theta}) = \frac{1}{m} \sum_{j=1}^{m}x_{j,i}(\vec{\theta}^T \cdot \vec{x}_j - y_j)$

Metaprogramming and DataFrames Formulas

Julia has "metaprogramming" support: http://docs.julialang.org/en/latest/manual/metaprogramming/

The DataFrames package provides a "Formula" object for use in regression modeling: DataFrames.Formula

The syntax of formulae seems similar to that of R formulae, which is in turn derived from the "S" language:

Chambers and Hastie (1993) "Statistical Models in S" is the definitive reference for formulas in S and thus R, although their conception of formulas has been extended for other uses, and there is a package "formula" that defines a new object class Formula that inherits from class formula. At the R command line, ?formula will give you some details.

The design notes describe the language of the formulas: Design Notes
And we can just look at the source: https://github.com/JuliaStats/DataFrames.jl/blob/master/src/formula.jl


In [1]:
using DataFrames
using RDatasets

iris = data("datasets", "iris")
# The DataFrames formula parser seems to take issue with column names having periods
# and I don't know if there's some proper way to escape them--so I'll just rename them:
# Uses string.replace: https://github.com/JuliaLang/julia/blob/master/base/string.jl
# and dataframe.rename: https://github.com/JuliaStats/DataFrames.jl/blob/master/src/dataframe.jl
# 
# Also note that we need to use "rename!" to operate in place.
for col_name in DataFrames.colnames(iris)
    DataFrames.rename!(iris, col_name, replace(col_name, "\.", "_"))
end

print(head(iris))


6x6 DataFrame:
          Sepal_Length Sepal_Width Petal_Length Petal_Width  Species
[1,]    1          5.1         3.5          1.4         0.2 "setosa"
[2,]    2          4.9         3.0          1.4         0.2 "setosa"
[3,]    3          4.7         3.2          1.3         0.2 "setosa"
[4,]    4          4.6         3.1          1.5         0.2 "setosa"
[5,]    5          5.0         3.6          1.4         0.2 "setosa"
[6,]    6          5.4         3.9          1.7         0.4 "setosa"

In [1]:
# Suppose we want to predict petal width as a function of sepal.length, sepal.width, and petal.length
# for the "setosa" species.
# data_set = iris[:(Species .== "setosa"), :][["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]]
mf = ModelFrame(Formula(:(Petal_Width ~ Sepal_Length + Sepal_Width + Petal_Length)),
                iris[:(Species .== "setosa"), :])
mm = ModelMatrix(mf);


WARNING: contains(collection, item) is deprecated, use in(item, collection) instead
 in contains at reduce.jl:238

Implementation

So now we can implement a linear regression model using the dataframes structures explored above.


In [2]:
using DataFrames
using RDatasets
using PyPlot

In [3]:
# Parameters:
#    training_data is a DataFrames.DataFrame (https://github.com/JuliaStats/DataFrames.jl/blob/master/src/dataframe.jl)
#    containing the training data.
#    
#    reg_formula is a DataFrames.Formula (https://github.com/JuliaStats/DataFrames.jl/blob/master/src/formula.jl)
#    specifying the linear equation to be fitted from training data.
#
# Example:
# 
function linreg(training_data::DataFrame, reg_formula::Formula,
                step_size::Float64, threshold::Float64, max_iter::Float64)
    mf = ModelFrame(reg_formula, training_data)
    mm = ModelMatrix(mf) # mm.m is the matrix data with constant-param 1 column added.
    
    # Size of training set
    m = size(training_data,1)
    # Number of features
    n = size(mm, 2)
    
    # Column vector:
    theta = zeros(Float64, n, 1)
    
    # Hypothesis fcn: (theta_0' * mm.m[i,:])
    # x_i vector: mm.m[i,:]
    # y vector: model_response(mf)
    iter_cnt = 0
    diff = 1
    while diff > threshold && iter_cnt < max_iter
        # grad = (theta dot x_i - y_i)*x_i
        grad = [(sum([theta[j]*(mm.m[i,j]) for j=1:n]) - model_response(mf)[i])*mm.m[i,:] for i=1:m]
        theta1 = theta - (step_size/m)*sum(grad)'
        diff = sqrt(sum([x^2 for x in theta1 - theta]))
        theta = theta1
        iter_cnt += 1
        if iter_cnt % 3000 == 0
            println(("itercnt", iter_cnt))
            println(theta)
        end
    end
    
    println(("itercnt", iter_cnt))
    println(theta)
    
    return theta
end


Out[3]:
# methods for generic function linreg
linreg(training_data::DataFrame,reg_formula::Formula,step_size::Float64,threshold::Float64,max_iter::Float64) at In[3]:12

Let's try it out:


In [4]:
# Get some test data:
iris_test = data("datasets", "iris")
for col_name in DataFrames.colnames(iris)
    DataFrames.rename!(iris_test, col_name, replace(col_name, "\.", "_"))
end

# 2D test on iris data:
p = linreg(iris[:(Species .== "setosa"), :],
           Formula(:(Petal_Width ~ Sepal_Length)),
           0.075,
           1e-6,
           1e7)


("itercnt",3000)
-.10715174451593251
.07060429763094006

("itercnt",6000)
-.1480795298978382
.0787420140181157

("itercnt",9000)
-.16244791818262228
.08159889639548676

("itercnt",11916)
-.16741101563239197
.08258571449647287

Out[4]:
2x1 Array{Float64,2}:
 -0.167411 
  0.0825857

Let's plot what we got:


In [5]:
n = length(iris[:(Species .== "setosa"), :]["Sepal_Length"])

PyPlot.plt.clf()
PyPlot.plot(iris[:(Species .== "setosa"), :]["Sepal_Length"],
            iris[:(Species .== "setosa"), :]["Petal_Width"],
            "k.")

t = PyPlot.linspace(3, 6, n)
PyPlot.plot(t, p[1] + p[2]*t, "g.-")
#PyPlot.plot(t, t1 + t0*t, "b.-")
plt.gcf()


Out[5]:

Let's compare it to some existing library:


In [6]:
using PyCall
@pyimport numpy

n = length(iris[:(Species .== "setosa"), :]["Sepal_Length"])

PyPlot.plt.clf()
PyPlot.plot(iris[:(Species .== "setosa"), :]["Sepal_Length"],
            iris[:(Species .== "setosa"), :]["Petal_Width"],
            "k.")
(t0,t1) = numpy.polyfit(iris[:(Species .== "setosa"), :]["Sepal_Length"],
                        iris[:(Species .== "setosa"), :]["Petal_Width"],
                        1)

t = PyPlot.linspace(3, 6, n)
PyPlot.plot(t, t1 + t0*t, "b.-")
plt.gcf()


Out[6]:

Let's try it in 3d.


In [7]:
# Get some test data:
iris_test = data("datasets", "iris")
for col_name in DataFrames.colnames(iris)
    DataFrames.rename!(iris_test, col_name, replace(col_name, "\.", "_"))
end

# 2D test on iris data:
p = linreg(iris[:(Species .== "setosa"), :],
           Formula(:(Petal_Width ~ Sepal_Length + Petal_Length)),
           0.035,
           1e-6,
           1e7)

# Not quite sure how best to 3d plot in Julia.
# Matplotlib has a mplot3d package, but I couldn't quickly get it going:
#@pyimport mpl_toolkits
#fig = plt.gcf()
#ax = fig.add_subplot(111, projection="3d")
# 
# There's also "gaston": https://github.com/mbaz/Gaston.jl
# which is a front-end for gnuplot, which is 3d capable.


(
Warning: imported binding for transpose overwritten in module __anon__
"itercnt",3000)
-.09648622553722243
.02965471557163424
.13330457112510494

("itercnt",6000)
-.17009784204457326
.03972411256015653
.14897457172630543

("itercnt",9000)
-.21793550079522567
.04720737308602839
.1559362434658068

("itercnt",12000)
-.24885154061053935
.05209105105883925
.16027257368663408

("itercnt",15000)
-.26882295573604464
.05524826430352748
.1630655231574656

("itercnt",18000)
-.2817238244036604
.05728784145979987
.16486925500535707

("itercnt",21000)
-.2900573332222357
.058605343022005554
.16603438116269206

("itercnt",24000)
-.2954404865817103
.05945640297230445
.16678701056175416

("itercnt",26610)
-.2985469303438554
.05994752216845585
.1672213286058956

Out[7]:
3x1 Array{Float64,2}:
 -0.298547 
  0.0599475
  0.167221