Consider a ridge regression problem as follows:

$$\text{minimize } \ \frac{1}{2} \sum_{i=1}^n (w^T x_i - y_i)^2 + \frac{\lambda}{2} \|w\|^2$$

This problem is a special case of a more general family of problems called regularized empirical risk minimization, where the objective function is usually comprised of two parts: a set of loss terms and a regularization term.

Now, we show how to use the package SGDOptim to solve such a problem. First, we have to prepare some simulation data:


In [2]:
w = [3.0, -4.0, 5.0];   # the underlying model coefficients


Out[2]:
3-element Array{Float64,1}:
  3.0
 -4.0
  5.0

In [3]:
n = 10000; X = randn(3, n);  # generate 10000 sample features


Out[3]:
3x10000 Array{Float64,2}:
  1.20754  -0.106163  1.2656   -0.195827  …   0.926012   1.33045    0.24129 
  1.36097  -0.86756   1.05529   0.609302      1.11368   -1.57341   -1.165   
 -1.05901   0.648764  0.67563   0.276843     -0.262188   0.425759  -0.052067

In [4]:
sig = 0.1; y = vec(w'X) + sig * randn(n); # generate the responses, adding some noise


Out[4]:
10000-element Array{Float64,1}:
  -7.16593
   6.56201
   2.89612
  -1.69577
   4.10703
  11.8532 
  14.9927 
   3.83893
 -22.1367 
 -10.1685 
  -6.24833
  -3.17315
   5.49029
   ⋮      
  10.8841 
  11.0071 
  -7.784  
  12.7698 
  -1.27293
  -2.37607
  -5.78557
  -8.26047
  -9.313  
  -3.02949
  12.331  
   5.12746

Now, let's try to estimate $w$ from the data. This can be done by the following statement:


In [5]:
using SGDOptim

The first step is to construct a risk model, which comprises a prediction model and a loss function.


In [8]:
rmodel = riskmodel(LinearPred(3),  # use linear prediction x -> w'x, 3 is the input dimension
                   SqrLoss())      # use squared loss: loss(u, y) = (u - y)^2/2


Out[8]:
EmpiricalRisks.SupervisedRiskModel{EmpiricalRisks.LinearPred,EmpiricalRisks.SqrLoss}(EmpiricalRisks.LinearPred(3),EmpiricalRisks.SqrLoss())

Now, we are ready to solve the problem:


In [10]:
w_e = sgd(rmodel,
    zeros(3),      # the initial guess
    minibatch_seq(X, y, 10),    # supply the data in mini-batches, each with 10 samples
    reg = SqrL2Reg(1.0e-4),     # add a squared L2 regression with coefficient 1.0e-4
    lrate = t->1.0/(100.0 + t), # set the rule of learning rate 
    cbinterval = 100,           # invoke the callback every 100 iterations
    callback = simple_trace)    # print the optimization trace in callback


Iter 100: avg.loss = 6.5156e-03
Iter 200: avg.loss = 4.5992e-03
Iter 300: avg.loss = 4.8039e-03
Iter 400: avg.loss = 8.1380e-03
Iter 500: avg.loss = 7.6406e-03
Iter 600: avg.loss = 1.0267e-02
Iter 700: avg.loss = 1.6335e-02
Iter 800: avg.loss = 1.3647e-02
Iter 900: avg.loss = 1.1578e-02
Iter 1000: avg.loss = 1.0762e-02
Out[10]:
3-element Array{Float64,1}:
  3.00284
 -3.99607
  5.0021 

Note that 10000 samples can be partitioned into 1000 minibatches of size 10. So there were 1000 iterations, each using a single minibatch.

Now let's compare the estimated solution with the ground-truth:


In [12]:
sumabs2(w_e - w) / sumabs2(w)


Out[12]:
5.582303599714307e-7

The result looks quite accurate. We are done!