Nonlinear optimization using JuMP (Julia for Mathematical Programming)

By Tyler Ransom, Duke University Social Science Research Institute

tyler.ransom@duke.edu

Let's examine how to use a new Julia packaged called JuMP to illustrate some of Julia's powerful resources for optimizing objective functions of all kinds.

For an overview of the JuMP package, see here. Essentially, JuMP makes use of Julia's macros to greatly simplify construction and optimization of various problems.

Here are some features of JuMP:

interfaces seamlessly with many industry-grade solvers (e.g. AMPL)
can be used to solve linear programming, nonlinear programming, and many other types of problems (including constrained optimization)
automatically differentiates the objective function (not numerical differentiation), resulting in speed gains
user-friendly model construction: user simply writes the objective function and any constraints in mathematical notation; JuMP then translates this into binaries at compile time

In this illustration, we will use JuMP for nonlinear programming problems (i.e. maximum likelihood estimation). We will be using the open-source optimizer Ipopt (pronounced eye-PEE-opt)

So let's load the packages we'll need and get started!



In [ ]:

    
using JuMP
using Ipopt

Now let's create some data inside of a function:



In [1]:

    
function datagen(N,T)
    ## Generate data for a linear model to test optimization
    srand(1234)
    
    N = convert(Int64,N) #inputs to functions such as -ones- need to be integers!
    T = convert(Int64,T) #inputs to functions such as -ones- need to be integers!
    const n = convert(Int64,N*T) # use -const- as a way to declare a variable to be global (so other functions can access it)
    
    # generate the Xs
    const X = cat(2,ones(N*T,1),5+3*randn(N*T,1),rand(N*T,1),
               2.5+2*randn(N*T,1),15+3*randn(N*T,1),
               .7-.1*randn(N*T,1),5+3*randn(N*T,1),
               rand(N*T,1),2.5+2*randn(N*T,1),
               15+3*randn(N*T,1),.7-.1*randn(N*T,1),
               5+3*randn(N*T,1),rand(N*T,1),2.5+2*randn(N*T,1),
               15+3*randn(N*T,1),.7-.1*randn(N*T,1));
    
    # generate the betas (X coefficients)
    const bAns = [ 2.15; 0.10;  0.50; 0.10; 0.75; 1.2; 0.10;  0.50; 0.10; 0.75; 1.2; 0.10;  0.50; 0.10; 0.75; 1.2 ]
    
    # generate the std dev of the errors
    const sigAns = 0.3
    
    # generate the Ys from the Xs, betas, and error draws
    draw = 0 + sigAns*randn(N*T,1)
    const y = X*bAns+draw
    
    # return generated data so that other functions (below) have access
    return X,y,bAns,sigAns,n
end









    Out[1]:





datagen (generic function with 1 method)

Now let's evaluate the function so that the function outputs are in the current scope:



In [3]:

    
X,y,bAns,sigAns,n = datagen(1e4,5);

OLS estimation

We can estimate the bAns parameter vector using OLS:



In [4]:

    
bHat = (X'*X)\(X'*y);
sigHat = sqrt((y-X*bHat)'*(y-X*bHat)/(n-size(X,2)));

And now we can compare the estimates with the solution:



In [5]:

    
[round([bHat;sigHat],3) [bAns;sigAns]]









    Out[5]:





17x2 Array{Float64,2}:
 2.101  2.15
 0.1    0.1 
 0.497  0.5 
 0.1    0.1 
 0.751  0.75
 1.215  1.2 
 0.1    0.1 
 0.505  0.5 
 0.1    0.1 
 0.75   0.75
 1.208  1.2 
 0.1    0.1 
 0.509  0.5 
 0.101  0.1 
 0.75   0.75
 1.221  1.2 
 0.3    0.3

The estimator performs well.

Estimating the parameters of a normal MLE using `JuMP`

Now let's estimate this using maximum likelihood under the (valid) assumption that the errors in our datagen() function are independently drawn from a normal distribution with mean 0 and standard deviation to be estimated.

Recall that the likelihood function for this scenario is: $$ L_{i} = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma^{2}}} \exp \left( -\frac{\left(y_{i}-x_{i}\beta\right)^{2}}{2\sigma^{2}}\right) $$

or, more simply,

$$ L_{i} = \prod_{i=1}^{N} \frac{1}{\sigma} \phi \left(\frac{y_{i}-x_{i}\beta}{\sigma}\right) $$

where $\phi\left(\cdot\right)$ is the probability density function of the standard normal distribution.

In maximum likelihood estimation, we search for the parameters that maximize the log likelihood function. Our function above becomes:

$$ l_{i} = \sum_{i=1}^{N} -\frac{1}{2}\log(2\pi) - \log(\sigma) -\frac{\left(y_{i}-x_{i}\beta\right)^{2}}{2\sigma^{2}} $$

`JuMP` syntax

JuMP requires the following syntax elements:

model name
solver
variable (i.e. a parameter to search over)
objective function
solve() call

We will construct these line-by-line below.

First, let's decleare the model name and solver:



In [6]:

    
myMLE = Model(solver=IpoptSolver(tol=1e-6));

Now let's tell JuMP which variables to search over. For the example above, this is the vector $\beta$ and the scalar $\sigma$.



In [7]:

    
@defVar(myMLE, b[i=1:size(X,2)]);
@defVar(myMLE, s>=0.0);

Now we will write the objective function, which is the log likelihood function from above. We will also tell JuMP that we are maximizing.



In [8]:

    
@setNLObjective(myMLE, Max, (n/2)*log(1/(2*pi*s^2))-sum{(y[i]-sum{X[i,k]*b[k], 
                            k=1:size(X,2)})^2, i=1:size(X,1)}/(2s^2));

Note that we specify the summation term over $i$ using sum{ ,i=1:size(X,1)}. We also have to specify the linear algebra implied by X*b as another sum, this time over $k$. This is because JuMP doesn't understand matrix multiplication. This last fact adds a little bit of syntax to the objective function, but overall it is very readable.

Now that we have all of our elements set up, we can issue a solve() call and ask JuMP to perform the optimization:



In [9]:

    
status = solve(myMLE)









    Out[9]:





:Optimal

We see that the solver finished at a solution that is a local maximum. To store the estimated values of the parameters, we can use the getValue() functions, applied to each separate parameter:



In [10]:

    
bOpt = getValue(b[:]);
sOpt = getValue(s);

And if we compare them with the solution, we see that we have indeed obtained the correct parameter estimates:



In [11]:

    
[round([bOpt;sOpt],3) [bAns;sigAns]]









    Out[11]:





17x2 Array{Float64,2}:
 2.101  2.15
 0.1    0.1 
 0.497  0.5 
 0.1    0.1 
 0.751  0.75
 1.215  1.2 
 0.1    0.1 
 0.505  0.5 
 0.1    0.1 
 0.75   0.75
 1.208  1.2 
 0.1    0.1 
 0.509  0.5 
 0.101  0.1 
 0.75   0.75
 1.221  1.2 
 0.3    0.3

We can also return other helpful values such as the objective function value at the optimum, which is the log likelihood of the estimated model.



In [12]:

    
getObjectiveValue(myMLE)









    Out[12]:





-10714.712841850938

Additionally, we can obtain the Hessian matrix of the model, which serves as a key component to the variance-covariance matrix that is used for statistical inference. The syntax for this is a bit messy, but I will put it below for reference:



In [13]:

    
this_par = myMLE.colVal
m_eval = JuMP.JuMPNLPEvaluator(myMLE);
MathProgBase.initialize(m_eval, [:ExprGraph, :Grad, :Hess])
hess_struct = MathProgBase.hesslag_structure(m_eval)
hess_vec = zeros(length(hess_struct[1]))
numconstr = length(m_eval.m.linconstr) + length(m_eval.m.quadconstr) + length(m_eval.m.nlpdata.nlconstr)
dimension = length(myMLE.colVal)
MathProgBase.eval_hesslag(m_eval, hess_vec, this_par, 1.0, zeros(numconstr))
this_hess_ld = sparse(hess_struct[1], hess_struct[2], hess_vec, dimension, dimension)
hOpt = this_hess_ld + this_hess_ld' - sparse(diagm(diag(this_hess_ld)));
hOpt = -full(hOpt); #since we are maximizing
seOpt = sqrt(diag(full(hOpt)\eye(size(hOpt,1))));

The above code obtained the estimated hessian matrix, took the negative of it (because we are maximizing), and then obtained standard errors of our parameter estimates from the diagonal elements of the inverse of the negative hessian.



In [14]:

    
[round([bOpt;sOpt],3) round(seOpt,3)]









    Out[14]:





17x2 Array{Float64,2}:
 2.101  0.021
 0.1    0.0  
 0.497  0.005
 0.1    0.001
 0.751  0.0  
 1.215  0.013
 0.1    0.0  
 0.505  0.005
 0.1    0.001
 0.75   0.0  
 1.208  0.013
 0.1    0.0  
 0.509  0.005
 0.101  0.001
 0.75   0.0  
 1.221  0.013
 0.3    0.001

We can compare these with the OLS standard errors and confirm that they are the same:



In [15]:

    
seHat = sqrt((sigHat^2).*diag((X'*X)\eye(size(X,2))));
[round(bHat,3) round(seHat,3)]









    Out[15]:





16x2 Array{Float64,2}:
 2.101  0.021
 0.1    0.0  
 0.497  0.005
 0.1    0.001
 0.751  0.0  
 1.215  0.013
 0.1    0.0  
 0.505  0.005
 0.1    0.001
 0.75   0.0  
 1.208  0.013
 0.1    0.0  
 0.509  0.005
 0.101  0.001
 0.75   0.0  
 1.221  0.013

Constrained optimization in JuMP

Constraints can easily be added to the objective function. There are two ways of doing this:

Directly to the @defVar statement, e.g. @defVar(myMLE, s>=0.0); restricts s to be weakly positive.
Additional lines after the set of @defVar statements, but before the @setNLObjective statement. The syntax for this is, e.g. @addConstraint(myMLE, b[15] == 0) which would restrict the 15th element of the b vector to equal 0.

JuMP accepts up to literally thousands of constraints, and its simple syntax makes coding much easier.

The Final Word

Pros of JuMP:

Versatile modeling language that takes advantage of Julia's features.
Interfaces to a variety of optimizers, meaning the user can simply call a different optimizer without adjusting any underlying code.
Very fast. Roughly 3+ times faster than Matlab's fminunc for a simple large-scale problem.
User can specify maximization or minimization on the fly without needed to re-express the objective function.

Cons of JuMP:

Because JuMP does constrained optimization if the user specifies constraints on any of the parameters. Thus, the hessian it returns is the hessian of the corresponding Lagrangian, not the hessian of the objective. This makes inference a bit trickier.
No matrix multiplication allowed, so users must specify matrix operations in scalar form