Case 1: Implementing LS solver

by Ken Bastiaensen, Milan Van den Heuvel, Gonzalo Villa *Advanced Econometrics 2016-2017.

The system of linear equations is given in matrix notation $$ y = X \beta + \mu$$

with dependent variables $y$ and residuals $\mu \in \mathbb{R}^{N\times1}$, independent variables $X \in \mathbb{R}^{N\times K}$ and coefficients $\beta \in \mathbb{R}^{K\times 1}$.

We use the estimator for the coefficients $\hat\beta = (X'X)^{-1}Xy$ as given in course. In practice more advanced methods are used for stability, see the bottom part of this document.

With $\hat\mu = y - X \hat\beta$, the standard error $\sigma$ is estimated by $$\hat\sigma^2 = \frac{\hat\mu'\hat\mu}{N - K}$$ and variance-covariance matrix of $\hat\beta$ given by $$V[\hat\beta] = \hat\sigma^2 (X'X)^{-1}$$

The t distribution for this null hypothesis that $\beta_0=0$ is then $$ t = \frac{\hat\beta - 0}{s.e.} $$ with the standard errors $s.e.$ on the diagonal of $\hat\sigma$.



In [3]:

    
#Load the Distributions package. Use `Pkg.install("Distributions")` to install first time.
using Distributions: TDist, ccdf



In [4]:

    
type regress_results
    coefs
    yhat
    res
    vcv
    tstat
    pval
end



In [5]:

    
# Keyword arguments are placed after semicolon.
# Symbols start with colon, e.g. `:symbol`.
function ols(y, X; corr=:none, lags::Int=Int(floor(size(X,1)^(1/4))))

    # β̂ = X \ y is more stable than β̂ = inv(X'*X) * X' \ y 
    # see notes at bottom of case 1 notebook
    β̂ = X \ y
    ŷ = X * β̂
    μ̂ = y - ŷ

    T, K = size(X)
    σ̂² = dot(μ̂, μ̂) / (T - K)

    #use correction for variance covariance
    if corr == :none
        vcv = σ̂² * inv(X'*X)
    elseif corr == :white
        vcv = newey_west(X, μ̂, 0)
    elseif corr == :newey_west
        vcv = newey_west(X, μ̂, lags)
    else
        error("wrong argument for correction keyword")
    end
    
    # T statistics for H₀: βᵢ = 0
    tstat = β̂ ./ sqrt(diag(vcv))

    # absolute value and times two for double sided test
    pval  = 2 * ccdf(TDist(T-K), abs(tstat))

    regress_results(β̂, ŷ, μ̂, vcv, tstat, pval)
end









    Out[5]:





ols (generic function with 1 method)



In [6]:

    
function newey_west(X, μ̂, lags::Integer)

    XtXInv = inv(X'*X)
    T, K = size(X)
    
    if lags==0 # White estimator
        return XtXInv * X' * diagm(μ̂.^2) * X * XtXInv
    end

    vcv = zeros(K, K)
    for t = 1:T
        vcv += μ̂[t]^2 * (X[t,:] * X[t,:]')
    end
    for lag in 1:lags
        w = 1 - lag / (lags + 1)
        for t in (lag + 1):T            
            vcv += w * μ̂[t] * μ̂[t-lag] * (X[t,:]*X[t-lag,:]' + X[t-lag,:]*X[t,:]')            
        end
    end
    vcv = XtXInv * vcv * XtXInv
end









    Out[6]:





newey_west (generic function with 1 method)



In [7]:

    
#test if it runs
ols(randn(20), randn(20,3))









    Out[7]:





regress_results([0.024595,-0.295037,0.0980838],[-0.349802,-0.669619,-0.266682,-0.321911,0.506153,-0.308462,0.400647,0.518798,-0.000732764,-0.098182,0.251172,-0.0739667,0.169721,0.417368,0.0515518,-0.263067,-0.123205,-0.666188,-0.219392,0.105833],[1.79539,0.492269,-0.795891,-0.583049,0.153626,-0.11934,0.18611,-2.00379,0.189448,-0.690507,-0.257992,0.944688,-0.588816,2.74247,1.00391,-0.378464,-0.215375,-0.213175,0.530209,2.14441],[0.0970421 0.00210263 0.0178319; 0.00210263 0.074978 0.0134133; 0.0178319 0.0134133 0.0453945],[0.0789527,-1.07748,0.460358],[0.937992,0.296318,0.651093])



In [8]:

    
#simulation test
K = 3   # number of parameters
N = 100 # number of observations

# Create actual parameters and observations
β = randn(K)
X = randn(N, K)
X[:,1] = ones(N) # intercept
σ = 5
μ = σ * randn(N) # ~ N(0, σ)
y = X * β + μ;



In [9]:

    
res = ols(y, X)
@show β
@show res.coefs
res.vcv









    



β = [-0.246029,0.296951,1.26321]
res.coefs = [0.637702,-0.0886215,1.48288]






    Out[9]:





3×3 Array{Float64,2}:
  0.2156      -0.0140701   0.00271354
 -0.0140701    0.213752   -0.0104189 
  0.00271354  -0.0104189   0.201953



In [10]:

    
res = ols(y, X, corr=:newey_west)
@show res.coefs #should be the same as ols without correction
res.vcv









    



res.coefs = [0.637702,-0.0886215,1.48288]






    Out[10]:





3×3 Array{Float64,2}:
  0.17083     -0.00812258  -0.00107652
 -0.00812258   0.170932    -0.0308773 
 -0.00107652  -0.0308773    0.118859



In [11]:

    
# We test with the illustration of the slides 
y = [1673,1688,1666,1735,1749,1756,1815,1867,1948,2048,2128,2165,2257,2316,2324]
X2=[1839,1844,1831,1881,1883,1910,1969,2016,2126,2239,2336,2404,2487,2535,2595]
X = hcat(ones(length(X2)), X2, collect(1:15))
ols(y, X)









    Out[11]:





regress_results([300.286,0.741981,8.04356],[1672.83,1684.59,1682.98,1728.13,1737.65,1765.73,1817.55,1860.47,1950.13,2042.02,2122.03,2180.53,2250.16,2293.82,2346.38],[0.167431,3.41396,-16.9838,6.87355,11.346,-9.73102,-2.55145,6.53189,-2.12957,5.98303,5.96733,-15.5309,6.8411,22.1825,-22.38],[6133.65 -3.70794 220.206; -3.70794 0.00225946 -0.137052; 220.206 -0.137052 8.90154],[3.83421,15.6096,2.69597],[0.00237732,2.46242e-9,0.0194537])

Estimate Cigarette Demand

Read data



In [12]:

    
data, header = readcsv("Data_Baltagi.csv", header=true)
println.(header);









    



state
 year
 P/pack
 pop
 pop16+
 CPI
 ndi/capita
 C/capita
 Pn/pack
 ln C_it
 ln P_it
 ln Pn_it
 ln Y_it

We first model $$\ln C_{it} = \beta_0 + \beta_1\ln P_{it} + \beta_2\ln Pn_{it} + \beta_3\ln Y_{it} + \mu_{it}$$



In [13]:

    
y = data[:, 10]
X = hcat(ones(y), data[:, 11:13]);



In [34]:

    
using DataFrames

Results for ordinary Least Squares



In [14]:

    
res_ols = ols(y, X)
@show res.coefs
@show sqrt(diag(res.vcv))
@show res.tstat
@show res.pval;









    



res.coefs = [4.7478,-1.02434,0.259558,0.0655521]
sqrt(diag(res.vcv)) = [0.11508,0.0588673,0.0579217,0.0251092]
res.tstat = [41.2567,-17.4008,4.48119,2.61068]
res.pval = [7.77625e-243,1.88586e-61,8.03852e-6,0.00913432]

OLS with White correction



In [15]:

    
res = ols(y, X, corr= :white)
@show res.coefs #should be the same
@show sqrt(diag(res.vcv))
@show res.tstat
@show res.pval;









    



res.coefs = [4.7478,-1.02434,0.259558,0.0655521]
sqrt(diag(res.vcv)) = [0.0937596,0.074459,0.0665903,0.0207982]
res.tstat = [50.638,-13.7571,3.89783,3.15181]
res.pval = [1.19654e-316,1.92136e-40,0.000101729,0.00165759]

OLS with Newey-West correction, automatic lags selection



In [16]:

    
res = ols(y, X, corr = :newey_west)
@show res.coefs #should be the same
@show sqrt(diag(res.vcv))
@show res.tstat
@show res.pval;









    



res.coefs = [4.7478,-1.02434,0.259558,0.0655521]
sqrt(diag(res.vcv)) = [0.215621,0.16504,0.146265,0.0479885]
res.tstat = [22.0192,-6.20659,1.77457,1.366]
res.pval = [2.70923e-92,7.1522e-10,0.0761901,0.172163]

EGLS assuming AR(1) errors

We first define the Generalized Least Squares (GLS) method:



In [17]:

    
function gls(y, X, Ω)
    P = chol(inv(Ω))
    ols(P*y, P*X)
end









    Out[17]:





gls (generic function with 1 method)

We look at the residuals versus residuals with 1 time lag to inspect an AR(1) pattern in the error terms.



In [18]:

    
res = ols(y, X, corr = :newey_west)
μ̂ = res.res;

We must consider that the panel data consists of $n = 30$ states over $T = 46$ timesteps for a total of 1380 observations.



In [20]:

    
states = data[:,1]
@show T = sum(states .== 1.0) 
@show n = length(unique(states))
n*T == size(X, 1) || error("missing panel data");









    



T = sum(states .== 1.0) = 30
n = length(unique(states)) = 46

We create a time series for residuals $\mu_t$ and for lagged residuals $\mu_{t-1}$ by taking into account the panel data. Specifically we copy all residuals $\mu$ and remove the first time points for $\mu_t$ and the last ones for $\mu_{t-1}$ for each state.



In [21]:

    
μ̂_t   = deleteat!(copy(μ̂), 1:T:n*T)
μ̂_tm1 = deleteat!(copy(μ̂), T:T:n*T);



In [25]:

    
using Plots # Pkg.add("Plots") if not yet installed
gr(); #Pkg.add("GR") if not yet installed









    Out[25]:





Plots.GRBackend()



In [26]:

    
scatter(μ̂_t, μ̂_tm1, legend=false)
#visibly strong autocorrelation present









    Out[26]:

We formally test the significance of the autocorrelation using the Durbin-Watson statistic and published tables. We couldn't find the tables for panel data, so we used $T = 30$, $K=3$ and no intercept at 5% significance with a stated lower bound of 0.948. The value we find is much lower, however, so strong indication of autocorrelation present.



In [27]:

    
ρ̂ = (μ̂_tm1 \ μ̂_t )[1]









    Out[27]:





0.9501243934288294



In [29]:

    
@show d = norm(μ̂_t .- μ̂_tm1)^2 / norm(μ̂)^2
d < 0.948









    



d = norm(μ̂_t .- μ̂_tm1) ^ 2 / norm(μ̂) ^ 2 = 0.07668692300272952






    Out[29]:





true

For EGLS with AR1 the easiest way to proceed is to estimate $\rho$ and create the Projection matrix directly for use in OLS.



In [54]:

    
# function to create a projection matrix based on correlation and size
function ar1_proj(len, ρ)
    P_ar1 = zeros(len, len)
    
    # set diagonal
    for i = 1:len
        P_ar1[i,i] = 1
    end
    P_ar1[1,1] = sqrt(1 - ρ̂)
    
    # set off diagonals
    for i = 1:len-1
        P_ar1[i+1, i] = - ρ̂
    end
    
    P_ar1
end









    



WARNING: Method definition ar1_proj(Any, Any) in module Main at In[52]:3 overwritten at In[54]:3.






    Out[54]:





ar1_proj (generic function with 1 method)



In [55]:

    
P = ar1_proj(n*T, ρ̂)

# Run GLS
res_egls = ols(P * y, P * X)
@show res_egls.coefs
@show sqrt(diag(res_egls.vcv))
@show res_egls.pval;









    



res_egls.coefs = [5.8115,-0.551747,-0.0593205,-0.170876]
sqrt(diag(res_egls.vcv)) = [0.111038,0.0375419,0.0416841,0.022351]
res_egls.pval = [0.0,1.57581e-45,0.154936,3.89684e-14]

When taking into account this autocorrelation, we observe that the coefficients are smaller (except, then, the constant) and for $P_{it}$ even insignificant.

Dynamic CigDem Model

For the dynamic model, we include a lag for the sigaret sales: $$\ln C_{it} = \beta_0 + \alpha\ln C_{i,t-1} + \beta_1\ln P_{it} + \beta_2\ln Pn_{it} + \beta_3\ln Y_{it} + \mu_{it}$$ We will create a new model matrix with the lagged variable and without the first observation, for which no lag is available.



In [56]:

    
# keep only available dependent variables by removing first observation for each state
Yd = deleteat!(copy(y), 1:T:n*T)

# create model matrix with independent variables
Xd = ones((T-1)*n, 5)

C_lag = deleteat!(copy(y), T:T:n*T)
Xd[:, 2] = C_lag

for state = 1:n
    Xd[(state - 1)*(T-1) + 1:state*(T-1), 3:5] = X[(state - 1)*T + 2:state*T, 2:4]
end



In [57]:

    
res_dyn = ols(Yd, Xd)
res_dyn.coefs









    Out[57]:





5-element Array{Float64,1}:
  0.278311 
  0.969494 
 -0.0901512
  0.0240347
 -0.0306788



In [71]:

    
abs(res_dyn.tstat)









    Out[71]:





5-element Array{Float64,1}:
   7.10461
 157.669  
   6.18342
   1.82687
   5.08946

So same results as in Baltagi 2005, Table 8.1 for OLS.

TODO: add estimation for first differences

We also look at possible autocorrelation still present.



In [65]:

    
μ̂     = res_dyn.res
μ̂_t   = deleteat!(copy(μ̂), 1:(T-1):n*(T-1))
μ̂_tm1 = deleteat!(copy(μ̂), (T-1):(T-1):n*(T-1))
scatter(μ̂_t, μ̂_tm1, legend=false)
# by looking at it, it seems most autocorrelation gone









    Out[65]:



In [68]:

    
@show ρ̂ = (μ̂_tm1 \ μ̂_t )[1]
@show d = norm(μ̂_t .- μ̂_tm1)^2 / norm(μ̂)^2
d < 0.948









    



ρ̂ = (μ̂_tm1 \ μ̂_t)[1] = -0.02025847697480531
d = norm(μ̂_t .- μ̂_tm1) ^ 2 / norm(μ̂) ^ 2 = 1.966921826688738






    Out[68]:





false

No significant autocorrelation left

Addendum

Background on OLS

Problem

Ordinary Least Squares (OLS) is a solution for a system of linear equations by minimizing the sum of squared differences between the observed values ($y$) and the corresponding modelled values ($\hat y = X \hat\beta$).

A system of equations is given by $$ y_i = \beta_1 + \beta_2 X_{i2} + \beta_3 X_{i3} + ... + \beta_K X_{iK} + \mu_i, \quad i=1,...,N$$

or in matrix notation $$ y = X \beta + \mu$$

with $y, \mu \in \mathbb{R}^{N\times1}$, $X \in \mathbb{R}^{N\times K}$ and $\beta \in \mathbb{R}^{K\times 1}$.

The OLS solves $$\operatorname{\,min} \, \big\|y - X \hat\beta \big\|^2$$ that with some algebra leads to the normal equations $$ X' X\beta = X'y$$

Solution

Normal equations

In case of a rank complete matrix $X$ the normal equations can be solved by inverting the (Gram) matrix $X'X$ such that $\hat\beta = (X'X)^{-1}X'y$.

As X'X is a symmetric, positive definite matrix it's computationally more efficient to use the cholesky decomposition of $X'X = LL'$ with $L$ a lower triangular matrix. This gives $LL'\hat\beta = X'y$. First solve $Lz = X'y$ for $z$ by forward substitution and then $L'\hat\beta = z$ for $\beta$ by backward substitution.

However, solving the normal equations is numerically unstable, i.e. it is very sensitive to small pertubations in $X$. The condition number $\kappa$ of the system is worsened: $\kappa(X'X) = [\kappa(X)]^2$. We show the impact further below.

We first simulate some parameters and data using the same notation as in the slides.



In [13]:

    
K = 5   # number of parameters
N = 100 # number of observations

# Create actual parameters and observations
β = randn(K)
X = randn(N, K)
X[:,1] = ones(N) # intercept
σ = 0.1
μ = randn(N) * σ # ~ N(0,1)
y = X * β + μ;



In [14]:

    
#define function
OLS_normal(y, X) = inv(X'X) * X'y

# test and print result side by side for quick sanity check
β̂_normal = OLS_normal(y, X)
hcat(β̂_normal, β)









    Out[14]:





5×2 Array{Float64,2}:
  0.53793    0.544423
 -1.03249   -1.03373 
  0.287501   0.292956
  0.752885   0.74161 
 -1.347     -1.34078



In [15]:

    
function OLS_chol(y, X)
    X′X_chol = cholfact(X'X)
    X′X_chol\(X'*y)
end

# test and print result side by side for quick sanity check
β̂_chol = OLS_chol(y, X)
hcat(β̂_chol, β)









    Out[15]:





5×2 Array{Float64,2}:
  0.53793    0.544423
 -1.03249   -1.03373 
  0.287501   0.292956
  0.752885   0.74161 
 -1.347     -1.34078

QR Factorization

We decompose $X$ to its orthogonal QR decomposition: $$X = Q\begin{bmatrix} R \\ 0 \end{bmatrix} $$ with $Q\in\mathbb{R}^{N\times N} $ an orthogonal matrix and $R\in\mathbb{R}^{k\times k}$ an upper triangular matrix (with positive diagonal elements). The solution is then given by $$R \hat\beta =\left(Q' \mathbf y \right)_K$$

This is the default method in Julia for solving OLS (for non-square matrix $X$) with β̂ = X\y



In [16]:

    
function OLS_qr(y, X)
    X_qr = qrfact(X)
    X_qr \ y
end









    Out[16]:





OLS_qr (generic function with 1 method)



In [17]:

    
β̂_qr = OLS_qr(y, X)
hcat(β̂_qr, β)









    Out[17]:





5×2 Array{Float64,2}:
  0.53793    0.544423
 -1.03249   -1.03373 
  0.287501   0.292956
  0.752885   0.74161 
 -1.347     -1.34078

SVD

Another orthogonal decomposition uses the singular value decomposition (SVD) of $X$: $$X = U \Sigma V'$$ with $U \in\mathbb{R}^{N\times N}$ and $V \in\mathbb{R}^{K\times K}$ both an orthogonal matrix, and $\Sigma\in\mathbb{R}^{N\times K}$ a diagonal matrix. SVD can transform any matrix into a diagonal matrix with the right choice of orthogonal coordinate systems for its domain and range. This even works for rank deficient matrices and is numerically stable! In such an ill-conditioned system, SVD will return the solution $\hat\beta$ that has the smallest norm.



In [18]:

    
function OLS_svd(y, X)
    X_svd = svdfact(X)
    X_svd \ y
end









    Out[18]:





OLS_svd (generic function with 1 method)



In [19]:

    
β̂_svd = OLS_svd(y, X)
hcat(β̂_svd, β)









    Out[19]:





5×2 Array{Float64,2}:
  0.53793    0.544423
 -1.03249   -1.03373 
  0.287501   0.292956
  0.752885   0.74161 
 -1.347     -1.34078

Numerical Stability

We now test numerical stability (example 3.5 in [1]) with $$X = \begin{bmatrix} 1 & 1\\ 1 & 1+\epsilon \end{bmatrix}, y = \begin{bmatrix} 2 \\ 2 \end{bmatrix}$$ The solution is given by $$\hat\beta=\begin{bmatrix}2 \\ 0\end{bmatrix}$$

This system is ill-conditioned because a slight perturbation in $$y_e = \begin{bmatrix} 2 \\ 2 + \epsilon \end{bmatrix}$$ gives the compeltely different solution $$\hat\beta_e=\begin{bmatrix}1 \\ 1\end{bmatrix}$$

We show that methods using the normal equations do not find the second solution.

[1]: Numerically Efficient Methods for Solving Least Squares Problems, Do Q Lee.



In [20]:

    
ϵ = 1e-7
X = [1 1;1 1+ϵ]
y = [2; 2]
OLS_normal(y, X)









    Out[20]:





2-element Array{Float64,1}:
 2.0
 0.0



In [21]:

    
yₑ = [2; 2+ϵ]
@show OLS_normal(yₑ, X)
@show OLS_chol(yₑ, X);









    



OLS_normal(yₑ,X) = [0.75,1.125]
OLS_chol(yₑ,X) = [0.727273,1.27273]



In [22]:

    
@show OLS_qr(yₑ, X)
@show OLS_svd(yₑ, X);









    



OLS_qr(yₑ,X) = [1.0,1.0]
OLS_svd(yₑ,X) = [1.0,1.0]

Advanced methods

Several advanced methods exist such as bayesian OLS (with possible priors), sparse OLS solvers, James-Stein estimator and iterative solvers.



In [23]:

    
using IterativeSolvers



In [24]:

    
OLS_lsmr(y, X) = lsmr(X, y)









    Out[24]:





OLS_lsmr (generic function with 1 method)



In [25]:

    
β̂_lsmr = OLS_lsmr(y, X)
hcat(β̂_lsmr, β)









    



DimensionMismatch("vectors must have same lengths")

 in hcat(::Array{Float64,1}, ::Array{Float64,1}) at ./array.jl:738