by Ken Bastiaensen, Milan Van den Heuvel, Gonzalo Villa *Advanced Econometrics 2016-2017.
The system of linear equations is given in matrix notation $$ y = X \beta + \mu$$
with dependent variables $y$ and residuals $\mu \in \mathbb{R}^{N\times1}$, independent variables $X \in \mathbb{R}^{N\times K}$ and coefficients $\beta \in \mathbb{R}^{K\times 1}$.
We use the estimator for the coefficients $\hat\beta = (X'X)^{-1}Xy$ as given in course. In practice more advanced methods are used for stability, see the bottom part of this document.
With $\hat\mu = y - X \hat\beta$, the standard error $\sigma$ is estimated by $$\hat\sigma^2 = \frac{\hat\mu'\hat\mu}{N - K}$$ and variance-covariance matrix of $\hat\beta$ given by $$V[\hat\beta] = \hat\sigma^2 (X'X)^{-1}$$
The t distribution for this null hypothesis that $\beta_0=0$ is then $$ t = \frac{\hat\beta - 0}{s.e.} $$ with the standard errors $s.e.$ on the diagonal of $\hat\sigma$.
In [3]:
#Load the Distributions package. Use `Pkg.install("Distributions")` to install first time.
using Distributions: TDist, ccdf
In [4]:
type regress_results
coefs
yhat
res
vcv
tstat
pval
end
In [5]:
# Keyword arguments are placed after semicolon.
# Symbols start with colon, e.g. `:symbol`.
function ols(y, X; corr=:none, lags::Int=Int(floor(size(X,1)^(1/4))))
# β̂ = X \ y is more stable than β̂ = inv(X'*X) * X' \ y
# see notes at bottom of case 1 notebook
β̂ = X \ y
ŷ = X * β̂
μ̂ = y - ŷ
T, K = size(X)
σ̂² = dot(μ̂, μ̂) / (T - K)
#use correction for variance covariance
if corr == :none
vcv = σ̂² * inv(X'*X)
elseif corr == :white
vcv = newey_west(X, μ̂, 0)
elseif corr == :newey_west
vcv = newey_west(X, μ̂, lags)
else
error("wrong argument for correction keyword")
end
# T statistics for H₀: βᵢ = 0
tstat = β̂ ./ sqrt(diag(vcv))
# absolute value and times two for double sided test
pval = 2 * ccdf(TDist(T-K), abs(tstat))
regress_results(β̂, ŷ, μ̂, vcv, tstat, pval)
end
Out[5]:
In [6]:
function newey_west(X, μ̂, lags::Integer)
XtXInv = inv(X'*X)
T, K = size(X)
if lags==0 # White estimator
return XtXInv * X' * diagm(μ̂.^2) * X * XtXInv
end
vcv = zeros(K, K)
for t = 1:T
vcv += μ̂[t]^2 * (X[t,:] * X[t,:]')
end
for lag in 1:lags
w = 1 - lag / (lags + 1)
for t in (lag + 1):T
vcv += w * μ̂[t] * μ̂[t-lag] * (X[t,:]*X[t-lag,:]' + X[t-lag,:]*X[t,:]')
end
end
vcv = XtXInv * vcv * XtXInv
end
Out[6]:
In [7]:
#test if it runs
ols(randn(20), randn(20,3))
Out[7]:
In [8]:
#simulation test
K = 3 # number of parameters
N = 100 # number of observations
# Create actual parameters and observations
β = randn(K)
X = randn(N, K)
X[:,1] = ones(N) # intercept
σ = 5
μ = σ * randn(N) # ~ N(0, σ)
y = X * β + μ;
In [9]:
res = ols(y, X)
@show β
@show res.coefs
res.vcv
Out[9]:
In [10]:
res = ols(y, X, corr=:newey_west)
@show res.coefs #should be the same as ols without correction
res.vcv
Out[10]:
In [11]:
# We test with the illustration of the slides
y = [1673,1688,1666,1735,1749,1756,1815,1867,1948,2048,2128,2165,2257,2316,2324]
X2=[1839,1844,1831,1881,1883,1910,1969,2016,2126,2239,2336,2404,2487,2535,2595]
X = hcat(ones(length(X2)), X2, collect(1:15))
ols(y, X)
Out[11]:
Read data
In [12]:
data, header = readcsv("Data_Baltagi.csv", header=true)
println.(header);
We first model $$\ln C_{it} = \beta_0 + \beta_1\ln P_{it} + \beta_2\ln Pn_{it} + \beta_3\ln Y_{it} + \mu_{it}$$
In [13]:
y = data[:, 10]
X = hcat(ones(y), data[:, 11:13]);
In [34]:
using DataFrames
In [14]:
res_ols = ols(y, X)
@show res.coefs
@show sqrt(diag(res.vcv))
@show res.tstat
@show res.pval;
In [15]:
res = ols(y, X, corr= :white)
@show res.coefs #should be the same
@show sqrt(diag(res.vcv))
@show res.tstat
@show res.pval;
In [16]:
res = ols(y, X, corr = :newey_west)
@show res.coefs #should be the same
@show sqrt(diag(res.vcv))
@show res.tstat
@show res.pval;
We first define the Generalized Least Squares (GLS) method:
In [17]:
function gls(y, X, Ω)
P = chol(inv(Ω))
ols(P*y, P*X)
end
Out[17]:
We look at the residuals versus residuals with 1 time lag to inspect an AR(1) pattern in the error terms.
In [18]:
res = ols(y, X, corr = :newey_west)
μ̂ = res.res;
We must consider that the panel data consists of $n = 30$ states over $T = 46$ timesteps for a total of 1380 observations.
In [20]:
states = data[:,1]
@show T = sum(states .== 1.0)
@show n = length(unique(states))
n*T == size(X, 1) || error("missing panel data");
We create a time series for residuals $\mu_t$ and for lagged residuals $\mu_{t-1}$ by taking into account the panel data. Specifically we copy all residuals $\mu$ and remove the first time points for $\mu_t$ and the last ones for $\mu_{t-1}$ for each state.
In [21]:
μ̂_t = deleteat!(copy(μ̂), 1:T:n*T)
μ̂_tm1 = deleteat!(copy(μ̂), T:T:n*T);
In [25]:
using Plots # Pkg.add("Plots") if not yet installed
gr(); #Pkg.add("GR") if not yet installed
Out[25]:
In [26]:
scatter(μ̂_t, μ̂_tm1, legend=false)
#visibly strong autocorrelation present
Out[26]:
We formally test the significance of the autocorrelation using the Durbin-Watson statistic and published tables. We couldn't find the tables for panel data, so we used $T = 30$, $K=3$ and no intercept at 5% significance with a stated lower bound of 0.948. The value we find is much lower, however, so strong indication of autocorrelation present.
In [27]:
ρ̂ = (μ̂_tm1 \ μ̂_t )[1]
Out[27]:
In [29]:
@show d = norm(μ̂_t .- μ̂_tm1)^2 / norm(μ̂)^2
d < 0.948
Out[29]:
For EGLS with AR1 the easiest way to proceed is to estimate $\rho$ and create the Projection matrix directly for use in OLS.
In [54]:
# function to create a projection matrix based on correlation and size
function ar1_proj(len, ρ)
P_ar1 = zeros(len, len)
# set diagonal
for i = 1:len
P_ar1[i,i] = 1
end
P_ar1[1,1] = sqrt(1 - ρ̂)
# set off diagonals
for i = 1:len-1
P_ar1[i+1, i] = - ρ̂
end
P_ar1
end
Out[54]:
In [55]:
P = ar1_proj(n*T, ρ̂)
# Run GLS
res_egls = ols(P * y, P * X)
@show res_egls.coefs
@show sqrt(diag(res_egls.vcv))
@show res_egls.pval;
When taking into account this autocorrelation, we observe that the coefficients are smaller (except, then, the constant) and for $P_{it}$ even insignificant.
For the dynamic model, we include a lag for the sigaret sales: $$\ln C_{it} = \beta_0 + \alpha\ln C_{i,t-1} + \beta_1\ln P_{it} + \beta_2\ln Pn_{it} + \beta_3\ln Y_{it} + \mu_{it}$$ We will create a new model matrix with the lagged variable and without the first observation, for which no lag is available.
In [56]:
# keep only available dependent variables by removing first observation for each state
Yd = deleteat!(copy(y), 1:T:n*T)
# create model matrix with independent variables
Xd = ones((T-1)*n, 5)
C_lag = deleteat!(copy(y), T:T:n*T)
Xd[:, 2] = C_lag
for state = 1:n
Xd[(state - 1)*(T-1) + 1:state*(T-1), 3:5] = X[(state - 1)*T + 2:state*T, 2:4]
end
In [57]:
res_dyn = ols(Yd, Xd)
res_dyn.coefs
Out[57]:
In [71]:
abs(res_dyn.tstat)
Out[71]:
So same results as in Baltagi 2005, Table 8.1 for OLS.
We also look at possible autocorrelation still present.
In [65]:
μ̂ = res_dyn.res
μ̂_t = deleteat!(copy(μ̂), 1:(T-1):n*(T-1))
μ̂_tm1 = deleteat!(copy(μ̂), (T-1):(T-1):n*(T-1))
scatter(μ̂_t, μ̂_tm1, legend=false)
# by looking at it, it seems most autocorrelation gone
Out[65]:
In [68]:
@show ρ̂ = (μ̂_tm1 \ μ̂_t )[1]
@show d = norm(μ̂_t .- μ̂_tm1)^2 / norm(μ̂)^2
d < 0.948
Out[68]:
No significant autocorrelation left
Ordinary Least Squares (OLS) is a solution for a system of linear equations by minimizing the sum of squared differences between the observed values ($y$) and the corresponding modelled values ($\hat y = X \hat\beta$).
A system of equations is given by $$ y_i = \beta_1 + \beta_2 X_{i2} + \beta_3 X_{i3} + ... + \beta_K X_{iK} + \mu_i, \quad i=1,...,N$$
or in matrix notation $$ y = X \beta + \mu$$
with $y, \mu \in \mathbb{R}^{N\times1}$, $X \in \mathbb{R}^{N\times K}$ and $\beta \in \mathbb{R}^{K\times 1}$.
The OLS solves $$\operatorname{\,min} \, \big\|y - X \hat\beta \big\|^2$$ that with some algebra leads to the normal equations $$ X' X\beta = X'y$$
In case of a rank complete matrix $X$ the normal equations can be solved by inverting the (Gram) matrix $X'X$ such that $\hat\beta = (X'X)^{-1}X'y$.
As X'X is a symmetric, positive definite matrix it's computationally more efficient to use the cholesky decomposition of $X'X = LL'$ with $L$ a lower triangular matrix. This gives $LL'\hat\beta = X'y$. First solve $Lz = X'y$ for $z$ by forward substitution and then $L'\hat\beta = z$ for $\beta$ by backward substitution.
However, solving the normal equations is numerically unstable, i.e. it is very sensitive to small pertubations in $X$. The condition number $\kappa$ of the system is worsened: $\kappa(X'X) = [\kappa(X)]^2$. We show the impact further below.
We first simulate some parameters and data using the same notation as in the slides.
In [13]:
K = 5 # number of parameters
N = 100 # number of observations
# Create actual parameters and observations
β = randn(K)
X = randn(N, K)
X[:,1] = ones(N) # intercept
σ = 0.1
μ = randn(N) * σ # ~ N(0,1)
y = X * β + μ;
In [14]:
#define function
OLS_normal(y, X) = inv(X'X) * X'y
# test and print result side by side for quick sanity check
β̂_normal = OLS_normal(y, X)
hcat(β̂_normal, β)
Out[14]:
In [15]:
function OLS_chol(y, X)
X′X_chol = cholfact(X'X)
X′X_chol\(X'*y)
end
# test and print result side by side for quick sanity check
β̂_chol = OLS_chol(y, X)
hcat(β̂_chol, β)
Out[15]:
We decompose $X$ to its orthogonal QR decomposition: $$X = Q\begin{bmatrix} R \\ 0 \end{bmatrix} $$ with $Q\in\mathbb{R}^{N\times N} $ an orthogonal matrix and $R\in\mathbb{R}^{k\times k}$ an upper triangular matrix (with positive diagonal elements). The solution is then given by $$R \hat\beta =\left(Q' \mathbf y \right)_K$$
This is the default method in Julia for solving OLS (for non-square matrix $X$) with β̂ = X\y
In [16]:
function OLS_qr(y, X)
X_qr = qrfact(X)
X_qr \ y
end
Out[16]:
In [17]:
β̂_qr = OLS_qr(y, X)
hcat(β̂_qr, β)
Out[17]:
Another orthogonal decomposition uses the singular value decomposition (SVD) of $X$: $$X = U \Sigma V'$$ with $U \in\mathbb{R}^{N\times N}$ and $V \in\mathbb{R}^{K\times K}$ both an orthogonal matrix, and $\Sigma\in\mathbb{R}^{N\times K}$ a diagonal matrix. SVD can transform any matrix into a diagonal matrix with the right choice of orthogonal coordinate systems for its domain and range. This even works for rank deficient matrices and is numerically stable! In such an ill-conditioned system, SVD will return the solution $\hat\beta$ that has the smallest norm.
In [18]:
function OLS_svd(y, X)
X_svd = svdfact(X)
X_svd \ y
end
Out[18]:
In [19]:
β̂_svd = OLS_svd(y, X)
hcat(β̂_svd, β)
Out[19]:
We now test numerical stability (example 3.5 in [1]) with $$X = \begin{bmatrix} 1 & 1\\ 1 & 1+\epsilon \end{bmatrix}, y = \begin{bmatrix} 2 \\ 2 \end{bmatrix}$$ The solution is given by $$\hat\beta=\begin{bmatrix}2 \\ 0\end{bmatrix}$$
This system is ill-conditioned because a slight perturbation in $$y_e = \begin{bmatrix} 2 \\ 2 + \epsilon \end{bmatrix}$$ gives the compeltely different solution $$\hat\beta_e=\begin{bmatrix}1 \\ 1\end{bmatrix}$$
We show that methods using the normal equations do not find the second solution.
[1]: Numerically Efficient Methods for Solving Least Squares Problems, Do Q Lee.
In [20]:
ϵ = 1e-7
X = [1 1;1 1+ϵ]
y = [2; 2]
OLS_normal(y, X)
Out[20]:
In [21]:
yₑ = [2; 2+ϵ]
@show OLS_normal(yₑ, X)
@show OLS_chol(yₑ, X);
In [22]:
@show OLS_qr(yₑ, X)
@show OLS_svd(yₑ, X);
Several advanced methods exist such as bayesian OLS (with possible priors), sparse OLS solvers, James-Stein estimator and iterative solvers.
In [23]:
using IterativeSolvers
In [24]:
OLS_lsmr(y, X) = lsmr(X, y)
Out[24]:
In [25]:
β̂_lsmr = OLS_lsmr(y, X)
hcat(β̂_lsmr, β)