Linear algebra functions in Julia are largely implemented by calling functions from LAPACK. Sparse factorizations call functions from SuiteSparse.
For factorization algorithms, Julia has already implemented common routines as lu, qr and chol. The routines call LAPACK subroutines for decomposition algorithm in general. For example, lufact! calls LAPACK.getrf! to update matrices in place. The algorithm uses partitioned factorization, which rearranges operations into matrix operations(BLAS-LEVEL3).
function lufact!{T<:BlasFloat}(A::StridedMatrix{T}; pivot = true)
!pivot && return generic_lufact!(A,pivot = pivot)
lpt = LAPACK.getrf!(A)
return LU{T, typeof(A)}(lpt[1], lpt[2], lpt[3])
end
And LAPACK.getrf! calls getrf against the liblapack from GOTOBLAS2.
ccall(($(string(getrf)), liblapack), Void, Ptr{BlasInt},
Ptr{BlasInt}, Ptr{$elty}, Ptr{BlasInt}, Ptr{BlasInt},
Ptr{BlasInt}, &m, &n, A, &lda, ipiv, info)
The delaration of getrf:
Fortran77:
call sgetrf( m, n, a, lda, ipiv, info )
call dgetrf( m, n, a, lda, ipiv, info )
call cgetrf( m, n, a, lda, ipiv, info )
call zgetrf( m, n, a, lda, ipiv, info )
Fortran95:
call getrf(a [, ipiv] [, info])
C:
lapack_int LAPACKE_<?>getrf(int matrix_layout, lapack_int m, lapack_int n, <datatype>* a, lapack_int lda, lapack_int* ipiv);
Since lu normally take 1/3 min(m,n)^2 (3 max(m,n) - min(m,n)) flops to compute the factorization. We usually will consider using partial pivoting(getrf), which requires O(n^2) comparisons in total, when m = n.
An expensive pivoting strategy is complete pivoting, we try to avoid using it, since it costs too many operations O(n^3) on comparisons, and the performance does not differ too much from partial pivot case in practice.
There is another pivoting strategy between above two, which is call rook. It keeps searching the pivot in current row and then search the pivot in the column of previous pivot and so on, until there is no more to search. This practically gives a reasonable reduce in comparison process, taking a tiny multiple of the cost of partial pivoting, and could get nearly complete pivoting performance. Anyway, rook can sometimes take O(n^3) comparisons.
Growth factor is defined as tha ratio of maximal(in magnitude) elements of result matrix at kth stage and at 0th stage. And it is easy to prove that with partial pivoting the factor grows no faster than exponential. And exponential growth can be achieved at some cases. While complete pivoting has a more appealing upper bound as n^(3/4 log n).
Growth factor basically is the rule to judge whether pivoting method is good or not. However, in practice, unusual large growth factor is rare.
Julia provides chol for Cholesky factorization with PD matrices, it calls LAPACK.pstrf! for C speed, and pivoting is disabled as default. chol(A) and chol(A, :U) will return upper triangle matrix, and chol(A, :L) will return lower triangle matrix.
Ax = b gives x = A\b, we can use Gauss-Elimination w/Partial Pivoting(GEPP) to solve for x, but we do not really compute the inverse, and it seems always unnecessary to compute the inverse in practice. Julia has built-in function \(A,B) for solving linear system, which calls lu solver when dense, and uses UMFPACK for sparse matrix.
However, GEPP could not be implemented with parallelism efficiently. Basically speaking, if algorithm is GE based, then parallism will be difficult.
Iteration based method Schulz is using
to invert A. This method involves only matrix multiplication, which can be highly parallelized, the result converges quadratically.
Solving linear system involves direct method and iterative method.
For stationary iterative method, we always refer Jacobi, Gauss-Seidel, SOR, SSOR. There are lots of resources to implement them in different language, however, we have to be careful when using these iterative methods, since rounding error plays an important role when A is far from normal. The following is an example from Higham's book.
In [44]:
using Gadfly
In [72]:
_order = 100;
D = diagm(1.5*ones(_order));
L = diagm(ones(_order - 1),-1)
A = D + L;
b = 2.5*ones(_order);
# true solution
x_true = zeros(_order);
for i = 1:size
@inbounds x_true[i] = 1 - (-2/3)^i
end
# error and condition number of A
@printf("norm(Ax - b) = %4.16lf\n", norm(A*x_true - b))
@printf("cond(A) = %4.16lf\n", cond(A));
# SOR, w = 1.5
w = 1.5
@inbounds M = 1/w*(D + w*L)
@inbounds N = 1/w*(1-w)*D
# true solution as initial guess
x0 = x_true
# 250 iterations forward error
err = zeros(250)
for iter = 1:250
@inbounds err[iter] = norm(x0 - x_true,Inf)/norm(x_true,Inf)
#rounding error ruins everything
@inbounds x = M\(N*x0 + b)
x0 = x
end
plot(x=collect(1:250), y = err)
Out[72]:
These methods works by minimizing the residual on subspace formed by a sequence of vectors generated by matrix powers times residual. Usual routines are pcg, gmres, bicg.
If there is no rounding error, the iteration typically will stop exact at N step. N is size of system. To accelarate the iteration process, preconditioner is always present.
Julia does not provide any Krylov subspace method for solving equation. The following routine is a naive implementation of cg method. Also it is included in src/linalg directory.
In [39]:
export cg
function cg(A::Union(AbstractMatrix, Function),b::Vector;
tol::Real=1e-6,maxIter::Int64=1000,
M::Union(AbstractMatrix, Function)=x->x ,x0::Vector=[],out::Int64=0)
n = length(b)
nb = norm(b)
# allocate memory once
Ap = zeros(eltype(b),n)
Af = isa(A,Function) ? A : x->A_mul_B!(Ap, A, x)
Mf = isa(M,Function) ? M : x->M\x
if isempty(x0)
x0 = zeros(eltype(b),n)
r = copy(b)
else
r = b - Af(x0)
end
z = Mf(r)
p = copy(z)
resvec = zeros(maxIter)
iter = 1
flag = 1
@inbounds for iter=1:maxIter
Ap = Af(p)
gamma = dot(r,z)
alpha = gamma/dot(p,Ap)
# if A is ill-conditioned
if alpha==Inf || alpha<0
flag = 2; break
end
BLAS.axpy!(n,alpha,p,1,x0,1) # x = alpha*p+x
BLAS.axpy!(n,-alpha,Ap,1,r,1) # r -= alpha*Ap
resvec[iter] = norm(r)/nb
# converge
if resvec[iter] <= tol
flag = 0; break
end
z = Mf(r)
beta = dot(z,r)/gamma
# p = z + beta*p
p = BLAS.scal!(n,beta,p,1)
p = BLAS.axpy!(n,1.0,z,1,p,1)
end
# control output
if out>=0
if flag == 1
println("Does not converge within maximum iterations.")
elseif flag == 2
println("Matrix A in cg has to be positive definite.")
end
end
return x0,flag, resvec[iter], resvec, iter
end
Out[39]:
In [132]:
# following example run with MATLAB on my machine takes 0.06 seconds.
# this routine does not use a lot of flags, only takes 0.025 seonnds.
_siz = 512;
a = diagm(4*ones(_siz)) + diagm(-2*ones(_siz - 1), 1) + diagm(-2*ones(_siz - 1), -1); b = ones(512); x = zeros(512);
@time ret = cg(a, b,tol = 1e-6, maxIter = 100000, x0 = x);
Typically matrix multiplication with order n requires O(n^3) flops, 2x2 requires 8 multiplications and 4 additions, however, Strassen's method requires 7 multiplications and 15 additions. It has been tested that for larger size multiplication(over 1024x1024), Strassen could be faster than gemm. The following function is a naive implementation of Strassen's method, taking 1024 or 2048 as threshold for normal gemm. It turns out this can compete with normal gemm, but takes lots of memory(1.5GB for 8192 x 8192 matrices multiplication, almost stores 3 matrices of this size).
In [34]:
# Strassen's matrix multiplication algorithm.
# no peeling or padding, just for power of 2.
using Base.BLAS
mindim = 1024
function subm{T}(a::Matrix{T}, i::Int64, j::Int64, k::Int64, l::Int64)
return a[i:j, k:l]
end
function strassen!{T}(a::Matrix{T}, b::Matrix{T}, c::Matrix{T}, m::Int64, k::Int64, n::Int64, _mk::Matrix{T}, _kn::Matrix{T})
if m <= mindim || n <= mindim || k <= mindim
A_mul_B!(c, a, b);
return;
end
mt = int(m / 2)
kt = int(k / 2)
nt = int(n / 2)
mk = subm(_mk, 1, mt, 1, kt)
kn = subm(_kn, 1, kt, 1, nt)
# copy 3 matrices takes time and memory
@inbounds a11 = subm(a, 1, mt, 1, kt)
@inbounds a12 = subm(a, 1, mt, (kt + 1), k)
@inbounds a21 = subm(a, (mt + 1), m, 1 , kt)
@inbounds a22 = subm(a, (mt + 1), m, (kt + 1), k)
@inbounds b11 = subm(b, 1 ,kt, 1, nt)
@inbounds b12 = subm(b, 1, kt, (nt + 1), n)
@inbounds b21 = subm(b, (kt + 1), k, 1, nt)
@inbounds b22 = subm(b, (kt + 1), k, (nt + 1), n)
@inbounds c11 = subm(c, 1 ,mt, 1, nt)
@inbounds c12 = subm(c, 1, mt, (nt + 1), n)
@inbounds c21 = subm(c, (mt + 1), m , 1, nt)
@inbounds c22 = subm(c, (mt + 1), m, (nt + 1), n)
# s7, kn = b22 - b12
BLAS.blascopy!(kt*nt, b22, 1, kn, 1);
BLAS.axpy!(-1.0, b12, kn);
# s3, mk = a11 -a21
BLAS.blascopy!(mt*kt, a11, 1, mk, 1);
BLAS.axpy!(-1.0, a21, mk);
# m4, c21 = s3*s7 = mk*kn
strassen!(mk, kn, c21, mt, kt, nt, _mk, _kn);
# s1, mk = a21 + a22
BLAS.blascopy!(mt*kt, a21, 1, mk, 1)
BLAS.axpy!(1.0, a22, mk)
# s5, kn = b12 - b11
BLAS.blascopy!(kt*nt, b12,1, kn,1)
BLAS.axpy!(-1.0, b11, kn)
# m5,c22 = s1*s5
strassen!(mk, kn, c22, mt, kt, nt, _mk, _kn);
# s6, kn = b22 -s5 = b22 -kn
kn = b22 - kn
# s2, mk = s1 - a11 = mk -a11
BLAS.axpy!(-1.0, a11, mk)
# m1, c11 = s2*s6
strassen!(mk, kn, c11, mt, kt, nt, _mk, _kn);
# s4, mk = a12 - s2
mk = a12 - mk
# m6, c12 = s4*b22
strassen!(mk, b22, c12, mt, kt, nt, _mk, _kn);
# t3, c12 = m5 + m6 = c22 + c12
BLAS.axpy!(1.0, c22, c12)
# m2, mk = a11*b11
strassen!(a11, b11, mk, mt, kt, nt, _mk, _kn);
# t1, c11 = m1 + m2 = c11 + mk
BLAS.axpy!(1.0, mk, c11)
# c12 = t1+t3 = c11 + c12
BLAS.axpy!(1.0, c11, c12)
# t2, c11 = t1 + m4 = c11 + c21
BLAS.axpy!(1.0, c21, c11)
# s8, kn = s6 - b21 = kn - b21
BLAS.axpy!(-1.0, b21, kn)
# m7, c21 = a22*s8
strassen!(a22, kn, c21, mt, kt, nt, _mk, _kn)
# c21, t2 - m7 = c11 - c21
c21 = c11 - c21
# c22, t2 + m5 = c11 + c22
BLAS.axpy!(1.0, c11, c22)
# m3, c11 = a12*b21
strassen!(a12, b21, c11, mt, kt, nt, _mk, _kn)
# c11, m2 + m3 = mk + c11
BLAS.axpy!(1.0, mk, c11)
@inbounds c[1:mt, 1:nt] = c11
@inbounds c[1:mt, (nt+1):n] = c12
@inbounds c[(mt + 1):m , 1:nt] = c21
@inbounds c[(mt + 1):m, (nt + 1):n] = c22
return
end
Out[34]:
In [37]:
_siz = 4096;
# pre-allocate memory for objects
A = rand(_siz, _siz);
B = rand(_siz, _siz);
C = zeros(_siz,_siz);
_mk = zeros(int(_siz/2), int(_siz/2));
_kn = zeros(int(_siz/2), int(_siz/2));
In [36]:
@time A_mul_B!(C,A,B);
In [38]:
@time strassen!(A,B,C, _siz, _siz, _siz, _mk, _kn);gc();
So, now the best result says Matrix Multiplication can be done in O(n^2.37x), and the lower bound is improving very slowly. Winograd and Coppersmith gave the exponent as 2.376, and recently this is updated as 2.3728639 by Stothers, Vassilevska Williams and Le Gall.