Linear Algebra

Linear algebra functions in Julia are largely implemented by calling functions from LAPACK. Sparse factorizations call functions from SuiteSparse.

LU Matrices factorization

For factorization algorithms, Julia has already implemented common routines as lu, qr and chol. The routines call LAPACK subroutines for decomposition algorithm in general. For example, lufact! calls LAPACK.getrf! to update matrices in place. The algorithm uses partitioned factorization, which rearranges operations into matrix operations(BLAS-LEVEL3).

    function lufact!{T<:BlasFloat}(A::StridedMatrix{T};  pivot = true)
        !pivot && return generic_lufact!(A,pivot = pivot)
        lpt =  LAPACK.getrf!(A)
        return LU{T, typeof(A)}(lpt[1], lpt[2], lpt[3])
    end

And LAPACK.getrf! calls getrf against the liblapack from GOTOBLAS2.

    ccall(($(string(getrf)), liblapack), Void, Ptr{BlasInt}, 
        Ptr{BlasInt}, Ptr{$elty}, Ptr{BlasInt}, Ptr{BlasInt},
        Ptr{BlasInt}, &m, &n, A, &lda, ipiv, info)

The delaration of getrf:

Fortran77:

call sgetrf( m, n, a, lda, ipiv, info )
    call dgetrf( m, n, a, lda, ipiv, info )
    call cgetrf( m, n, a, lda, ipiv, info )
    call zgetrf( m, n, a, lda, ipiv, info )

Fortran95:

call getrf(a [, ipiv] [, info])

lapack_int LAPACKE_<?>getrf(int matrix_layout, lapack_int m, lapack_int n, <datatype>* a, lapack_int lda, lapack_int* ipiv);

Pivoting

Since lu normally take 1/3 min(m,n)^2 (3 max(m,n) - min(m,n)) flops to compute the factorization. We usually will consider using partial pivoting(getrf), which requires O(n^2) comparisons in total, when m = n.

An expensive pivoting strategy is complete pivoting, we try to avoid using it, since it costs too many operations O(n^3) on comparisons, and the performance does not differ too much from partial pivot case in practice.

There is another pivoting strategy between above two, which is call rook. It keeps searching the pivot in current row and then search the pivot in the column of previous pivot and so on, until there is no more to search. This practically gives a reasonable reduce in comparison process, taking a tiny multiple of the cost of partial pivoting, and could get nearly complete pivoting performance. Anyway, rook can sometimes take O(n^3) comparisons.

Growth Factor

Growth factor is defined as tha ratio of maximal(in magnitude) elements of result matrix at kth stage and at 0th stage. And it is easy to prove that with partial pivoting the factor grows no faster than exponential. And exponential growth can be achieved at some cases. While complete pivoting has a more appealing upper bound as n^(3/4 log n).

Growth factor basically is the rule to judge whether pivoting method is good or not. However, in practice, unusual large growth factor is rare.

Cholesky

Julia provides chol for Cholesky factorization with PD matrices, it calls LAPACK.pstrf! for C speed, and pivoting is disabled as default. chol(A) and chol(A, :U) will return upper triangle matrix, and chol(A, :L) will return lower triangle matrix.

Ax = b

Ax = b gives x = A\b, we can use Gauss-Elimination w/Partial Pivoting(GEPP) to solve for x, but we do not really compute the inverse, and it seems always unnecessary to compute the inverse in practice. Julia has built-in function \(A,B) for solving linear system, which calls lu solver when dense, and uses UMFPACK for sparse matrix.

However, GEPP could not be implemented with parallelism efficiently. Basically speaking, if algorithm is GE based, then parallism will be difficult.

Iteration based method Schulz is using

$$X_{k+1} = X_k(2I - AX_k)$$

to invert A. This method involves only matrix multiplication, which can be highly parallelized, the result converges quadratically.

Stationary Iterative method

Solving linear system involves direct method and iterative method.

For stationary iterative method, we always refer Jacobi, Gauss-Seidel, SOR, SSOR. There are lots of resources to implement them in different language, however, we have to be careful when using these iterative methods, since rounding error plays an important role when A is far from normal. The following is an example from Higham's book.



In [44]:

    
using Gadfly



In [72]:

    
_order = 100;
D = diagm(1.5*ones(_order));
L = diagm(ones(_order - 1),-1)
A = D + L;
b = 2.5*ones(_order);
# true solution
x_true = zeros(_order);

for i = 1:size
    @inbounds  x_true[i] = 1 - (-2/3)^i
end

# error and condition number of A
@printf("norm(Ax - b) = %4.16lf\n", norm(A*x_true - b))
@printf("cond(A) = %4.16lf\n", cond(A));
# SOR, w = 1.5
w = 1.5
@inbounds M = 1/w*(D + w*L)
@inbounds N = 1/w*(1-w)*D

# true solution as initial guess
x0  = x_true

# 250 iterations forward error
err = zeros(250)

for iter = 1:250
    @inbounds err[iter] = norm(x0 - x_true,Inf)/norm(x_true,Inf)
    #rounding error ruins everything
    @inbounds x = M\(N*x0 + b)
    x0 = x
end

plot(x=collect(1:250), y = err)









    



norm(Ax - b) = 0.0000000000000000
cond(A) = 4.9855199086823845






    Out[72]:

Krylov subspace methods

These methods works by minimizing the residual on subspace formed by a sequence of vectors generated by matrix powers times residual. Usual routines are pcg, gmres, bicg.

If there is no rounding error, the iteration typically will stop exact at N step. N is size of system. To accelarate the iteration process, preconditioner is always present.

Julia does not provide any Krylov subspace method for solving equation. The following routine is a naive implementation of cg method. Also it is included in src/linalg directory.



In [39]:

    
export cg

function cg(A::Union(AbstractMatrix, Function),b::Vector; 
    tol::Real=1e-6,maxIter::Int64=1000,
    M::Union(AbstractMatrix, Function)=x->x ,x0::Vector=[],out::Int64=0)


    n = length(b)
    nb = norm(b)

    # allocate memory once
    Ap = zeros(eltype(b),n) 

    Af =  isa(A,Function) ? A : x->A_mul_B!(Ap, A, x) 
    Mf =  isa(M,Function) ? M : x->M\x
    
    if isempty(x0)
        x0 = zeros(eltype(b),n)
        r = copy(b)
    else
        r = b - Af(x0)
    end	

    z = Mf(r)
    p = copy(z)	

    resvec = zeros(maxIter)

    iter   = 1 
    flag   = 1
    
    @inbounds for iter=1:maxIter
        Ap = Af(p)
        gamma = dot(r,z)
        alpha = gamma/dot(p,Ap)

        # if A is ill-conditioned
        if alpha==Inf || alpha<0
            flag = 2; break
        end

        BLAS.axpy!(n,alpha,p,1,x0,1) # x = alpha*p+x
        BLAS.axpy!(n,-alpha,Ap,1,r,1) # r -= alpha*Ap 

        resvec[iter] = norm(r)/nb
        # converge
        if  resvec[iter] <= tol
            flag = 0; break
        end

        z    = Mf(r)
        beta = dot(z,r)/gamma

        # p = z + beta*p
        p = BLAS.scal!(n,beta,p,1)
        p = BLAS.axpy!(n,1.0,z,1,p,1)
    end
    # control output
    if out>=0
        if flag == 1
            println("Does not converge within maximum iterations.")
        elseif flag == 2
            println("Matrix A in cg has to be positive definite.")
        end
    end
    return x0,flag, resvec[iter],  resvec, iter
end









    Out[39]:





cg (generic function with 1 method)



In [132]:

    
# following example run with MATLAB on my machine takes 0.06 seconds.
# this routine does not use a lot of flags, only takes 0.025 seonnds.
_siz = 512;
a = diagm(4*ones(_siz)) + diagm(-2*ones(_siz - 1), 1) + diagm(-2*ones(_siz - 1), -1); b = ones(512); x = zeros(512);
@time ret = cg(a, b,tol = 1e-6, maxIter = 100000,  x0 = x);









    



elapsed time: 0.02485232 seconds (866176 bytes allocated)

Matrix Multiplication

Typically matrix multiplication with order n requires O(n^3) flops, 2x2 requires 8 multiplications and 4 additions, however, Strassen's method requires 7 multiplications and 15 additions. It has been tested that for larger size multiplication(over 1024x1024), Strassen could be faster than gemm. The following function is a naive implementation of Strassen's method, taking 1024 or 2048 as threshold for normal gemm. It turns out this can compete with normal gemm, but takes lots of memory(1.5GB for 8192 x 8192 matrices multiplication, almost stores 3 matrices of this size).



In [34]:

    
# Strassen's matrix multiplication algorithm.

# no peeling or padding, just for power of 2.

using Base.BLAS

mindim = 1024

function subm{T}(a::Matrix{T}, i::Int64, j::Int64, k::Int64, l::Int64)
    return a[i:j, k:l] 
end


function strassen!{T}(a::Matrix{T}, b::Matrix{T}, c::Matrix{T}, m::Int64, k::Int64, n::Int64, _mk::Matrix{T}, _kn::Matrix{T})

    if m <= mindim || n <= mindim || k <= mindim 
        A_mul_B!(c, a, b);
        return;
    end

    mt = int(m / 2)
    kt = int(k / 2)
    nt = int(n / 2)

    mk =  subm(_mk, 1, mt, 1, kt)
    kn =  subm(_kn, 1, kt, 1, nt)

    # copy 3 matrices takes time and memory
@inbounds    a11 = subm(a, 1, mt, 1, kt)
@inbounds    a12 = subm(a, 1, mt, (kt + 1), k)
@inbounds    a21 = subm(a, (mt + 1), m, 1 , kt)
@inbounds    a22 = subm(a, (mt + 1), m, (kt + 1), k)

@inbounds    b11 = subm(b, 1 ,kt, 1, nt)
@inbounds    b12 = subm(b, 1, kt, (nt + 1), n)
@inbounds    b21 = subm(b, (kt + 1), k,  1, nt)
@inbounds    b22 = subm(b, (kt + 1), k,  (nt + 1), n)

@inbounds    c11 = subm(c, 1 ,mt, 1, nt)
@inbounds    c12 = subm(c, 1, mt, (nt + 1),  n)
@inbounds    c21 = subm(c, (mt + 1), m ,  1, nt)
@inbounds    c22 = subm(c, (mt + 1), m,  (nt + 1), n)

# s7, kn = b22 - b12
    BLAS.blascopy!(kt*nt, b22, 1, kn, 1);
    BLAS.axpy!(-1.0, b12, kn);
# s3, mk = a11 -a21
    BLAS.blascopy!(mt*kt, a11, 1, mk, 1);
    BLAS.axpy!(-1.0, a21, mk);
# m4, c21 = s3*s7 = mk*kn
    strassen!(mk, kn, c21, mt, kt, nt,  _mk, _kn);
# s1, mk = a21 + a22
    BLAS.blascopy!(mt*kt, a21, 1, mk, 1)
    BLAS.axpy!(1.0, a22, mk)
# s5, kn = b12 - b11
    BLAS.blascopy!(kt*nt, b12,1, kn,1)
    BLAS.axpy!(-1.0, b11, kn)
# m5,c22 = s1*s5
    strassen!(mk, kn, c22, mt, kt, nt,  _mk, _kn);
# s6, kn = b22 -s5 = b22 -kn
    kn = b22 - kn
# s2, mk = s1 - a11 = mk -a11
    BLAS.axpy!(-1.0, a11, mk)
# m1, c11 = s2*s6
    strassen!(mk, kn, c11, mt, kt, nt, _mk, _kn);
# s4, mk = a12 - s2
    mk = a12 - mk
# m6, c12 = s4*b22
    strassen!(mk, b22, c12, mt, kt, nt, _mk, _kn);
# t3, c12 = m5 + m6 = c22 + c12
    BLAS.axpy!(1.0, c22, c12)
# m2, mk  = a11*b11
    strassen!(a11, b11, mk, mt, kt, nt, _mk, _kn);
# t1, c11 = m1 + m2 = c11 + mk
    BLAS.axpy!(1.0, mk, c11)
# c12 = t1+t3 = c11 + c12
    BLAS.axpy!(1.0, c11, c12)
# t2, c11 = t1 + m4 = c11 + c21
    BLAS.axpy!(1.0, c21, c11)
# s8, kn = s6 - b21 = kn - b21
    BLAS.axpy!(-1.0, b21, kn)
# m7, c21 = a22*s8
    strassen!(a22, kn, c21, mt, kt, nt, _mk, _kn)
# c21, t2 - m7 = c11 - c21
    c21 = c11 - c21
# c22, t2 + m5 = c11 + c22
    BLAS.axpy!(1.0, c11, c22)
# m3, c11 = a12*b21
    strassen!(a12, b21, c11, mt, kt, nt, _mk, _kn)
# c11, m2 + m3 = mk + c11
    BLAS.axpy!(1.0, mk, c11)

@inbounds   c[1:mt, 1:nt] = c11
@inbounds   c[1:mt, (nt+1):n] = c12
@inbounds   c[(mt + 1):m , 1:nt] = c21
@inbounds   c[(mt + 1):m, (nt + 1):n] = c22

    return 

end









    Out[34]:





strassen! (generic function with 1 method)



In [37]:

    
_siz = 4096;
# pre-allocate memory for objects
A = rand(_siz, _siz);
B = rand(_siz, _siz);
C = zeros(_siz,_siz);
_mk = zeros(int(_siz/2), int(_siz/2));
_kn = zeros(int(_siz/2), int(_siz/2));



In [36]:

    
@time A_mul_B!(C,A,B);









    



elapsed time: 4.022460947 seconds (80 bytes allocated)



In [38]:

    
@time strassen!(A,B,C, _siz, _siz, _siz, _mk, _kn);gc();









    



elapsed time: 3.943923105 seconds (1569763180 bytes allocated, 4.21% gc time)

Fast Algorithm for Matrix Multiplication

So, now the best result says Matrix Multiplication can be done in O(n^2.37x), and the lower bound is improving very slowly. Winograd and Coppersmith gave the exponent as 2.376, and recently this is updated as 2.3728639 by Stothers, Vassilevska Williams and Le Gall.