Flexible matrix realignment

In this document, we will discuss a convenient (and hopefully efficient) way of realigning the matrix elements of the different tensors entering the CCD amplitude equation, so that all diagrams may be calculated by matrix multiplication.

In its current implementations (in python/numpy and c++/armadillo), I utilize sparse matrices to store all elements. This has some advantage when it comes to memory consumption, since only nonzero elements are stored. The current drawback seems to be the efficiency of matrix multiplications, since this process consumes notably more time for sparse matrices than the block diagonalization approach. This is due to the way the elements are stored.

The CCD amplitude equation

The T2 amplitude for the CCD truncation is

$$ t^{ab}_{ij} \epsilon^{ab}_{ij} = \langle ab \vert \vert ij\rangle + \frac{1}{2} \sum_{cd} \langle ab \vert \vert cd \rangle t^{cd}_{ij} + \frac{1}{2} \sum_{kl} \langle kl \vert \vert ij \rangle t^{ab}_{kl} + \hat{P}(ij \vert ab) \sum_{kc} \langle kb \vert \vert cj \rangle t^{ac}_{ik} + \frac{1}{4} \sum_{klcd} \langle kl \vert \vert cd \rangle t^{cd}_{ij} t^{ab}_{kl} + \hat{P}(ij) \sum_{klcd} \langle kl \vert \vert cd \rangle t^{ac}_{ik} t^{bd}_{jl} - \frac{1}{2} \hat{P}(ij) \sum_{klcd} \langle kl \vert \vert cd \rangle t^{dc}_{ik} t^{ab}_{lj} - \frac{1}{2} \hat{P}(ab)\sum_{klcd}\langle kl \vert \vert cd \rangle t^{ac}_{lk} t^{db}_{ij} $$

This equation is commonly solved iteratively by guessing some initial amplitude, typically the ones corresponding to the MBPT(2) energy, then solving for the RHS to yield the new approximation. The iterative process is halted when some convergence criteria is fulfilled, possibly when the change in amplitudes between the iterations subpasses some threshold or when the resulting energy converges towards some value.

The structure of this equation motivates a subdivision into linear and quadratic terms in order of the unknown amplitudes $t^{ab}_{ij}$.

By denoting linear terms by $L_n$ and quadratic terms by $Q_n$, we may then express the equation as

$$ t^{ab}_{ij} \epsilon^{ab}_{ij} = L_1(t^{ab}_{ij}) + L_2(t^{ab}_{ij}) + L_3(t^{ab}_{ij}) + Q_1(t^{ab}_{ij}) + Q_2(t^{ab}_{ij}) + Q_3(t^{ab}_{ij}) + Q_4(t^{ab}_{ij}) $$

Each term on the RHS above corresponds to a one of the diagram formed by contracting the similarity transformed hamiltonian with the exponential ansatz.

In the following sections, we will discuss how these terms may be calculated as matrix multiplications.

Setting up the matrices

Every factor in the terms above is a tensor of rank 4. To utilize the advantages of matrix multiplications, we need to represent these tensors as matrices (tensors of rank 2). In effect, we need a unambiguous mapping from rank 4 to rank 2.

One very straightforward such mapping is derived by considering the indexes and their associated length;

$$ p = [0,1,2,3, ..., N_p] $$

A matrix will have only two indices, so we then map the amplitudes and interactions onto matrices by for example

$$ \langle pq || rs \rangle = \langle p + q N_p || r + s N_r \rangle $$

Now we have the rows of the matrix contained in the bra part of the RHS above, and the columns is contained in the ket.


Example usage

Since a high accuracy calculation will involve a large number of particle states, the L1 term is computationally expensive (it is a sum over particle states). A naive implementation will typically consist of four nested for-loops (for a,b,c,d), and will require a lot of flops.

If we instead map it onto matrices, so that

$$\sum_{cd} \langle ab \vert \vert cd \rangle t^{cd}_{ij} = \sum_{cd} \langle a+ bN_a \vert \vert c+dN_c \rangle t^{c+dN_c}_{i+jN_i} = \sum_{\gamma} \langle \alpha \vert \vert \gamma \rangle t^{\gamma}_{\beta}$$

We easily see that the whole diagram may be calculated as a matrix multiplication

$$L_1 = V^{\alpha}_\gamma * T^\gamma_\beta$$

Flexible indexing

Because of the occurances of indices in the contributions $L_1, L_2$ and $Q_1$, these diagrams are especially straighforward to set up as matrix multiplications. In the remaining diagrams, we will need to realign the matrix elemenent before the multiplication is performed.

In the following we will utilize the following matrix notation:

$$t^{cd}_{ij} \equiv | cd \rangle T \langle ij |$$

The $L_1$ term is then calculated as

$$L_1 = (| a + bN_a\rangle V \langle c + dN_c | )( | c + dN_c \rangle T \langle i + jN_i|) $$

The alignment problem is related to diagrams where indices does not occur in corresponding bra and kets as for $L_1$ above. For example, we have for the $L_3$ term (prefactors and permutations left out)

$$L_3 = \sum_{kc} \langle kb \vert \vert cj \rangle t^{ac}_{ik} $$

To set up a matrix for V and T, we need the indices $k$ and $c$ to coincide in the bra and ket of V and T respectively. In effect, we want

$$\tilde{L}_3 = (| b + jN_b\rangle V \langle k + cN_k | )( | k + cN_k \rangle T \langle a + iN_a|) $$

Where the indices of $\tilde{L}_3$ does not align with the indices of $L_3$, but since the elements are the same we only need to perform a simple realignment back to the original column- and row indexing.

$$\tilde{L}_3 \rightarrow L_3$$

The process is maybe most easily understood as a generalized transpose for tensors with $rank>2$.

Subdivision of the interaction matrix

As may be seen from the CCD equation, only certain elements from the interaction matrix actually enters the calculation.

This makes it convenient to subdivide the interaction matrix into blocks in

$$V^{pp}_{pp}$$$$V^{hh}_{hh}$$$$V^{hh}_{pp}$$$$V^{pp}_{hh}$$$$V^{hp}_{ph}$$

In these matrices $h$ and $p$ denotes hole and particle states. Their location in the full interaction matrix is visualized below.

hh hp ph pp
hh $V^{hh}_{hh}$ $V^{hh}_{pp}$
hp $V^{hp}_{ph}$
ph
pp $V^{pp}_{hh}$ $V^{pp}_{pp}$

A lot of symmetries is also involved in these elements, so we do not need to actually calculate all quantities involved prior to the CCD calculation.

The actual alignments

We will now derive the various realignments that we need to perform to solve the full CCD amplitude equation by matrix multiplications. These expressions are meant to correspond with the actual implementation in fermicc, so instead of having direct corresponding indices, we use $p,q,r,s$ for the first, second, third and fourth index respectively (as they occur in the tensors "native" representation.)

The realignments/permutations are listed in the table below. Note that any prefactors or permutations is omitted from the expressions.

Table: Realignment of matrix elements in interaction and amplitudes

Diagram Expression Aligned multiplication Interaction Amplitude(1) Amplitude(2) Final realignment
$L_1$ $ \sum_{cd} \langle ab \vert \vert cd \rangle t^{cd}_{ij}$ $( ab \rangle V \langle cd )( cd \rangle T \langle ij )$
$L_2$ $ \sum_{kl} \langle kl \vert \vert ij \rangle t^{ab}_{kl}$ $( ab \rangle T \langle kl )( kl \rangle V \langle ij )$
$L_3$ $ \sum_{kc} \langle kb \vert \vert cj \rangle t^{ac}_{ik}$ $( jb \rangle \tilde{V} \langle ck )( ck \rangle \tilde{T} \langle ai )$ $V^{pq}_{rs} = \tilde{V}^{sq}_{rp}$ $T^{pq}_{rs} = \tilde{T}^{qs}_{pr}$ $(L_3)^{pq}_{rs} = \tilde{(L_3)}^{sq}_{pr}$
$Q_1$ $ \sum_{klcd} \langle kl \vert \vert cd \rangle t^{cd}_{ij} t^{ab}_{kl}$ $( ab \rangle T \langle kl )( kl \rangle V \langle cd )( cd \rangle T \langle ij )$
$Q_2$ $ \sum_{klcd} \langle kl \vert \vert cd \rangle t^{ac}_{ik} t^{bd}_{jl}$ $( ai \rangle T \langle kc )( kc \rangle V \langle ld )( ld \rangle T \langle bj )$ $V^{pq}_{rs} = \tilde{V}^{pr}_{qs}$ $T^{pq}_{rs} = \tilde{T}^{pr}_{sq}$ $T^{pq}_{rs} = \tilde{T}^{rq}_{ps}$ $(Q_2)^{pq}_{rs} = \tilde{(Q_2)}^{pr}_{qs}$
$Q_3$ $ \sum_{klcd} \langle kl \vert \vert cd \rangle t^{dc}_{ik} t^{ab}_{lj}$ $( abj \rangle T \langle l )( l \rangle V \langle kcd )( kcd \rangle T \langle i )$ $V^{pq}_{rs} = \tilde{V}^{q}_{prs}$ $T^{pq}_{rs} = \tilde{T}^{pqs}_{r}$ $T^{pq}_{rs} = \tilde{T}^{sqp}_{r}$ $(Q_3)^{pq}_{rs} = \tilde{(Q_3)}^{pqs}_{r}$
$Q_4$ $ \sum_{klcd}\langle kl \vert \vert cd \rangle t^{ac}_{lk} t^{db}_{ij}$ $( a \rangle T \langle klc )( klc \rangle V \langle d )( d \rangle T \langle bij )$ $V^{pq}_{rs} = \tilde{V}^{pqr}_{s}$ $T^{pq}_{rs} = \tilde{T}^{p}_{srq}$ $T^{pq}_{rs} = \tilde{T}^{p}_{qrs}$ $(Q_4)^{pq}_{rs} = \tilde{(Q_4)}^{p}_{qrs}$

Implementation

A standard format for representing sparse matrices is the COOrdinate format. In this format the matrix elements are stored in three arrays, one for the actual values, and two for each coordinate in the matrix. All elements not stored in these arrays are by default zero.

For example, the matrix below

0 1 2 3
0 1 3
1
2
3 2

May be represented in the following way

element 0 1 2
values 1 3 2
row 0 0 3
column 1 3 0

This is however not the most optimal (least space consuming) representation, so formats such as CSC (compressed sparse column) or CSR (compressed sparse row) is commonly utilized. Armadillos sp_mat object is of the first type, meaning that the array containing the column indices is compressed by replacing it with a pointer to the first element occuring in each column. This way, the index in the column pointer array denotes the actual column in which the following elements occur (up until the pointer to the next element). To access the actual column index of any element in the matrix, we therefore have to unpack this array first.

http://netlib.org/linalg/html_templates/node92.html

One such unpacking procedure written for the python/numpy implementation of the CCD calculation is given below


In [3]:
def unpack_indptr(indptr):
    #Unpack compressed indices
    #indptr is the array to be uncompressed
    I =zeros(indptr[-1], dtype = int)
    for i in range(len(indptr)-1):
        I[indptr[i]:indptr[i+1]] = i
    return I

When the matrices are represented in this way, the realignment procedure becomes trivial. Given elements, row- and columnindices, we may easily obtain the rank 4 indices, as is done in the following python example:


In [4]:
from numpy import *
Np = 14
Nq = 14
Nr = 14
Ns = 14

row = arange(Np*Nq)
col = arange(Nr*Ns)

p = row//Np
q = row%Np
r = col//Nr
s = col%Nr

Next up, an example of how to recast the elements to a "realigned" matrix


In [5]:
row = q + r*Nq
col = s + p*Ns

#New matrix dimensions: (Nq*Nr), (Ns*Np)
#Matrix elements "val" remain the same

Density of the interactions

Due to the many kroenecker deltas occuring in the HEG interaction, we find that the interaction matrix has a denisty of approx < 1%, meaning that this is a case where utilizing sparse storage is reasonable.

Pros and cons of the sparse scheme

The main drawback of the sparse approach will however be the efficiency of the matrix multiplications when compared to the block diagonalization scheme. When block diagonalized, the calculations may be performed in mere seconds.

The main advantage of this approach is, to my expection, that the implementation of the CCSDT truncation will be very straightforward codewise, low on memory consumption, and possible to perform also as serial computations. (unless computational cost exceeds the expected progression).

There is a number of routes that may possibly speed up the sparse matrix multiplications. I could continue optimizing the COO-scheme, find a replacement for armadillo, or possibly try a CUDA im

One is to implement a similar subdivision of the matrix elements that is done for the block diagonal scheme. By simply skipping casting the elements to a sparse matrix in the first place, I attempted to just perform the multiplication directly on the elements stored in a COOrdinate format with a subdivision similar to the block diagonal scheme. This successfully decreased the time spent on multiplication with a factor of about 0.007 when compared to the naive implementation, but the native scipy.sparse multiplication operation still outperformed this algorithm by a factor of about 0.001 when compared to the optimized implementation. (meaning a factor of about 0.00007 compared to the naive scheme).

On the other hand, Armadillo's native sparse multiplication did not compare well to the scipy algorithm, as it was only about a factor of 0.1 faster than the optimized algorithm.

Benchmarking sparse multiplication

Library/Algorithm Time (s) Fractional time
Naïve 2.07011318512 1
Optimized 0.042023196118 0.02029995094957284
Scipy 0.000879618572071 0.00042491327449808525
Armadillo 0.025932 0.012526851278664156

The table lists the time spent on performing a multiplication of the $V_{pp}^{pp}$ matrix with itself, with a dimension of 576x567 and a density of about 0.75 % . (ratio of nonzero elements)

It should be noted that the results from armadillo does not directly compare as they were calculated in a virtual ubuntu environment. (Oracle VM VirtualBox, no cap on CPU and plenty of RAM).

Considering alternative libraries

Mainly three libraries for the C++ implementation are up for consideration:

  1. Eigen (http://eigen.tuxfamily.org/index.php?title=Main_Page)
  2. CSparse (http://www.cise.ufl.edu/research/sparse/CSparse/)
  3. CUSparse https://developer.nvidia.com/cuSPARSE

The first two seems to have optimized support for sparse matrix multiplication (and a lot of other functionality). The third certainly meets this criteria, but it is also dependent on the GPU/CUDA. Scaling should also be taken into account, as libraries such as CUDA may or may not (https://www.cs.fsu.edu/research/projects/aseshpande.pdf https://www.cs.fsu.edu/research/projects/rose_report.pdf ) perform slower for small systems compared to libraries utilizing the CPU. Another consideration in this matter is that the upcoming CCSDT implementation should be expected to require parallellization (in MPI), so it will be of great advantage if the implementation is easily extended to parallel computation.

Notes, april 10th

Morten advised that I optimized the L1 calculation as this naturally should be the bottleneck in the algorithm.

Just to confirm that this was the case, I timed each step in the "advance" (amplitude) algorithm, and found to my surprise that the Q3 calculation spent a factor of more than 100 more time than the other diagrams.

This actually makes sense, since the implementation is done by sparse matrices with compressed rows. As Q4 has equivalent complexity, but reversed order of reordering of matrix elements (making the columns dominate), we see as expected that this calculation is equivalently faster than the others.

L1 is quite slow, but not unreasonable. Comparing L1 with Q3, we find a factor of 10 in difference.

These findings will become central in the optimization process, and they also should be considered an argument to change from Armadillo to Eigen3, as Eigen supports both CSR and CSC formats, where the latter would be superior in calculating this diagram.

We could however possibly just transpose the matrices differently before calulation to save some time. This should be tried out soon. (Tried it out right away with great success. Gave an algorithmic speedup factor of 500.)

What is the most efficient kind of multiplication of CSR matrices available in Armadillo? The above shows that orientation/transpose plays an important role, so we should identify the most efficient orientation to utilize this fact on all matrix products.

We need some kind of notation for a "general" transpose, that is symmetry transformations of tensors with rank more than 2. We also want this notation to be clear on how the final tensor is mapped to a matrix.

A transpose is normally denoted

$$ (\vert \alpha \rangle M \langle \beta \vert)^T = \vert \beta \rangle M \langle \alpha \vert $$

This is unambiguous in the rank 2 case, since we only have two indices to permute.

In the case of a cube/tensor of rank 3 we may however consider a similar operation for each "slice", but we would then also need to know the axis, or which indices to permute. (Note: Is this just a permutation, as performed in some of the diagrams?)

By comparing different ways of performing the multiplication

$$ V_{pp}^{hh} $$

A comparison of convergence

The following table shows a comparison of two different implementations of the CCD (CCSD) for the HEG, using only 2 hole states and 12 particle states.

Iteration Sparse Gauss-Stanton Naive
0 -0.015399 -0.015399
1
2
3
4
5