Michael Wooley's Homepage

A Time- and Memory-Efficient Solution to $\text{cholesky}\left[A\otimes B\right]\varepsilon$

Michael Wooley
Oct 22, 2018
VAR Bayesian Statistics computation python numba

This post discusses an efficient solution to $\text{cholesky}\left[A\otimes B\right]\varepsilon$, where $A$ and $B$ are real, symmetric, positive definite matrices of while $\varepsilon$ is an appropriately-sized vector. Why would anyone ever want to compute this thing? Basically, an expression like this is going to pop up if you want to sample from a multivariate normal distribution with covariance matrix $A\otimes B$.

⊕

Figure 1. We’ll quickly run into memory errors if we blindly compute $A\otimes B$. The projected matrix size in gigabytes assumes a 64-bit float.

For me, this expression arose as a bottleneck in a project that I’m working on that involves a Bayesian VAR with a Normal-Inverse-Wishart prior. What’s wrong with using regular NumPy to compute np.linalg.cholesky(np.kron(A, B)).dot(e)? This matrix gets very big very fast. You can see this in Figure 1, which considers the BVAR case with a set number of lags $p$ and $n$ variables per period. In the normal-inverse-wishart case the posterior covariance matrix has size $n(n\cdot p+1)\times n(n\cdot p+1)$. The right figure says that we need about 10,000 gigabytes to compute $A\otimes B$ when $n=200$ and $lags=13$. I want to be able to accommodate specifications of this magnitude. In order to do so, though, I’m going to need to find an alternative to np.linalg.cholesky(np.kron(A, B)).dot(e).

A Solution

The key to solving this problem is to avoid explicitly computing $A \otimes B$. This can be done by observing that

$$\text{cholesky}\left[ A\otimes B\right] = \text{cholesky}[A]\otimes \text{cholesky}[B].$$

Let’s prove this. Recall that the Cholesky decomposition of a real, symmetric, positive definite matrix $X$ is a lower-triangular matrix $L$ such that: Cholesky decompositions can exist in other cases but we are only interested in this case. Note also that a Cholesky decomposition always exists when a matrix satisfies the properties on $X$ $$X = LL^{\prime}$$ Now define $L_A$ and $L_B$ so that $A = L_A L_A^{\prime}$ and $B = L_B L_B^{\prime}$. Additionally define $L_C := L_A \otimes L_B$. Notice that $L_C$ is also lower-triangular and–since $L_A$ and $L_B$ have real and postive diagonal entries–so too does $L_C$. Consider the following:

\begin{align} L_C L_C^{\prime} & = (L_A \otimes L_B)(L_A \otimes L_B)^{\prime} \newline & =(L_A \otimes L_B)(L_A^{\prime} \otimes L_B^{\prime}) \newline & =(L_A L_A^{\prime} \otimes L_B L_B^{\prime}) \newline & =A\otimes B \end{align}

From this we see that $L_C$ is the Cholesky decomposition of $A\otimes B$.

What does this fact give us? Two things:

We avoid explicitly computing the full kronecker product.
- Clearly, this saves on memory.
- Moreover, since the matrices that we are multiplying are lower-triangular, we can skip multiplicative terms where an input is known to be zero.
Computing two (relatively small) choleskies is cheaper than computing the cholesky of $A\otimes B$.

Here is some pseudo-code for the solution to $\text{cholesky}\left[A\otimes B\right]\varepsilon$:

def chol_kron_ab_e(A: FMATRIX, B: FMATRIX, e: FVECTOR) -> FVECTOR:
  """Solution to `cholesky[kron[A, B]] * e`.
  Args:
      A (FMATRIX): Real, Symmetric, Positive Definite Matrix of size [mA, mA]
      B (FMATRIX): Real, Symmetric, Positive Definite Matrix of size [mB, mB]
      e (FVECTOR): Vector of shape [mA * mB, 1]
  Returns:
      FVECTOR: Vector of shape [mA * mB, 1] solution to `cholesky[kron[A, B]] * e`
  """
  mA = A.shape[0]
  mB = B.shape[0]
  L_a = cholesky(A)
  L_b = cholesky(B)

  out = np.zeros((mA * mB))

  for ii in range(mA):
    for jj in range(ii + 1):
      for hh in range(mB):
        for kk in range(hh + 1):
          out[mB * ii + hh] += L_a[ii, jj] * L_b[hh, kk] * e[mB * jj + kk]

  return out

Obviously, this would be pretty slow to run in standard python. In the tested implementation I used Numba, which can do explicit loops efficiently. For the sake of comparison, all of the test functions below are compiled with Numba.

Evaluation

Let’s see how well this thing works. Throughout, we’ll use a numba version of np.linalg.cholesky(np.kron(A, B)).dot(e) as a benchmark.

Figure 2 provides some information on time-to-compute. I’ve kept the size of the matrices relatively small so that the benchmark can be computed.

⊕

Figure 2. The proposed algorithm (“Loop”) provides a considerable speed-up over the benchmark and is pretty fast in absolute terms.

From the left figure we see that the proposed algorithm quickly achieves a speed-up at all matrix sizes. As the size of the matrices increase the speed-up increases at a more-or-less linear rate.

The right sub-figure plots the absolute time-to-compute for the proposed algorithm in seconds. Not surprisingly, this relationship is convex at all lags. However, in absolute terms it is still pretty fast.

Figure 3 compares the memory requirements of the two algorithms. Due to the long time-to-compute when calculating memory usage for the benchmark case, we only did this at a few values. The memory usage statistic presented here is the maximum memory usage required by the process, as computed by the command memory_profiler.memory_usage.

⊕

Figure 3. Relative and Absolute Memory Usage.

The proposed algorithm is pretty memory-efficient. With 42 variables the NumPy method already requires about 8.2 gigs of memory. Most people don’t have that kind of memory to spare! On the other hand, memory consumption of the alternative method grows at a modest (though convex) rate.

Finally, I tested the proposed algorithm for large-dimensional cases. The benchmark is skipped here because I don’t have enough memory to do this sort of comparison. The results are displayed in Figure 4.

⊕

Figure 4. Time-to-compute and memory usage for proposed algorithm (“loop”) only.

First, the good: memory usage continues to climb very slowly despite a huge increase in the size of matrices involved. Second, the okay: time-to-run is non-trivial at higher $n$. It also seems to pick up around $n=125$ for some reason. If you want to do sampling with this thing you better plan ahead.

Conclusion

I’m a little bit disappointed that I couldn’t keep the time down a bit more for the larger matrices. If I had a bit more time I’d investigate pure-c versions of this procedure to see how low it can go. If you have the time and want to try that, feel free to build on my results, which I’ve posted as a gist.