BLAS and LAPACK#

We’ve seen a bit of dense linear algebra using numpy and scipy. Now we’re going to look under the hood.

Regardless of what language you’re using, chances are if you’re doing numerical linear algebra, you are able to take advantage of libraries of code which implement most common linear algebra routines and factorizations.

Basic Linear Algebra Subprograms (BLAS)

Linear Algebra PACKage (LAPACK)

These libraries are wrapped by scipy, which exposes an interface.

Why should you care?#

First, you probably shouldn’t be writing your own basic linear algebra routines if you can avoid it.

  1. It takes time to write them

  2. Even if you know what you’re doing, there’s a chance you have a bug

  3. Performance optimization is involved

It is entirely possible to do linear algebra in Python without ever worrying about the libraries under the hood. However, maybe you are prototyping an algorithm in Python, and then want to write compiled/optimized code in C/fortran. In this case, it is good to be able to translate what you’re doing into BLAS/LAPACK routines.

View Your Configuration#

You can view what BLAS and LAPACK libraries NumPy is using

%pylab inline
import scipy as sp
import scipy.linalg as la

np.__config__.show()
%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib
blas_armpl_info:
  NOT AVAILABLE
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/include']
lapack_armpl_info:
  NOT AVAILABLE
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

the above show the libraries mkl_rt, indicating that the system is using Intel’s math kernel library (MKL) - this is a library of mathematical functions (including BLAS and LAPACK) which is optimized for Intel CPUs, and is the default for Anaconda Python.

You can do the same for scipy:

sp.__config__.show()
Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /home/goodwill/miniconda3/envs/pycourse_test/include
    lib directory: /home/goodwill/miniconda3/envs/pycourse_test/lib
    name: mkl-sdl
    openblas configuration: unknown
    pc file directory: /home/goodwill/miniconda3/envs/pycourse_test/lib/pkgconfig
    version: '2023.1'
  lapack:
    detection method: pkgconfig
    found: true
    include directory: /home/goodwill/miniconda3/envs/pycourse_test/include
    lib directory: /home/goodwill/miniconda3/envs/pycourse_test/lib
    name: mkl-sdl
    openblas configuration: unknown
    pc file directory: /home/goodwill/miniconda3/envs/pycourse_test/lib/pkgconfig
    version: '2023.1'
  pybind11:
    detection method: pkgconfig
    include directory: /home/goodwill/miniconda3/envs/pycourse_test/include
    name: pybind11
    version: 2.10.4
Compilers:
  c:
    commands: /croot/scipy_1691606680890/_build_env/bin/x86_64-conda-linux-gnu-cc
    linker: ld.bfd
    name: gcc
    version: 11.2.0
  c++:
    commands: /croot/scipy_1691606680890/_build_env/bin/x86_64-conda-linux-gnu-c++
    linker: ld.bfd
    name: gcc
    version: 11.2.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 0.29.36
  fortran:
    commands: /croot/scipy_1691606680890/_build_env/bin/x86_64-conda-linux-gnu-gfortran
    linker: ld.bfd
    name: gcc
    version: 11.2.0
  pythran:
    include directory: ../../_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/pythran
    version: 0.13.1
Machine Information:
  build:
    cpu: x86_64
    endian: little
    family: x86_64
    system: linux
  cross-compiled: false
  host:
    cpu: x86_64
    endian: little
    family: x86_64
    system: linux
Python Information:
  path: /home/goodwill/miniconda3/envs/pycourse_test/bin/python
  version: '3.11'

BLAS Routines#

scipy BLAS interface

BLAS implements basic linear algebra routines like dot product, matrix-vector product, and matrix-matrix product as well as triangular solves.

It is written in Fortran, so will be easiest to use if you set the flag order='F' when constructing arrays

from scipy.linalg import blas

n = 4096
A = np.array(np.random.randn(n,n), order='F')
B = np.array(np.random.randn(n,n), order='F')


C_np = A @ B # numpy  C = A * B
C_blas = blas.dgemm(1.0, A, B) # BLAS C = A * B
np.linalg.norm(C_np - C_blas)
6.549260815938909e-12
%timeit C_np = np.matmul(A,B) # numpy
1 s ± 42.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit C_blas = blas.dgemm(1.0, A, B) # blas
1.06 s ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

BLAS Naming Conventions#

BLAS functions have short names (like dgemm) that look a little cryptic at first. There is a pattern to the names though. See here for reference.

blas.<character><name><mod>

The <character> field can be

  1. s: single precision float

  2. d: double precision float

  3. c: single precision complex float

  4. z: double precision complex float

The <name> field indicate either the operation type, or matrix type.

The <mod> field indicates additional details about the operation.

For example, dgemm = d + ge + mm has d as the <character>, ge for <name> (general matrix), and mm for <mod> (matrix multiplication).

BLAS Levels#

BLAS is split up into 3 conceptual levels

  1. Level 1 - vector-vector operations

  2. Level 2 - matrix-vector operations

  3. Level 3 - matrix-matrix operations

Level 3 operations are most efficient, since they can take the most advantage of memory performance optimizations e.g. minimizing cache misses.

FLOPS#

FLOPS are Floating-Point Operations Per Second, and are one way to measure the power of numerical code. Floating point operations are counted as the number of +, *, /, - applied to floating point numbers.

  • A human is capable of <1 flop

  • Your CPU is probably capable of several Giga-flops (\(10^9\))

  • A high end GPU will be measured in Tera-flops (\(10^{12}\))

  • Super computers are mostly O(100) Peta-flops (\(10^{17}\)). Some are close to an Exa-flop (\(10^{18}\))

BLAS Level 1#

Some vector-vector operations (insert the appropriate <character>)

  1. axpy (\(a x + y\))

  2. dot (dot product \(x^H x\))

  3. nrm2 (\(\|x\|_2\))

See the SciPy BLAS reference

n = 1000
x = np.random.rand(n)
y = np.random.rand(n)

z = blas.daxpy(x, y, a=1.0)
xx = blas.ddot(x, x)
x2 = blas.dnrm2(x)
np.sqrt(xx) - x2
0.0
import time
# measure FLOPs of fdot
n = 10_000_000
dtype = np.float32
x = np.array(np.random.randn(n), dtype=dtype)
y = np.array(np.random.randn(n), dtype=dtype)
print("measuring BLAS Level 1 flops using {}".format(dtype))
t0 = time.time()
xy = blas.sdot(x,y)
t1 = time.time()
print("time elapsed = {} sec.".format(t1 - t0))
flops = (2*n) / (t1 - t0) # 2N FLOP for dot - one multiplication and one addition per entry
print("{:e} FLOPS".format(flops))
measuring BLAS Level 1 flops using <class 'numpy.float32'>
time elapsed = 0.004943132400512695 sec.
4.046017e+09 FLOPS

This is consistent with a CPU running at ~3 GHz

BLAS Level 2#

Some matrix-vector operations (insert appropriate <character>)

  1. gemv \(\alpha A x\)

  2. trsv \(L^{-1} x\) (triangular solve)

  3. trmv \(L x\) (triangular matrix-vector product)

See the SciPy BLAS reference

# measure FLOPs of gemv
n = 4096
dtype = np.float32
A = np.array(np.random.randn(n, n), dtype=dtype)
x = np.array(np.random.randn(n), dtype=dtype)
print("measuring BLAS Level 2 flops using {}".format(dtype))
t0 = time.time()
y = blas.sgemv(1.0, A, x)
t1 = time.time()
print("time elapsed = {} sec.".format(t1 - t0))
flops = (2 * n**2) / (t1 - t0) # 2 n**2 FLOP - one multiplication and one addition per entry of A
print("{:e} FLOPS".format(flops))
measuring BLAS Level 2 flops using <class 'numpy.float32'>
time elapsed = 0.2168872356414795 sec.
1.547091e+08 FLOPS

BLAS Level 3#

Some matrix-matrix operations (insert appropriate <character>)

  1. gemm \(\alpha AB\)

  2. syrk \(\alpha A A^T\)

  3. trmm \(\alpha L_1 L_2\) (triangular multiplication)

See the SciPy BLAS reference

# measure FLOPs of gemm
n = 4096
dtype = np.float32
A = np.array(np.random.randn(n, n), dtype=dtype)
B = np.array(np.random.randn(n, n), dtype=dtype)
print("measuring BLAS Level 3 flops using {}".format(dtype))
t0 = time.time()
C = blas.sgemm(1.0, A, B)
t1 = time.time()
print("time elapsed = {} sec.".format(t1 - t0))
flops = (2 * n**3) / (t1 - t0) # 2 n**3 FLOP - n multiplications and additions per entry of C
print("{:e} FLOPS".format(flops))
measuring BLAS Level 3 flops using <class 'numpy.float32'>
time elapsed = 0.8897984027862549 sec.
1.544608e+11 FLOPS

From our experminets, we see that the BLAS level 3 operation gives us the most FLOPS.

LAPACK Routines#

scipy LAPACK interface

from scipy.linalg import lapack

LAPACK is a library of linear algebra routines that go beyond basic operations. These include routines for various factorizations and eigenvalue and singular value decompositions.

Again, the names are a bit cryptic, and it is worth searching online (and reading documentation) to figure out how to call the right functions.

Example: Obtain a orthonormal basis for columns of a matrix#

In a variety of situations it is convenient to be able to obtain an orthonormal basis for columns of a matrix A (e.g. subspace iteration, randomized numerical linear algebra). One way to do this is to use the QR decomposition:

m = 1000
n = 100
A = np.array(np.random.randn(m, n), order='F')

def get_Q_qr(A):
    Q, R = la.qr(A, mode='economic')
    return Q

Q = get_Q_qr(A)

print(Q.shape)
print(np.linalg.norm(Q.T @ Q - np.eye(n))) # measure how close Q is to orthogonal
(1000, 100)
3.738826405129496e-15

We can get this using LAPACK routines as well:

def get_Q_lapack(A):
    qr, tau, work, info = lapack.dgeqrf(A)
    Q, work, info = lapack.dorgqr(qr, tau)
    return Q

A = np.array(np.random.randn(m, n), order='F')
Q = get_Q_lapack(A)

print(Q.shape)
print(np.linalg.norm(Q.T @ Q - np.eye(n))) # measure how close Q is to orthogonal
(1000, 100)
3.699301345247151e-15

You can also do many LAPACK (and BLAS) routines in-place, meaning you can overwrite the matrix. This saves time used for memory allocation

def get_Q_inplace(A):
    m, n = A.shape
    lwork = max(3*n, 1)
    qr, tau, work, info = lapack.dgeqrf(A, lwork, 1) # overwrite A = True
    Q, work, info = lapack.dorgqr(qr, tau, lwork, 1) # overwrite qr = True
    return Q

A = np.array(np.random.randn(m, n), order='F')
Q = get_Q_inplace(A)

print(Q.shape)
print(np.linalg.norm(Q.T @ Q - np.eye(n))) # measure how close Q is to orthogonal
print(np.linalg.norm(A - Q)) # A now contains Q
(1000, 100)
3.90739608378158e-15
0.0
m = 4096
n = 128

for get_Q in (get_Q_qr, get_Q_lapack, get_Q_inplace):
    
    print("timing {}".format(get_Q))
    A = np.array(np.random.randn(m, n), order='F')
    t0 = time.time()
    Q = get_Q(A)
    t1 = time.time()
    print("  time elapsed = {} sec.".format(t1 - t0))  
timing <function get_Q_qr at 0x7fe297c74540>
  time elapsed = 0.017192840576171875 sec.
timing <function get_Q_lapack at 0x7fe297c74a40>
  time elapsed = 0.01151728630065918 sec.
timing <function get_Q_inplace at 0x7fe297c74b80>
  time elapsed = 0.010828971862792969 sec.

As we see, the in-place version is fastest.

Exercise#

One way to get the top k-dimensional eigenspace (eigenpace with k-largest eigenvalues) is using a subspace version of the power method. Here is some pseudocode:


# get top k-dimensional eigenspace of symmetric matrix A
m, n = A.shape # m should equal n
Q = random n x k matrix
Q, R = qr(Q) # orthogonalize
for some number of iterations:
    Q = A @ Q
    Q, R = qr(Q) # re-othogonalize
    
return Q

Implement a function that realizes this algorithm. Use BLAS to implement matrix-matrix multiplication, and use LAPACK to do the orthogonalization of Q.

Compare this to the span of the top-k eigenvectors of A obtained via eigh.

## Your Code here

Exercise#

Implement a version of solve for a square matrix A which uses LAPACK for the LU decomposition, and BLAS for the triangular solves. You may want to look at dgetrf (the lu return object has L in the lower triangular part, and U in the upper triangular part)

## Your code here