BLAS and LAPACK#
We’ve seen a bit of dense linear algebra using numpy and scipy. Now we’re going to look under the hood.
Regardless of what language you’re using, chances are if you’re doing numerical linear algebra, you are able to take advantage of libraries of code which implement most common linear algebra routines and factorizations.
Basic Linear Algebra Subprograms (BLAS)
Linear Algebra PACKage (LAPACK)
These libraries are wrapped by scipy, which exposes an interface.
Why should you care?#
First, you probably shouldn’t be writing your own basic linear algebra routines if you can avoid it.
It takes time to write them
Even if you know what you’re doing, there’s a chance you have a bug
Performance optimization is involved
It is entirely possible to do linear algebra in Python without ever worrying about the libraries under the hood. However, maybe you are prototyping an algorithm in Python, and then want to write compiled/optimized code in C/fortran. In this case, it is good to be able to translate what you’re doing into BLAS/LAPACK routines.
View Your Configuration#
You can view what BLAS and LAPACK libraries NumPy is using
%pylab inline
import scipy as sp
import scipy.linalg as la
np.__config__.show()
%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib
blas_armpl_info:
NOT AVAILABLE
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/include']
lapack_armpl_info:
NOT AVAILABLE
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/goodwill/miniconda3/envs/pycourse_test/include']
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
the above show the libraries mkl_rt
, indicating that the system is using Intel’s math kernel library (MKL) - this is a library of mathematical functions (including BLAS and LAPACK) which is optimized for Intel CPUs, and is the default for Anaconda Python.
You can do the same for scipy:
sp.__config__.show()
Build Dependencies:
blas:
detection method: pkgconfig
found: true
include directory: /home/goodwill/miniconda3/envs/pycourse_test/include
lib directory: /home/goodwill/miniconda3/envs/pycourse_test/lib
name: mkl-sdl
openblas configuration: unknown
pc file directory: /home/goodwill/miniconda3/envs/pycourse_test/lib/pkgconfig
version: '2023.1'
lapack:
detection method: pkgconfig
found: true
include directory: /home/goodwill/miniconda3/envs/pycourse_test/include
lib directory: /home/goodwill/miniconda3/envs/pycourse_test/lib
name: mkl-sdl
openblas configuration: unknown
pc file directory: /home/goodwill/miniconda3/envs/pycourse_test/lib/pkgconfig
version: '2023.1'
pybind11:
detection method: pkgconfig
include directory: /home/goodwill/miniconda3/envs/pycourse_test/include
name: pybind11
version: 2.10.4
Compilers:
c:
commands: /croot/scipy_1691606680890/_build_env/bin/x86_64-conda-linux-gnu-cc
linker: ld.bfd
name: gcc
version: 11.2.0
c++:
commands: /croot/scipy_1691606680890/_build_env/bin/x86_64-conda-linux-gnu-c++
linker: ld.bfd
name: gcc
version: 11.2.0
cython:
commands: cython
linker: cython
name: cython
version: 0.29.36
fortran:
commands: /croot/scipy_1691606680890/_build_env/bin/x86_64-conda-linux-gnu-gfortran
linker: ld.bfd
name: gcc
version: 11.2.0
pythran:
include directory: ../../_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/pythran
version: 0.13.1
Machine Information:
build:
cpu: x86_64
endian: little
family: x86_64
system: linux
cross-compiled: false
host:
cpu: x86_64
endian: little
family: x86_64
system: linux
Python Information:
path: /home/goodwill/miniconda3/envs/pycourse_test/bin/python
version: '3.11'
BLAS Routines#
BLAS implements basic linear algebra routines like dot product, matrix-vector product, and matrix-matrix product as well as triangular solves.
It is written in Fortran, so will be easiest to use if you set the flag order='F'
when constructing arrays
from scipy.linalg import blas
n = 4096
A = np.array(np.random.randn(n,n), order='F')
B = np.array(np.random.randn(n,n), order='F')
C_np = A @ B # numpy C = A * B
C_blas = blas.dgemm(1.0, A, B) # BLAS C = A * B
np.linalg.norm(C_np - C_blas)
6.549260815938909e-12
%timeit C_np = np.matmul(A,B) # numpy
1 s ± 42.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit C_blas = blas.dgemm(1.0, A, B) # blas
1.06 s ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
BLAS Naming Conventions#
BLAS functions have short names (like dgemm
) that look a little cryptic at first. There is a pattern to the names though. See here for reference.
blas.<character><name><mod>
The <character>
field can be
s
: single precision floatd
: double precision floatc
: single precision complex floatz
: double precision complex float
The <name>
field indicate either the operation type, or matrix type.
The <mod>
field indicates additional details about the operation.
For example, dgemm = d + ge + mm
has d
as the <character>
, ge
for <name>
(general matrix), and mm
for <mod>
(matrix multiplication).
BLAS Levels#
BLAS is split up into 3 conceptual levels
Level 1 - vector-vector operations
Level 2 - matrix-vector operations
Level 3 - matrix-matrix operations
Level 3 operations are most efficient, since they can take the most advantage of memory performance optimizations e.g. minimizing cache misses.
FLOPS#
FLOPS are Floating-Point Operations Per Second, and are one way to measure the power of numerical code. Floating point operations are counted as the number of +, *, /, -
applied to floating point numbers.
A human is capable of <1 flop
Your CPU is probably capable of several Giga-flops (\(10^9\))
A high end GPU will be measured in Tera-flops (\(10^{12}\))
Super computers are mostly O(100) Peta-flops (\(10^{17}\)). Some are close to an Exa-flop (\(10^{18}\))
BLAS Level 1#
Some vector-vector operations (insert the appropriate <character>
)
n = 1000
x = np.random.rand(n)
y = np.random.rand(n)
z = blas.daxpy(x, y, a=1.0)
xx = blas.ddot(x, x)
x2 = blas.dnrm2(x)
np.sqrt(xx) - x2
0.0
import time
# measure FLOPs of fdot
n = 10_000_000
dtype = np.float32
x = np.array(np.random.randn(n), dtype=dtype)
y = np.array(np.random.randn(n), dtype=dtype)
print("measuring BLAS Level 1 flops using {}".format(dtype))
t0 = time.time()
xy = blas.sdot(x,y)
t1 = time.time()
print("time elapsed = {} sec.".format(t1 - t0))
flops = (2*n) / (t1 - t0) # 2N FLOP for dot - one multiplication and one addition per entry
print("{:e} FLOPS".format(flops))
measuring BLAS Level 1 flops using <class 'numpy.float32'>
time elapsed = 0.004943132400512695 sec.
4.046017e+09 FLOPS
This is consistent with a CPU running at ~3 GHz
BLAS Level 2#
Some matrix-vector operations (insert appropriate <character>
)
gemv
\(\alpha A x\)trsv
\(L^{-1} x\) (triangular solve)trmv
\(L x\) (triangular matrix-vector product)
# measure FLOPs of gemv
n = 4096
dtype = np.float32
A = np.array(np.random.randn(n, n), dtype=dtype)
x = np.array(np.random.randn(n), dtype=dtype)
print("measuring BLAS Level 2 flops using {}".format(dtype))
t0 = time.time()
y = blas.sgemv(1.0, A, x)
t1 = time.time()
print("time elapsed = {} sec.".format(t1 - t0))
flops = (2 * n**2) / (t1 - t0) # 2 n**2 FLOP - one multiplication and one addition per entry of A
print("{:e} FLOPS".format(flops))
measuring BLAS Level 2 flops using <class 'numpy.float32'>
time elapsed = 0.2168872356414795 sec.
1.547091e+08 FLOPS
BLAS Level 3#
Some matrix-matrix operations (insert appropriate <character>
)
See the SciPy BLAS reference
# measure FLOPs of gemm
n = 4096
dtype = np.float32
A = np.array(np.random.randn(n, n), dtype=dtype)
B = np.array(np.random.randn(n, n), dtype=dtype)
print("measuring BLAS Level 3 flops using {}".format(dtype))
t0 = time.time()
C = blas.sgemm(1.0, A, B)
t1 = time.time()
print("time elapsed = {} sec.".format(t1 - t0))
flops = (2 * n**3) / (t1 - t0) # 2 n**3 FLOP - n multiplications and additions per entry of C
print("{:e} FLOPS".format(flops))
measuring BLAS Level 3 flops using <class 'numpy.float32'>
time elapsed = 0.8897984027862549 sec.
1.544608e+11 FLOPS
From our experminets, we see that the BLAS level 3 operation gives us the most FLOPS.
LAPACK Routines#
from scipy.linalg import lapack
LAPACK is a library of linear algebra routines that go beyond basic operations. These include routines for various factorizations and eigenvalue and singular value decompositions.
Again, the names are a bit cryptic, and it is worth searching online (and reading documentation) to figure out how to call the right functions.
Example: Obtain a orthonormal basis for columns of a matrix#
In a variety of situations it is convenient to be able to obtain an orthonormal basis for columns of a matrix A
(e.g. subspace iteration, randomized numerical linear algebra). One way to do this is to use the QR decomposition:
m = 1000
n = 100
A = np.array(np.random.randn(m, n), order='F')
def get_Q_qr(A):
Q, R = la.qr(A, mode='economic')
return Q
Q = get_Q_qr(A)
print(Q.shape)
print(np.linalg.norm(Q.T @ Q - np.eye(n))) # measure how close Q is to orthogonal
(1000, 100)
3.738826405129496e-15
We can get this using LAPACK routines as well:
def get_Q_lapack(A):
qr, tau, work, info = lapack.dgeqrf(A)
Q, work, info = lapack.dorgqr(qr, tau)
return Q
A = np.array(np.random.randn(m, n), order='F')
Q = get_Q_lapack(A)
print(Q.shape)
print(np.linalg.norm(Q.T @ Q - np.eye(n))) # measure how close Q is to orthogonal
(1000, 100)
3.699301345247151e-15
You can also do many LAPACK (and BLAS) routines in-place, meaning you can overwrite the matrix. This saves time used for memory allocation
def get_Q_inplace(A):
m, n = A.shape
lwork = max(3*n, 1)
qr, tau, work, info = lapack.dgeqrf(A, lwork, 1) # overwrite A = True
Q, work, info = lapack.dorgqr(qr, tau, lwork, 1) # overwrite qr = True
return Q
A = np.array(np.random.randn(m, n), order='F')
Q = get_Q_inplace(A)
print(Q.shape)
print(np.linalg.norm(Q.T @ Q - np.eye(n))) # measure how close Q is to orthogonal
print(np.linalg.norm(A - Q)) # A now contains Q
(1000, 100)
3.90739608378158e-15
0.0
m = 4096
n = 128
for get_Q in (get_Q_qr, get_Q_lapack, get_Q_inplace):
print("timing {}".format(get_Q))
A = np.array(np.random.randn(m, n), order='F')
t0 = time.time()
Q = get_Q(A)
t1 = time.time()
print(" time elapsed = {} sec.".format(t1 - t0))
timing <function get_Q_qr at 0x7fe297c74540>
time elapsed = 0.017192840576171875 sec.
timing <function get_Q_lapack at 0x7fe297c74a40>
time elapsed = 0.01151728630065918 sec.
timing <function get_Q_inplace at 0x7fe297c74b80>
time elapsed = 0.010828971862792969 sec.
As we see, the in-place version is fastest.
Exercise#
One way to get the top k
-dimensional eigenspace (eigenpace with k-largest eigenvalues) is using a subspace version of the power method. Here is some pseudocode:
# get top k-dimensional eigenspace of symmetric matrix A
m, n = A.shape # m should equal n
Q = random n x k matrix
Q, R = qr(Q) # orthogonalize
for some number of iterations:
Q = A @ Q
Q, R = qr(Q) # re-othogonalize
return Q
Implement a function that realizes this algorithm. Use BLAS to implement matrix-matrix multiplication, and use LAPACK to do the orthogonalization of Q.
Compare this to the span of the top-k eigenvectors of A
obtained via eigh
.
## Your Code here
Exercise#
Implement a version of solve
for a square matrix A
which uses LAPACK for the LU decomposition, and BLAS for the triangular solves. You may want to look at dgetrf (the lu
return object has L in the lower triangular part, and U in the upper triangular part)
## Your code here