Vectorization, Numpy Universal Functions#
In order to write performant code using numerical libraries, it is useful to keep the following rule in mind:
Code that is predictable can be made fast
The most common example of predictable code is a fixed length loop which performs the same operation at each iteration (up to changes in index)
Examples
Matrix-vector multiplication
Functions applied element-wise to an array
Non-examples:
Code with branch instructions (
if
,else
, etc.)Code with recursive function calls (at least in Python)
One reason why predictable code can be fast is that most CPUs have what is called a branch predictor in them, which pre-loads computation. If a branch is predicted incorrectly, then the CPU has to switch gears and go along the correct brach, which takes time. Code without branches will minimize the number of branch prediction errors, speeding up code.
Another reason why predictable code can be made fast is vectorization. If you are performing the same operation in a predictable way, code can employ special instructions such as AVX which can greatly increase efficiency.
You don’t need to worry about the details in Python, but it is good to know how to write code that allows libraries like NumPy to take advantage of these techniques. Note that standard Python loops will not take advantage of these things - you typically need to use libraries.
Universal Functions#
Numpy universal functions (or ufuncs) are functions that are applied element-wise to an array. Examples include most math operations and logical comparisons. You can find additional information in the ufunc documentation.
import numpy as np
import time
# set up two vectors
n = 1_000_000
x = np.random.randn(n)
y = np.random.randn(n)
def naive_add(x, y):
"""
add two arrays using a Python for-loop
"""
z = np.empty_like(x)
for i in range(len(x)):
z[i] = x[i] + y[i]
return z
start = time.monotonic()
z = naive_add(x, y)
end = time.monotonic()
print("time for naive add: {:0.3e} sec".format(end - start))
start = time.monotonic()
z = np.add(x, y)
end = time.monotonic()
print("time for numpy add: {:0.3e} sec".format(end - start))
time for naive add: 1.924e-01 sec
time for numpy add: 4.033e-03 sec
If you prefer to write your code without the for-loop, you can use np.vectorize
. See the documentation for details.
@np.vectorize
def numpy_add_vectorize(x, y):
return x + y
start = time.monotonic()
z = naive_add(x, y)
end = time.monotonic()
print("time for naive add: {:0.3e} sec".format(end - start))
start = time.monotonic()
z = np.add(x, y)
end = time.monotonic()
print("time for numpy add: {:0.3e} sec".format(end - start))
start = time.monotonic()
z = numpy_add_vectorize(x, y)
end = time.monotonic()
print("time for numpy add (vectorize): {:0.3e} sec".format(end - start))
time for naive add: 1.984e-01 sec
time for numpy add: 2.147e-03 sec
time for numpy add (vectorize): 1.627e-01 sec
np.vectorize
doesn’t really give a performance boost, but can make defining functions simpler
Exercise#
perform some timing tests that compare a naive python loop implementation with a numpy ufunc. Try computing
sin
,exp
, multiplication, etc.
## Your code here
More complicated functions#
Some functions that can be sped up considerably are a bit more complicated than ufuncs. One example is matrix-vector multiplication. We’ll use the notation Ax = y
, where
\begin{equation}
y_i = \sum_j A_{i,j} x_j
\end{equation}
# set up matrix and vector for multiplication
m, n = 500, 1000
A = np.random.randn(m, n)
x = np.random.randn(n)
def naive_matvec(A, x):
"""
naive matrix-vector multiplication implementation
"""
m, n = A.shape
y = np.zeros(m)
for i in range(m):
for j in range(n):
y[i] = y[i] + A[i,j] * x[j]
return y
start = time.monotonic()
y1 = naive_matvec(A, x)
end = time.monotonic()
print("time for naive matvec: {:0.3e} sec".format(end - start))
start = time.monotonic()
# y2 = np.matmul(A, x)
y2 = A @ x
end = time.monotonic()
print("time for numpy matvec: {:0.3e} sec".format(end - start))
np.linalg.norm(y1 - y2)
time for naive matvec: 1.547e-01 sec
time for numpy matvec: 5.649e-03 sec
7.501200011387548e-13
Numba#
Numba is a just in time (JIT) compiler for Python code. It provides several decorators which make it very easy to get speedups for numerical code in many situations.
Just in time compilation is an increasingly popular solution that bridges the gap between interpreted and compiled languages. Generally:
Interpreted languages (such as Python) simply read code line-by-line and execute as they go along.
Compiled languages (such as C, C++, fortran) compile code into a binary, which can be optimized to run quickly
Compilation takes time intially but saves time when you use the binary. Python libraries such as NumPy and SciPy use compiled libraries under the hood for speed. Interpreted languages tend to be slower, but are easier to develop in.
Just in time compilation will produce a compiled version of a function the first time it is needed. Julia is a relatively new language which uses JIT to produce fast code with less development overhead.
One of the things you need to know to compile code is the types used - if you want to use different types (e.g. single and double precision versions of a function), you need different compiled versions. Python usually allows you to not worry too much about type, but this is one reason why you need to know about it anyways in scientific computing.
First:
$ conda install numba
Let’s look at how Numba can be used with our naive_add
ufunc.
from numba import jit, njit
@jit # this is the only thing we do different
def numba_add(x, y):
"""
add two arrays using a Python for-loop
"""
z = np.empty_like(x)
for i in range(len(x)):
z[i] = x[i] + y[i]
return z
/tmp/ipykernel_1632/852113555.py:3: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@jit # this is the only thing we do different
# set up two vectors
n = 1_000_000
x = np.random.randn(n)
y = np.random.randn(n)
start = time.monotonic()
z = naive_add(x, y)
end = time.monotonic()
print("time for naive add: {:0.3e} sec".format(end - start))
start = time.monotonic()
z = np.add(x, y)
end = time.monotonic()
print("time for numpy add: {:0.3e} sec".format(end - start))
start = time.monotonic()
z = numba_add(x, y)
end = time.monotonic()
print("time for numba add: {:0.3e} sec".format(end - start))
time for naive add: 2.079e-01 sec
time for numpy add: 2.242e-03 sec
time for numba add: 3.245e-01 sec
The numba
JIT function runs in about the same time as the naive function. Let’s see what happens when we run the code again:
# set up two vectors
n = 1_000_000
x = np.random.randn(n)
y = np.random.randn(n)
start = time.monotonic()
z = naive_add(x, y)
end = time.monotonic()
print("time for naive add: {:0.3e} sec".format(end - start))
start = time.monotonic()
z = np.add(x, y)
end = time.monotonic()
print("time for numpy add: {:0.3e} sec".format(end - start))
start = time.monotonic()
z = numba_add(x, y)
end = time.monotonic()
print("time for numba add: {:0.3e} sec".format(end - start))
time for naive add: 2.060e-01 sec
time for numpy add: 2.482e-03 sec
time for numba add: 2.032e-03 sec
Now the numba
function is much faster. This is because the first time the function is called, it must be compiled. Every subsequent time you call the function, it will run much faster.
The take-away is that it is advantageous to use JIT with functions you will use repeatedly, but not necessarily worth the time for functions you will only use once.
Advanced numba#
You can get a lot of mileage out of numba
without too much trouble. It is always good to look at the documentation to learn more. Here are a few examples:
Parallelization (this is supported on a handful of known operations).
from numba import prange # parallel range
@jit(nopython=True, parallel=True)
def numba_add_parallel(x, y):
"""
add two arrays using a Python for-loop
"""
z = np.empty_like(x)
for i in prange(len(x)):
z[i] = x[i] + y[i]
return z
# set up two vectors
n = 100_000_000
x = np.random.randn(n)
y = np.random.randn(n)
z = numba_add_parallel(x, y) # precompile
start = time.monotonic()
z = numba_add(x, y)
end = time.monotonic()
print("time for numba add: {:0.3e} sec".format(end - start))
start = time.monotonic()
z = np.add(x, y)
end = time.monotonic()
print("time for numpy add: {:0.3e} sec".format(end - start))
start = time.monotonic()
z = numba_add_parallel(x, y)
end = time.monotonic()
print("time for numba parallel add: {:0.3e} sec".format(end - start))
time for numba add: 4.561e-01 sec
time for numpy add: 1.899e-01 sec
time for numba parallel add: 2.664e-01 sec
Parallelization of matvec
:
@jit(nopython=True, parallel=True)
def numba_sumvec(x):
"""
naive matrix-vector multiplication implementation
"""
m = x.shape
y = 0
for i in prange(m):
y = y + x[j]
return y
@jit(nopython=True, parallel=True)
def numba_matvec(A, x):
"""
naive matrix-vector multiplication implementation
"""
m, n = A.shape
y = np.zeros(m)
for i in prange(m):
for j in range(n):
y[i] = y[i] + A[i,j] * x[j]
return y
# set up matrix and vector for multiplication
m, n = 4000, 4000
A = np.random.randn(m, n)
x = np.random.randn(n)
y = numba_matvec(A, x) # precompile
start = time.monotonic()
y1 = numba_matvec(A, x)
end = time.monotonic()
print("time for numba parallel matvec: {:0.3e} sec".format(end - start))
start = time.monotonic()
y2 = np.matmul(A, x)
end = time.monotonic()
print("time for numpy matvec: {:0.3e} sec".format(end - start))
np.linalg.norm(y1 - y2)
time for numba parallel matvec: 8.725e-03 sec
time for numpy matvec: 8.683e-03 sec
9.032072117815461e-12
declaring ufuncs without a loop. You can define a function on elements using numba.vectorize
. Unlike numpy.vectorize
, numba
will give you a noticeable speedup. We need to provide a call signature - for example float32(float32, float32)
will take two different single precision floating point numbers (float32
) as input, and return a single precision floating point number as output. See the documentation for additional details.
from numba import vectorize
@vectorize(['float32(float32, float32)']) # call signature defined
def numba_add_vectorize(a, b):
return a + b
# Initialize arrays
n = 100_000_000
x = np.ones(n, dtype=np.float32)
y = np.ones(x.shape, dtype=x.dtype)
z = np.empty_like(x, dtype=x.dtype)
# precompile
z = numba_add_vectorize(x, y)
start = time.monotonic()
z = numba_add(x, y)
end = time.monotonic()
print("time for numba add: {:0.3e} sec".format(end - start))
start = time.monotonic()
z = numba_add_vectorize(x, y)
end = time.monotonic()
print("time for numba add (vectorized): {:0.3e} sec".format(end - start))
time for numba add: 2.383e-01 sec
time for numba add (vectorized): 9.444e-02 sec
For NVIDIA GPUS, you can use cuda through numba
- see the Nvidia page for example
from numba import vectorize
@vectorize(['float32(float32, float32)'], target='cuda')
def numba_add_cuda(a, b):
return a + b
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/cudadrv/nvvm.py:139, in NVVM.__new__(cls)
138 try:
--> 139 inst.driver = open_cudalib('nvvm')
140 except OSError as e:
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/cudadrv/libs.py:64, in open_cudalib(lib)
63 path = get_cudalib(lib)
---> 64 return ctypes.CDLL(path)
File ~/miniconda3/envs/pycourse_test/lib/python3.11/ctypes/__init__.py:376, in CDLL.__init__(self, name, mode, handle, use_errno, use_last_error, winmode)
375 if handle is None:
--> 376 self._handle = _dlopen(self._name, mode)
377 else:
OSError: libnvvm.so: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
NvvmSupportError Traceback (most recent call last)
Cell In[20], line 3
1 from numba import vectorize
----> 3 @vectorize(['float32(float32, float32)'], target='cuda')
4 def numba_add_cuda(a, b):
5 return a + b
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/np/ufunc/decorators.py:131, in vectorize.<locals>.wrap(func)
129 vec = Vectorize(func, **kws)
130 for sig in ftylist:
--> 131 vec.add(sig)
132 if len(ftylist) > 0:
133 vec.disable_compile()
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/np/ufunc/deviceufunc.py:391, in DeviceVectorize.add(self, sig)
388 funcname = self.pyfunc.__name__
389 kernelsource = self._get_kernel_source(self._kernel_template,
390 devfnsig, funcname)
--> 391 corefn, return_type = self._compile_core(devfnsig)
392 glbl = self._get_globals(corefn)
393 sig = signature(types.void, *([a[:] for a in args] + [return_type[:]]))
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/vectorizers.py:202, in CUDAVectorize._compile_core(self, sig)
201 def _compile_core(self, sig):
--> 202 cudevfn = cuda.jit(sig, device=True, inline=True)(self.pyfunc)
203 return cudevfn, cudevfn.overloads[sig.args].signature.return_type
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/decorators.py:131, in jit.<locals>._jit(func)
129 from numba.core import typeinfer
130 with typeinfer.register_dispatcher(disp):
--> 131 disp.compile_device(argtypes, restype)
132 else:
133 disp.compile(argtypes)
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/dispatcher.py:877, in CUDADispatcher.compile_device(self, args, return_type)
871 nvvm_options = {
872 'opt': 3 if self.targetoptions.get('opt') else 0,
873 'fastmath': fastmath
874 }
876 cc = get_current_device().compute_capability
--> 877 cres = compile_cuda(self.py_func, return_type, args,
878 debug=debug,
879 lineinfo=lineinfo,
880 inline=inline,
881 fastmath=fastmath,
882 nvvm_options=nvvm_options,
883 cc=cc)
884 self.overloads[args] = cres
886 cres.target_context.insert_user_function(cres.entry_point,
887 cres.fndesc,
888 [cres.library])
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler_lock.py:35, in _CompilerLock.__call__.<locals>._acquire_compile_lock(*args, **kwargs)
32 @functools.wraps(func)
33 def _acquire_compile_lock(*args, **kwargs):
34 with self:
---> 35 return func(*args, **kwargs)
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/compiler.py:230, in compile_cuda(pyfunc, return_type, args, debug, lineinfo, inline, fastmath, nvvm_options, cc)
228 from numba.core.target_extension import target_override
229 with target_override('cuda'):
--> 230 cres = compiler.compile_extra(typingctx=typingctx,
231 targetctx=targetctx,
232 func=pyfunc,
233 args=args,
234 return_type=return_type,
235 flags=flags,
236 locals={},
237 pipeline_class=CUDACompiler)
239 library = cres.library
240 library.finalize()
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler.py:762, in compile_extra(typingctx, targetctx, func, args, return_type, flags, locals, library, pipeline_class)
738 """Compiler entry point
739
740 Parameter
(...)
758 compiler pipeline
759 """
760 pipeline = pipeline_class(typingctx, targetctx, library,
761 args, return_type, flags, locals)
--> 762 return pipeline.compile_extra(func)
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler.py:460, in CompilerBase.compile_extra(self, func)
458 self.state.lifted = ()
459 self.state.lifted_from = None
--> 460 return self._compile_bytecode()
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler.py:528, in CompilerBase._compile_bytecode(self)
524 """
525 Populate and run pipeline for bytecode input
526 """
527 assert self.state.func_ir is None
--> 528 return self._compile_core()
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler.py:507, in CompilerBase._compile_core(self)
505 self.state.status.fail_reason = e
506 if is_final_pipeline:
--> 507 raise e
508 else:
509 raise CompilerError("All available pipelines exhausted")
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler.py:494, in CompilerBase._compile_core(self)
492 res = None
493 try:
--> 494 pm.run(self.state)
495 if self.state.cr is not None:
496 break
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler_machinery.py:368, in PassManager.run(self, state)
365 msg = "Failed in %s mode pipeline (step: %s)" % \
366 (self.pipeline_name, pass_desc)
367 patched_exception = self._patch_error(msg, e)
--> 368 raise patched_exception
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler_machinery.py:356, in PassManager.run(self, state)
354 pass_inst = _pass_registry.get(pss).pass_inst
355 if isinstance(pass_inst, CompilerPass):
--> 356 self._runPass(idx, pass_inst, state)
357 else:
358 raise BaseException("Legacy pass in use")
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler_lock.py:35, in _CompilerLock.__call__.<locals>._acquire_compile_lock(*args, **kwargs)
32 @functools.wraps(func)
33 def _acquire_compile_lock(*args, **kwargs):
34 with self:
---> 35 return func(*args, **kwargs)
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler_machinery.py:311, in PassManager._runPass(self, index, pss, internal_state)
309 mutated |= check(pss.run_initialization, internal_state)
310 with SimpleTimer() as pass_time:
--> 311 mutated |= check(pss.run_pass, internal_state)
312 with SimpleTimer() as finalize_time:
313 mutated |= check(pss.run_finalizer, internal_state)
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler_machinery.py:273, in PassManager._runPass.<locals>.check(func, compiler_state)
272 def check(func, compiler_state):
--> 273 mangled = func(compiler_state)
274 if mangled not in (True, False):
275 msg = ("CompilerPass implementations should return True/False. "
276 "CompilerPass with name '%s' did not.")
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/compiler.py:128, in CUDALegalization.run_pass(self, state)
125 def run_pass(self, state):
126 # Early return if NVVM 7
127 from numba.cuda.cudadrv.nvvm import NVVM
--> 128 if NVVM().is_nvvm70:
129 return False
130 # NVVM < 7, need to check for charseq
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/cudadrv/nvvm.py:144, in NVVM.__new__(cls)
141 cls.__INSTANCE = None
142 errmsg = ("libNVVM cannot be found. Do `conda install "
143 "cudatoolkit`:\n%s")
--> 144 raise NvvmSupportError(errmsg % e)
146 # Find & populate functions
147 for name, proto in inst._PROTOTYPES.items():
NvvmSupportError: libNVVM cannot be found. Do `conda install cudatoolkit`:
libnvvm.so: cannot open shared object file: No such file or directory
# precompile
z = numba_add_cuda(x, y)
start = time.monotonic()
z = numba_add(x, y)
end = time.monotonic()
print("time for numba add: {:0.3e} sec".format(end - start))
start = time.monotonic()
z = numba_add_vectorize(x, y)
end = time.monotonic()
print("time for numba add (vectorized): {:0.3e} sec".format(end - start))
start = time.monotonic()
z = numba_add_cuda(x, y)
end = time.monotonic()
print("time for numba add (cuda): {:0.3e} sec".format(end - start))
time for numba add: 3.125e-01 sec
time for numba add (vectorized): 2.114e-01 sec
time for numba add (cuda): 5.494e-01 sec
We see that there may be a bit of a speedup using GPU - this will depend on the type of operation being performed, and if the GPU will be better suited for the computation compared to a vectorized instruction on a CPU.
For more information, see performance hints