Vectorization, Numpy Universal Functions

Vectorization, Numpy Universal Functions#

In order to write performant code using numerical libraries, it is useful to keep the following rule in mind:

Code that is predictable can be made fast

The most common example of predictable code is a fixed length loop which performs the same operation at each iteration (up to changes in index)

Examples

Matrix-vector multiplication
Functions applied element-wise to an array

Non-examples:

Code with branch instructions (if, else, etc.)
Code with recursive function calls (at least in Python)

One reason why predictable code can be fast is that most CPUs have what is called a branch predictor in them, which pre-loads computation. If a branch is predicted incorrectly, then the CPU has to switch gears and go along the correct brach, which takes time. Code without branches will minimize the number of branch prediction errors, speeding up code.

Another reason why predictable code can be made fast is vectorization. If you are performing the same operation in a predictable way, code can employ special instructions such as AVX which can greatly increase efficiency.

You don’t need to worry about the details in Python, but it is good to know how to write code that allows libraries like NumPy to take advantage of these techniques. Note that standard Python loops will not take advantage of these things - you typically need to use libraries.

Universal Functions#

Numpy universal functions (or ufuncs) are functions that are applied element-wise to an array. Examples include most math operations and logical comparisons. You can find additional information in the ufunc documentation.

import numpy as np
import time

# set up two vectors
n = 1_000_000
x = np.random.randn(n)
y = np.random.randn(n)

def naive_add(x, y):
    """
    add two arrays using a Python for-loop
    """
    z = np.empty_like(x)
    for i in range(len(x)):
        z[i] = x[i] + y[i]
        
    return z

start = time.monotonic()
z = naive_add(x, y)
end = time.monotonic()
print("time for naive add: {:0.3e} sec".format(end - start))

start = time.monotonic()
z = np.add(x, y)
end = time.monotonic()
print("time for numpy add: {:0.3e} sec".format(end - start))

time for naive add: 1.924e-01 sec
time for numpy add: 4.033e-03 sec

If you prefer to write your code without the for-loop, you can use np.vectorize. See the documentation for details.

@np.vectorize
def numpy_add_vectorize(x, y):
    return x + y

start = time.monotonic()
z = naive_add(x, y)
end = time.monotonic()
print("time for naive add: {:0.3e} sec".format(end - start))

start = time.monotonic()
z = np.add(x, y)
end = time.monotonic()
print("time for numpy add: {:0.3e} sec".format(end - start))

start = time.monotonic()
z = numpy_add_vectorize(x, y)
end = time.monotonic()
print("time for numpy add (vectorize): {:0.3e} sec".format(end - start))

time for naive add: 1.984e-01 sec
time for numpy add: 2.147e-03 sec
time for numpy add (vectorize): 1.627e-01 sec

np.vectorize doesn’t really give a performance boost, but can make defining functions simpler

Exercise#

perform some timing tests that compare a naive python loop implementation with a numpy ufunc. Try computing sin, exp, multiplication, etc.

## Your code here

More complicated functions#

Some functions that can be sped up considerably are a bit more complicated than ufuncs. One example is matrix-vector multiplication. We’ll use the notation Ax = y, where \begin{equation} y_i = \sum_j A_{i,j} x_j \end{equation}

# set up matrix and vector for multiplication
m, n = 500, 1000
A = np.random.randn(m, n)
x = np.random.randn(n)

def naive_matvec(A, x):
    """
    naive matrix-vector multiplication implementation
    """
    m, n = A.shape
    y = np.zeros(m)
    for i in range(m):
        for j in range(n):
            y[i] = y[i] + A[i,j] * x[j]
    
    return y

start = time.monotonic()
y1 = naive_matvec(A, x)
end = time.monotonic()
print("time for naive matvec: {:0.3e} sec".format(end - start))

start = time.monotonic()
# y2 = np.matmul(A, x)
y2 = A @ x
end = time.monotonic()
print("time for numpy matvec: {:0.3e} sec".format(end - start))

np.linalg.norm(y1 - y2)

time for naive matvec: 1.547e-01 sec
time for numpy matvec: 5.649e-03 sec

7.501200011387548e-13

Numba#

Numba is a just in time (JIT) compiler for Python code. It provides several decorators which make it very easy to get speedups for numerical code in many situations.

Just in time compilation is an increasingly popular solution that bridges the gap between interpreted and compiled languages. Generally:

Interpreted languages (such as Python) simply read code line-by-line and execute as they go along.
Compiled languages (such as C, C++, fortran) compile code into a binary, which can be optimized to run quickly

Compilation takes time intially but saves time when you use the binary. Python libraries such as NumPy and SciPy use compiled libraries under the hood for speed. Interpreted languages tend to be slower, but are easier to develop in.

Just in time compilation will produce a compiled version of a function the first time it is needed. Julia is a relatively new language which uses JIT to produce fast code with less development overhead.

One of the things you need to know to compile code is the types used - if you want to use different types (e.g. single and double precision versions of a function), you need different compiled versions. Python usually allows you to not worry too much about type, but this is one reason why you need to know about it anyways in scientific computing.

First:

$ conda install numba

Let’s look at how Numba can be used with our naive_add ufunc.

from numba import jit, njit

@jit # this is the only thing we do different
def numba_add(x, y):
    """
    add two arrays using a Python for-loop
    """
    z = np.empty_like(x)
    for i in range(len(x)):
        z[i] = x[i] + y[i]
        
    return z

/tmp/ipykernel_1632/852113555.py:3: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit # this is the only thing we do different

# set up two vectors
n = 1_000_000
x = np.random.randn(n)
y = np.random.randn(n)

start = time.monotonic()
z = naive_add(x, y)
end = time.monotonic()
print("time for naive add: {:0.3e} sec".format(end - start))

start = time.monotonic()
z = np.add(x, y)
end = time.monotonic()
print("time for numpy add: {:0.3e} sec".format(end - start))

start = time.monotonic()
z = numba_add(x, y)
end = time.monotonic()
print("time for numba add: {:0.3e} sec".format(end - start))

time for naive add: 2.079e-01 sec
time for numpy add: 2.242e-03 sec

time for numba add: 3.245e-01 sec

The numba JIT function runs in about the same time as the naive function. Let’s see what happens when we run the code again:

# set up two vectors
n = 1_000_000
x = np.random.randn(n)
y = np.random.randn(n)

start = time.monotonic()
z = naive_add(x, y)
end = time.monotonic()
print("time for naive add: {:0.3e} sec".format(end - start))

start = time.monotonic()
z = np.add(x, y)
end = time.monotonic()
print("time for numpy add: {:0.3e} sec".format(end - start))

start = time.monotonic()
z = numba_add(x, y)
end = time.monotonic()
print("time for numba add: {:0.3e} sec".format(end - start))

time for naive add: 2.060e-01 sec
time for numpy add: 2.482e-03 sec
time for numba add: 2.032e-03 sec

Now the numba function is much faster. This is because the first time the function is called, it must be compiled. Every subsequent time you call the function, it will run much faster.

The take-away is that it is advantageous to use JIT with functions you will use repeatedly, but not necessarily worth the time for functions you will only use once.

Advanced numba#

You can get a lot of mileage out of numba without too much trouble. It is always good to look at the documentation to learn more. Here are a few examples:

Parallelization (this is supported on a handful of known operations).

from numba import prange # parallel range

@jit(nopython=True, parallel=True)
def numba_add_parallel(x, y):
    """
    add two arrays using a Python for-loop
    """
    z = np.empty_like(x)
    for i in prange(len(x)):
        z[i] = x[i] + y[i]
        
    return z

# set up two vectors
n = 100_000_000
x = np.random.randn(n)
y = np.random.randn(n)

z = numba_add_parallel(x, y) # precompile

start = time.monotonic()
z = numba_add(x, y)
end = time.monotonic()
print("time for numba add: {:0.3e} sec".format(end - start))

start = time.monotonic()
z = np.add(x, y)
end = time.monotonic()
print("time for numpy add: {:0.3e} sec".format(end - start))

start = time.monotonic()
z = numba_add_parallel(x, y)
end = time.monotonic()
print("time for numba parallel add: {:0.3e} sec".format(end - start))

time for numba add: 4.561e-01 sec
time for numpy add: 1.899e-01 sec

time for numba parallel add: 2.664e-01 sec

Parallelization of matvec:

@jit(nopython=True, parallel=True)
def numba_sumvec(x):
    """
    naive matrix-vector multiplication implementation
    """
    m = x.shape
    y = 0
    for i in prange(m):
        y = y + x[j]
    
    return y

@jit(nopython=True, parallel=True)
def numba_matvec(A, x):
    """
    naive matrix-vector multiplication implementation
    """
    m, n = A.shape
    y = np.zeros(m)
    for i in prange(m):
        for j in range(n):
            y[i] = y[i] + A[i,j] * x[j]
    
    return y

# set up matrix and vector for multiplication
m, n = 4000, 4000
A = np.random.randn(m, n)
x = np.random.randn(n)

y = numba_matvec(A, x) # precompile

start = time.monotonic()
y1 = numba_matvec(A, x)
end = time.monotonic()
print("time for numba parallel matvec: {:0.3e} sec".format(end - start))

start = time.monotonic()
y2 = np.matmul(A, x)
end = time.monotonic()
print("time for numpy matvec: {:0.3e} sec".format(end - start))

np.linalg.norm(y1 - y2)

time for numba parallel matvec: 8.725e-03 sec
time for numpy matvec: 8.683e-03 sec

9.032072117815461e-12

declaring ufuncs without a loop. You can define a function on elements using numba.vectorize. Unlike numpy.vectorize, numba will give you a noticeable speedup. We need to provide a call signature - for example float32(float32, float32) will take two different single precision floating point numbers (float32) as input, and return a single precision floating point number as output. See the documentation for additional details.

from numba import vectorize

@vectorize(['float32(float32, float32)']) # call signature defined
def numba_add_vectorize(a, b):
    return a + b

# Initialize arrays
n = 100_000_000
x = np.ones(n, dtype=np.float32)
y = np.ones(x.shape, dtype=x.dtype)
z = np.empty_like(x, dtype=x.dtype)

# precompile
z = numba_add_vectorize(x, y)

start = time.monotonic()
z = numba_add(x, y)
end = time.monotonic()
print("time for numba add: {:0.3e} sec".format(end - start))

start = time.monotonic()
z = numba_add_vectorize(x, y)
end = time.monotonic()
print("time for numba add (vectorized): {:0.3e} sec".format(end - start))

time for numba add: 2.383e-01 sec
time for numba add (vectorized): 9.444e-02 sec

For NVIDIA GPUS, you can use cuda through numba - see the Nvidia page for example

from numba import vectorize

@vectorize(['float32(float32, float32)'], target='cuda')
def numba_add_cuda(a, b):
    return a + b

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/cudadrv/nvvm.py:139, in NVVM.__new__(cls)
    138 try:
--> 139     inst.driver = open_cudalib('nvvm')
    140 except OSError as e:

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/cudadrv/libs.py:64, in open_cudalib(lib)
     63 path = get_cudalib(lib)
---> 64 return ctypes.CDLL(path)

File ~/miniconda3/envs/pycourse_test/lib/python3.11/ctypes/__init__.py:376, in CDLL.__init__(self, name, mode, handle, use_errno, use_last_error, winmode)
    375 if handle is None:
--> 376     self._handle = _dlopen(self._name, mode)
    377 else:

OSError: libnvvm.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

NvvmSupportError                          Traceback (most recent call last)
Cell In[20], line 3
      1 from numba import vectorize
----> 3 @vectorize(['float32(float32, float32)'], target='cuda')
      4 def numba_add_cuda(a, b):
      5     return a + b

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/np/ufunc/decorators.py:131, in vectorize.<locals>.wrap(func)
    129 vec = Vectorize(func, **kws)
    130 for sig in ftylist:
--> 131     vec.add(sig)
    132 if len(ftylist) > 0:
    133     vec.disable_compile()

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/np/ufunc/deviceufunc.py:391, in DeviceVectorize.add(self, sig)
    388 funcname = self.pyfunc.__name__
    389 kernelsource = self._get_kernel_source(self._kernel_template,
    390                                        devfnsig, funcname)
--> 391 corefn, return_type = self._compile_core(devfnsig)
    392 glbl = self._get_globals(corefn)
    393 sig = signature(types.void, *([a[:] for a in args] + [return_type[:]]))

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/vectorizers.py:202, in CUDAVectorize._compile_core(self, sig)
    201 def _compile_core(self, sig):
--> 202     cudevfn = cuda.jit(sig, device=True, inline=True)(self.pyfunc)
    203     return cudevfn, cudevfn.overloads[sig.args].signature.return_type

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/decorators.py:131, in jit.<locals>._jit(func)
    129     from numba.core import typeinfer
    130     with typeinfer.register_dispatcher(disp):
--> 131         disp.compile_device(argtypes, restype)
    132 else:
    133     disp.compile(argtypes)

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/dispatcher.py:877, in CUDADispatcher.compile_device(self, args, return_type)
    871 nvvm_options = {
    872     'opt': 3 if self.targetoptions.get('opt') else 0,
    873     'fastmath': fastmath
    874 }
    876 cc = get_current_device().compute_capability
--> 877 cres = compile_cuda(self.py_func, return_type, args,
    878                     debug=debug,
    879                     lineinfo=lineinfo,
    880                     inline=inline,
    881                     fastmath=fastmath,
    882                     nvvm_options=nvvm_options,
    883                     cc=cc)
    884 self.overloads[args] = cres
    886 cres.target_context.insert_user_function(cres.entry_point,
    887                                          cres.fndesc,
    888                                          [cres.library])

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler_lock.py:35, in _CompilerLock.__call__.<locals>._acquire_compile_lock(*args, **kwargs)
     32 @functools.wraps(func)
     33 def _acquire_compile_lock(*args, **kwargs):
     34     with self:
---> 35         return func(*args, **kwargs)

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/compiler.py:230, in compile_cuda(pyfunc, return_type, args, debug, lineinfo, inline, fastmath, nvvm_options, cc)
    228 from numba.core.target_extension import target_override
    229 with target_override('cuda'):
--> 230     cres = compiler.compile_extra(typingctx=typingctx,
    231                                   targetctx=targetctx,
    232                                   func=pyfunc,
    233                                   args=args,
    234                                   return_type=return_type,
    235                                   flags=flags,
    236                                   locals={},
    237                                   pipeline_class=CUDACompiler)
    239 library = cres.library
    240 library.finalize()

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler.py:762, in compile_extra(typingctx, targetctx, func, args, return_type, flags, locals, library, pipeline_class)
    738 """Compiler entry point
    739 
    740 Parameter
   (...)
    758     compiler pipeline
    759 """
    760 pipeline = pipeline_class(typingctx, targetctx, library,
    761                           args, return_type, flags, locals)
--> 762 return pipeline.compile_extra(func)

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler.py:460, in CompilerBase.compile_extra(self, func)
    458 self.state.lifted = ()
    459 self.state.lifted_from = None
--> 460 return self._compile_bytecode()

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler.py:528, in CompilerBase._compile_bytecode(self)
    524 """
    525 Populate and run pipeline for bytecode input
    526 """
    527 assert self.state.func_ir is None
--> 528 return self._compile_core()

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler.py:507, in CompilerBase._compile_core(self)
    505         self.state.status.fail_reason = e
    506         if is_final_pipeline:
--> 507             raise e
    508 else:
    509     raise CompilerError("All available pipelines exhausted")

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler.py:494, in CompilerBase._compile_core(self)
    492 res = None
    493 try:
--> 494     pm.run(self.state)
    495     if self.state.cr is not None:
    496         break

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler_machinery.py:368, in PassManager.run(self, state)
    365 msg = "Failed in %s mode pipeline (step: %s)" % \
    366     (self.pipeline_name, pass_desc)
    367 patched_exception = self._patch_error(msg, e)
--> 368 raise patched_exception

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler_machinery.py:356, in PassManager.run(self, state)
    354 pass_inst = _pass_registry.get(pss).pass_inst
    355 if isinstance(pass_inst, CompilerPass):
--> 356     self._runPass(idx, pass_inst, state)
    357 else:
    358     raise BaseException("Legacy pass in use")

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler_lock.py:35, in _CompilerLock.__call__.<locals>._acquire_compile_lock(*args, **kwargs)
     32 @functools.wraps(func)
     33 def _acquire_compile_lock(*args, **kwargs):
     34     with self:
---> 35         return func(*args, **kwargs)

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler_machinery.py:311, in PassManager._runPass(self, index, pss, internal_state)
    309     mutated |= check(pss.run_initialization, internal_state)
    310 with SimpleTimer() as pass_time:
--> 311     mutated |= check(pss.run_pass, internal_state)
    312 with SimpleTimer() as finalize_time:
    313     mutated |= check(pss.run_finalizer, internal_state)

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/core/compiler_machinery.py:273, in PassManager._runPass.<locals>.check(func, compiler_state)
    272 def check(func, compiler_state):
--> 273     mangled = func(compiler_state)
    274     if mangled not in (True, False):
    275         msg = ("CompilerPass implementations should return True/False. "
    276                "CompilerPass with name '%s' did not.")

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/compiler.py:128, in CUDALegalization.run_pass(self, state)
    125 def run_pass(self, state):
    126     # Early return if NVVM 7
    127     from numba.cuda.cudadrv.nvvm import NVVM
--> 128     if NVVM().is_nvvm70:
    129         return False
    130     # NVVM < 7, need to check for charseq

File ~/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/numba/cuda/cudadrv/nvvm.py:144, in NVVM.__new__(cls)
    141     cls.__INSTANCE = None
    142     errmsg = ("libNVVM cannot be found. Do `conda install "
    143               "cudatoolkit`:\n%s")
--> 144     raise NvvmSupportError(errmsg % e)
    146 # Find & populate functions
    147 for name, proto in inst._PROTOTYPES.items():

NvvmSupportError: libNVVM cannot be found. Do `conda install cudatoolkit`:
libnvvm.so: cannot open shared object file: No such file or directory

# precompile
z = numba_add_cuda(x, y)

start = time.monotonic()
z = numba_add(x, y)
end = time.monotonic()
print("time for numba add: {:0.3e} sec".format(end - start))

start = time.monotonic()
z = numba_add_vectorize(x, y)
end = time.monotonic()
print("time for numba add (vectorized): {:0.3e} sec".format(end - start))

start = time.monotonic()
z = numba_add_cuda(x, y)
end = time.monotonic()
print("time for numba add (cuda): {:0.3e} sec".format(end - start))

time for numba add: 3.125e-01 sec
time for numba add (vectorized): 2.114e-01 sec
time for numba add (cuda): 5.494e-01 sec

We see that there may be a bit of a speedup using GPU - this will depend on the type of operation being performed, and if the GPU will be better suited for the computation compared to a vectorized instruction on a CPU.

For more information, see performance hints