Optimization in SciPy#

Optimization seeks to find the best (optimal) value of some function subject to constraints

\begin{equation} \mathop{\mathsf{minimize}}_x f(x)\ \text{subject to } c(x) \le b \end{equation}

import numpy as np
import scipy.linalg as la
import matplotlib.pyplot as plt
import scipy.optimize as opt

Functions of One Variable#

An easy example is to minimze a quadratic

f = lambda x: x**2 + 1 # a x^T x + 0*x + 1

x = np.linspace(-2,2,500)
plt.plot(x, f(x))
plt.show()
../_images/2d84538d3a747a0b1e85fc6bfe693ba4a38d587a624f9adf869fc4722bd23a19.png

There is a clear minimum at x = 0. You can search for minima using opt.minimize_scalar

sol = opt.minimize_scalar(f)
sol
 message: 
          Optimization terminated successfully;
          The returned value satisfies the termination criteria
          (using xtol = 1.48e-08 )
 success: True
     fun: 1.0
       x: 9.803862664247969e-09
     nit: 36
    nfev: 39

The return type gives you the function value sol.fun, as well as the value of x which achieve the minimum sol.x

x = np.linspace(-2,2,500)
plt.plot(x, f(x))
plt.scatter([sol.x], [sol.fun], label='solution')
plt.legend()
plt.show()
../_images/a13cf026b4506622135f1289b882b78d91fe8fb12f3f3229d8b9ec8fbad4e349.png

If you want to add constraints such as \(x \ge 1\), you can use bounds, along with specifying the bounded method:

sol = opt.minimize_scalar(f, bounds=(1, 1000), method='bounded')
x = np.linspace(-2,2,500)
plt.plot(x, f(x))
plt.scatter([sol.x], [sol.fun], label='solution')
plt.legend()
plt.show()
../_images/37f4f106f2c8d9cc3c9fd2a14fd6d21b4931243aaec70b24af7f5382c8dee51e.png

For more complicated functions, there may be multiple solutions

f = lambda x : (x - 2)**2 * (x + 2)**2 - x

x = np.linspace(-3,3,500)
plt.plot(x, f(x))
plt.show()
../_images/d23177f03e020ed06fb8808ef3220266557c19ac1ad883e427ef94142b50f098.png
sol = opt.minimize_scalar(f)

plt.plot(x, f(x))
plt.scatter([sol.x], [sol.fun], label='solution')
plt.legend()
plt.show()
../_images/b8b1a4d77175f482e90c11b882f3bd91d95362f6ddab6bb894cde9d323d428fb.png

Note that you will only find one minimum, and this minimum will generally only be a local minimum, not a global minimum.

Functions of Multiple variables#

You might also wish to minimize functions of multiple variables. In this case, you use opt.minimize

A multivariate quadratic generally has the form x^T A x + b^T x + c, where x is an n-dimensional vector, A is a n x n matrix, b is a n-dimensional vector, and c is a scalar. When A is positive definite (PD), there is a unique minimum.

n = 2
A = np.random.randn(n+1,n)
A = A.T @ A # generates positive-definite matrix

b = np.random.rand(n)

f = lambda x : np.dot(x - b, A @ (x - b)) # (x - b)^T A (x - b)

sol = opt.minimize(f, np.zeros(n))
sol
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 3.6140409350212163e-16
        x: [ 7.301e-01  2.606e-01]
      nit: 5
      jac: [-2.377e-15  2.254e-15]
 hess_inv: [[ 1.448e-01 -5.021e-03]
            [-5.021e-03  1.530e-01]]
     nfev: 21
     njev: 7

The solution for the above problem is x = b

print(sol.x)
print(b)
print(np.linalg.norm(sol.x - b))
[0.73008137 0.26056669]
[0.73008138 0.26056669]
1.0192650915964426e-08

We can increase the dimension of the problem to be much greater:

n = 100
A = np.random.randn(n+1,n)
A = A.T @ A

b = np.random.rand(n)

f = lambda x : np.dot(x - b, A @ (x - b)) # (x - b)^T A (x - b)

sol = opt.minimize(f, np.zeros(n))

print(la.norm(sol.x - b))
3.5774711959220054e-05

Example: Solving a Linear System#

Recall that solving a linear system means finding \(x\) so that \(A x = b\). We can turn this into an optimization problem by seeking to minimize the residual \(\|b - Ax\|_2\): \begin{equation} \mathop{\mathsf{minimize}}_x |b - Ax|_2\ \end{equation}

n = 5
A = np.random.randn(n,n)
x0 = np.random.randn(n) # ground truth x
b = A @ x0

sol = opt.minimize(lambda x : la.norm(b - A @ x), np.zeros(n))
la.norm(sol.x - x0)
1.6292037522676566e-08

Exercises#

  1. Modify the above example to minimize \(\|b-Ax\|_1\) instead of \(\| b - Ax\|_2\) (look at documentation for la.norm). Is there any difference in the output of opt.minimize?

  2. How fast is la.solve compared to the above example?

## Your code here

Constraints#

Passing in a function to be optimized is fairly straightforward. Constraints are slightly less trivial. These are specified using classes LinearConstraint and NonlinearConstraint

  • Linear constraints take the form lb <= A @ x <= ub

  • Nonlinear constraints take the form lb <= fun(x) <= ub

For the special case of a linear constraint with the form lb <= x <= ub, you can use Bounds

Linear Constraints#

Let’s look at how you might implement the linear constraints on a 2-vector x \begin{equation} \begin{cases} -1 \le x_0 \le 1\ c \le A * x \le d \end{cases} \end{equation} where A = np.array([[1,1],[0, 1]], c = -2 * np.ones(2), and d = 2 * np.ones(2)

from scipy.optimize import Bounds, LinearConstraint

# constraint 1
C1 = Bounds(np.array([-1, -np.inf]), np.array([1,np.inf]))

# constraint 2
A = np.array([[1,1],[0,1]])
c = -2 * np.ones(2)
d = 2 * np.ones(2)
C2 = LinearConstraint(A, c, d)

here’s a visualization of the constrained region:

n = 100
xx, yy = np.meshgrid(np.linspace(-3,3,n), np.linspace(-3,3,n))

C = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        x = np.array([xx[i,j], yy[i,j]])
        # constraint 1 
        c1 = -1 <= xx[i,j] <= 1
        # constraint 2
        c2 = np.all((c <= A@x, A@x <= d))
        C[i,j] = c1 and c2
        
plt.imshow(C, extent=(-3,3,-3,3), origin='lower')
plt.show()
../_images/f39414c275b2bd79ee9290cbdacc3d1737da9f7de1b3fc29f56559058e788fe3.png

Now, let’s say we want to minimize a function f(x) subject to x obeying the constraints given above. We can simply pass in the constraints

f = lambda x : x[0]*x[1]
sol = opt.minimize(f, np.random.rand(2), bounds=C1, constraints=(C2,))
sol
 message: Optimization terminated successfully
 success: True
  status: 0
     fun: -2.000000000005538
       x: [ 1.000e+00 -2.000e+00]
     nit: 4
     jac: [-2.000e+00  1.000e+00]
    nfev: 12
    njev: 4

Let’s visualize the function in the constrained region:

F = np.zeros((n, n))

for i in range(n):
    for j in range(n):
        if C[i,j]:
            F[i,j] = f([xx[i,j], yy[i,j]])
        else:
            F[i,j] = np.nan
            
plt.imshow(F, extent=(-3,3,-3,3), origin='lower')
plt.colorbar()
plt.show()
../_images/1d30be40693472fc86a5c317578b42c29727a33e442eb70430d8fd86285d8876.png

We see that [-1,2] does appear to be a minimum

Nonlinear constraints#

Nonlinear constraints can be used to define more complicated domains. For instance, let’s look at the constraint \begin{equation} 1 \le | x |_2 \le 2 \end{equation}

from scipy.optimize import NonlinearConstraint

C3 = NonlinearConstraint(lambda x : np.linalg.norm(x), 1, 2)

we can visualize this constraint again, and we see that it defines an annulus

n = 100
xx, yy = np.meshgrid(np.linspace(-3,3,n), np.linspace(-3,3,n))

C = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        x = np.array([xx[i,j], yy[i,j]])
        C[i,j] = 1 <= np.linalg.norm(x) <= 2
        
plt.imshow(C, extent=(-3,3,-3,3), origin='lower')
plt.show()
../_images/3087bb10b629aa2f0a6a22641e4ef7df2a90fdca5bae408c6b5194915b29a0cb.png

Let’s try minimizing the same function as before: \(f(x) = x[0] * x[1]\)

f = lambda x : x[0]*x[1]

F = np.zeros((n, n))

for i in range(n):
    for j in range(n):
        if C[i,j]:
            F[i,j] = f([xx[i,j], yy[i,j]])
        else:
            F[i,j] = np.nan
            
plt.imshow(F, extent=(-3,3,-3,3), origin='lower')
plt.colorbar()
plt.show()
../_images/6938c60d312b79d94d5a64ca20c07266c7ef618cf3792ab8c02b929ede95fd36.png
sol = opt.minimize(f, np.random.rand(2), constraints=(C3,))
sol
 message: Optimization terminated successfully
 success: True
  status: 0
     fun: -2.0000000005711054
       x: [-1.414e+00  1.414e+00]
     nit: 12
     jac: [ 1.414e+00 -1.414e+00]
    nfev: 43
    njev: 11

we see that the solution is close to a true minimizer \(x = [-\sqrt{2}, \sqrt{2}]\)

Example: Eigenvalues and Eigenvectors#

One example of an optimzation problem is to find the largest eigenvalue/eigenvector pair of a matrix \(A\). If \(A\) is symmetric, and \((\lambda, x)\) is the largest eigenvalue pair, then \(x\) satisfies \begin{equation} \mathop{\mathsf{minimize}}_x -x^T A x\ \text{subject to } |x|_2 = 1 \end{equation}

And we have \(x^T A x = \lambda\). Let’s try encoding this as an optimization problem.

n = 5
A = np.random.randn(n,n)
A = A + A.T # make symmetric

NormConst = NonlinearConstraint(lambda x : np.linalg.norm(x), 1, 1) # norm constraint
sol = opt.minimize(lambda x : -np.dot(x, A @ x), np.random.randn(5), constraints=(NormConst, ))
sol
 message: Optimization terminated successfully
 success: True
  status: 0
     fun: -5.298355703915112
       x: [ 2.686e-01  2.067e-01 -3.055e-01  4.153e-01 -7.870e-01]
     nit: 57
     jac: [-2.848e+00 -2.191e+00  3.235e+00 -4.407e+00  8.336e+00]
    nfev: 371
    njev: 57

Let’s take a look at a solution obtained using eigh:

import scipy.linalg as la
lam, V = la.eigh(A)
lam[-1], V[:,-1] # largest eigenpair
(5.2983608517562075,
 array([-0.26906152, -0.20651925,  0.30493873, -0.41645572,  0.7864685 ]))
print("error in eigenvalue: {}".format(lam[-1] + sol.fun))
print("error in eigenvector: {}".format(la.norm(-sol.x - V[:,-1])))
error in eigenvalue: 5.14784109562072e-06
error in eigenvector: 0.0014240320131961124

Optimization Options#

So far, we haven’t worried too much about the internals of opt.minimize. However, for large problems, you’re often going to want to choose an appropriate solver, and provide Jacobian and Hessian information if you are able to.

A list of available solvers is available in the minimize documentation.

Jacobian#

The Jacobian of a multi-valued function \(f: \mathbb{R}^n \to \mathbb{R}^m\) is the \(m\times n\) matrix \(J_f\), where \begin{equation} J_f[i,j] = \frac{\partial{f}_i}{\partial{x_j}} \end{equation}

If \(f\) is scalar-valued (i.e. \(m = 1\)), then the Jacobian is the transpose of the gradient \(J_f = (\nabla f)^T\)

In our examples so far, we only provided the function. If you are able to provide the Jacobian as well, then you can typically solve problems with fewer function evaluations because you have better information about how to decrease the function value. If you don’t provide the Jacobian, many solvers will try to approximate the Jacobian using finite difference approximations, but these are typically less accurate (and slower) than if you can give an explicit formula.

Note that you can also provide a Jacobian to NonlinearConstraint

# function definition
def f(x):
    return np.cos(x[0]) + np.sin(x[1])

# Jacobian of function
def Jf(x):
    return np.array([-np.sin(x[0]), np.cos(x[1])])

x0 = np.random.rand(2)

%time sol1 = opt.minimize(f, x0)
sol1
CPU times: user 3.9 ms, sys: 316 µs, total: 4.22 ms
Wall time: 3.92 ms
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: -1.9999999999930531
        x: [ 3.142e+00 -1.571e+00]
      nit: 9
      jac: [-3.010e-06  2.176e-06]
 hess_inv: [[ 1.022e+00 -3.197e-05]
            [-3.197e-05  1.001e+00]]
     nfev: 45
     njev: 15
%time sol2 = opt.minimize(f, x0, jac=Jf)
sol2
CPU times: user 0 ns, sys: 3.34 ms, total: 3.34 ms
Wall time: 3.03 ms
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: -1.9999999999931064
        x: [ 3.142e+00 -1.571e+00]
      nit: 9
      jac: [-3.007e-06  2.179e-06]
 hess_inv: [[ 1.022e+00 -3.203e-05]
            [-3.203e-05  1.001e+00]]
     nfev: 15
     njev: 15

We see that when we provoide the Jacobian, the number of function evaluations nfev decreases.

Hessian#

The Hessian of a multivariate function \(f: \mathbb{R}^n \to \mathbb{R}\) is a \(n\times n\) matrix \(H_f\) of second derivatives: \begin{equation} H[i,j] = \frac{\partial^2 f}{\partial x_i \partial x_j} \end{equation}

The Hessian provides information about the curvature of the function \(f\), which can be used to accelerate convergence of the optimization algorithm. If you don’t provide the Hessian, many solvers will numerically approximate it, which will typically not work as well as an explicit Hessian.

Again, you can also provide a Hessian to NonlinearConstraint

def Hf(x):
    return np.array([[-np.cos(x[0]), 0], [0, -np.sin(x[1])]])

%time sol3 = opt.minimize(f, x0, jac=Jf, hess=Hf)
sol3
CPU times: user 0 ns, sys: 2.99 ms, total: 2.99 ms
Wall time: 4.19 ms
/home/goodwill/miniconda3/envs/pycourse_test/lib/python3.11/site-packages/scipy/optimize/_minimize.py:565: RuntimeWarning: Method BFGS does not use Hessian information (hess).
  warn('Method %s does not use Hessian information (hess).' % method,
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: -1.9999999999931064
        x: [ 3.142e+00 -1.571e+00]
      nit: 9
      jac: [-3.007e-06  2.179e-06]
 hess_inv: [[ 1.022e+00 -3.203e-05]
            [-3.203e-05  1.001e+00]]
     nfev: 15
     njev: 15

We see that the defualt method BFGS does not use the Hessian. Let’s choose one that does, like Newton-CG.

def Hf(x):
    return np.array([[-np.cos(x[0]), 0], [0, -np.sin(x[1])]])

%time sol3 = opt.minimize(f, x0, jac=Jf, hess=Hf, method='Newton-CG')
sol3
CPU times: user 3.35 ms, sys: 0 ns, total: 3.35 ms
Wall time: 3.05 ms
 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -2.0
       x: [ 3.142e+00 -1.571e+00]
     nit: 6
     jac: [ 8.095e-10  8.386e-10]
    nfev: 11
    njev: 11
    nhev: 6
(sol3.x - sol2.x)/np.pi
array([ 9.57261136e-07, -6.93316609e-07])

As we see above, we only compute the function, Jacobian and Hessian for 5 iterations. The wall time is about 5x as fast as when we didn’t provide any of this additional information as well. This will matter more for larger problems.

Solvers#

Here’s a guide to a couple of different solvers. See the notes section of minimize for additional details.

  • Unconstrained minimization: BFGS - uses Jacobian evaluations to get a low-rank approximation to the Hessian.

  • Bound constrained minimization: L-BFGS-B - Variation of BFGS which uses limited memory (the L). Use with very large problems. Only uses Jacobian (not Hessian).

  • Unconstrained minimization with Jacobian/Hessian: Newton-CG - uses Jacobian and Hessian to exactly solve quadratic approximations to the objective.

  • General constrained minimization: trust-const - a trust region method for constrained optimization problems. Can use the Hessian of both the objective and constraints.

You can find a lot of information and examples about these different options in the scipy.optimize tutorial

Global Optimization#

opt.minimize is good for finding local minima of functions. This often works well when you have a single minimum, or if you don’t care too much about finding the global minimum.

What if your function has many local minima, but you want to find the global minimum? You can use one of the global optimization functions.

Note that finding a global minumum is generally a much more difficult problem than finding a local minimum, and these functions are not guranteed to find the true global minimum, and may not be very fast.

Here’s an example of a function which has a global minimum, but many local minima:

# global minimum at (pi, -pi/2)
def f(x, a=0.1):
    return np.cos(x[0]) + np.sin(x[1]) + a*(x[0] - np.pi)**2 + a*(x[1] + np.pi/2)**2
f([np.pi, -np.pi/2])
-2.0

we can use the basinhopping solver to look for this minimum

sol = opt.basinhopping(f, 10*np.random.rand(2))
# sol = opt.basinhopping(f, np.array([np.pi, -np.pi/2]) + 0.2*np.random.randn(2))
sol
                    message: ['requested number of basinhopping iterations completed successfully']
                    success: True
                        fun: -1.9999999999999987
                          x: [ 3.142e+00 -1.571e+00]
                        nit: 100
      minimization_failures: 0
                       nfev: 1254
                       njev: 418
 lowest_optimization_result:  message: Optimization terminated successfully.
                              success: True
                               status: 0
                                  fun: -1.9999999999999987
                                    x: [ 3.142e+00 -1.571e+00]
                                  nit: 3
                                  jac: [ 5.960e-08  4.470e-08]
                             hess_inv: [[ 1.000e+00 -2.729e-03]
                                        [-2.729e-03  8.336e-01]]
                                 nfev: 12
                                 njev: 4

as we see, we find a local minimum, but not the global minimum. We can increase the number of basin-hopping iterations, and increase the “temperature” T, which governs the hop size to be the approximate distance between local minima.

%time sol = opt.basinhopping(f, 10*np.random.rand(2), T=1e-6, niter=1)
sol
CPU times: user 0 ns, sys: 7.12 ms, total: 7.12 ms
Wall time: 6.41 ms
                    message: ['requested number of basinhopping iterations completed successfully']
                    success: True
                        fun: -1.99999999999956
                          x: [ 3.142e+00 -1.571e+00]
                        nit: 1
      minimization_failures: 0
                       nfev: 42
                       njev: 14
 lowest_optimization_result:  message: Optimization terminated successfully.
                              success: True
                               status: 0
                                  fun: -1.99999999999956
                                    x: [ 3.142e+00 -1.571e+00]
                                  nit: 3
                                  jac: [ 5.066e-07  9.090e-07]
                             hess_inv: [[ 8.944e-01  8.015e-02]
                                        [ 8.015e-02  9.392e-01]]
                                 nfev: 12
                                 njev: 4

Now, we get the true minimization value.

Here’s the same problem using opt.dual_annealing:

%time sol = opt.dual_annealing(f, bounds=[[-100,100], [-100,100]])
sol
CPU times: user 132 ms, sys: 5.94 ms, total: 138 ms
Wall time: 138 ms
 message: ['Maximum number of iteration reached']
 success: True
  status: 0
     fun: -1.9999999999998104
       x: [ 3.142e+00 -1.571e+00]
     nit: 1000
    nfev: 4037
    njev: 12
    nhev: 0

This runs much faster, and doesn’t need as much tweaking of the arguments. However, we had to specify bounds in which we would look for the solution.

Finding Roots#

A root of a function \(f: \mathbb{R}^m \to \mathbb{R}^n\) is a point \(x\) so that \(f(x) = 0\). The go-to function for this is opt.root

def f(x):
    return x + 2*np.cos(x)

sol = opt.root(f, np.random.rand())
print(sol.x, sol.fun) # root and value
[-1.02986653] [0.]

for scalar functions, you can also use root_scalar

sol = opt.root_scalar(f, x0=-1, x1=1)
sol.root
-1.0298665293222544

You can also find roots of multi-valued functions. E.g. if we want to solve the nonlinear system \begin{equation} x_0 \cos(x_1) = 4\ x_0 x_1 - x_1 = 5 \end{equation}

we can seek to find a root \begin{equation} x_0 \cos(x_1) - 4 = 0\ x_0 x_1 - x_1 - 5 = 0 \end{equation}

def f(x):
    return [x[0] * np.cos(x[1]) - 4, x[1]*x[0] - x[1] - 5]

%time sol = opt.root(f, np.ones(2))
sol
CPU times: user 139 µs, sys: 0 ns, total: 139 µs
Wall time: 134 µs
 message: The solution converged.
 success: True
  status: 1
     fun: [ 3.732e-12  1.617e-11]
       x: [ 6.504e+00  9.084e-01]
    nfev: 17
    fjac: [[-5.625e-01 -8.268e-01]
           [ 8.268e-01 -5.625e-01]]
       r: [-1.091e+00 -1.762e+00 -7.374e+00]
     qtf: [ 6.257e-08  2.401e-08]
print(sol.x, sol.fun) # root and value
[6.50409711 0.90841421] [3.73212572e-12 1.61710645e-11]

You can also specify the Jacobian for faster convergence (in this case, the Jacobian is \(2 \times 2\))

# returns the function and Jacobian as tuple
def f2(x):
    f = [x[0] * np.cos(x[1]) - 4,
          x[1]*x[0] - x[1] - 5]
    df = np.array([[np.cos(x[1]), -x[0] * np.sin(x[1])],
                   [x[1], x[0] - 1]])
    return f, df
    
%time sol = opt.root(f2, np.ones(2), jac=True, method='lm')
sol
CPU times: user 898 µs, sys: 78 µs, total: 976 µs
Wall time: 2.4 ms
 message: The relative error between two consecutive iterates is at most 0.000000
 success: True
  status: 2
     fun: [ 0.000e+00  0.000e+00]
       x: [ 6.504e+00  9.084e-01]
   cov_x: [[ 8.747e-01 -2.853e-02]
           [-2.853e-02  1.860e-02]]
    nfev: 8
    njev: 7
    fjac: [[ 7.523e+00 -7.316e-01]
           [ 2.454e-01 -1.069e+00]]
    ipvt: [2 1]
     qtf: [ 9.538e-13  1.207e-13]
print(sol.x, sol.fun) # root and value
[6.50409711 0.90841421] [0. 0.]

We see that the computed root in this case satisfies the equations more precisely.

Exercise#

Compute a root of the above example using opt.minimize instead of opt.root. One way to do this is to minimize \(\|f(x)\|\). Do the results differ if you use a 1-norm vs a 2-norm?

Above examples: find roots of multi-valued functions. E.g. if we want to solve the nonlinear system \begin{equation} x_0 \cos(x_1) = 4 \end{equation}
\begin{equation} x_0 x_1 - x_1 = 5 \end{equation}

## 2-norm:
def f(x):
    return la.norm([x[0] * np.cos(x[1]) - 4, x[1]*x[0] - x[1] - 5], ord = 2)

sol1 = opt.minimize(f, np.ones(2))
print(sol1.x)

## 1-norm:
def g(x):
    return la.norm([x[0] * np.cos(x[1]) - 4, x[1]*x[0] - x[1] - 5], ord =1)

sol2 = opt.minimize(g, np.ones(2))
print(sol2.x)
[6.50409711 0.9084142 ]
[6.38657705 0.89398092]

As we can see from the result, 1-norm result and 2-norm result are slightly different. But they are both very close to the result computated by opt.root.