Distances#

A common task when dealing with data is computing the distance between two points.

We can use scipy.spatial.distance to compute a variety of distances.

import numpy as np
import scipy.linalg as la
import matplotlib.pyplot as plt
import scipy.spatial.distance as distance

A data set is a collection of observations, each of which may have several features. We’ll consider the situation where the data set is a matrix X, where each row X[i] is an observation. We’ll use n to denote the number of observations and p to denote the number of features, so X is a \(n \times p\) matrix.

For example, we might sample from a circle (with some gaussian noise)

def sample_circle(n, r=1, sigma=0.1):
    """
    sample n points from a circle of radius r
    add Gaussian noise with variance sigma^2
    """
    X = np.random.randn(n,2)
    X = r * X / np.linalg.norm(X,axis=1).reshape(-1,1)
    return X + sigma * np.random.randn(n,2)

X = sample_circle(100)
plt.scatter(X[:,0], X[:,1])
plt.show()
../_images/fd0ffabaa8566041e31807fa14821de524d638f1923e3b699e0bcb6852adbc78.png

We might also call a data set a “point cloud”, or a collection of points in some space.

Similarity, Dissimilarity, and Metric#

A similarity between two points \(x, y\in X\) is a function \(s: X \times X \to \mathbb{R}\), where \(s\) is larger if \(x,y\) are more similar.

Cosine similarity is an an example of similarity for points in a real vector space. We can define it as \begin{equation} c(x,y) = \frac{x^T y}{|x|_2 |y|_2} \end{equation}

Note \(c(x,x) = 1\), and if \(x,y\) are orthogonal, then \(c(x,y) = 0\).

A dissimilarity between two points \(x,y \in X\) is a function \(d: X \times X \to \mathbb{R}_+\), where \(d\) is smaller if \(x,y\) are more similar. Sometimes, disimilarity functions will be called distances.

Cosine distance is an example of a dissimilarity for points in a real vector space. It is defined as \begin{equation} d(x,y) = 1 - c(x,y) \end{equation} Note \(d(x,x) = 0\), and \(d(x,y) = 1\) if \(x,y\) are orthogonal.

A metric is a disimilarity \(d\) that satisfies the metric axioms

  1. \(d(x,y) = 0\) iff \(x=y\) (identity)

  2. \(d(x,y) = d(y,x)\) (symmetry)

  3. \(d(x,y) \le d(x,z) + d(z,y)\) (triangle inequality)

You are probably most familiar with the Euclidean metric \begin{equation} d(x,y) = |x - y|_2 \end{equation}

Exercise#

Write functions for the cosine similarity, cosine distance, and euclidean distance between two numpy arrays treated as vectors.

## Your code here

Solution:#

Computing Distances#

SciPy provides a variety of functionality for computing distances in scipy.spatial.distance.

x = [1.0, 0.0]
y = [0.0, 1.0]

distance.euclidean(x, y) # sqrt(2)
1.4142135623730951
distance.cosine(x, y)
1.0

the \(L_1\) / Manhattan / City block distance \begin{equation} d(x,y) = |x - y|_1 = \sum_i |x_i - y_i| \end{equation}

distance.cityblock(x, y)
2.0

The \(L_\infty\) / Chebyshev distance \begin{equation} d(x,y) = |x - y|_\infty = \max_i |x_i - y_i| \end{equation}

distance.chebyshev(x, y)
1.0

More generally, the Minkowski distance \begin{equation} d(x,y) = |x - y|_p = \big( \sum_i (x_i - y_i)^p \big)^{1/p} \end{equation}

distance.minkowski(x, y, 3)
1.2599210498948732
np.cbrt(2) # cube root of 2
1.2599210498948734

The Mahalanobis distance. Note that this is defined in terms of an inverse covariance matrix. In fact, you can use it to compute distances based on the \(A\)-norm where \(A\) is any symmetric positive definite (SPD) matrix.

\begin{equation} d(x,y) = | x- y |_A = \sqrt{(x - y)^T A (x - y)} \end{equation}

A = np.random.randn(3,2)
A = A.T @ A
distance.mahalanobis(x, y, A)
3.791565791818502

You can also weight many distances with a vector \(w\). The entry \(w_i\) will weight the comparison between the \(i\)th entries of \(x\) and \(y\). E.g. weighted euclidean distance \begin{equation} d(x, y; w) = \sqrt{\sum_i w_i(x_i - y_i)^2} \end{equation}

w = [1, 2]
for d in (distance.euclidean, distance.cityblock, lambda x, y, w : distance.minkowski(x, y, 3, w)):
    print(d(x, y, w))
1.7320508075688774
3.0
1.4422495703074083

Exercise#

Compute a weighted euclidean distance using the Mahalanobis distance.

## Your code here

Pairwise Distances#

In general, we often want to compute distances between many points all at once. This typically boils down to computing a matrix \(D\), where \(D_{i,j} = d(x_i, x_j)\).

Here’s a simple function that would do just that

def pairwise(X, dist=distance.euclidean):
    """
    compute pairwise distances in n x p numpy array X
    """
    n, p = X.shape
    D = np.empty((n,n), dtype=np.float64)
    for i in range(n):
        for j in range(n):
            D[i,j] = dist(X[i], X[j])
            
    return D
X = sample_circle(5)
pairwise(X)
array([[0.        , 0.30319957, 0.34105369, 1.83236742, 1.88826178],
       [0.30319957, 0.        , 0.11891106, 1.87895696, 1.99089783],
       [0.34105369, 0.11891106, 0.        , 1.76894693, 1.89124471],
       [1.83236742, 1.87895696, 1.76894693, 0.        , 0.37151272],
       [1.88826178, 1.99089783, 1.89124471, 0.37151272, 0.        ]])
pairwise(X, dist=lambda x, y : distance.minkowski(x, y, 3, [1.0, 2.0]))
array([[0.        , 0.32255029, 0.41071762, 1.78252772, 1.822932  ],
       [0.32255029, 0.        , 0.12235965, 1.81110967, 1.94994064],
       [0.41071762, 0.12235965, 0.        , 1.70670453, 1.85986066],
       [1.78252772, 1.81110967, 1.70670453, 0.        , 0.39851746],
       [1.822932  , 1.94994064, 1.85986066, 0.39851746, 0.        ]])

There are a couple of inneficiencies above:

  1. D is symmetric (we do redundant computation)

  2. For some metrics, it can be more efficient to compute many distances at once

As an example of point 2, pairwise euclidean distances are \begin{equation} d(x_i, x_j) = \sqrt{(x_i - x_j)^T (x_i -x_j)}\ = \sqrt{x_i^T x_i - 2 x_i^T x_j + x_j^T x_j} \end{equation}

Let \(x^2\) denote the vector of length \(n\) where \(x_i^2 = x_i^T x_i\), and let \(1\) denote the vector of ones (of length \(n\)). Then \begin{equation} D = \sqrt{1 x^{2T} + x^2 1^T - 2 X X^T} \end{equation}

Thus, we can take advantage of BLAS level 3 operations to compute the distance matrix

def pairwise_euclidean(X):
    """
    Compute pairwise euclidean distances in X
    """
    XXT = X @ X.T
    x2 = np.diag(XXT)
    one = np.ones(X.shape[0])
    D = np.outer(x2, one) + np.outer(one, x2) - 2*XXT
    return np.sqrt(D)

la.norm(pairwise_euclidean(X) - pairwise(X))
1.1571112726691442e-15

We can also use BLAS syr2 for the rank-2 update:

print(la.blas.dsyr2.__doc__)
a = dsyr2(alpha,x,y,[lower,incx,offx,incy,offy,n,a,overwrite_a])

Wrapper for ``dsyr2``.

Parameters
----------
alpha : input float
x : input rank-1 array('d') with bounds (*)
y : input rank-1 array('d') with bounds (*)

Other Parameters
----------------
lower : input int, optional
    Default: 0
incx : input int, optional
    Default: 1
offx : input int, optional
    Default: 0
incy : input int, optional
    Default: 1
offy : input int, optional
    Default: 0
n : input int, optional
    Default: ((len(x)-1-offx)/abs(incx)+1 <=(len(y)-1-offy)/abs(incy)+1 ?(len(x)-1-offx)/abs(incx)+1 :(len(y)-1-offy)/abs(incy)+1)
a : input rank-2 array('d') with bounds (n,n)
overwrite_a : input int, optional
    Default: 0

Returns
-------
a : rank-2 array('d') with bounds (n,n)
def pairwise_euclidean_blas(X):
    """
    Compute pairwise euclidean distances in X 
    use syrk2 for rank-2 update
    """
    XXT = X @ X.T
    x2 = np.diag(XXT)
    one = np.ones(X.shape[0])
    D = la.blas.dsyr2(1.0, x2, one, a=-2*XXT) # this only updates upper triangular part
    D = np.triu(D) # extract upper triangular part
    return np.sqrt(D + D.T)

la.norm(pairwise_euclidean_blas(X) - pairwise(X))
1.0569008315563939e-15
import time

# generate data set
n = 100
d = 2

X = np.random.randn(n, d)

for d in (pairwise, pairwise_euclidean, pairwise_euclidean_blas):
    t0 = time.time()
    D = d(X)
    t1 = time.time()
    print("{} : {} sec.".format(d.__name__, t1 - t0))
pairwise : 0.041265249252319336 sec.
pairwise_euclidean : 0.004239559173583984 sec.
pairwise_euclidean_blas : 0.00018477439880371094 sec.

Exercise#

Write a version of pairwise_euclidean which uses Numba. How does this compare to the BLAS implementation?

## Your code here

pdist#

You don’t need to implement these faster methods yourself. scipy.spatial.distance.pdist has built-in optimizations for a variety of pairwise distance computations. You can use scipy.spatial.distance.cdist if you are computing pairwise distances between two data sets \(X, Y\).

from scipy.spatial.distance import pdist, cdist
D = pdist(X)

The output of pdist is not a matrix, but a condensed form which stores the lower-triangular entries in a vector.

D.shape
(4950,)

to get a square matrix, you can use squareform. You can also use squareform to go back to the condensed form.

from scipy.spatial.distance import squareform

D = squareform(D)
D.shape
(100, 100)
squareform(D).shape
(4950,)

pdist takes in a keyword argument metric, which determines the distance computed

pdist(X, metric='cityblock')
array([2.37331677, 2.06840507, 1.71775182, ..., 5.00535973, 1.82927861,
       3.38074781])

Let’s compare to our implementations

import time

# generate data set
n = 100
d = 2

X = np.random.randn(n, d)

for d in (pairwise, pairwise_euclidean, pairwise_euclidean_blas, pdist):
    t0 = time.time()
    D = d(X)
    t1 = time.time()
    print("{} : {} sec.".format(d.__name__, t1 - t0))
pairwise : 0.04546785354614258 sec.
pairwise_euclidean : 0.0004761219024658203 sec.
pairwise_euclidean_blas : 0.00013256072998046875 sec.
pdist : 4.220008850097656e-05 sec.

If we form the squareform:

import time

# generate data set
n = 100
d = 2

X = np.random.randn(n, d)

def sq_pdist(X, **kwargs):
    return squareform(pdist(X, **kwargs))

for d in (pairwise, pairwise_euclidean, pairwise_euclidean_blas, sq_pdist):
    t0 = time.time()
    D = d(X)
    t1 = time.time()
    print("{} : {} sec.".format(d.__name__, t1 - t0))
pairwise : 0.04576921463012695 sec.
pairwise_euclidean : 0.0004417896270751953 sec.
pairwise_euclidean_blas : 0.00012636184692382812 sec.
sq_pdist : 6.4849853515625e-05 sec.
D = cdist(X[:30],X[30:])
D.shape
(30, 70)

Exercise#

Use pdist to compute ’braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, and ‘euclidean’, distances on a random X. Are there noticeable differences in the time it takes to compute each distance?

## Your code here