Distances

Distances#

A common task when dealing with data is computing the distance between two points.

We can use scipy.spatial.distance to compute a variety of distances.

import numpy as np
import scipy.linalg as la
import matplotlib.pyplot as plt
import scipy.spatial.distance as distance

A data set is a collection of observations, each of which may have several features. We’ll consider the situation where the data set is a matrix X, where each row X[i] is an observation. We’ll use n to denote the number of observations and p to denote the number of features, so X is a \(n \times p\) matrix.

For example, we might sample from a circle (with some gaussian noise)

def sample_circle(n, r=1, sigma=0.1):
    """
    sample n points from a circle of radius r
    add Gaussian noise with variance sigma^2
    """
    X = np.random.randn(n,2)
    X = r * X / np.linalg.norm(X,axis=1).reshape(-1,1)
    return X + sigma * np.random.randn(n,2)

X = sample_circle(100)
plt.scatter(X[:,0], X[:,1])
plt.show()

../_images/fd0ffabaa8566041e31807fa14821de524d638f1923e3b699e0bcb6852adbc78.png

We might also call a data set a “point cloud”, or a collection of points in some space.

Similarity, Dissimilarity, and Metric#

A similarity between two points \(x, y\in X\) is a function \(s: X \times X \to \mathbb{R}\), where \(s\) is larger if \(x,y\) are more similar.

Cosine similarity is an an example of similarity for points in a real vector space. We can define it as \begin{equation} c(x,y) = \frac{x^T y}{|x|_2 |y|_2} \end{equation}

Note \(c(x,x) = 1\), and if \(x,y\) are orthogonal, then \(c(x,y) = 0\).

A dissimilarity between two points \(x,y \in X\) is a function \(d: X \times X \to \mathbb{R}_+\), where \(d\) is smaller if \(x,y\) are more similar. Sometimes, disimilarity functions will be called distances.

Cosine distance is an example of a dissimilarity for points in a real vector space. It is defined as \begin{equation} d(x,y) = 1 - c(x,y) \end{equation} Note \(d(x,x) = 0\), and \(d(x,y) = 1\) if \(x,y\) are orthogonal.

A metric is a disimilarity \(d\) that satisfies the metric axioms

\(d(x,y) = 0\) iff \(x=y\) (identity)
\(d(x,y) = d(y,x)\) (symmetry)
\(d(x,y) \le d(x,z) + d(z,y)\) (triangle inequality)

You are probably most familiar with the Euclidean metric \begin{equation} d(x,y) = |x - y|_2 \end{equation}

Exercise#

Write functions for the cosine similarity, cosine distance, and euclidean distance between two numpy arrays treated as vectors.

## Your code here

Solution:#

Computing Distances#

SciPy provides a variety of functionality for computing distances in scipy.spatial.distance.

x = [1.0, 0.0]
y = [0.0, 1.0]

distance.euclidean(x, y) # sqrt(2)

1.4142135623730951

distance.cosine(x, y)

1.0

the \(L_1\) / Manhattan / City block distance \begin{equation} d(x,y) = |x - y|_1 = \sum_i |x_i - y_i| \end{equation}

distance.cityblock(x, y)

2.0

The \(L_\infty\) / Chebyshev distance \begin{equation} d(x,y) = |x - y|_\infty = \max_i |x_i - y_i| \end{equation}

distance.chebyshev(x, y)

1.0

More generally, the Minkowski distance \begin{equation} d(x,y) = |x - y|_p = \big( \sum_i (x_i - y_i)^p \big)^{1/p} \end{equation}

distance.minkowski(x, y, 3)

1.2599210498948732

np.cbrt(2) # cube root of 2

1.2599210498948734

The Mahalanobis distance. Note that this is defined in terms of an inverse covariance matrix. In fact, you can use it to compute distances based on the \(A\)-norm where \(A\) is any symmetric positive definite (SPD) matrix.

\begin{equation} d(x,y) = | x- y |_A = \sqrt{(x - y)^T A (x - y)} \end{equation}

A = np.random.randn(3,2)
A = A.T @ A
distance.mahalanobis(x, y, A)

3.791565791818502

You can also weight many distances with a vector \(w\). The entry \(w_i\) will weight the comparison between the \(i\)th entries of \(x\) and \(y\). E.g. weighted euclidean distance \begin{equation} d(x, y; w) = \sqrt{\sum_i w_i(x_i - y_i)^2} \end{equation}

w = [1, 2]
for d in (distance.euclidean, distance.cityblock, lambda x, y, w : distance.minkowski(x, y, 3, w)):
    print(d(x, y, w))

7320508075688774
0
4422495703074083

Exercise#

Compute a weighted euclidean distance using the Mahalanobis distance.

## Your code here

Pairwise Distances#

In general, we often want to compute distances between many points all at once. This typically boils down to computing a matrix \(D\), where \(D_{i,j} = d(x_i, x_j)\).

Here’s a simple function that would do just that

def pairwise(X, dist=distance.euclidean):
    """
    compute pairwise distances in n x p numpy array X
    """
    n, p = X.shape
    D = np.empty((n,n), dtype=np.float64)
    for i in range(n):
        for j in range(n):
            D[i,j] = dist(X[i], X[j])
            
    return D

X = sample_circle(5)
pairwise(X)

array([[0.        , 0.30319957, 0.34105369, 1.83236742, 1.88826178],
       [0.30319957, 0.        , 0.11891106, 1.87895696, 1.99089783],
       [0.34105369, 0.11891106, 0.        , 1.76894693, 1.89124471],
       [1.83236742, 1.87895696, 1.76894693, 0.        , 0.37151272],
       [1.88826178, 1.99089783, 1.89124471, 0.37151272, 0.        ]])

pairwise(X, dist=lambda x, y : distance.minkowski(x, y, 3, [1.0, 2.0]))

array([[0.        , 0.32255029, 0.41071762, 1.78252772, 1.822932  ],
       [0.32255029, 0.        , 0.12235965, 1.81110967, 1.94994064],
       [0.41071762, 0.12235965, 0.        , 1.70670453, 1.85986066],
       [1.78252772, 1.81110967, 1.70670453, 0.        , 0.39851746],
       [1.822932  , 1.94994064, 1.85986066, 0.39851746, 0.        ]])

There are a couple of inneficiencies above:

D is symmetric (we do redundant computation)
For some metrics, it can be more efficient to compute many distances at once

As an example of point 2, pairwise euclidean distances are \begin{equation} d(x_i, x_j) = \sqrt{(x_i - x_j)^T (x_i -x_j)}\ = \sqrt{x_i^T x_i - 2 x_i^T x_j + x_j^T x_j} \end{equation}

Let \(x^2\) denote the vector of length \(n\) where \(x_i^2 = x_i^T x_i\), and let \(1\) denote the vector of ones (of length \(n\)). Then \begin{equation} D = \sqrt{1 x^{2T} + x^2 1^T - 2 X X^T} \end{equation}

Thus, we can take advantage of BLAS level 3 operations to compute the distance matrix

def pairwise_euclidean(X):
    """
    Compute pairwise euclidean distances in X
    """
    XXT = X @ X.T
    x2 = np.diag(XXT)
    one = np.ones(X.shape[0])
    D = np.outer(x2, one) + np.outer(one, x2) - 2*XXT
    return np.sqrt(D)

la.norm(pairwise_euclidean(X) - pairwise(X))

1.1571112726691442e-15

We can also use BLAS syr2 for the rank-2 update:

print(la.blas.dsyr2.__doc__)

a = dsyr2(alpha,x,y,[lower,incx,offx,incy,offy,n,a,overwrite_a])

Wrapper for ``dsyr2``.

Parameters
----------
alpha : input float
x : input rank-1 array('d') with bounds (*)
y : input rank-1 array('d') with bounds (*)

Other Parameters
----------------
lower : input int, optional
    Default: 0
incx : input int, optional
    Default: 1
offx : input int, optional
    Default: 0
incy : input int, optional
    Default: 1
offy : input int, optional
    Default: 0
n : input int, optional
    Default: ((len(x)-1-offx)/abs(incx)+1 <=(len(y)-1-offy)/abs(incy)+1 ?(len(x)-1-offx)/abs(incx)+1 :(len(y)-1-offy)/abs(incy)+1)
a : input rank-2 array('d') with bounds (n,n)
overwrite_a : input int, optional
    Default: 0

Returns
-------
a : rank-2 array('d') with bounds (n,n)

def pairwise_euclidean_blas(X):
    """
    Compute pairwise euclidean distances in X 
    use syrk2 for rank-2 update
    """
    XXT = X @ X.T
    x2 = np.diag(XXT)
    one = np.ones(X.shape[0])
    D = la.blas.dsyr2(1.0, x2, one, a=-2*XXT) # this only updates upper triangular part
    D = np.triu(D) # extract upper triangular part
    return np.sqrt(D + D.T)

la.norm(pairwise_euclidean_blas(X) - pairwise(X))

1.0569008315563939e-15

import time

# generate data set
n = 100
d = 2

X = np.random.randn(n, d)

for d in (pairwise, pairwise_euclidean, pairwise_euclidean_blas):
    t0 = time.time()
    D = d(X)
    t1 = time.time()
    print("{} : {} sec.".format(d.__name__, t1 - t0))

pairwise : 0.041265249252319336 sec.
pairwise_euclidean : 0.004239559173583984 sec.
pairwise_euclidean_blas : 0.00018477439880371094 sec.

Exercise#

Write a version of pairwise_euclidean which uses Numba. How does this compare to the BLAS implementation?

## Your code here

`pdist`#

You don’t need to implement these faster methods yourself. scipy.spatial.distance.pdist has built-in optimizations for a variety of pairwise distance computations. You can use scipy.spatial.distance.cdist if you are computing pairwise distances between two data sets \(X, Y\).

from scipy.spatial.distance import pdist, cdist

D = pdist(X)

The output of pdist is not a matrix, but a condensed form which stores the lower-triangular entries in a vector.

D.shape

(4950,)

to get a square matrix, you can use squareform. You can also use squareform to go back to the condensed form.

from scipy.spatial.distance import squareform

D = squareform(D)
D.shape

(100, 100)

squareform(D).shape

(4950,)

pdist takes in a keyword argument metric, which determines the distance computed

pdist(X, metric='cityblock')

array([2.37331677, 2.06840507, 1.71775182, ..., 5.00535973, 1.82927861,
       3.38074781])

Let’s compare to our implementations

import time

# generate data set
n = 100
d = 2

X = np.random.randn(n, d)

for d in (pairwise, pairwise_euclidean, pairwise_euclidean_blas, pdist):
    t0 = time.time()
    D = d(X)
    t1 = time.time()
    print("{} : {} sec.".format(d.__name__, t1 - t0))

pairwise : 0.04546785354614258 sec.
pairwise_euclidean : 0.0004761219024658203 sec.
pairwise_euclidean_blas : 0.00013256072998046875 sec.
pdist : 4.220008850097656e-05 sec.

If we form the squareform:

import time

# generate data set
n = 100
d = 2

X = np.random.randn(n, d)

def sq_pdist(X, **kwargs):
    return squareform(pdist(X, **kwargs))

for d in (pairwise, pairwise_euclidean, pairwise_euclidean_blas, sq_pdist):
    t0 = time.time()
    D = d(X)
    t1 = time.time()
    print("{} : {} sec.".format(d.__name__, t1 - t0))

pairwise : 0.04576921463012695 sec.
pairwise_euclidean : 0.0004417896270751953 sec.
pairwise_euclidean_blas : 0.00012636184692382812 sec.
sq_pdist : 6.4849853515625e-05 sec.

D = cdist(X[:30],X[30:])
D.shape

(30, 70)

Exercise#

Use pdist to compute ’braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, and ‘euclidean’, distances on a random X. Are there noticeable differences in the time it takes to compute each distance?

## Your code here

Distances

Contents

Distances#

Similarity, Dissimilarity, and Metric#

Exercise#

Solution:#

Computing Distances#

Exercise#

Pairwise Distances#

Exercise#

pdist#

Exercise#

`pdist`#