Linear Algebra in PyTorch#

PyTorch is a popular package for developing models for deep learning. In this section, we’ll look at its linear algebra capabilities.

Even if you are not doing deep learning, you can use PyTorch for linear algebra. One of the nice things about PyTorch is that it makes it easy to take advantage of GPU hardware, which is very efficient at certain operations.

You can find PyTorch tutorials on the official website as well as documentation. Note these tutorials are focused on deep learning. This section will focus on the linear algebra capabilities.

When you install PyTorch, use the pytorch channel in conda.

conda install pytorch -c pytorch
import numpy as np
import matplotlib.pyplot as plt
import torch
torch.__version__
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 import numpy as np
      2 import matplotlib.pyplot as plt
----> 3 import torch
      4 torch.__version__

ModuleNotFoundError: No module named 'torch'

PyTorch Tensors#

PyTorch Tensors are just multi-dimensional arrays. You can go back and forth between these and numpy ndarray.

A = np.random.rand(2,2)
A
array([[0.07249615, 0.07733924],
       [0.92554353, 0.75705366]])
B = torch.Tensor(A)
B
tensor([[0.0725, 0.0773],
        [0.9255, 0.7571]])

To put a tensor on gpu, use cuda(). Note that you must have an NVIDIA GPU in your computer to be able to do this successfully.

torch.cuda.is_available()
True
Bcuda = B.cuda()
print(Bcuda)
Bcpu = B.cpu()
print(Bcpu)
tensor([[0.0725, 0.0773],
        [0.9255, 0.7571]], device='cuda:0')
tensor([[0.0725, 0.0773],
        [0.9255, 0.7571]])
B.to(device='cuda')
tensor([[0.0725, 0.0773],
        [0.9255, 0.7571]], device='cuda:0')

To move a tensor back to CPU, you can use device='cpu'

Linear Algebra Functions#

PyTorch provides access to a variety of BLAS and LAPACK-type routines - see documentation here. These do not follow the BLAS/LAPACK naming conventions

torch.addmv is roughly equivalent to axpy, and performs \(Ax + y\)

m = 100
n = 100
device = torch.device('cuda') # 'cuda' or 'cpu'

Anp = np.random.randn(m,n)
xnp = np.random.randn(n)
ynp = np.random.randn(m)

A = torch.Tensor(Anp).to(device=device)
x = torch.Tensor(xnp).to(device=device)
y = torch.Tensor(ynp).to(device=device)

z = torch.addmv(y, A, x)

Let’s look at the timing difference between CPU and GPU

m = 1000
n = 1000

Anp = np.random.randn(m,n)
xnp = np.random.randn(n)
ynp = np.random.randn(m)
print("numpy")
%time z = ynp + Anp @ xnp


for device in ('cpu', 'cuda'):
    print(f"\ndevice = {device}")
    A = torch.Tensor(Anp).to(device=device)
    x = torch.Tensor(xnp).to(device=device)
    y = torch.Tensor(ynp).to(device=device)
    z = torch.addmv(y, A, x)

    %time z = torch.addmv(y, A, x)
    
    
numpy
CPU times: user 3.51 ms, sys: 0 ns, total: 3.51 ms
Wall time: 2.52 ms

device = cpu
CPU times: user 316 µs, sys: 112 µs, total: 428 µs
Wall time: 145 µs

device = cuda
CPU times: user 223 µs, sys: 0 ns, total: 223 µs
Wall time: 65.1 µs

torch.mv performs matrix-vector products

Anp = np.random.randn(m,n)
xnp = np.random.randn(n)

print("numpy")
%time z = Anp @ xnp


for device in ['cpu', 'cuda']:
    print(f"\ndevice = {device}")
    A = torch.Tensor(Anp).to(device=device)
    x = torch.Tensor(xnp).to(device=device)
    y = torch.mv(A, x)

    %time y = torch.mv(A, x)
numpy
CPU times: user 2.76 ms, sys: 0 ns, total: 2.76 ms
Wall time: 647 µs

device = cpu
CPU times: user 757 µs, sys: 0 ns, total: 757 µs
Wall time: 253 µs

device = cuda
CPU times: user 146 µs, sys: 50 µs, total: 196 µs
Wall time: 51 µs

torch.mm performs matrix-matrix multiplications

m = 1000
n = 1000

Anp = np.random.randn(m,n)
Bnp = np.random.randn(n, n)

print("numpy")
%time C = Anp @ Bnp

for device in ['cpu', 'cuda']:
    print(f"\ndevice = {device}")
    A = torch.Tensor(Anp).to(device=device)
    B = torch.Tensor(Bnp).to(device=device)
    C = torch.mm(A, B) # run once to warm up

    %time C = torch.mm(A, B)
numpy
CPU times: user 92.4 ms, sys: 8.05 ms, total: 100 ms
Wall time: 32.5 ms

device = cpu
CPU times: user 45.5 ms, sys: 382 µs, total: 45.8 ms
Wall time: 11.5 ms

device = cuda
CPU times: user 137 µs, sys: 0 ns, total: 137 µs
Wall time: 39.3 µs

Batch operations#

Where PyTorch (and GPUs in general) really shine are in batch operations. We get extra efficiency if we do a bunch of multiplications with matrices of the same size.

For matrix-matrix multiplcation, the function is torch.bmm

Because tensors are row-major, we want the batch index to be the first index. In the below code, the batch multiplication is equivalent to

for i in range(k):
    C[i] = A[i] @ B[i]
n = 512 # matrix size
k = 32 # batch size

Anp = np.random.randn(k, n, n)
Bnp = np.random.randn(k, n, n)
# see numpy matmul documentation for how this performs batch multiplication
print("numpy")
%time C = np.matmul(Anp, Bnp)

for device in ['cpu', 'cuda']:
    print(f"\ndevice = {device}")
    A = torch.randn(k, n, n).to(device=device)
    B = torch.randn(k, n, n).to(device=device)
    C = torch.bmm(A, B) # run once to warm up

    %time C = torch.bmm(A, B)
numpy
CPU times: user 268 ms, sys: 55 ms, total: 323 ms
Wall time: 81.6 ms

device = cpu
CPU times: user 121 ms, sys: 16.3 ms, total: 137 ms
Wall time: 34.6 ms

device = cuda
CPU times: user 0 ns, sys: 254 µs, total: 254 µs
Wall time: 263 µs

Sparse Linear Algebra#

PyTorch also supports sparse tensors in torch.sparse. Tensors are stored in COOrdinate format.

i = torch.LongTensor([[0, 1, 1],
                      [2, 0, 2]])
v = torch.FloatTensor([3, 4, 5])
torch.sparse.FloatTensor(i, v, torch.Size([2,3])).to_dense()
tensor([[0., 0., 3.],
        [4., 0., 5.]])

indices are stored in a 2 x nnz tensor of Long (a datatype that stores integers). Values are stored as floats.

Exercise#

Write a function that returns a sparse identity matrix of size n in PyTorch.