Scipy stats package#

A variety of functionality for dealing with random numbers can be found in the scipy.stats package.

from scipy import stats
import matplotlib.pyplot as plt
import numpy as np

Distributions#

There are over 100 continuous (univariate) distributions and about 15 discrete distributions provided by scipy

To avoid confusion with a norm in linear algebra we’ll import stats.norm as normal

from scipy.stats import norm as normal

A normal distribution with mean \(\mu\) and variance \(\sigma^2\) has a probability density function \begin{equation} \frac{1}{\sigma \sqrt{2\pi}} e^{-(x-\mu)^2 / 2\sigma^2} \end{equation}

# bounds of the distribution accessed using fields a and b
normal.a, normal.b
(-inf, inf)
# mean and variance
normal.mean(), normal.var()
(0.0, 1.0)
# Cumulative distribution function (CDF)
x = np.linspace(-5,5,100)
y = normal.cdf(x)
plt.plot(x, y)
plt.title('Normal CDF');
../_images/64830ca49c8e0645b068c8fa62b2f0b37eec67b2b8ac5948ec1b8a847386b25c.png
# Probability density function (PDF)
x = np.linspace(-5,5,100)
y = normal.pdf(x)
plt.plot(x, y)
plt.title('Normal PDF');
../_images/c6983740991ce9385ea4904f46eab8fd405d3fe6f029726e4be39f757b6424bd.png

You can sample from a distribution using the rvs method (random variates)

X = normal.rvs(size=1000)
plt.hist(X);
../_images/fb56e098baece3cbf7954bbe92a21c29a8d950e6aea0a5c12c212b0d8d85c54d.png

You can set a global random state using np.random.seed or by passing in a random_state argument to rvs

X = normal.rvs(size=1000, random_state=0)
plt.hist(X);
../_images/24316acdc2359c1c10c57775b60a5c0d7ba0f7a5b93a39a6bbc24b631c465fa6.png

To shift and scale any distribution, you can use the loc and scale keyword arguments

X = normal.rvs(size=1000, random_state=0, loc=1.0, scale=0.5)
plt.hist(X);
../_images/8eaefe7922d3dff58d7cbaf1677c60a9c766912879e78602ed440dc45b11d689.png
# Probability density function (PDF)
x = np.linspace(-5,5,100)
y = normal.pdf(x, loc=1.0, scale=0.5)
plt.plot(x, y)
plt.title('Normal PDF');
../_images/4a59d16d680414fd7aaf3a3f0b75ed3b9a0126d7c39d5a2d9be36ddb2abfe07b.png

You can also “freeze” a distribution so you don’t need to keep passing in the parameters

dist = normal(loc=1.0, scale=0.5)
X = dist.rvs(size=1000, random_state=0)
plt.hist(X);
../_images/8eaefe7922d3dff58d7cbaf1677c60a9c766912879e78602ed440dc45b11d689.png

Example: The Laplace Distribution#

We’ll review some of the methods we saw above on the Laplace distribution, which has a PDF \begin{equation} \frac{1}{2}e^{-|x|} \end{equation}

from scipy.stats import laplace
x = np.linspace(-5,5,100)
y = laplace.pdf(x)
plt.plot(x, y)
plt.title('Laplace PDF');
../_images/b5462d50ad73bca6dd976ae86585b009e7f00a536cdee42abe228d3cf86bdefe.png
x = np.linspace(-5,5,100)
y = laplace.cdf(x)
plt.plot(x, y)
plt.title('Laplace CDF');
../_images/7e65ccb6576ab87a89fd1046cc63c35033099d8c67bcd4fe1f0e1a2654625e88.png
X = laplace.rvs(size=1000)
plt.hist(X);
../_images/be827ae61e66f8ee7cd0e44b18089aa80d70f6859f1c56797a0d491e0c253f25.png
laplace.a, laplace.b
(-inf, inf)
laplace.mean(), laplace.var()
(0.0, 2.0)

Discrete Distributions#

Discrete distributions have many of the same methods as continuous distributions. One notable difference is that they have a probability mass function (PMF) instead of a PDF.

We’ll look at the Poisson distribution as an example

from scipy.stats import poisson
dist = poisson(1) # parameter 1
dist.a, dist.b
(0, inf)
dist.mean(), dist.var()
(1.0, 1.0)
x = np.arange(10)
y = dist.pmf(x)
plt.scatter(x, y);
../_images/490acac895cfd5778808cb37688654fbe17a4738a694fd5053af8c990d9abd86.png
x = np.linspace(0,10, 100)
y = dist.cdf(x)
plt.plot(x, y);
../_images/533e7f99b0b4ade809ab8ced03adae547d430d525bf082458b33e7f9356b2e99.png
X = dist.rvs(size=1000)
plt.hist(X);
../_images/8b7e5677bdcfd6ed68228f63e1a9af5157eef6718af333d4b3f7631ff5de6fb2.png

Fitting Parameters#

You can fit distributions using maximum likelihood estimates

X = normal.rvs(size=100, random_state=0)
normal.fit_loc_scale(X)
(0.059808015534485, 1.0078822447165796)

Statistical Tests#

You can also perform a variety of tests. A list of tests available in scipy available can be found here.

the t-test tests whether the mean of a sample differs significantly from the expected mean

stats.ttest_1samp(X, 0)
TtestResult(statistic=0.5904283402851698, pvalue=0.5562489158694675, df=99)

The p-value is 0.56, so we would expect to see a sample that deviates from the expected mean at least this much in about 56% of experiments. If we’re doing a hypothesis test, we wouldn’t consider this significant unless the p-value were much smaller (e.g. 0.05)

If we want to test whether two samples could have come from the same distribution, we can use a Kolmogorov-Smirnov (KS) 2-sample test

X0 = normal.rvs(size=1000, random_state=0, loc=5)
X1 = normal.rvs(size=1000, random_state=2, loc=5)

stats.ks_2samp(X0, X1)
KstestResult(statistic=0.023, pvalue=0.9542189106778983, statistic_location=5.848821448664333, statistic_sign=-1)

We see that the two samples are very likely to come from the same distribution

X0 = normal.rvs(size=1000, random_state=0)
X1 = laplace.rvs(size=1000, random_state=2)

stats.ks_2samp(X0, X1)
KstestResult(statistic=0.056, pvalue=0.08689937254547132, statistic_location=1.5363770542457977, statistic_sign=1)

When we draw from a different distribution, the p-value is much smaller, indicating that the samples are less likely to have been drawn from the same distribution.

Kernel Density Estimation#

We might also want to estimate a continuous PDF from samples, which can be accomplished using a Gaussian kernel density estimator (KDE)

X = np.hstack((laplace.rvs(size=100), normal.rvs(size=100, loc=5)))
plt.hist(X);
../_images/48b3efb5f07575d60dd60eb53c24ef59da9924e36746ffad2c99e2a5a330b7a8.png

A KDE works by creating a function that is the sum of Gaussians centered at each data point. In Scipy this is implemented as an object which can be called like a function

kde = stats.gaussian_kde(X)
x = np.linspace(-5,10,500)
y = kde(x)
plt.plot(x, y)
plt.title("KDE");
../_images/f109bfdf3e033545e0ad08fb0083eeb67802bfad8e9cd5cc05684e13be54ea51.png

We can change the bandwidth of the Gaussians used in the KDE using the bw_method parameter. You can also pass a function that will set this algorithmically. In the above plot, we see that the peak on the left is a bit smoother than we would expect for the Laplace distribution. We can decrease the bandwidth parameter to make it look sharper:

kde = stats.gaussian_kde(X, bw_method=0.2)
x = np.linspace(-5,10,500)
y = kde(x)
plt.plot(x, y)
plt.title("KDE");
../_images/f58c40c33ceb3c546fde56339d1379797c9dc6ee9ccca635818b263495ffebed.png

If we set the bandwidth too small, then the KDE will look a bit rough.

kde = stats.gaussian_kde(X, bw_method=0.1)
x = np.linspace(-5,10,500)
y = kde(x)
plt.plot(x, y)
plt.title("KDE");
../_images/1bc1754d55db743ad5dfcf0204ff736a569f990ab72c67638e2483912b5f5b42.png