Numpy random snippets

Most comments are taken from the Numpy documentation.

Import directive



In [ ]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

Tool functions



In [ ]:

    
def plot(data, bins=30):
    plt.hist(data, bins)
    plt.show()

Discrete distributions

Bernoulli distribution



In [ ]:

    
def bernoulli(p=None, size=1):
    return np.random.binomial(n=1, p=p, size=size)



In [ ]:

    
bernoulli(p=0.5, size=100)

Binomial distribution

Samples are drawn from a binomial distribution with specified parameters, $n$ trials and $p$ probability of success where $n \in \mathbb{N}$ and $p$ is in the interval $[0,1]$.

The probability density for the binomial distribution is

$$P(N) = \binom{n}{N}p^N(1-p)^{n-N}$$

where $n$ is the number of trials, $p$ is the probability of success, and $N$ is the number of successes.

When estimating the standard error of a proportion in a population by using a random sample, the normal distribution works well unless the product $p*n <=5$, where $p$ = population proportion estimate, and n = number of samples, in which case the binomial distribution is used instead. For example, a sample of 15 people shows 4 who are left handed, and 11 who are right handed. Then $p = 4/15 = 27%. 0.27*15 = 4$, so the binomial distribution should be used in this case.

See https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.binomial.html



In [ ]:



In [ ]:

    
np.random.binomial(n=10, p=0.5, size=100)



In [ ]:

    
data = np.random.binomial(n=10, p=0.25, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.binomial(n=10, p=0.5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.binomial(n=10, p=0.75, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.binomial(n=25, p=0.5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

Hypergeometric distribution

Samples are drawn from a hypergeometric distribution with specified parameters, ngood (ways to make a good selection), nbad (ways to make a bad selection), and nsample = number of items sampled, which is less than or equal to the sum ngood + nbad.

ngood : Number of ways to make a good selection. Must be nonnegative.
nbad : Number of ways to make a bad selection. Must be nonnegative.
nsample : Number of items sampled. Must be at least 1 and at most ngood + nbad.

The probability density for the Hypergeometric distribution is

$$P(x) = \frac{\binom{m}{n}\binom{N-m}{n-x}}{\binom{N}{n}},$$

where $0 \le x \le m$ and $n+m-N \le x \le n$

for $P(x)$ the probability of x successes, n = ngood, m = nbad, and N = number of samples.

Consider an urn with black and white marbles in it, ngood of them black and nbad are white. If you draw nsample balls without replacement, then the hypergeometric distribution describes the distribution of black balls in the drawn sample.

Note that this distribution is very similar to the binomial distribution, except that in this case, samples are drawn without replacement, whereas in the Binomial case samples are drawn with replacement (or the sample space is infinite). As the sample space becomes large, this distribution approaches the binomial.

Suppose you have an urn with 15 white and 15 black marbles. If you pull 15 marbles at random, how likely is it that 12 or more of them are one color ?

>>> s = np.random.hypergeometric(15, 15, 15, 100000)
>>> sum(s>=12)/100000. + sum(s<=3)/100000.
#   answer = 0.003 ... pretty unlikely!



In [ ]:

    
np.random.hypergeometric(ngood=15, nbad=15, nsample=15, size=100)



In [ ]:

    
data = np.random.hypergeometric(ngood=15, nbad=15, nsample=15, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

Poisson distribution

The Poisson distribution is the limit of the binomial distribution for large N:

$$f(k; \lambda)=\frac{\lambda^k e^{-\lambda}}{k!}$$

For events with an expected separation $\lambda$ the Poisson distribution $f(k; \lambda)$ describes the probability of $k$ events occurring within the observed interval $\lambda$.

Because the output is limited to the range of the C long type, a ValueError is raised when lam is within 10 sigma of the maximum representable value.

Lambda=1



In [ ]:

    
np.random.poisson(lam=1, size=100)



In [ ]:

    
data = np.random.poisson(lam=1, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

Lambda=2



In [ ]:

    
data = np.random.poisson(lam=2, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

Lambda=3



In [ ]:

    
data = np.random.poisson(lam=3, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

Lambda=4



In [ ]:

    
data = np.random.poisson(lam=4, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

Lambda=5



In [ ]:

    
data = np.random.poisson(lam=5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

Geometric distribution

Bernoulli trials are experiments with one of two outcomes: success or failure (an example of such an experiment is flipping a coin). The geometric distribution models the number of trials that must be run in order to achieve success. It is therefore supported on the positive integers, $k = 1, 2, \dots$

The probability mass function of the geometric distribution is

$$f(k) = (1 - p)^{k - 1} p$$

where $p$ is the probability of success of an individual trial.



In [ ]:

    
np.random.geometric(p=0.5, size=100)



In [ ]:

    
data = np.random.geometric(p=0.5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

Pascal distribution (negative binomial distribution)

Samples are drawn from a negative binomial distribution with specified parameters, $n$ trials and $p$ probability of success where $n$ is an integer > 0 and $p$ is in the interval $[0, 1]$.

The probability density for the negative binomial distribution is

$$P(N;n,p) = \binom{N+n-1}{n-1}p^{n}(1-p)^{N},$$

where $n-1$ is the number of successes, $p$ is the probability of success, and $N+n-1$ is the number of trials. The negative binomial distribution gives the probability of $n-1$ successes and $N$ failures in $N+n-1$ trials, and success on the $(N+n)$th trial.

If one throws a die repeatedly until the third time a "1" appears, then the probability distribution of the number of non-"1"s that appear before the third "1" is a negative binomial distribution.



In [ ]:

    
np.random.negative_binomial(n=1, p=0.1, size=100)



In [ ]:

    
data = np.random.negative_binomial(n=1, p=0.1, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

Uniform distribution



In [ ]:

    
np.random.choice(range(10), size=100, replace=True, p=None)



In [ ]:

    
data = np.random.choice(range(25), size=100000, replace=True, p=None)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

Miscellaneous



In [ ]:

    
np.random.choice(range(10), size=10, replace=False, p=None)



In [ ]:

    
np.random.choice([1, 2, 3], size=100, replace=True, p=[0.8, 0.1, 0.1])

Continuous distribution

Uniform distribution



In [ ]:

    
np.random.uniform(high=0.0, low=1.0, size=50)

Normal distribution

The probability density function of the normal distribution, first derived by De Moivre and 200 years later by both Gauss and Laplace independently, is often called the bell curve because of its characteristic shape (see the example below).

The normal distributions occurs often in nature. For example, it describes the commonly occurring distribution of samples influenced by a large number of tiny, random disturbances, each with its own unique distribution.

The probability density for the Gaussian distribution is

$$p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }} e^{ - \frac{ (x - \mu)^2 } {2 \sigma^2} },$$

where $\mu$ is the mean and $\sigma$ the standard deviation. The square of the standard deviation, $\sigma^2$, is called the variance.

The function has its peak at the mean, and its "spread" increases with the standard deviation (the function reaches 0.607 times its maximum at $x + \sigma$ and $x - \sigma$). This implies that numpy.random.normal is more likely to return samples lying close to the mean, rather than those far away.



In [ ]:

    
np.random.normal(loc=0.0, scale=1.0, size=50)



In [ ]:

    
data = np.random.normal(loc=0.0, scale=1.0, size=100000)
plot(data, bins=np.arange(-5, 6, 0.2))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.normal(loc=2.0, scale=1.0, size=100000)
plot(data, bins=np.arange(-5, 6, 0.2))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.normal(loc=0.0, scale=1.5, size=100000)
plot(data, bins=np.arange(-5, 6, 0.2))
print("mean:", data.mean())
print("std:", data.std())

Log normal distribution

Draw samples from a log-normal distribution with specified mean, standard deviation, and array shape. Note that the mean and standard deviation are not the values for the distribution itself, but of the underlying normal distribution it is derived from.

A variable $x$ has a log-normal distribution if $log(x)$ is normally distributed. The probability density function for the log-normal distribution is:

$$p(x) = \frac{1}{\sigma x \sqrt{2\pi}} e^{(-\frac{(ln(x)-\mu)^2}{2\sigma^2})}$$

where $\mu$ is the mean and $\sigma$ is the standard deviation of the normally distributed logarithm of the variable.

A log-normal distribution results if a random variable is the product of a large number of independent, identically-distributed variables in the same way that a normal distribution results if the variable is the sum of a large number of independent, identically-distributed variables.



In [ ]:

    
np.random.lognormal(mean=0.0, sigma=1.0, size=50)



In [ ]:

    
data = np.random.lognormal(mean=0.0, sigma=1.0, size=100000)
plot(data, bins=np.arange(0, 10, 0.2))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.lognormal(mean=2.0, sigma=1.0, size=100000)
plot(data, bins=np.arange(0, 10, 0.2))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.lognormal(mean=0.0, sigma=1.5, size=100000)
plot(data, bins=np.arange(0, 10, 0.2))
print("mean:", data.mean())
print("std:", data.std())

Power distribution

Draws samples in $[0, 1]$ from a power distribution with positive exponent $a - 1$ (with $a > 0$).

Also known as the power function distribution.

The probability density function is

$$P(x; a) = ax^{a-1}, 0 \le x \le 1, a>0.$$

The power function distribution is just the inverse of the Pareto distribution. It may also be seen as a special case of the Beta distribution.

It is used, for example, in modeling the over-reporting of insurance claims.



In [ ]:

    
np.random.power(a=1.0, size=50)



In [ ]:

    
data = np.random.power(a=0.25, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.power(a=0.5, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.power(a=1.0, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.power(a=2.0, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.power(a=5.0, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())

Beta distribution



In [ ]:

Exponential distribution

Its probability density function is

$$f\left( x; \frac{1}{\beta} \right) = \frac{1}{\beta} \exp \left( \frac{-x}{\beta} \right)$$

for $x > 0$ and 0 elsewhere.

$\beta$ is the scale parameter, which is the inverse of the rate parameter $\lambda = 1/\beta$. The rate parameter is an alternative, widely used parameterization of the exponential distribution.

The exponential distribution is a continuous analogue of the geometric distribution. It describes many common situations, such as the size of raindrops measured over many rainstorms, or the time between page requests to Wikipedia.

The scale parameter, $\beta = 1/\lambda$.



In [ ]:

    
np.random.exponential(scale=1.0, size=50)



In [ ]:

    
data = np.random.exponential(scale=1.0, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.exponential(scale=2.0, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.exponential(scale=5.0, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.exponential(scale=0.5, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())

Chi-square distribution

When df independent random variables, each with standard normal distributions (mean=0, variance=1), are squared and summed, the resulting distribution is chi-square.

This distribution is often used in hypothesis testing.

The variable obtained by summing the squares of df independent, standard normally distributed random variables:

$$Q = \sum_{i=0}^{\mathtt{df}} X^2_i$$

is chi-square distributed, denoted

$$Q \sim \chi^2_k.$$

The probability density function of the chi-squared distribution is

$$p(x) = \frac{(1/2)^{k/2}}{\Gamma(k/2)} x^{k/2 - 1} e^{-x/2},$$

where $\Gamma$ is the gamma function,

$$\Gamma(x) = \int_0^{-\infty} t^{x - 1} e^{-t} dt.$$



In [ ]:

    
np.random.chisquare(df=1.0, size=50)



In [ ]:

    
data = np.random.chisquare(df=1.0, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.chisquare(df=2.0, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())



In [ ]:

    
data = np.random.chisquare(df=5.0, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())