Most comments are taken from the Numpy documentation.
In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
In [ ]:
def plot(data, bins=30):
plt.hist(data, bins)
plt.show()
In [ ]:
def bernoulli(p=None, size=1):
return np.random.binomial(n=1, p=p, size=size)
In [ ]:
bernoulli(p=0.5, size=100)
Samples are drawn from a binomial distribution with specified parameters, $n$ trials and $p$ probability of success where $n \in \mathbb{N}$ and $p$ is in the interval $[0,1]$.
The probability density for the binomial distribution is
$$P(N) = \binom{n}{N}p^N(1-p)^{n-N}$$where $n$ is the number of trials, $p$ is the probability of success, and $N$ is the number of successes.
When estimating the standard error of a proportion in a population by using a random sample, the normal distribution works well unless the product $p*n <=5$, where $p$ = population proportion estimate, and n = number of samples, in which case the binomial distribution is used instead. For example, a sample of 15 people shows 4 who are left handed, and 11 who are right handed. Then $p = 4/15 = 27%. 0.27*15 = 4$, so the binomial distribution should be used in this case.
In [ ]:
In [ ]:
np.random.binomial(n=10, p=0.5, size=100)
In [ ]:
data = np.random.binomial(n=10, p=0.25, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.binomial(n=10, p=0.5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.binomial(n=10, p=0.75, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.binomial(n=25, p=0.5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
Samples are drawn from a hypergeometric distribution with specified parameters, ngood (ways to make a good selection), nbad (ways to make a bad selection), and nsample = number of items sampled, which is less than or equal to the sum ngood + nbad.
ngood + nbad
.The probability density for the Hypergeometric distribution is
$$P(x) = \frac{\binom{m}{n}\binom{N-m}{n-x}}{\binom{N}{n}},$$where $0 \le x \le m$ and $n+m-N \le x \le n$
for $P(x)$ the probability of x successes, n = ngood, m = nbad, and N = number of samples.
Consider an urn with black and white marbles in it, ngood of them black and nbad are white. If you draw nsample balls without replacement, then the hypergeometric distribution describes the distribution of black balls in the drawn sample.
Note that this distribution is very similar to the binomial distribution, except that in this case, samples are drawn without replacement, whereas in the Binomial case samples are drawn with replacement (or the sample space is infinite). As the sample space becomes large, this distribution approaches the binomial.
Suppose you have an urn with 15 white and 15 black marbles. If you pull 15 marbles at random, how likely is it that 12 or more of them are one color ?
>>> s = np.random.hypergeometric(15, 15, 15, 100000)
>>> sum(s>=12)/100000. + sum(s<=3)/100000.
# answer = 0.003 ... pretty unlikely!
In [ ]:
np.random.hypergeometric(ngood=15, nbad=15, nsample=15, size=100)
In [ ]:
data = np.random.hypergeometric(ngood=15, nbad=15, nsample=15, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
The Poisson distribution is the limit of the binomial distribution for large N:
$$f(k; \lambda)=\frac{\lambda^k e^{-\lambda}}{k!}$$For events with an expected separation $\lambda$ the Poisson distribution $f(k; \lambda)$ describes the probability of $k$ events occurring within the observed interval $\lambda$.
Because the output is limited to the range of the C long type, a ValueError is raised when lam
is within 10 sigma of the maximum representable value.
In [ ]:
np.random.poisson(lam=1, size=100)
In [ ]:
data = np.random.poisson(lam=1, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.poisson(lam=2, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.poisson(lam=3, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.poisson(lam=4, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.poisson(lam=5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
Bernoulli trials are experiments with one of two outcomes: success or failure (an example of such an experiment is flipping a coin). The geometric distribution models the number of trials that must be run in order to achieve success. It is therefore supported on the positive integers, $k = 1, 2, \dots$
The probability mass function of the geometric distribution is
$$f(k) = (1 - p)^{k - 1} p$$where $p$ is the probability of success of an individual trial.
In [ ]:
np.random.geometric(p=0.5, size=100)
In [ ]:
data = np.random.geometric(p=0.5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
Samples are drawn from a negative binomial distribution with specified parameters, $n$ trials and $p$ probability of success where $n$ is an integer > 0 and $p$ is in the interval $[0, 1]$.
The probability density for the negative binomial distribution is
$$P(N;n,p) = \binom{N+n-1}{n-1}p^{n}(1-p)^{N},$$where $n-1$ is the number of successes, $p$ is the probability of success, and $N+n-1$ is the number of trials. The negative binomial distribution gives the probability of $n-1$ successes and $N$ failures in $N+n-1$ trials, and success on the $(N+n)$th trial.
If one throws a die repeatedly until the third time a "1" appears, then the probability distribution of the number of non-"1"s that appear before the third "1" is a negative binomial distribution.
In [ ]:
np.random.negative_binomial(n=1, p=0.1, size=100)
In [ ]:
data = np.random.negative_binomial(n=1, p=0.1, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
np.random.choice(range(10), size=100, replace=True, p=None)
In [ ]:
data = np.random.choice(range(25), size=100000, replace=True, p=None)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
np.random.choice(range(10), size=10, replace=False, p=None)
In [ ]:
np.random.choice([1, 2, 3], size=100, replace=True, p=[0.8, 0.1, 0.1])
In [ ]:
np.random.uniform(high=0.0, low=1.0, size=50)
The probability density function of the normal distribution, first derived by De Moivre and 200 years later by both Gauss and Laplace independently, is often called the bell curve because of its characteristic shape (see the example below).
The normal distributions occurs often in nature. For example, it describes the commonly occurring distribution of samples influenced by a large number of tiny, random disturbances, each with its own unique distribution.
The probability density for the Gaussian distribution is
$$p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }} e^{ - \frac{ (x - \mu)^2 } {2 \sigma^2} },$$where $\mu$ is the mean and $\sigma$ the standard deviation. The square of the standard deviation, $\sigma^2$, is called the variance.
The function has its peak at the mean, and its "spread" increases with
the standard deviation (the function reaches 0.607 times its maximum at
$x + \sigma$ and $x - \sigma$).
This implies that numpy.random.normal
is more likely to return samples
lying close to the mean, rather than those far away.
In [ ]:
np.random.normal(loc=0.0, scale=1.0, size=50)
In [ ]:
data = np.random.normal(loc=0.0, scale=1.0, size=100000)
plot(data, bins=np.arange(-5, 6, 0.2))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.normal(loc=2.0, scale=1.0, size=100000)
plot(data, bins=np.arange(-5, 6, 0.2))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.normal(loc=0.0, scale=1.5, size=100000)
plot(data, bins=np.arange(-5, 6, 0.2))
print("mean:", data.mean())
print("std:", data.std())
Draw samples from a log-normal distribution with specified mean, standard deviation, and array shape. Note that the mean and standard deviation are not the values for the distribution itself, but of the underlying normal distribution it is derived from.
A variable $x$ has a log-normal distribution if $log(x)$ is normally distributed. The probability density function for the log-normal distribution is:
$$p(x) = \frac{1}{\sigma x \sqrt{2\pi}} e^{(-\frac{(ln(x)-\mu)^2}{2\sigma^2})}$$where $\mu$ is the mean and $\sigma$ is the standard deviation of the normally distributed logarithm of the variable.
A log-normal distribution results if a random variable is the product of a large number of independent, identically-distributed variables in the same way that a normal distribution results if the variable is the sum of a large number of independent, identically-distributed variables.
In [ ]:
np.random.lognormal(mean=0.0, sigma=1.0, size=50)
In [ ]:
data = np.random.lognormal(mean=0.0, sigma=1.0, size=100000)
plot(data, bins=np.arange(0, 10, 0.2))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.lognormal(mean=2.0, sigma=1.0, size=100000)
plot(data, bins=np.arange(0, 10, 0.2))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.lognormal(mean=0.0, sigma=1.5, size=100000)
plot(data, bins=np.arange(0, 10, 0.2))
print("mean:", data.mean())
print("std:", data.std())
Draws samples in $[0, 1]$ from a power distribution with positive exponent $a - 1$ (with $a > 0$).
Also known as the power function distribution.
The probability density function is
$$P(x; a) = ax^{a-1}, 0 \le x \le 1, a>0.$$The power function distribution is just the inverse of the Pareto distribution. It may also be seen as a special case of the Beta distribution.
It is used, for example, in modeling the over-reporting of insurance claims.
In [ ]:
np.random.power(a=1.0, size=50)
In [ ]:
data = np.random.power(a=0.25, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.power(a=0.5, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.power(a=1.0, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.power(a=2.0, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.power(a=5.0, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
Its probability density function is
$$f\left( x; \frac{1}{\beta} \right) = \frac{1}{\beta} \exp \left( \frac{-x}{\beta} \right)$$for $x > 0$ and 0 elsewhere.
$\beta$ is the scale parameter, which is the inverse of the rate parameter $\lambda = 1/\beta$. The rate parameter is an alternative, widely used parameterization of the exponential distribution.
The exponential distribution is a continuous analogue of the geometric distribution. It describes many common situations, such as the size of raindrops measured over many rainstorms, or the time between page requests to Wikipedia.
The scale parameter, $\beta = 1/\lambda$.
In [ ]:
np.random.exponential(scale=1.0, size=50)
In [ ]:
data = np.random.exponential(scale=1.0, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.exponential(scale=2.0, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.exponential(scale=5.0, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.exponential(scale=0.5, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())
When df
independent random variables, each with standard normal
distributions (mean=0, variance=1), are squared and summed, the
resulting distribution is chi-square.
This distribution is often used in hypothesis testing.
The variable obtained by summing the squares of df
independent,
standard normally distributed random variables:
is chi-square distributed, denoted
$$Q \sim \chi^2_k.$$The probability density function of the chi-squared distribution is
$$p(x) = \frac{(1/2)^{k/2}}{\Gamma(k/2)} x^{k/2 - 1} e^{-x/2},$$where $\Gamma$ is the gamma function,
$$\Gamma(x) = \int_0^{-\infty} t^{x - 1} e^{-t} dt.$$
In [ ]:
np.random.chisquare(df=1.0, size=50)
In [ ]:
data = np.random.chisquare(df=1.0, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.chisquare(df=2.0, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
In [ ]:
data = np.random.chisquare(df=5.0, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())