Author : Ronojoy Adhikari
Email : rjoy@imsc.res.in | Web : www.imsc.res.in/~rjoy
Github : www.github.com/ronojoy | Twitter: @phyrjoy
In [ ]:
%matplotlib inline
import numpy as np
import scipy.stats as stats
import matplotlib.pylab as pylab
from matplotlib import pyplot as plt
The probability that a random variable $X$ takes on a value $X=x$ is given by $P(x)$
In [ ]:
# the scipy.stats package has many discrete and continuous random variables
# the coin toss random variable
X = stats.bernoulli(p=0.5)
# bernoulli mean = p, variance = p(1-p)
#X.mean(), X.var()
# entropy = -\sum_x p(x) * ln p(x)
# quantifies the uncertainty in the random variable
#X.entropy()
# lots of throws of the coin
x = X.rvs(1000)
In [ ]:
# the empirical frequency distribution - aka the histogram
pylab.rcParams['figure.figsize'] = 6, 4 # medium size figures
plt.hist(x, normed = True, color="#348ABD", alpha = 0.4);
In [ ]:
# the 'central limit theorem' random variable
X = stats.norm()
# the normal deviate has zero mean and unit variance
#X.mean(), X.var()
# the normal deviate has the greatest entropy amongst all continuous, unbounded
# random variables with given mean and variance
# see Jaynes - Probability theory, the logic of science - for proof and discussion
#X.entropy()
# a million draws from the normal distribution
x = X.rvs(1e6)
In [ ]:
# the empirical frequency distribution - aka the histogram
pylab.rcParams['figure.figsize'] = 6, 4 # medium size figure
plt.hist(x, bins = 50, normed = True, color="#348ABD", alpha = 0.4 );
Intuitive idea : add a lot of independent, and identically distributed random variabes and the result will normally distributed.
More precisely : Consider i.i.d random varibles $X_i$ with mean $\mu$ and variance $\sigma^2$.
The sum $S_n = \frac{X_1 + X_2 + \ldots X_n}{n}$ approaches a normal distribution with mean $\mu$ and variance $\frac{\sigma^2}{n}$ as $n\rightarrow \infty$.
Lets see this in Python ....
In [ ]:
# lets add lots independent, identically distributed numbers together
X = stats.uniform() # uniform deviates
X = stats.bernoulli(0.5) # coin tosses
X = stats.poisson(1) # telephone calls
print 'The mean and variance of the uniform distribtion are', X.mean(), 'and', X.var(), 'respectively.'
N = [1, 2, 4, 6, 8, 16, 32, 64]
for k, n in enumerate(N):
# draw the n random variables
x = X.rvs(size = [n, 1e5])
# get their sum
s = x.mean(axis=0)
# plot the distribution of the sum
sx = plt.subplot(len(N) / 2, 2, k + 1)
plt.setp(sx.get_yticklabels(), visible=False)
plt.hist(s,
bins = 50,
label="$\mu = $ %f \n $n\sigma^2 = $ %f " % (s.mean(), n*s.var()),
color="#348ABD",
alpha=0.4)
leg = plt.legend()
leg.get_frame().set_alpha(0.4)
plt.autoscale(tight=True)
Intuitive idea : add a lot of independent, and identically distributed random variabes and the result will normally distributed.
More precisely : Consider i.i.d random varibles $X_i$ with mean $\mu$ and variance $\sigma^2$.
The sum $S_n = \frac{X_1 + X_2 + \ldots X_n}{n}$ approaches a normal distribution with mean $\mu$ and variance $\frac{\sigma^2}{n}$ as $n\rightarrow \infty$.
The normal distribution $P(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ is important!
In [ ]:
# N throws of a coin with parameter theta0
N, theta0 = 500, 0.5
# data contains the outcome of the trial
data = stats.bernoulli.rvs(p = theta0, size = N)
# theta is distributed in the unit interval
theta = np.linspace(0, 1, 128)
In [ ]:
# inspiration : Devinder Sivia and Cam Pilon
# compute posterior after the nthrow-th trial
nthrow = [0, 1, 2, 3, 4, 5, 8, 16, 32, N]
for k, n in enumerate(nthrow):
# number of heads
heads = data[:n].sum()
# posterior probability of theta with conjugate prior (see Sivia for derivation)
p = stats.beta.pdf(theta, 1 + heads, 1 + n - heads)
# plot the posterior
sx = plt.subplot(len(nthrow) / 2, 2, k + 1)
plt.setp(sx.get_yticklabels(), visible=False)
plt.plot(theta, p, label="tosses %d \n heads %d " % (n, heads))
plt.fill_between(theta, 0, p, color="#348ABD", alpha=0.4)
plt.vlines(theta0, 0, 4, color="k", linestyles="--", lw=1)
leg = plt.legend()
leg.get_frame().set_alpha(0.4)
plt.autoscale(tight=True)
In [ ]:
X = stats.bernoulli(0.5) # one fair coin
Y = stats.bernoulli(0.5) # another fair coin
# want a new random variable which is their sum
Z = Y + X
Amazing machine learning and inference applications can be built if programs have probability semantics!
What's happening in the Python world ?