"Random" Numbers

This lesson was developed using materials from a variety of sources, including Wikipedia, Ch 10 of "Beautiful testing" by John D. Cook, and Numerical Recipes Online


Instructions: Create a new directory called RandomNumbers with a notebook called RandomNumbersTour. Give it a heading 1 cell title "Random" Numbers. Read this page, typing in the code in the code cells and executing them as you go.

Do not copy/paste.

Type the commands yourself to get the practice doing it. This will also slow you down so you can think about the commands and what they are doing as you type them.</font>

Save your notebook when you are done, then try the accompanying exercises.



In [ ]:
%pylab inline
import numpy as np
import matplotlib.pyplot as plot

Introduction

We have been using random numbers in this class for a while now, without really understanding where they come from. Let's explore their provenance and see how we can use them in applications.

Pseudo-Random Number Generation (PRNG)

There are many different ways to generate uniform random numbers over a specified range (such as U[0,1]). Physically, we can spin a roulette wheel, draw balls from a lottery, throw darts at a board, etc. However, when we wish to use the numbers in a computer, obviously we need a way to generate the numbers algorithmically. Numerically/arithmetically, we can use a sequential method where each new number is a deterministic function of the previous numbers. This destroys their true randomness and makes them at best, "pseudo-random". However, in most cases, it is sufficient if the numbers “look” uniformly distributed and have no correlation between them. i.e. they pass statistical tests and obey the central limit theorem.

All pseudo-random number generators (PRNG) should possess a few key properties. Namely, they should

  1. be fast and not memory intensive

  2. be able to reproduce a given stream of random numbers (for debugging/verification of computer programs or so we can use identical numbers to compare different systems)

  3. be able to produce several different independent “streams” of random numbers

  4. have a long periodicity, so that they do not wrap around and produce the same numbers again within a reasonably long window.

To obtain a sequence of pseudo-random numbers, one initilizes the state of the generator with a truly random "seed" value and the generator uses that seed to create an initial "state", then produces a pseudo-random sequence of numbers from that state. The sequence will eventually repeat when the generator's state returns to that initial one. The length of the sequence of non-repeating numbers is the period of the PRNG. A more technical definition from wikipedia states: "The period of a PRNG is defined as the maximum over all starting states of the length of the repetition-free prefix of the sequence. The period is bounded by the size of the seed state, measured in bits." If a PRNG's internal state contains $n$ bits, its period can be no longer than 2$^n$ results, and may be much shorter. It is relatively easy to build PRNGs with periods long enough for many practical applications, but one must be cautious in applying PRNG's to problems that require very large quantities of random numbers.

Almost all languages and simulation packages have good built-in generators. We won't dwell on the details of how they work, but let us assume that we have an algorithm (program) that generates independent samples of uniform[0,1] random variables. In Python, we can use the NumPy random library, which is based on the Mersenne-Twister algorithm developed in 1997.


In [ ]:
#Review the documentation for NumPy's random module:
np.random?

When you want to be able to reproduce a particular set of random numbers, you can set the "seed" value of the generator to a particular value. Every time you set the seed to this value, it returns the generator to its initial state and will always give you the same sequence of numbers. The default seed value in python is None. The python kernel interprets this to mean that we want it to try to initialize itself using a pseudo-random number from the computer's random number cache or it will use the current computer clock time to set the seed. Let's see what setting and resetting the seed does:


In [ ]:
#print 5 uniformly distributed numbers between 0 and 1
print np.random.random(5)
#print another 5 - should be different
print np.random.random(5)

In [ ]:
#now set the seed to something:
np.random.seed(4242)

#print 5 random numbers from the generator with this seed
print "Using seed = 4242:"
print np.random.random(5)

In [ ]:
#Restore the seed to it's original `None`
np.random.seed(None)

#print 5 more numbers using the new random seed
print "Using seed = None:"
print np.random.random(5)

In [ ]:
#reset the seed to 4242 and print again
np.random.seed(4242)
print "Using seed = 4242, we should get the same numbers as before"
print np.random.random(5)

If we plot the normalized distribution of uniformly sampled random numbers in the range U[0,1], we should get a flat line at 1.0, subject to random fluctuations. The more samples we take the smaller the fluctuations about the average value.


In [ ]:
#reset the seed
np.random.seed(None)
z = np.random.random(100000)
plt.hist(z,100,normed=True);
plt.plot([0,1.],[1.,1.],'r-',lw=3)
plt.xlabel("x",fontsize=20)
plt.ylabel("sample probability",fontsize=20)
plt.show()

This shows us that every value of $x$ is (roughly) equally as likely to be chosen as every other value.

Notice here we have used the plotting function "hist". This counts the frequency of values within bins of finite size and plots the result. It also returns the array of bin contents, in case you want to use them for further calculations. The number of bins to use is the second argument of the function. We normalized the histogram by dividing each bin's value by the integral of the full histogram. To learn more about hist, try the help.


In [ ]:
help(plt.hist)

In a subsequent lesson we will explore how to create random numbers that follow the shape of a particular function by using the probability density to weight the sampling toward or away from a particular region. For example, we might wish to sample from a decaying exponential function, where the probability of obtaining a particular value should be very high at low $x$ and very low at high $x$. For now, let's learn how we can test the randomness of our PRNG.

How random is "random"?

Before we look at useful applications of random numbers, let's pause to consider them a bit more. How do we know that the "random" numbers we are getting are random enough? We already know that they can't ever be considered truly random, but surely there is some way to say with confidence that the numbers we obtain are good enough. To do that, we can apply a series of statistical tests on a large sample of random numbers to see if they are statistically sound.

A good overview of the problem from which this discussion is drawn is provided in Chapter 10 of the book "Beautiful Testing", written by John D. Cook. He makes the chapter available online for free through his webpage.

The most straightforward statistical tests we can apply are

  • Mean test
  • Variance test
  • Chi-square or "Bucket" test
  • Kolmogorov-Smirnov Test

Mean test

The mean test is as you would expect. Generate a large sample of random numbers and compute their mean.

$$m = \frac{1}{n}\sum_{i=1}^n x_i$$

Then compare to the expected mean value for the distribution you are sampling from. For example, for uniform numbers between 0 and 1, you'd expect a mean of 0.5.


Variance test

For the variance test, compute the variance for a large sample of data with

$$\sigma^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - m)^2$$

$\sigma$ is the standard deviation of the mean and for a large data sample should approximate that of the normal distribution, where 68% of all values should fall within one standard deviation, 95% should fall within 2 standard deviations, and so on. You can check your distribution of samples by computing the percentage of random numbers that fall within the appropriate ranges.


Chi-square or "Bucket" test

There are some distributions for which the mean and variance tests don't work because they don't have well-defined measures to compare against. The chi-square or "bucket" test is an alternative. Divide the distribution of samples up into $k$ "buckets" or bins, in which the expected number of samples can be determined from the theoretical distribution. For example, if you are sampling numbers so that they follow an exponential distribution, you know that there should be more values close to 0 than high $x$. The chi-square can be computed with

$$\chi^2 = \sum_{i=1}^k\frac{(O_i-E_i)^2}{E_i}$$

where $O_i$ is the observed number of occurrences for that bin and $E_i$ is the expected number, given by the theoretical probability normalized by the bin width and sample size. The rule of thumb for choosing $k$ is that each bucket should have a minimum of 5 samples in it.

If $\chi^2$ is too large, the numbers we sampled do not follow the distribution we desire and should not be used. If $\chi^2$ is too low, our numbers are not random enough - they are too predictable.

How do we define the acceptable range of $\chi^2$ values? If we use a large number of buckets, $k$, and generate a large number of samples $n$, a large number of times $N$, then $\chi^2$ itself should follow a normal distribution with mean value $k$-1 and variance $2k-2$. We can then use the mean and variance tests on $\chi^2$ to determine the reliability of the randomness of our generated samples.

Kolmogorov-Smirnov test

Finally, the Kolmogorov-Smirnov test allows us to look at the distribution of samples at finer detail than the chi-square test. For a large number of samples, $n$, compute the empirical distribution function (proportion of samples below each sample point) and compare that to the theoretical cumulative distribution function, which is a running integral of the function. Let $F_n(x)$ represent the empirical distribution function of the samples,

$$F_n(x) = \frac{\textrm{number of } x_i \textrm{ values} \leq x}{n}$$

and $F(x)$ is the theoretical cumulative distribution function. To compare the two numbers, calculate

$$K^+ = \sqrt{n} \underset{x}\max{(F_n(x) - F(x))} \\ K^- = \sqrt{n} \underset{x}\max{(F(x) - F_n(x))}$$

which returns quantities (modulo the $\sqrt{n}$) that represent the largest deviation between $F_n(x)$ and $F(x)$. If our samples perfectly matched the expected distribution, $K^+$ and $K^-$ would both be zero. But we know that in practice this can never happen. If they are too close to zero, that also raises suspicions. So we need some way of defining the range of reasonable values. The central limit theorem doesn't work in this case but ranges can be computed. For our purposes, we will use the definition that for large $n$, we expect $K^+$ to be between 0.07089 and 1.5174 around 98% of the time.

If you are curious about where those numbers come from, consult for example "The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, Third Edition" by Donald E. Knuth, as suggested by John D. Cook in the "Beautiful Testing" book.


All content is under a modified MIT License, and can be freely used and adapted. See the full license text here.