Title: Modeling Randomness Date: 2013-10-05 17:05 Author: cfarmer Email: carson.farmer@gmail.com Category: Statistical Modeling for Python Tags: Helpful tips, Python, Statistical Modeling, Teaching Slug: statistical-modeling-python-modeling-randomness Latex: yes Status: draft

Modeling Randomness

One of the most surprizing outcomes of the revolution in computing technology has been the discovery of diverse uses for randomness in the analysis of data and in science generally. Most young people have little trouble with the idea of a computer generating random data; they see it in computer games and simulations. Older people, raised with the idea that computers do mathematical operations and that such operations are completely deterministic, sometimes find computer-generated random numbers suspect. Indeed, conventional algebraic notation ($+$, $-$, $\sqrt{}$, $\cos{}$, and so on) has no notation for "generated at random".

One of the simplest operators for generating random events is numpy.random.choice. This takes several arguments: the first being a set of items (a) to choose from, and the second being the number of events (size) to generate. Each item is equally likely to be chosen. For example, here is a simulation of a coin flip:


In [4]:
from numpy.random import choice
coin = ["H", "T"]
choice(coin, 5)


Out[4]:
array(['H', 'T', 'T', 'H', 'H'], 
      dtype='|S1')

In [5]:
choice(coin, 5)


Out[5]:
array(['H', 'H', 'T', 'H', 'H'], 
      dtype='|S1')

The first command imports the function choice from the numpy.random module, and the second command creates an object (a list) holding the possible outcome of each event, called coin. The next command generated five events, each event being a random choice of the outcomes in coin.

Another example is rolling dice. First, construct a set of the possible outcomes:


In [8]:
die = range(1,7)
die


Out[8]:
[1, 2, 3, 4, 5, 6]

Then generate random events. Here is a roll of two dice:


In [9]:
choice(die, 2)


Out[9]:
array([1, 2])

The choice function is also useful for selecting cases at random from a data frame. This use will be the basis for statistical methods introduced in later chapters.

Random Draws for Probability Models

Although choice is useful for random sampling, it can work only with finite sets of possible outcomes, such as H/T or 1/2/3/4/5/6 or the cases in a data frame. By default in choice, the underlying probability model is equiprobability - each possible outcome is equally likely. You can specify another probability model by using the p= argument to choice, which controls probabilities associated with each entry in a. For instance, to flip coins that are very likely to come up heads:


In [13]:
choice(coin, 10, p=[0.9, 0.1])


Out[13]:
array(['T', 'H', 'H', 'H', 'H', 'H', 'T', 'T', 'H', 'H'], 
      dtype='|S1')

Python and Scipy provides other operators that allow draws to be made from outcome sets that are infinite.

For example, norm from the scipy.stats module returns a normal continuous random variable from which random draws from a normal probability distribution can be made (using norm.rvs()). The arguments to norm.rvs let you specify the mean (loc=) and standard deviation (scale=) of the particular normal distribution that you want, and the size argument specifies how many draws to make. To illustrate, here is a set of 15 random numbers from a normal distribution with a mean of 1000 and a standard deviation of 75:


In [20]:
from scipy.stats import norm
samps = norm.rvs(loc=1000, scale=75, size=15)
samps


Out[20]:
array([ 1019.79796434,   993.84890674,   880.38305582,   913.01814583,
         941.81574239,  1082.1391777 ,  1028.0528797 ,   938.16469329,
        1086.1622266 ,   961.97899718,  1013.02887874,   987.75245943,
         904.10582853,  1136.39536813,   952.81042028])

In this example, the output was assigned to an object samps to facilitate some additional computations to the values. For instance, here is the mean and standard deviation of the sample:


In [22]:
samps.mean()


Out[22]:
989.29698298120923

In [23]:
samps.std()


Out[23]:
70.49533293053193

Don't be surprised that the mean and standard deviation of the sample don't match exactly the parameters that were set with the arguments loc=1000, scale=75. The sample was drawn at random and so the sample statistics are going to vary from one sample to the next. Part of the statistical methodology to be studied in later tutorials has to do with determining how close the statistics calculated from a sample are likely to be to the parameters of the underlying population.

Often you will generate very large samples. In these situations you usually don't want to display all of the samples, just do calculations with them. The practical limits of "large" depend on the computer you are using and how much time you are willing to wait for the computer to perform the calculations. For a function like norm.rvs and the others to be introduced in this chapter, it's feasible to generate samples of size 10,000 or 100,000 on an ordinary laptop computer.m


In [24]:
samps = norm.rvs(1000, 75, size=100000)
samps.mean()


Out[24]:
1000.0807598989297

In [25]:
samps.std()


Out[25]:
74.986186438984689

Notice that the sample mean and standard deviation are quite close to the population parameters in this large sample (remember not to put commas in as puctuation in large numbers: it's 100000 not 100,000).

The simulations that you will do in later tutorials will be much more elaborate than the simple draws here. Even with today's computers, you will want to use only a few hundred trials.

Standard Probability Models

Python and Scipy provides a set of modules like norm for different probability models in the scipy.stats module. All of these probability models work in a similar way:

  • Each has a set of parameters that specify the particular probability distribution you want
  • Each has an rvs function that allows random draws to be made

All probability models can be found in the scipy.stats module, and each one has a different set of parameters that need to be specified.

A useful reference for comparison between Matlab, R, and Python probability models can be [found here][reference]. It is also useful for determining which parameters are required for generating random variates etc.

Family Module Parameters Type
Normal norm loc (mean), scale (std dev) continuous
Uniform uniform loc (min), scale (max) continuous
Binomial binom n (size), p (probability) discrete
Poisson poisson mu (mean rate [lambda]) discrete
Exponential expon scale (1.0/lambda) continuous
Lognormal lognorm s (std dev), loc (mean) continuous
$\chi^2$ (Chi-squared) chi2 df (deg. of freedom) continuous
Student's t t df (deg. freedom) continuous
Snedecor's F f dfn (df numerator), (df denominator) continuous

To use the random number generators effectively, you first much choose a particular probability model based on the setting that applies in your situation. This setting will usually indicate what the population parameters should be. Some examples:


In [31]:
# Load several useful probability models
from scipy.stats import uniform, norm, binom, poisson, expon, lognorm, chi2, t, f

1) You are in charge of a hiring committee that is going to interview three candidates selected from a population of job applicants that is 63\% female. How many of the interviewees will be female? Modeling this as a random selection from the applicant pool, a binomial model is appropriate. The size of each trial is 3 (i.e., n=3), and the probability of being female is 63\% (i.e., p=0.63):


In [44]:
from scipy.stats import binom
    samps = binom.rvs(3, 0.63, size=40)
    samps


Out[44]:
array([1, 3, 2, 0, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 1, 3, 0, 1, 2, 1, 3, 1, 2,
       2, 1, 2, 2, 2, 2, 2, 1, 1, 0, 3, 3, 1, 0, 2, 2, 1])

There are 40 trials here, since size argument was set to 40. Remember, each of the trials is a simulation of one hiring event. In the first simulated event, two of the interviewees were female; in the third only one was female. Typically, you will be summarizing all the simulations, for example, to see how likely each possible outcome is.


In [43]:
import pandas as pd
pd.Series(samps).value_counts()


Out[43]:
2    23
1    10
3     6
0     1
dtype: int64

2) You want to simulate the number of customers who come into a store over the course of an hour. The average rate is 15 per hour. To simulate a situation where customers arrive randomly, the Poisson model is appropriate:


In [46]:
from scipy.stats import poisson
poisson.rvs(15, size=25)


Out[46]:
array([12, 17,  9, 12, 21, 15, 16, 12,  8, 21, 15, 15, 12,  8, 13, 16, 21,
       16, 18, 13, 13, 19, 15, 19, 18])

3) You want to generate a simulation of the interval between earthquakes as in Example 11.5.4 from the book. To simulate the random intervals with a typical rate of 0.03 earthquakes per year, you would use:


In [50]:
from scipy.stats import expon
expon.rvs(scale=1.0/0.03, size=15)


Out[50]:
array([ 15.86444614,  16.85881696,  35.01553743,  24.7432013 ,
        39.02067078,  29.99499166,  31.59646041,   0.44506101,
        29.60265513,  22.45041553,   8.08170478,  49.81936582,
        27.54049465,  54.51531863,  23.3166474 ])

Notice the huge variation in the intervals!

Coverage Intervals

You will often need to compute intervals in order to describe the range of likely outcomes from a random process. Python and Scipy provide a series of functions for this purpose; one for each named probability model. The functions are called ppf, standing for percent point function (percentiles). In all cases, the first argument is the set of percentiles (q) you want to calculate for the particular probability model. The other named arguments are the parameters.

Remember that to find a 95% coverage interval you need the 0.025 and 0.975 quantiles. For a 99% interval, you need the 0.005 and 0.995 quantiles. To illustrate, here are the 95% coverage intervals for a few probability models:

  • A normal distribution with mean 0 and standard deviation 1:

In [55]:
norm.ppf([0.025, 0.975], 0, 1)


Out[55]:
array([-1.95996398,  1.95996398])
  • The hiring committee situation modelled by a binomial distribution with n=3 and p=0.63:

Perhaps you are suprised to see that the coverage interval includes all the possible outcomes. That's because the number of cases in each trial $(n=3)$ is quite small.


In [56]:
binom.ppf([0.025, 0.975], 3, 0.63)


Out[56]:
array([ 0.,  3.])
  • The numbr of customers entering a store during an hour as modelled by a Poisson distribution with an average rate of 15 customers per hour:

In [65]:
poisson.ppf([0.025, 0.975], 15)


Out[65]:
array([  8.,  23.])
  • The interval between earthquakes modelled by an exponential distribution with a typical rate of 0.03 earthquakes per year:

In [63]:
expon.ppf([0.025, 0.975], scale=1.0/0.03)


Out[63]:
array([   0.84392693,  122.96264847])

You can also use the ppf methods to find the value that would be at a particular percentile. For example, the exponential model with a rate of 0.03 gives the 25th percentile of the interval between earthquakes as:


In [66]:
expon.ppf(0.25, scale=1.0/0.03)


Out[66]:
9.5894024150593644

A quarter of the time, the interval between earthquakes will be 9.59 years or less.

It's entirely feasible to calculate percentiles and coverage intervals by combining the random number generators with the methods to compute quantiles presented in previous tutorials. For example, here is the 95\% coverage interval from a normal distribution with mean 0 and standard deviation 1:


In [67]:
samps = norm.rvs(loc=0, scale=1, size=10000)
[pd.Series(samps).quantile(q) for q in [.025, .975]]


Out[67]:
[-1.9637144782958194, 1.9671117119775101]

The disadvantage of this approach is that it is a simulation and the results will vary randomly. By making the sample size large enough - here it is $n = 10000$ - you can reduce the random variation. Using the ppf methods uses mathematical analysis to give you what is effectively an infinite sample size. For this reason, it's advisable to use the ppf methods when you can. However, for many of the techniques to be introduced in later tutorials you will have to generate a random sample then apply Series.quantile or similar to approximate the coverage intervals.

Cumulative Probability

A cumulative probability computation applies to situations where you have a measured value and you want to know where that value ranks relative to the entire set of possible outcomes. You have already seen cumulative probabilities computed from samples; they also apply to probability models. The word "cumulative" is used because the probability reflects the change of seeing a particular value or smaller. Python and Scipy provide a cumulative density function (cdf) to compute cumulative probabilities on each named probability model.

It's easy to confuse cumulative probabilities with percentiles because they are so closely related. Mathematically, the cumulative probability functions are the inverse of the percentile operators, To help you remember which is which, it's helpful to distinguish the based on the type of argument thta you give to the function:

Cumulative Probability The input argument is a measured value, something that could be the output of a single draw from the probability distribution. The output is always a number between 0 and 1 - a probability.

Percentile The input is a cumulative probability, a number between 0 and 1. The output is on the scale of the measured variable.

Example: You have just gotten your score, 670, on a professional school admissions test. According to the information published by the testing company, the scores are normally distributed with a mean of 600 and a standard deviation of 100. So, your ranking on the test, as indicated by a cumulative probability, is:


In [68]:
norm.cdf(670, loc=600, scale=100)


Out[68]:
0.75803634777692697

Your score is at about the 75th percentile.

Example: Unfortunately, the professional school that you want to go to accepts only students with score in the top 15 percent. Your score, at 75.8%, isn't good enough. So, you still study some more and take practice tests until your score is good enough. How well will you need to score to reach the 85th percentile?


In [69]:
norm.ppf(0.85, loc=600, scale=100)


Out[69]:
703.64333894937897

Next time on, Statistical Modeling: A Fresh Approach for Python...

  • Confidence in Models

Reference

As with all 'Statistical Modeling: A Fresh Approach for Python' tutorials, this tutorial is based directly on material from 'Statistical Modeling: A Fresh Approach (2nd Edition)' by Daniel Kaplan.

I have made an effort to keep the text and explanations consistent between the original (R-based) version and the Python tutorials, in order to keep things comparable. With that in mind, any errors, omissions, and/or differences between the two versions are mine, and any questions, comments, and/or concerns should be directed to me.