Discrete Random Variables and Sampling

George Tzanetakis, University of Victoria

In this notebook we will explore discrete random variables and sampling. After defining a helper class and associated functions we will be able to create both symbolic and numeric random variables and generate samples from them.

Define a helper random variable class based on the scipy discrete random variable functionality providing both numeric and symbolic RVs


In [1]:
%matplotlib inline 
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np

In [2]:
class Random_Variable: 
    
    def __init__(self, name, values, probability_distribution): 
        self.name = name 
        self.values = values 
        self.probability_distribution = probability_distribution 
        if all(type(item) is np.int64 for item in values): 
            self.type = 'numeric'
            self.rv = stats.rv_discrete(name = name, values = (values, probability_distribution))
        elif all(type(item) is str for item in values): 
            self.type = 'symbolic'
            self.rv = stats.rv_discrete(name = name, values = (np.arange(len(values)), probability_distribution))
            self.symbolic_values = values 
        else: 
            self.type = 'undefined'
            
    def sample(self,size): 
        if (self.type =='numeric'): 
            return self.rv.rvs(size=size)
        elif (self.type == 'symbolic'): 
            numeric_samples = self.rv.rvs(size=size)
            mapped_samples = [values[x] for x in numeric_samples]
            return mapped_samples

Let's first create some random samples of symbolic random variables corresponding to a coin and a dice


In [5]:
values = ['H', 'T']
probabilities = [0.9, 0.1]
coin = Random_Variable('coin', values, probabilities)
samples = coin.sample(20)
print(samples)


['H', 'H', 'H', 'H', 'H', 'T', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'T', 'H', 'H']

In [4]:
values = ['1', '2', '3', '4', '5', '6']
probabilities = [1/6.] * 6
dice = Random_Variable('dice', values, probabilities)
samples = dice.sample(10)
print(samples);
[100] * 10
[1 / 6.] * 3


['6', '3', '3', '3', '6', '6', '4', '3', '3', '6']
Out[4]:
[0.16666666666666666, 0.16666666666666666, 0.16666666666666666]

Now let's look at a numeric random variable corresponding to a dice so that we can more easily make plots and histograms


In [6]:
values = np.arange(1,7)
probabilities = [1/6.] * 6
dice = Random_Variable('dice', values, probabilities)
samples = dice.sample(100)
plt.stem(samples, markerfmt= ' ')


Out[6]:
<Container object of 3 artists>

Let's now look at a histogram of these generated samples. Notice that even with 500 samples the bars are not equal length so the calculated frequencies are only approximating the probabilities used to generate them


In [7]:
plt.figure()
plt.hist(samples,bins=[1,2,3,4,5,6,7],normed=1, rwidth=0.5,align='left');


Let's plot the cumulative histogram of the samples


In [7]:
plt.hist(samples,bins=[1,2,3,4,5,6,7],normed=1, rwidth=0.5,align='left', cumulative=True);


Let's now estimate the frequency of the event roll even number in different ways. First let's count the number of even numbers in the generated samples. Then let's take the sum of the counts of the individual estimated probabilities.


In [10]:
# we can also write the predicates directly using lambda notation 
est_even = len([x for x in samples if x%2==0]) / len(samples)
est_2 = len([x for x in samples if x==2]) / len(samples)
est_4 = len([x for x in samples if x==4]) / len(samples)
est_6 = len([x for x in samples if x==6]) / len(samples)
print(est_even)
# Let's print some estimates 
print('Estimates of 2,4,6 = ', (est_2, est_4, est_6))
print('Direct estimate = ', est_even) 
print('Sum of estimates = ', est_2 + est_4 + est_6)
print('Theoretical value = ', 0.5)


0.495
Estimates of 2,4,6 =  (0.156, 0.169, 0.17)
Direct estimate =  0.495
Sum of estimates =  0.495
Theoretical value =  0.5

Notice that we can always estimate the probability of an event by simply counting how many times it occurs in the samples of an experiment. However if we have multiple events we are interested in then it can be easier to calculate the probabilities of the values of invdividual random variables and then use the rules of probability to estimate the probabilities of more complex events.