In this notebook we will explore discrete random variables and sampling. After defining a helper class and associated functions we will be able to create both symbolic and numeric random variables and generate samples from them.
Define a helper random variable class based on the scipy discrete random variable functionality providing both numeric and symbolic RVs
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np
In [2]:
class Random_Variable:
def __init__(self, name, values, probability_distribution):
self.name = name
self.values = values
self.probability_distribution = probability_distribution
if all(type(item) is np.int64 for item in values):
self.type = 'numeric'
self.rv = stats.rv_discrete(name = name, values = (values, probability_distribution))
elif all(type(item) is str for item in values):
self.type = 'symbolic'
self.rv = stats.rv_discrete(name = name, values = (np.arange(len(values)), probability_distribution))
self.symbolic_values = values
else:
self.type = 'undefined'
def sample(self,size):
if (self.type =='numeric'):
return self.rv.rvs(size=size)
elif (self.type == 'symbolic'):
numeric_samples = self.rv.rvs(size=size)
mapped_samples = [values[x] for x in numeric_samples]
return mapped_samples
Let's first create some random samples of symbolic random variables corresponding to a coin and a dice
In [5]:
values = ['H', 'T']
probabilities = [0.9, 0.1]
coin = Random_Variable('coin', values, probabilities)
samples = coin.sample(20)
print(samples)
In [4]:
values = ['1', '2', '3', '4', '5', '6']
probabilities = [1/6.] * 6
dice = Random_Variable('dice', values, probabilities)
samples = dice.sample(10)
print(samples);
[100] * 10
[1 / 6.] * 3
Out[4]:
Now let's look at a numeric random variable corresponding to a dice so that we can more easily make plots and histograms
In [6]:
values = np.arange(1,7)
probabilities = [1/6.] * 6
dice = Random_Variable('dice', values, probabilities)
samples = dice.sample(100)
plt.stem(samples, markerfmt= ' ')
Out[6]:
Let's now look at a histogram of these generated samples. Notice that even with 500 samples the bars are not equal length so the calculated frequencies are only approximating the probabilities used to generate them
In [7]:
plt.figure()
plt.hist(samples,bins=[1,2,3,4,5,6,7],normed=1, rwidth=0.5,align='left');
Let's plot the cumulative histogram of the samples
In [7]:
plt.hist(samples,bins=[1,2,3,4,5,6,7],normed=1, rwidth=0.5,align='left', cumulative=True);
Let's now estimate the frequency of the event roll even number in different ways. First let's count the number of even numbers in the generated samples. Then let's take the sum of the counts of the individual estimated probabilities.
In [10]:
# we can also write the predicates directly using lambda notation
est_even = len([x for x in samples if x%2==0]) / len(samples)
est_2 = len([x for x in samples if x==2]) / len(samples)
est_4 = len([x for x in samples if x==4]) / len(samples)
est_6 = len([x for x in samples if x==6]) / len(samples)
print(est_even)
# Let's print some estimates
print('Estimates of 2,4,6 = ', (est_2, est_4, est_6))
print('Direct estimate = ', est_even)
print('Sum of estimates = ', est_2 + est_4 + est_6)
print('Theoretical value = ', 0.5)
Notice that we can always estimate the probability of an event by simply counting how many times it occurs in the samples of an experiment. However if we have multiple events we are interested in then it can be easier to calculate the probabilities of the values of invdividual random variables and then use the rules of probability to estimate the probabilities of more complex events.