Random variables

When the objective is to predict the category (qualitative, such as predicting political party affiliation), we term the it as predicting a qualitative random variable. On the other hand, if we are predicting a quantitative value (number of cars sold), we term it a quantitative random variable.

When the observations of a quantitative random variable can assume values in a continuous interval (such as predicting temperature), it is called a continuous random variable.

Properties of discrete random variable

Say, we are predicting the probability of getting heads in two coin tosses P(y). Then

probability of y ranges from 0 and 1
sum of probabilities of all values of y = 1
probabilities of outcomes of discrete random variable is additive. Thus probability of y = 1 or 2 is P(1) + P(2)

Binomial and Poisson discrete random variables

Binomial probability distribution

A binomial experiment is one in which the outcome is one of two possible outcomes. Coin tosses, accept / reject, pass / fail, infected / uninfected, these are the kinds of studies that involve a binomial experiment. Thus an experiment is of binomial in nature if

experiment has n identical trials
each trial results in 1 of 2 outcomes ( success and failure )
probability of one of the outcome, say success remains the same for all trials
trials are independent of each other
the random variable y is the number of successes observed in n trials.

The probability of observing y success in n trials of a binomial experiment is $$ P(y) = \frac{n!}{y!(n-y)!}\pi^y (1-\pi)^{n-y} $$

where

n = number of trials
$\pi$ = probability of success in a single trial
$1-\pi$ = probability of failure in a single tiral
y = number of successes in n trials
$n!$ (n factorial) = $n(n-1)(n-2)..(n-(n-1))$

Mean and Standard Deviation of Binomial probability distribution

$$ \mu = n\pi $$$$ \sigma = \sqrt{n\pi(1-\pi)} $$

where

$\mu$ is mean
$\sigma$ is standard deviation

We can build a simple Python function to calculate the binomial probability as shown below:



In [1]:

    
import math

def bin_prob(n,y,pi):
    a = math.factorial(n)/(math.factorial(y)*math.factorial(n-y))
    b = math.pow(pi, y) * math.pow((1-pi), (n-y))
    p_y = a*b
    return p_y

Binomial probability of germination

Let us consider a problem where 100 seeds are drawn at random. The germination rate of each seed is 85%. Or in other words, the probability that a seed will germinate is 0.85, derived from experiment that 85 out of 100 seeds would germinate in a nursery. Now we want to calculate what is the probability

that utmost only 80 seeds will germinate
that utmost only 50 seeds will germinate
that utmost only 10 seeds will germinate
that utmost only 95 seeds will germinate



In [5]:

    
utmost_80 = bin_prob(100,80,0.85)
print("utmost 80: " + str(utmost_80))

utmost_50 = bin_prob(100,50,0.85)
print("utmost 50: " + str(utmost_50))

utmost_10 = bin_prob(100,10,0.85)
print("utmost 10: " + str(utmost_10))

utmost_95 = bin_prob(100, 95, 0.85)
print("utmost 95: " + str(utmost_95))









    



utmost 80: 0.04022449066141771
utmost 50: 1.9026685879668748e-16
utmost 10: 2.4027434608795305e-62
utmost 95: 0.0011271383580980794

We could calculate the probability for all possible values of the discrete random varibale in a loop and plot the probabilities as shown below:



In [20]:

    
x =[]
y =[]
cum_prob = []
for i in range(1,101):
    x.append(i)
    p_y = bin_prob(100,i,0.85)
#     print(str(i) + "  " + str(p_y))
    y.append(p_y)
    
    if i==1:
        cum_prob.append(p_y)
    else:
        cum_prob.append(cum_prob[i-2] + p_y)



In [7]:

    
import matplotlib.pyplot as plt
%matplotlib inline

fig,ax = plt.subplots(1,2, figsize=(13,5))
ax[0].plot(x,y)
ax[0].set_title('Probability of y successes')
ax[0].set_xlabel('num of successes in 100 trials')
ax[0].set_ylabel('probability of successes')

ax[1].plot(x,cum_prob)
ax[1].set_title('Cumulative Probability of y successes')
ax[1].set_xlabel('num of successes in 100 trials')
ax[1].set_ylabel('cumulative probability of successes')









    Out[7]:





<matplotlib.text.Text at 0x1126d2b00>

As we can see in the graph above, the probability that x number of seeds will germinate peaks around 85, matching the germination rate of 0.85.



In [24]:

    
#find x corresponding to the max probability value
y.index(max(y)) + 1









    Out[24]:





85

The probability falls steeply before and after 85. Using the cumulative probability, we can answer the question of atleast. Find the probability that

atleast 20 seeds will germinate = prob(that 21 + 22 + 23 ... 100) will germinate



In [30]:

    
atleast_20 = cum_prob[99] - cum_prob[19]
print("atleast 20 = " + str(atleast_20))

atleast_85 = cum_prob[99] - cum_prob[84]
print("atleast 85 = " + str(atleast_85))

atleast_95 = cum_prob[99] - cum_prob[94]
print("atleast 95 = " + str(atleast_95))









    



atleast 20 = 1.0
atleast 85 = 0.45722420577595013
atleast 95 = 0.00042551381703914704

We can repeat the experiment with a sample size of 20 and plot the results



In [31]:

    
x =[]
y =[]
cum_prob = []
for i in range(1,21):
    x.append(i)
    p_y = bin_prob(20,i,0.85)
#     print(str(i) + "  " + str(p_y))
    y.append(p_y)
    
    if i==1:
        cum_prob.append(p_y)
    else:
        cum_prob.append(cum_prob[i-2] + p_y)



In [32]:

    
#find x corresponding to the max probability value
y.index(max(y)) + 1









    Out[32]:





17



In [33]:

    
import matplotlib.pyplot as plt
%matplotlib inline

fig,ax = plt.subplots(1,2, figsize=(13,5))
ax[0].plot(x,y)
ax[0].set_title('Probability of y successes')
ax[0].set_xlabel('num of successes in 20 trials')
ax[0].set_ylabel('probability of successes')

ax[1].plot(x,cum_prob)
ax[1].set_title('Cumulative Probability of y successes')
ax[1].set_xlabel('num of successes in 20 trials')
ax[1].set_ylabel('cumulative probability of successes')









    Out[33]:





<matplotlib.text.Text at 0x112aaa630>

Poisson probability distribution

Poisson is used for modeling the events of a particular time over a period of time or region of space. An example is the number of vehicles passing through a security checkpoint in a 5 min interval.

Conditions

The probability distribution of a discrete random variable y is Poisson, if:

Events occur one at a time. Two or more events do not occur precisely at the same time or space
Events are independent - occurrence of an event at a time is independent of any other event in during a non-overlapping period of time or space
The expected number of events during one period or region $\mu$ is the same as the expected number of events in any other period or region

Thus the probability of observing y events in a unit of time or space is given by

$$ P(y) = \frac{\mu^{y}e^{-\mu}}{y!} $$

where

$\mu$ is average value of y
e is naturally occurring constant. e = 2.71828

Example Let y denote number of field mice captured in a trap in 24 hour period. The average value of y is 2.3. What is the probability of capturing exactly 4 mice in a randomly selected trap?

Ans: $$ \mu=2.3 $$ $$ P(y=4)=? $$



In [1]:

    
import math
def poisson_prob(y,mu):
    e = 2.71828
    numerator = math.pow(mu, y) * math.pow(e, 0-mu)
    denomenator = math.factorial(y)
    
    return numerator/denomenator



In [2]:

    
#calculate p(4)
p_4 = poisson_prob(4, 2.3)
p_4









    Out[2]:





0.1169024103856968

Lets plot the distribution of y for values 0 to 10



In [11]:

    
y=list(range(0,11))
p_y = []
cum_y = []
mu = 2.3

for yi in y:
    prob = poisson_prob(yi, mu)
    p_y.append(prob)

    if yi==0:
        cum_y.append(prob)
    else:
        cum_y.append(cum_y[yi-1] + prob)



In [13]:

    
#plot this
import matplotlib.pyplot as plt
%matplotlib inline

fig,ax = plt.subplots(1,2, figsize=(13,5))
ax[0].plot(y, p_y)
ax[0].set_title('Probability of finding y mice in 24 hours')
ax[0].set_xlabel('Probability of finding exactly y mice in 24 hours')
ax[0].set_ylabel('Probability')

ax[1].plot(y,cum_y)
ax[1].set_title('Cumulative Probability of y successes')
ax[1].set_xlabel('Probability of finding atleast y mice in 24 hours')
ax[1].set_ylabel('Cumulative probability')









    Out[13]:





<matplotlib.text.Text at 0x115375eb8>