Random variables

When the objective is to predict the category (qualitative, such as predicting political party affiliation), we term the it as predicting a qualitative random variable. On the other hand, if we are predicting a quantitative value (number of cars sold), we term it a quantitative random variable.

When the observations of a quantitative random variable can assume values in a continuous interval (such as predicting temperature), it is called a continuous random variable.

Properties of discrete random variable

Say, we are predicting the probability of getting heads in two coin tosses P(y). Then

  • probability of y ranges from 0 and 1
  • sum of probabilities of all values of y = 1
  • probabilities of outcomes of discrete random variable is additive. Thus probability of y = 1 or 2 is P(1) + P(2)

Binomial and Poisson discrete random variables

Binomial probability distribution

A binomial experiment is one in which the outcome is one of two possible outcomes. Coin tosses, accept / reject, pass / fail, infected / uninfected, these are the kinds of studies that involve a binomial experiment. Thus an experiment is of binomial in nature if

  • experiment has n identical trials
  • each trial results in 1 of 2 outcomes ( success and failure )
  • probability of one of the outcome, say success remains the same for all trials
  • trials are independent of each other
  • the random variable y is the number of successes observed in n trials.

The probability of observing y success in n trials of a binomial experiment is $$ P(y) = \frac{n!}{y!(n-y)!}\pi^y (1-\pi)^{n-y} $$

where

  • n = number of trials
  • $\pi$ = probability of success in a single trial
  • $1-\pi$ = probability of failure in a single tiral
  • y = number of successes in n trials
  • $n!$ (n factorial) = $n(n-1)(n-2)..(n-(n-1))$

Mean and Standard Deviation of Binomial probability distribution

$$ \mu = n\pi $$$$ \sigma = \sqrt{n\pi(1-\pi)} $$

where

  • $\mu$ is mean
  • $\sigma$ is standard deviation

We can build a simple Python function to calculate the binomial probability as shown below:


In [1]:
import math

def bin_prob(n,y,pi):
    a = math.factorial(n)/(math.factorial(y)*math.factorial(n-y))
    b = math.pow(pi, y) * math.pow((1-pi), (n-y))
    p_y = a*b
    return p_y

Binomial probability of germination

Let us consider a problem where 100 seeds are drawn at random. The germination rate of each seed is 85%. Or in other words, the probability that a seed will germinate is 0.85, derived from experiment that 85 out of 100 seeds would germinate in a nursery. Now we want to calculate what is the probability

  • that utmost only 80 seeds will germinate
  • that utmost only 50 seeds will germinate
  • that utmost only 10 seeds will germinate
  • that utmost only 95 seeds will germinate

In [5]:
utmost_80 = bin_prob(100,80,0.85)
print("utmost 80: " + str(utmost_80))

utmost_50 = bin_prob(100,50,0.85)
print("utmost 50: " + str(utmost_50))

utmost_10 = bin_prob(100,10,0.85)
print("utmost 10: " + str(utmost_10))

utmost_95 = bin_prob(100, 95, 0.85)
print("utmost 95: " + str(utmost_95))


utmost 80: 0.04022449066141771
utmost 50: 1.9026685879668748e-16
utmost 10: 2.4027434608795305e-62
utmost 95: 0.0011271383580980794

We could calculate the probability for all possible values of the discrete random varibale in a loop and plot the probabilities as shown below:


In [20]:
x =[]
y =[]
cum_prob = []
for i in range(1,101):
    x.append(i)
    p_y = bin_prob(100,i,0.85)
#     print(str(i) + "  " + str(p_y))
    y.append(p_y)
    
    if i==1:
        cum_prob.append(p_y)
    else:
        cum_prob.append(cum_prob[i-2] + p_y)

In [7]:
import matplotlib.pyplot as plt
%matplotlib inline

fig,ax = plt.subplots(1,2, figsize=(13,5))
ax[0].plot(x,y)
ax[0].set_title('Probability of y successes')
ax[0].set_xlabel('num of successes in 100 trials')
ax[0].set_ylabel('probability of successes')

ax[1].plot(x,cum_prob)
ax[1].set_title('Cumulative Probability of y successes')
ax[1].set_xlabel('num of successes in 100 trials')
ax[1].set_ylabel('cumulative probability of successes')


Out[7]:
<matplotlib.text.Text at 0x1126d2b00>

As we can see in the graph above, the probability that x number of seeds will germinate peaks around 85, matching the germination rate of 0.85.


In [24]:
#find x corresponding to the max probability value
y.index(max(y)) + 1


Out[24]:
85

The probability falls steeply before and after 85. Using the cumulative probability, we can answer the question of atleast. Find the probability that

  • atleast 20 seeds will germinate = prob(that 21 + 22 + 23 ... 100) will germinate

In [30]:
atleast_20 = cum_prob[99] - cum_prob[19]
print("atleast 20 = " + str(atleast_20))

atleast_85 = cum_prob[99] - cum_prob[84]
print("atleast 85 = " + str(atleast_85))

atleast_95 = cum_prob[99] - cum_prob[94]
print("atleast 95 = " + str(atleast_95))


atleast 20 = 1.0
atleast 85 = 0.45722420577595013
atleast 95 = 0.00042551381703914704

We can repeat the experiment with a sample size of 20 and plot the results


In [31]:
x =[]
y =[]
cum_prob = []
for i in range(1,21):
    x.append(i)
    p_y = bin_prob(20,i,0.85)
#     print(str(i) + "  " + str(p_y))
    y.append(p_y)
    
    if i==1:
        cum_prob.append(p_y)
    else:
        cum_prob.append(cum_prob[i-2] + p_y)

In [32]:
#find x corresponding to the max probability value
y.index(max(y)) + 1


Out[32]:
17

In [33]:
import matplotlib.pyplot as plt
%matplotlib inline

fig,ax = plt.subplots(1,2, figsize=(13,5))
ax[0].plot(x,y)
ax[0].set_title('Probability of y successes')
ax[0].set_xlabel('num of successes in 20 trials')
ax[0].set_ylabel('probability of successes')

ax[1].plot(x,cum_prob)
ax[1].set_title('Cumulative Probability of y successes')
ax[1].set_xlabel('num of successes in 20 trials')
ax[1].set_ylabel('cumulative probability of successes')


Out[33]:
<matplotlib.text.Text at 0x112aaa630>

Poisson probability distribution

Poisson is used for modeling the events of a particular time over a period of time or region of space. An example is the number of vehicles passing through a security checkpoint in a 5 min interval.

Conditions

The probability distribution of a discrete random variable y is Poisson, if:

  • Events occur one at a time. Two or more events do not occur precisely at the same time or space
  • Events are independent - occurrence of an event at a time is independent of any other event in during a non-overlapping period of time or space
  • The expected number of events during one period or region $\mu$ is the same as the expected number of events in any other period or region

Thus the probability of observing y events in a unit of time or space is given by

$$ P(y) = \frac{\mu^{y}e^{-\mu}}{y!} $$

where

  • $\mu$ is average value of y
  • e is naturally occurring constant. e = 2.71828

Example Let y denote number of field mice captured in a trap in 24 hour period. The average value of y is 2.3. What is the probability of capturing exactly 4 mice in a randomly selected trap?

Ans: $$ \mu=2.3 $$ $$ P(y=4)=? $$


In [1]:
import math
def poisson_prob(y,mu):
    e = 2.71828
    numerator = math.pow(mu, y) * math.pow(e, 0-mu)
    denomenator = math.factorial(y)
    
    return numerator/denomenator

In [2]:
#calculate p(4)
p_4 = poisson_prob(4, 2.3)
p_4


Out[2]:
0.1169024103856968

Lets plot the distribution of y for values 0 to 10


In [11]:
y=list(range(0,11))
p_y = []
cum_y = []
mu = 2.3

for yi in y:
    prob = poisson_prob(yi, mu)
    p_y.append(prob)

    if yi==0:
        cum_y.append(prob)
    else:
        cum_y.append(cum_y[yi-1] + prob)

In [13]:
#plot this
import matplotlib.pyplot as plt
%matplotlib inline

fig,ax = plt.subplots(1,2, figsize=(13,5))
ax[0].plot(y, p_y)
ax[0].set_title('Probability of finding y mice in 24 hours')
ax[0].set_xlabel('Probability of finding exactly y mice in 24 hours')
ax[0].set_ylabel('Probability')

ax[1].plot(y,cum_y)
ax[1].set_title('Cumulative Probability of y successes')
ax[1].set_xlabel('Probability of finding atleast y mice in 24 hours')
ax[1].set_ylabel('Cumulative probability')


Out[13]:
<matplotlib.text.Text at 0x115375eb8>