When the objective is to predict the category (qualitative, such as predicting political party affiliation), we term the it as predicting a qualitative random variable
. On the other hand, if we are predicting a quantitative value (number of cars sold), we term it a quantitative random variable
.
When the observations of a quantitative random variable
can assume values in a continuous interval (such as predicting temperature), it is called a continuous random variable
.
Say, we are predicting the probability of getting heads in two coin tosses P(y). Then
A binomial experiment is one in which the outcome is one of two possible outcomes. Coin tosses, accept / reject, pass / fail, infected / uninfected, these are the kinds of studies that involve a binomial experiment. Thus an experiment is of binomial in nature if
n
identical trialsy
is the number of successes observed in n
trials.The probability of observing y
success in n
trials of a binomial experiment is
$$
P(y) = \frac{n!}{y!(n-y)!}\pi^y (1-\pi)^{n-y}
$$
where
y
= number of successes in n
trialsWe can build a simple Python function to calculate the binomial probability as shown below:
In [1]:
import math
def bin_prob(n,y,pi):
a = math.factorial(n)/(math.factorial(y)*math.factorial(n-y))
b = math.pow(pi, y) * math.pow((1-pi), (n-y))
p_y = a*b
return p_y
Let us consider a problem where 100 seeds are drawn at random. The germination rate of each seed is 85%
. Or in other words, the probability that a seed will germinate is 0.85
, derived from experiment that 85
out of 100
seeds would germinate in a nursery. Now we want to calculate what is the probability
In [5]:
utmost_80 = bin_prob(100,80,0.85)
print("utmost 80: " + str(utmost_80))
utmost_50 = bin_prob(100,50,0.85)
print("utmost 50: " + str(utmost_50))
utmost_10 = bin_prob(100,10,0.85)
print("utmost 10: " + str(utmost_10))
utmost_95 = bin_prob(100, 95, 0.85)
print("utmost 95: " + str(utmost_95))
We could calculate the probability for all possible values of the discrete random varibale in a loop and plot the probabilities as shown below:
In [20]:
x =[]
y =[]
cum_prob = []
for i in range(1,101):
x.append(i)
p_y = bin_prob(100,i,0.85)
# print(str(i) + " " + str(p_y))
y.append(p_y)
if i==1:
cum_prob.append(p_y)
else:
cum_prob.append(cum_prob[i-2] + p_y)
In [7]:
import matplotlib.pyplot as plt
%matplotlib inline
fig,ax = plt.subplots(1,2, figsize=(13,5))
ax[0].plot(x,y)
ax[0].set_title('Probability of y successes')
ax[0].set_xlabel('num of successes in 100 trials')
ax[0].set_ylabel('probability of successes')
ax[1].plot(x,cum_prob)
ax[1].set_title('Cumulative Probability of y successes')
ax[1].set_xlabel('num of successes in 100 trials')
ax[1].set_ylabel('cumulative probability of successes')
Out[7]:
As we can see in the graph above, the probability that x
number of seeds will germinate peaks around 85
, matching the germination rate of 0.85
.
In [24]:
#find x corresponding to the max probability value
y.index(max(y)) + 1
Out[24]:
The probability falls steeply before and after 85. Using the cumulative probability
, we can answer the question of atleast
. Find the probability that
In [30]:
atleast_20 = cum_prob[99] - cum_prob[19]
print("atleast 20 = " + str(atleast_20))
atleast_85 = cum_prob[99] - cum_prob[84]
print("atleast 85 = " + str(atleast_85))
atleast_95 = cum_prob[99] - cum_prob[94]
print("atleast 95 = " + str(atleast_95))
We can repeat the experiment with a sample size of 20
and plot the results
In [31]:
x =[]
y =[]
cum_prob = []
for i in range(1,21):
x.append(i)
p_y = bin_prob(20,i,0.85)
# print(str(i) + " " + str(p_y))
y.append(p_y)
if i==1:
cum_prob.append(p_y)
else:
cum_prob.append(cum_prob[i-2] + p_y)
In [32]:
#find x corresponding to the max probability value
y.index(max(y)) + 1
Out[32]:
In [33]:
import matplotlib.pyplot as plt
%matplotlib inline
fig,ax = plt.subplots(1,2, figsize=(13,5))
ax[0].plot(x,y)
ax[0].set_title('Probability of y successes')
ax[0].set_xlabel('num of successes in 20 trials')
ax[0].set_ylabel('probability of successes')
ax[1].plot(x,cum_prob)
ax[1].set_title('Cumulative Probability of y successes')
ax[1].set_xlabel('num of successes in 20 trials')
ax[1].set_ylabel('cumulative probability of successes')
Out[33]:
Poisson is used for modeling the events of a particular time over a period of time or region of space. An example is the number of vehicles passing through a security checkpoint in a 5 min interval.
Conditions
The probability distribution of a discrete random variable y is Poisson, if:
Thus the probability of observing y events in a unit of time or space is given by
$$ P(y) = \frac{\mu^{y}e^{-\mu}}{y!} $$where
e = 2.71828
Example
Let y denote number of field mice captured in a trap in 24 hour period. The average value of y is 2.3
. What is the probability of capturing exactly 4
mice in a randomly selected trap?
Ans: $$ \mu=2.3 $$ $$ P(y=4)=? $$
In [1]:
import math
def poisson_prob(y,mu):
e = 2.71828
numerator = math.pow(mu, y) * math.pow(e, 0-mu)
denomenator = math.factorial(y)
return numerator/denomenator
In [2]:
#calculate p(4)
p_4 = poisson_prob(4, 2.3)
p_4
Out[2]:
Lets plot the distribution of y for values 0 to 10
In [11]:
y=list(range(0,11))
p_y = []
cum_y = []
mu = 2.3
for yi in y:
prob = poisson_prob(yi, mu)
p_y.append(prob)
if yi==0:
cum_y.append(prob)
else:
cum_y.append(cum_y[yi-1] + prob)
In [13]:
#plot this
import matplotlib.pyplot as plt
%matplotlib inline
fig,ax = plt.subplots(1,2, figsize=(13,5))
ax[0].plot(y, p_y)
ax[0].set_title('Probability of finding y mice in 24 hours')
ax[0].set_xlabel('Probability of finding exactly y mice in 24 hours')
ax[0].set_ylabel('Probability')
ax[1].plot(y,cum_y)
ax[1].set_title('Cumulative Probability of y successes')
ax[1].set_xlabel('Probability of finding atleast y mice in 24 hours')
ax[1].set_ylabel('Cumulative probability')
Out[13]: