Many statistical tools and techniques used in data analysis are based on probability, a measures on a scale from 0 to 1 the likelihood of an event occurence (0 the event never occurs, 1, the event always occurs.).
Variables in the columns of a data set can be thought of as random variables: their values varying due to chance.
The distribution of the outcomes (values) of a random variable is described using probability distribution (function). In statistics, there are several common probability distributions, corresponding to various "shapes". The most commonly used to model real life random events are the Uniform, Normal, Binomial, Exponential, Poisson, and Lognormal distributions.
This notebook presents several common probability distributions and how to work with them in Python.
The uniform distribution is a probability distribution where each value within a certain range is equally likely to occur and values outside of the range never occur.
In :# Uncomment next command if you need to install a missing module #!pip install statsmodels import numpy as np import pandas as pd import matplotlib.pyplot as plt import scipy.stats as stats %matplotlib inline
In :uniform_data = stats.uniform.rvs(size = 10000, # Generate 10000 numbers loc = 0, # From 0 scale = 20) # To 20 pd.DataFrame(uniform_data).plot(kind = "density", # Plot the distribution figsize = (9,9), xlim = (-5,25))
Out:<matplotlib.axes._subplots.AxesSubplot at 0x10cea4240>
The plot above is an approximation of the theoretical uniform distribution because it is based on a sample of observations: we generated 10,000 data points from a uniform distribution spanning the range 0 to 20.
In the density plot, we see that the density of sample data is almost level, i.e., all values have the same probability of occurring.
Note: the area under a probability density curve is always equal to 1.
More useful scipy functions:
stats.distribution.rvs() generates random numbers from the specified distribution. The arguments to rvs() vary depending on the type of distribution considered, e.g., the starting and ending points and the sample size for a uniform distribution.
stats.distribution.cdf() returns the probability that an observation drawn from a distribution is below a specified value (a.k.a., the cumulative distribution function) calculated as the area under the distribution's density curve to the left of the specified value (on the x axis). For example, in the uniform distribution above, there is a 25% chance that an observation will be in the range 0 to 5 and a 75% chance it will fall in the range 5 to 20. We can confirm this with cdf():
In :stats.uniform.cdf(x = 5, # quantile to check loc = 0, # start value of the distribution scale = 20) # end value of the distribution
In :stats.uniform.ppf( q = 0.2, # Probability cutoff loc = 0, # start value of the distribution scale = 20) # end value of the distribution
Probability distribution functions in scipy also support median(), mean(), var() and std().
In :for x in range(-2,24,4): print("Density at x value " + str(x)) print( stats.uniform.pdf(x, loc = 0, scale = 20) )
Density at x value -2 0.0 Density at x value 2 0.05 Density at x value 6 0.05 Density at x value 10 0.05 Density at x value 14 0.05 Density at x value 18 0.05 Density at x value 22 0.0
To generate random real numbers in a range with equal probability one can draw numbers from a uniform distribution using the stats.distribution.rvs() described above.
Python also has with a library called "random" which comes equiped with various operations that involve randomization.
In :import random print(random.randint(0,10)) # Get a random integer in the specified range print(random.choice([2,4,6,9])) # Get a random element from a sequence
In :print(random.random()) # Get a real number between 0 and 1
In :print(random.uniform(0,10)) # Get a real in the specified range
Regardless of the method used to generate random numbers, the result of a random process can differ from one run to the next. If having different results each time is not desired, e.g., if results need to be reproduced exactly, one can ensure that the results are the same each time by setting the random number generator's seed value to a specific figure. random.seed() allows us to set the random number generator's seed.
Notice that we generate the exact same numbers with both calls to random.uniform() because the same seed is set before each call. This reproducibility illustrates the fact that these random numbers aren't truly random, but rather "pseudorandom".
Many Python library functions that use randomness have an optional random seed argument built in, e.g., the rvs() function has an argument random_state, which allows us to set the seed.
Note: there is a separate internal seed from the numpy library, which can be set with numpy.random.seed() when using functions from numpy and libraries built on top of it such as pandas and scipy.
In :random.seed(10) # Set the seed to 12 print([random.uniform(0,20) for x in range(4)]) random.seed(10) # Set the seed to the same value print([random.uniform(0,20) for x in range(4)])
[11.42805189379827, 8.577781093502292, 11.561826022689408, 4.121964642790035] [11.42805189379827, 8.577781093502292, 11.561826022689408, 4.121964642790035]
The normal (or Gaussian distribution) is a continuous probability distribution with a bell-shaped curve and is characterized by its center point (mean) and spread (standard deviation). Most observations from a normal distribution lie close to the mean, i.e., about 68% of the data lies within 1 standard deviation of the mean, 95% lies within 2 standard deviations and 99.7% within 3 standard deviations.
Note that many common statistical tests assume distributions are normal. (In the scipy module the normal distribution is referred to as norm.)
In :prob_under_minus1 = stats.norm.cdf(x = -1, loc = 0, scale = 1) prob_over_1 = 1 - stats.norm.cdf(x= 1, loc = 0, scale= 1) between_prob = 1 - (prob_under_minus1 + prob_over_1) print(prob_under_minus1, prob_over_1, between_prob)
0.158655253931 0.158655253931 0.682689492137
The figures above show that approximately 16% of the data in a normal distribution with mean 0 and standard deviation 1 is below -1, 16% is above 1 and 68% is between -1 and 1. Let's plot the normal distribution:
In :# Plot normal distribution areas* #plt.rcParams["figure.figsize"] = (9,9) plt.fill_between(x=np.arange(-4,-1,0.01), y1= stats.norm.pdf(np.arange(-4,-1,0.01)) , facecolor='red', alpha=0.35) plt.fill_between(x=np.arange(1,4,0.01), y1= stats.norm.pdf(np.arange(1,4,0.01)) , facecolor='red', alpha=0.35) plt.fill_between(x=np.arange(-1,1,0.01), y1= stats.norm.pdf(np.arange(-1,1,0.01)) , facecolor='blue', alpha=0.35) plt.text(x=-1.8, y=0.03, s= round(prob_under_minus1,3)) plt.text(x=-0.2, y=0.1, s= round(between_prob,3)) plt.text(x=1.4, y=0.03, s= round(prob_over_1,3))
Out:<matplotlib.text.Text at 0x10a44c860>
The plot above shows the bell shape of the normal distribution as well as the area below, above and within one standard deviation of the mean. One can also check the quantiles of a normal distribution with stats.norm.ppf():
In :print( stats.norm.ppf(q=0.025) ) # Find the quantile for the 2.5% cutoff print( stats.norm.ppf(q=0.975) ) # Find the quantile for the 97.5% cutoff
The figures above show that approximately 5% of the data is further than 2 standard deviations from the mean.
The binomial distribution is a discrete probability distribution that models the outcomes of a given number of random trails of some experiment or event. The binomial distribution has two parameters: the probability of success in a trial and the number of trials. The binomial distribution represents the likelihood to achieve a given number of successes in n trials of an experiment. We could model flipping a fair coin 10 times with a binomial distribution where the number of trials is set to 10 and the probability of success is set to 0.5. The scipy name for the binomial is binom.
In :fair_coin_flips = stats.binom.rvs(n=10, # Number of flips per trial p=0.5, # Success probability size=10000) # Number of trials print( pd.crosstab(index="counts", columns= fair_coin_flips)) pd.DataFrame(fair_coin_flips).hist(range=(-0.5,10.5), bins=11)
col_0 0 1 2 3 4 5 6 7 8 9 10 row_0 counts 15 99 441 1113 2062 2469 1988 1225 485 97 6Out:array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10a41a198>]], dtype=object)
The binomial distribution is discrete and can be summarized with a frequency table (and a histogram). The histogram shows us that a binomial distribution with a 50% probability of success is roughly symmetric, but this is not the case in general.
In :biased_coin_flips = stats.binom.rvs(n=10, # Number of flips per trial p=0.8, # Success probability size=10000) # Number of trials # Print table of counts print( pd.crosstab(index="counts", columns= biased_coin_flips)) # Plot histogram pd.DataFrame(biased_coin_flips).hist(range=(-0.5,10.5), bins=11)
col_0 3 4 5 6 7 8 9 10 row_0 counts 2 43 272 860 2060 2957 2698 1108Out:array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10562f550>]], dtype=object)
In :#cdf() check the probability of achieving a number of successes within a certain range print(stats.binom.cdf(k=5, # Probability of k = 5 successes or less n=10, # With 10 flips p=0.8))# And success probability 0.8 print(1 - stats.binom.cdf(k=8, # Probability of k = 9 successes or more n=10, # With 10 flips p=0.8)) # And success probability 0.8
For continuous probability density functions, we can use pdf() to check the probability density at a given x value. For discrete distributions like the binomial, we can use stats.distribution.pmf(), the probability mass function to check the proportion of observations at given number of successes k.
In :print(stats.binom.pmf(k=5, # Probability of k = 5 successes n=10, # With 10 flips p=0.5)) # And success probability 0.5 print(stats.binom.pmf(k=8, # Probability of k = 8 successes n=10, # With 10 flips p=0.8)) # And success probability 0.8
The geometric distribution is discrete and models the number of trials it takes to achieve a success in repeated experiments with a given probability of success. The exponential distribution is continuous and models the amount of time before an event occurs given a certain occurrence rate. The scipy nickname for the geometric distribution is "geom". Below we model the number of trials it takes to get a success (heads) when flipping a fair coin:
In :random.seed(12) flips_till_heads = stats.geom.rvs(size=10000, # Generate geometric data p=0.5) # With success prob 0.5 # Print table of counts print( pd.crosstab(index="counts", columns= flips_till_heads)) # Plot histogram pd.DataFrame(flips_till_heads).hist(range=(-0.5,max(flips_till_heads)+0.5) , bins=max(flips_till_heads)+1)
col_0 1 2 3 4 5 6 7 8 9 10 11 12 13 row_0 counts 4947 2548 1214 640 327 172 86 33 15 9 3 4 2Out:array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10debba58>]], dtype=object)
In the 10,000 trails generated, the longest it took to get a heads was 13 flips. We can use cdf() to check the probability of needing 3 flips or more to get a success:
In :first_three = stats.geom.cdf(k=3, # Prob of success in first 5 flips p=0.5) 1 - first_three
In :#Use pmf() to check the probability of seeing a specific number of flips before a successes stats.geom.pmf(k=2, # Prob of needing exactly 2 flips to get the first success p=0.5)
In :# Get the probability of waiting more than 1 time unit before a success prob_1 = stats.expon.cdf(x=1, scale=1) # Arrival rate 1 - prob_1
Note: The average arrival time for the exponential distribution is equal to 1/arrival_rate. Let's plot this exponential distribution:
In :plt.fill_between(x=np.arange(0,1,0.01), y1= stats.expon.pdf(np.arange(0,1,0.01)) , facecolor='blue', alpha=0.35) plt.fill_between(x=np.arange(1,7,0.01), y1= stats.expon.pdf(np.arange(1,7,0.01)) , facecolor='red', alpha=0.35) plt.text(x=0.3, y=0.2, s= round(prob_1,3)) plt.text(x=1.5, y=0.08, s= round(1 - prob_1,3))
Out:<matplotlib.text.Text at 0x10e16e2e8>
The Poisson distribution models the probability of seeing a certain number of successes within a time interval, where the time it takes for the next success is modeled by an exponential distribution. The Poisson distribution can be used to model the number of arrivals a hospital can expect in a hour's time, the number of emailsone can expect to receive in a day, etc. The scipy name for the Poisson distribution is "poisson". Below is a plot for data from a Poisson distribution with an arrival rate of 1 per time unit:
In :random.seed(12) arrival_rate_1 = stats.poisson.rvs(size=10000, # Generate Poisson data mu=1 ) # Average arrival time 1 # Print table of counts print( pd.crosstab(index="counts", columns= arrival_rate_1)) # Plot histogram pd.DataFrame(arrival_rate_1).hist(range=(-0.5,max(arrival_rate_1)+0.5) , bins=max(arrival_rate_1)+1)
col_0 0 1 2 3 4 5 6 row_0 counts 3586 3715 1902 628 141 23 5Out:array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10ddc0b38>]], dtype=object)
The histogram shows that when arrivals are relatively infrequent, it is rare to see more than a couple of arrivals in each time period.
In :random.seed(12) arrival_rate_10 = stats.poisson.rvs(size=10000, # Generate Poisson data mu=10 ) # Average arrival time 10 # Print table of counts print( pd.crosstab(index="counts", columns= arrival_rate_10)) # Plot histogram pd.DataFrame(arrival_rate_10).hist(range=(-0.5,max(arrival_rate_10)+0.5) , bins=max(arrival_rate_10)+1)
col_0 1 2 3 4 5 6 7 8 9 10 ... 14 15 16 \ row_0 ... counts 3 26 85 213 406 640 894 1083 1253 1234 ... 530 335 228 col_0 17 18 19 20 21 22 23 row_0 counts 139 70 29 24 9 7 3 [1 rows x 23 columns]Out:array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10ddc05f8>]], dtype=object)
We can use cdf() to check the probability of achieving more (or less) than a certain number of successes and pmf() to check the probability of obtaining a specific number of successes:
In :print(stats.poisson.cdf(k=5, # Check the probability of 5 arrivals or less mu=10)) # With arrival rate 10 print(stats.poisson.pmf(k=10, # Check the prob f exactly 10 arrivals mu=10)) # With arrival rate 10