Distribution

the way in which something is shared out among a group or spread over an area

Random Variable

a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense). A random variable can take on a set of possible different values (similarly to other mathematical variables), each with an associated probability wiki

Types

Discrete Random Variables
Eg: Genders of the buyers buying shoe
Continuous Random Variables
Eg: Shoe Sales in a quarter

Probability Distribution

Assigns a probability to each measurable subset of the possible outcomes of a random experiment, survey, or procedure of statistical inference. wiki

Probability Mass Function

probability mass function (pmf) is a function that gives the probability that a discrete random variable is exactly equal to some value

Discrete probability distribution(Cumulative Mass Function)

probability distribution characterized by a probability mass function

Probability Density Function

function that describes the relative likelihood for this random variable to take on a given value

Continuous probability distribution(Cumulative Density function)

probability that the variable takes a value less than or equal to x

Central Limit Theorem

Given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution. wiki

Normal Distribution

A bell shaped distribution. It is also called Gaussian distribution

PDF

CDF

Skewness

Measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. wiki

Kurtosis

Measure of the "peakedness" of the probability distribution of a real-valued random variable wiki

Binomial Distribution

Binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. A success/failure experiment is also called a Bernoulli experiment or Bernoulli trial; when n = 1, the binomial distribution is a Bernoulli distribution wiki

Exponential Distribution

Probability distribution that describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate. It has the key property of being memoryless. wiki

Uniform distribution

All values have the same frequency wiki)

6-sigma philosophy

Histograms

Most commonly used representation of a distribution.

Let's plot distribution of weed prices for 2014



In [1]:

    
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)
%matplotlib inline



In [3]:

    
#Import the data
cars = pd.read_csv("cars_v1.csv", encoding="ISO-8859-1")



In [6]:

    
#Replace missing values in Mileage with mean
cars.Mileage.fillna(cars.Mileage.mean(), inplace=True)



In [7]:

    
sns.distplot(cars.Mileage, kde=False)









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x11c699438>

Question If you randomly select a car, what is the probability, with equal chances of selecting any of the make available in our dataset, that the mileage will be greater than 25?



In [ ]:

    
sns.distplot(cars.Mileage, bins=range(0,50,1))

Using `scipy` to use distribution



In [12]:

    
from scipy import stats
import scipy as sp
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
%matplotlib inline



In [13]:

    
#Generate random numbers that are normally distributed
random_normal = sp.randn(100)
plt.scatter(range(100), random_normal)









    Out[13]:





<matplotlib.collections.PathCollection at 0x11d6a95f8>



In [ ]:

    
print "mean:", random_normal.mean(), " variance:", random_normal.var()



In [15]:

    
#Create a normal distribution with mean 2.5 and standard deviation 1.7

n = stats.norm(loc=2.5, scale=1.7)



In [16]:

    
#Generate random number from that distribution
n.rvs()









    Out[16]:





0.542441366031919



In [17]:

    
#for the above normal distribution, what is the pdf at 0.3?
n.pdf(0.3)









    Out[17]:





0.10157711386142985



In [18]:

    
#Binomial distribution with `p` = 0.4 and number of trials as 15



In [19]:

    
stats.binom.pmf(range(15), 10, 0.4)









    Out[19]:





array([  6.04661760e-03,   4.03107840e-02,   1.20932352e-01,
         2.14990848e-01,   2.50822656e-01,   2.00658125e-01,
         1.11476736e-01,   4.24673280e-02,   1.06168320e-02,
         1.57286400e-03,   1.04857600e-04,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00])

Standard Error

It is a measure of how far the estimate to be off, on average. More technically, it is the standard deviation of the sampling distribution of a statistic(mostly the mean). Please do not confuse it with standard deviation. Standard deviation is a measure of the variability of the observed quantity. Standard error, on the other hand, describes variability of the estimate.

To illustrate this, let's do the following.

Not all the make and models are available in our dataset. Also, we had to impute some of the values.

Let's say that a leading automobile magazine did an extensive survey and printed that the mean mileage is 22.7.

Compute standard deviation and standard error for the mean for our dataset



In [20]:

    
cars.head()









    Out[20]:






  
    
      
      Make
      Model
      Price
      Type
      ABS
      BootSpace
      GearType
      AirBag
      Engine
      FuelCapacity
      Mileage
    
  
  
    
      0
      Ashok Leyland Stile
      Ashok Leyland Stile LE 8-STR (Diesel)
      750
      MPV
      No
      500.0
      Manual
      No
      1461.0
      50.0
      20.7
    
    
      1
      Ashok Leyland Stile
      Ashok Leyland Stile LS 8-STR (Diesel)
      800
      MPV
      No
      500.0
      Manual
      No
      1461.0
      50.0
      20.7
    
    
      2
      Ashok Leyland Stile
      Ashok Leyland Stile LX 8-STR (Diesel)
      830
      MPV
      No
      500.0
      Manual
      No
      1461.0
      50.0
      20.7
    
    
      3
      Ashok Leyland Stile
      Ashok Leyland Stile LS 7-STR (Diesel)
      850
      MPV
      No
      500.0
      Manual
      No
      1461.0
      50.0
      20.7
    
    
      4
      Ashok Leyland Stile
      Ashok Leyland Stile LS 7-STR Alloy (Diesel)
      880
      MPV
      No
      500.0
      Manual
      No
      1461.0
      50.0
      20.7



In [23]:

    
#Mean and standard deviation of car's mileage
print(" Sample Mean:", cars.Mileage.mean(), "\n", "Sample Standard Deviation:", cars.Mileage.std())









    



 Sample Mean: 17.480407854984836 
 Sample Standard Deviation: 4.086421315837099



In [28]:

    
print(" Max Mileage:", cars.Mileage.max(), "\n", "Min Mileage:", cars.Mileage.min())









    



 Max Mileage: 30.0 
 Min Mileage: 5.0

We'll follow the same procedures we did in the resampling.ipynb. We will bootstrap samples from actual observed data 10,000 times and compute difference between sample mean and actual mean. Find root mean squared error to get standard error



In [29]:

    
def squared_error(bootstrap_sample, actual_mean):
    return np.square(bootstrap_sample.mean() - actual_mean)

def experiment_for_computing_standard_error(observed_mileage, number_of_times, actual_mean):
    bootstrap_mean = np.empty([number_of_times, 1], dtype=np.int32)
    bootstrap_sample = np.random.choice(observed_mileage, size=[observed_mileage.size, number_of_times], replace=True)
    bootstrap_squared_error = np.apply_along_axis(squared_error, 1, bootstrap_sample, actual_mean)
    return np.sqrt(bootstrap_squared_error.mean())



In [30]:

    
#Standard error of the estimate for mean
experiment_for_computing_standard_error(np.array(cars.Mileage), 10, 22.7)









    Out[30]:





5.3759376209847147



In [ ]:

	Make	Model	Price	Type	ABS	BootSpace	GearType	AirBag	Engine	FuelCapacity	Mileage
0	Ashok Leyland Stile	Ashok Leyland Stile LE 8-STR (Diesel)	750	MPV	No	500.0	Manual	No	1461.0	50.0	20.7
1	Ashok Leyland Stile	Ashok Leyland Stile LS 8-STR (Diesel)	800	MPV	No	500.0	Manual	No	1461.0	50.0	20.7
2	Ashok Leyland Stile	Ashok Leyland Stile LX 8-STR (Diesel)	830	MPV	No	500.0	Manual	No	1461.0	50.0	20.7
3	Ashok Leyland Stile	Ashok Leyland Stile LS 7-STR (Diesel)	850	MPV	No	500.0	Manual	No	1461.0	50.0	20.7
4	Ashok Leyland Stile	Ashok Leyland Stile LS 7-STR Alloy (Diesel)	880	MPV	No	500.0	Manual	No	1461.0	50.0	20.7