the way in which something is shared out among a group or spread over an area
a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense). A random variable can take on a set of possible different values (similarly to other mathematical variables), each with an associated probability wiki
Types
Assigns a probability to each measurable subset of the possible outcomes of a random experiment, survey, or procedure of statistical inference. wiki
probability mass function (pmf) is a function that gives the probability that a discrete random variable is exactly equal to some value
probability distribution characterized by a probability mass function
function that describes the relative likelihood for this random variable to take on a given value
probability that the variable takes a value less than or equal to x
Given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution. wiki
A bell shaped distribution. It is also called Gaussian distribution
PDF
CDF
Measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. wiki
Measure of the "peakedness" of the probability distribution of a real-valued random variable wiki
Binomial distribution with parameters n
and p
is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. A success/failure experiment is also called a Bernoulli experiment or Bernoulli trial; when n = 1, the binomial distribution is a Bernoulli distribution wiki
Probability distribution that describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate. It has the key property of being memoryless. wiki
All values have the same frequency wiki)
In [1]:
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)
%matplotlib inline
In [3]:
#Import the data
cars = pd.read_csv("cars_v1.csv", encoding="ISO-8859-1")
In [6]:
#Replace missing values in Mileage with mean
cars.Mileage.fillna(cars.Mileage.mean(), inplace=True)
In [7]:
sns.distplot(cars.Mileage, kde=False)
Out[7]:
Question If you randomly select a car, what is the probability, with equal chances of selecting any of the make available in our dataset, that the mileage will be greater than 25?
In [ ]:
sns.distplot(cars.Mileage, bins=range(0,50,1))
In [12]:
from scipy import stats
import scipy as sp
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
%matplotlib inline
In [13]:
#Generate random numbers that are normally distributed
random_normal = sp.randn(100)
plt.scatter(range(100), random_normal)
Out[13]:
In [ ]:
print "mean:", random_normal.mean(), " variance:", random_normal.var()
In [15]:
#Create a normal distribution with mean 2.5 and standard deviation 1.7
n = stats.norm(loc=2.5, scale=1.7)
In [16]:
#Generate random number from that distribution
n.rvs()
Out[16]:
In [17]:
#for the above normal distribution, what is the pdf at 0.3?
n.pdf(0.3)
Out[17]:
In [18]:
#Binomial distribution with `p` = 0.4 and number of trials as 15
In [19]:
stats.binom.pmf(range(15), 10, 0.4)
Out[19]:
It is a measure of how far the estimate to be off, on average. More technically, it is the standard deviation of the sampling distribution of a statistic(mostly the mean). Please do not confuse it with standard deviation. Standard deviation is a measure of the variability of the observed quantity. Standard error, on the other hand, describes variability of the estimate.
To illustrate this, let's do the following.
Not all the make and models are available in our dataset. Also, we had to impute some of the values.
Let's say that a leading automobile magazine did an extensive survey and printed that the mean mileage is 22.7.
Compute standard deviation and standard error for the mean for our dataset
In [20]:
cars.head()
Out[20]:
In [23]:
#Mean and standard deviation of car's mileage
print(" Sample Mean:", cars.Mileage.mean(), "\n", "Sample Standard Deviation:", cars.Mileage.std())
In [28]:
print(" Max Mileage:", cars.Mileage.max(), "\n", "Min Mileage:", cars.Mileage.min())
We'll follow the same procedures we did in the resampling.ipynb
. We will bootstrap samples from actual observed data 10,000 times and compute difference between sample mean and actual mean. Find root mean squared error to get standard error
In [29]:
def squared_error(bootstrap_sample, actual_mean):
return np.square(bootstrap_sample.mean() - actual_mean)
def experiment_for_computing_standard_error(observed_mileage, number_of_times, actual_mean):
bootstrap_mean = np.empty([number_of_times, 1], dtype=np.int32)
bootstrap_sample = np.random.choice(observed_mileage, size=[observed_mileage.size, number_of_times], replace=True)
bootstrap_squared_error = np.apply_along_axis(squared_error, 1, bootstrap_sample, actual_mean)
return np.sqrt(bootstrap_squared_error.mean())
In [30]:
#Standard error of the estimate for mean
experiment_for_computing_standard_error(np.array(cars.Mileage), 10, 22.7)
Out[30]:
In [ ]: