Copyright 2016 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT

```
In [1]:
```from __future__ import print_function, division
%matplotlib inline
import numpy as np
import brfss
import thinkstats2
import thinkplot

```
In [2]:
```def RMSE(estimates, actual):
"""Computes the root mean squared error of a sequence of estimates.
estimate: sequence of numbers
actual: actual value
returns: float RMSE
"""
e2 = [(estimate-actual)**2 for estimate in estimates]
mse = np.mean(e2)
return np.sqrt(mse)

`n=7`

. We run `iters=1000`

experiments and collect the mean and median of each sample.

```
In [3]:
```import random
def Estimate1(n=7, iters=1000):
"""Evaluates RMSE of sample mean and median as estimators.
n: sample size
iters: number of iterations
"""
mu = 0
sigma = 1
means = []
medians = []
for _ in range(iters):
xs = [random.gauss(mu, sigma) for _ in range(n)]
xbar = np.mean(xs)
median = np.median(xs)
means.append(xbar)
medians.append(median)
print('Experiment 1')
print('rmse xbar', RMSE(means, mu))
print('rmse median', RMSE(medians, mu))
Estimate1()

```
```

Using $\bar{x}$ to estimate the mean works a little better than using the median; in the long run, it minimizes RMSE. But using the median is more robust in the presence of outliers or large errors.

The obvious way to estimate the variance of a population is to compute the variance of the sample, $S^2$, but that turns out to be a biased estimator; that is, in the long run, the average error doesn't converge to 0.

The following function computes the mean error for a collection of estimates.

```
In [4]:
```def MeanError(estimates, actual):
"""Computes the mean error of a sequence of estimates.
estimate: sequence of numbers
actual: actual value
returns: float mean error
"""
errors = [estimate-actual for estimate in estimates]
return np.mean(errors)

`n=7`

. We run `iters=1000`

experiments and two estimates for each sample, $S^2$ and $S_{n-1}^2$.

```
In [5]:
```def Estimate2(n=7, iters=1000):
mu = 0
sigma = 1
estimates1 = []
estimates2 = []
for _ in range(iters):
xs = [random.gauss(mu, sigma) for i in range(n)]
biased = np.var(xs)
unbiased = np.var(xs, ddof=1)
estimates1.append(biased)
estimates2.append(unbiased)
print('mean error biased', MeanError(estimates1, sigma**2))
print('mean error unbiased', MeanError(estimates2, sigma**2))
Estimate2()

```
```

`iters`

.

```
In [6]:
```def SimulateSample(mu=90, sigma=7.5, n=9, iters=1000):
xbars = []
for j in range(iters):
xs = np.random.normal(mu, sigma, n)
xbar = np.mean(xs)
xbars.append(xbar)
return xbars
xbars = SimulateSample()

```
In [7]:
```cdf = thinkstats2.Cdf(xbars)
thinkplot.Cdf(cdf)
thinkplot.Config(xlabel='Sample mean',
ylabel='CDF')

```
```

The mean of the sample means is close to the actual value of $\mu$.

```
In [8]:
```np.mean(xbars)

```
Out[8]:
```

```
In [9]:
```ci = cdf.Percentile(5), cdf.Percentile(95)
ci

```
Out[9]:
```

And the RMSE of the sample means is called the standard error.

```
In [10]:
```stderr = RMSE(xbars, 90)
stderr

```
Out[10]:
```

```
In [11]:
```def Estimate3(n=7, iters=1000):
lam = 2
means = []
medians = []
for _ in range(iters):
xs = np.random.exponential(1.0/lam, n)
L = 1 / np.mean(xs)
Lm = np.log(2) / thinkstats2.Median(xs)
means.append(L)
medians.append(Lm)
print('rmse L', RMSE(means, lam))
print('rmse Lm', RMSE(medians, lam))
print('mean error L', MeanError(means, lam))
print('mean error Lm', MeanError(medians, lam))
Estimate3()

```
```

The RMSE is smaller for the sample mean than for the sample median.

But neither estimator is unbiased.

**Exercise:** In this chapter we used $\bar{x}$ and median to estimate µ, and found that $\bar{x}$ yields lower MSE. Also, we used $S^2$ and $S_{n-1}^2$ to estimate σ, and found that $S^2$ is biased and $S_{n-1}^2$ unbiased.
Run similar experiments to see if $\bar{x}$ and median are biased estimates of µ. Also check whether $S^2$ or $S_{n-1}^2$ yields a lower MSE.

```
In [12]:
```# Solution
def Estimate4(n=7, iters=100000):
"""Mean error for xbar and median as estimators of population mean.
n: sample size
iters: number of iterations
"""
mu = 0
sigma = 1
means = []
medians = []
for _ in range(iters):
xs = [random.gauss(mu, sigma) for i in range(n)]
xbar = np.mean(xs)
median = np.median(xs)
means.append(xbar)
medians.append(median)
print('Experiment 1')
print('mean error xbar', MeanError(means, mu))
print('mean error median', MeanError(medians, mu))
Estimate4()

```
```

```
In [13]:
```# Solution
def Estimate5(n=7, iters=100000):
"""RMSE for biased and unbiased estimators of population variance.
n: sample size
iters: number of iterations
"""
mu = 0
sigma = 1
estimates1 = []
estimates2 = []
for _ in range(iters):
xs = [random.gauss(mu, sigma) for i in range(n)]
biased = np.var(xs)
unbiased = np.var(xs, ddof=1)
estimates1.append(biased)
estimates2.append(unbiased)
print('Experiment 2')
print('RMSE biased', RMSE(estimates1, sigma**2))
print('RMSE unbiased', RMSE(estimates2, sigma**2))
Estimate5()

```
```

```
In [14]:
```# Solution
# My conclusions:
# 1) xbar and median yield lower mean error as m increases, so neither
# one is obviously biased, as far as we can tell from the experiment.
# 2) The biased estimator of variance yields lower RMSE than the unbiased
# estimator, by about 10%. And the difference holds up as m increases.

**Exercise:** Suppose you draw a sample with size n=10 from an exponential distribution with λ=2. Simulate this experiment 1000 times and plot the sampling distribution of the estimate L. Compute the standard error of the estimate and the 90% confidence interval.

Repeat the experiment with a few different values of `n`

and make a plot of standard error versus `n`

.

```
In [15]:
```# Solution
def SimulateSample(lam=2, n=10, iters=1000):
"""Sampling distribution of L as an estimator of exponential parameter.
lam: parameter of an exponential distribution
n: sample size
iters: number of iterations
"""
def VertLine(x, y=1):
thinkplot.Plot([x, x], [0, y], color='0.8', linewidth=3)
estimates = []
for _ in range(iters):
xs = np.random.exponential(1.0/lam, n)
lamhat = 1.0 / np.mean(xs)
estimates.append(lamhat)
stderr = RMSE(estimates, lam)
print('standard error', stderr)
cdf = thinkstats2.Cdf(estimates)
ci = cdf.Percentile(5), cdf.Percentile(95)
print('confidence interval', ci)
VertLine(ci[0])
VertLine(ci[1])
# plot the CDF
thinkplot.Cdf(cdf)
thinkplot.Config(xlabel='estimate',
ylabel='CDF',
title='Sampling distribution')
return stderr
SimulateSample()

```
Out[15]:
```

```
In [16]:
```# Solution
# My conclusions:
# 1) With sample size 10:
# standard error 0.762510819389
# confidence interval (1.2674054394352277, 3.5377353792673705)
# 2) As sample size increases, standard error and the width of
# the CI decrease:
# 10 0.90 (1.3, 3.9)
# 100 0.21 (1.7, 2.4)
# 1000 0.06 (1.9, 2.1)
# All three confidence intervals contain the actual value, 2.

**Exercise:** In games like hockey and soccer, the time between goals is roughly exponential. So you could estimate a team’s goal-scoring rate by observing the number of goals they score in a game. This estimation process is a little different from sampling the time between goals, so let’s see how it works.

Write a function that takes a goal-scoring rate, `lam`

, in goals per game, and simulates a game by generating the time between goals until the total time exceeds 1 game, then returns the number of goals scored.

Write another function that simulates many games, stores the estimates of `lam`

, then computes their mean error and RMSE.

Is this way of making an estimate biased?

```
In [17]:
```def SimulateGame(lam):
"""Simulates a game and returns the estimated goal-scoring rate.
lam: actual goal scoring rate in goals per game
"""
goals = 0
t = 0
while True:
time_between_goals = random.expovariate(lam)
t += time_between_goals
if t > 1:
break
goals += 1
# estimated goal-scoring rate is the actual number of goals scored
L = goals
return L

```
In [18]:
```# Solution
# The following function simulates many games, then uses the
# number of goals scored as an estimate of the true long-term
# goal-scoring rate.
def Estimate6(lam=2, m=1000000):
estimates = []
for i in range(m):
L = SimulateGame(lam)
estimates.append(L)
print('Experiment 4')
print('rmse L', RMSE(estimates, lam))
print('mean error L', MeanError(estimates, lam))
pmf = thinkstats2.Pmf(estimates)
thinkplot.Hist(pmf)
thinkplot.Config(xlabel='Goals scored', ylabel='PMF')
Estimate6()

```
```

```
In [19]:
```# Solution
# My conclusions:
# 1) RMSE for this way of estimating lambda is 1.4
# 2) The mean error is small and decreases with m, so this estimator
# appears to be unbiased.
# One note: If the time between goals is exponential, the distribution
# of goals scored in a game is Poisson.
# See https://en.wikipedia.org/wiki/Poisson_distribution

```
In [ ]:
```