Q3

In this question, you'll go over some of the core terms and concepts in statistics.

Part A

Write a function, variance, which computes the variance of a list of numbers.

The function takes one argument: a list or 1D NumPy array of numbers. It returns one floating-point number: the variance of all the numbers.

Recall the formula for variance:

$$ variance = \frac{1}{N - 1} \sum_{i = 1}^{N} (x_i - \mu_x)^2 $$

where $N$ is the number of numbers in your list, $x_i$ is the number at index $i$ in the list, and $\mu_x$ is the average value of all the $x$ values.

You can use numpy.array and your numpy.mean functions, but no other NumPy functions or built-in Python functions other than range().


In [ ]:


In [ ]:
import numpy as np

np.random.seed(5987968)
x = np.random.random(8491)
v = x.var(ddof = 1)
np.testing.assert_allclose(v, variance(x))

In [ ]:
np.random.seed(4159)
y = np.random.random(25)
w = y.var(ddof = 1)
np.testing.assert_allclose(w, variance(y))

Part B

The lecture on statistics mentions latent variables, specifically how you cannot know what the underlying process is that's generating your data; all you have is the data, on which you have to impose certain assumptions in order to derive hypotheses about what generated the data in the first place.

To illustrate this, the code provided below generates sample data from distributions with mean and variance that are typically not known to you. Put another way, pretend you cannot see the mean (loc) and variance (scale) in the code that generates these samples; all you usually can see are the data samples themselves.

You'll use the numpy.mean and variance function you wrote in Part A to compute the statistics on the sample data itself and observe how these statistics change.

In the space provided, compute and print the mean and variance of each of the three samples:

  • sample1
  • sample2
  • sample3

You can just print() them out in the space provided. Don't modify anything above where it says "DON'T MODIFY".


In [ ]:
import numpy as np
np.random.seed(5735636)

sample1 = np.random.normal(loc = 10, scale = 5, size = 10)
sample2 = np.random.normal(loc = 10, scale = 5, size = 1000)
sample3 = np.random.normal(loc = 10, scale = 5, size = 1000000)

#########################
# DON'T MODIFY ANYTHING #
#   ABOVE THIS BLOCK    #
#########################

### BEGIN SOLUTION

### END SOLUTION

Part C

Since you don't usually know the true mean and variance of the process that presumably generated your data, the mean and variance you compute yourself are estimates of the true mean and variance. Explain what you saw in the estimates you computed above as they related to the number of samples. What implications does this have for computing statistics as part of real-world analyses?