Q1

In this question, you'll go over some of the core terms and concepts in statistics.

A

Write a function, mean, which computes the mean of a list of numbers.

The function takes one argument: a list or 1D NumPy array of numbers. It returns one floating-point number: the average value.

You can use numpy.array but no other NumPy functions or built-in Python functions.


In [ ]:


In [ ]:
try:
    mean
except:
    assert False
else:
    assert True

In [ ]:
import numpy as np

np.random.seed(2342348)
x = np.random.random(100)
np.testing.assert_allclose(0.51465810266723755, mean(x))

np.random.seed(5825)
y = np.random.random(1000)
np.testing.assert_allclose(0.50133630983357202, mean(y))

B

Write a function, variance, which computes the variance of a list of numbers.

The function takes one argument: a list or 1D NumPy array of numbers. It returns one floating-point number: the variance of all the numbers.

Recall the formula for variance:

$$ variance = \frac{1}{N - 1} \sum_{i = 1}^{N} (x_i - \mu_x)^2 $$

where $N$ is the number of numbers in your list, $x_i$ is the number at index $i$ in the list, and $\mu_x$ is the average value of all the $x$ values.

You can use numpy.array and your mean function from Part A, but no other NumPy functions or built-in Python functions.


In [ ]:


In [ ]:
try:
    variance
except:
    assert False
else:
    assert True

In [ ]:
import numpy as np

np.random.seed(5987968)
x = np.random.random(8491)
v = x.var(ddof = 1)
np.testing.assert_allclose(v, variance(x))

np.random.seed(4159)
y = np.random.random(25)
w = y.var(ddof = 1)
np.testing.assert_allclose(w, variance(y))

C

The lecture on statistics mentions latent variables, specifically how you cannot know what the underlying process is that's generating your data; all you have is the data, on which you have to impose certain assumptions in order to derive hypotheses about what generated the data in the first place.

To illustrate this, the code provided below generates sample data from distributions with mean and variance that are normally not known to you. You'll use the functions you wrote in parts A and B to compute the statistics on the sample data itself and observe how these statistics change.

In the space provided, compute and print the mean and variance of each of the three samples:

  • sample1
  • sample2
  • sample3

Use the functions you wrote in Parts A and B.


In [ ]:
import numpy as np
np.random.seed(5735636)

sample1 = np.random.normal(loc = 10, scale = 5, size = 10)
sample2 = np.random.normal(loc = 10, scale = 5, size = 1000)
sample3 = np.random.normal(loc = 10, scale = 5, size = 1000000)

#########################
# DON'T MODIFY ANYTHING #
#   ABOVE THIS BLOCK    #
#########################

### BEGIN SOLUTION

### END SOLUTION

D

Since you don't usually know the true mean and variance of the process that presumably generated your data, the mean and variance you compute yourself are estimates of the true mean and variance. Explain what you saw in the estimates you computed above as they related to the number of samples. What implications does this have for computing statistics as part of real-world analyses?