Unit 2: Programming Design

Lesson 8: Statistics - Pre-activity

Scientific Context: Describing Data (statistics)

In any scientific investigation, it all starts with the data. You may call them a set of results, or a sample, or events, but whatever the name, they consist of a set of basic measurements from which you are trying to extract meaningful information. Statistics is a tool used to describe data in a meaningful way. In this lesson, various descriptive statistics and calculations are introduced that are used to analyze and interpret the results of an experiment.

For our convenience, we will be using the equations for variance based on what we refer to as a "finite population". This means that we have a small number of observations and they represent our whole population, rather than our data representing a sample from a much larger population where we then use the sample to estimate the variance for a larger population. For samples from a larger population, the first term in the equations for variance below (Eq 2-4, all starting with $V(x)$) would have to be $\frac{1}{N-1}$. If you are in biology or neuroscience, you will be more familiar with the $\frac{1}{N-1}$ versions since we are nearly always using a random(ish) sample to estimate population parameters and the $\frac{1}{N-1}$ versions give a more conservative (hopefully less biased) estimate of the variance. Two contexts where the $\frac{1}{N}$ versions are used are when precision of a set of measurements is being estimated, as in the Pre-Activity Questions below, and if you wanted the variance of a set of class grades but where not going to make any inference about them or compare them to any other class. If you have never taken statistics, you don't need to worry about this now, just use the equations given.

Descriptive Statistics: Mean (Average), Variance, and Standard Deviation

If you want to describe your data with just a single number, the best and most meaningful for summary statistic many data sets is the arithmetic mean. The mean ($\bar{x}$, pronounced "x bar") of a set of values ($N$) is defined as:

$$\bar{x} = \frac{1}{N}\sum_{i=1}^N x_i = \frac{x_1+x_2+\cdots +x_N}{N} \hspace{25 mm} (1)$$

In order to assess the precision of your data, however, we need a number to express the spread (or dispersion) of the data about the mean. For a finite population (see note above), this is given by the variance of x:

$$V(x)=\frac{1}{N}\sum_{i=1}^N(x_i-\overline{x})^2 \hspace{35 mm}(2)$$

which gives how much x varies from its mean value . The disadvantage of the above equation is that we must know the mean ahead of time to calculate the standard deviation. It can be shown, however, that equation 2 is equivalent to the mean square minus the square of the mean:

$$V(x)=\frac{1}{N}\sum_{i=1}^Nx_i^2 -\left(\frac{1}{N}\sum_{i=1}^Nx_i\right)^2 \hspace{25 mm}(3)$$

or: $$V(x) = \overline{x^2} - \overline{x}^2 \hspace{50 mm}(4)$$

This relationship allows one to calculate the variance and the mean simultaneously. The root mean squared deviation is called the standard deviation and is given the symbol σ. It is simply the square root of the variance: $$\sigma = \sqrt{V(x)} = \sqrt{\overline{x^2} - \overline{x}^2} \hspace{35 mm}(5)$$

Pre-Activity Questions

The temperature of a solution was recorded at 1 minute intervals for 10 minutes resulting in the following values (Kelvin) [305.158, 308.644, 304.479, 303.238, 316.673, 297.346, 310.036, 302.258, 305.534, 311.674]. Using this temperature data:

1. illustrate that Equation 2 and Equation 3 give the same value for the variance.

2. identify the mean and standard deviation corresponding to these temperature readings.