Importance of valid sampling

By Evgenia "Jenny" Nitishinskaya and Delaney Granizo-Mackenzie

Notebook released under the Creative Commons Attribution 4.0 License.


In order to evaluate a population based on sample observations, the sample must be unbiased. Otherwise, it is not representative of the population, and conclusions drawn from it will be invalid. For example, we always take into account the number of samples we have, and expect to have more accurate results if we have more observations. Here we will discuss four other types of sampling bias.

Data-mining bias

Data mining refers to testing a set of data for the presense of different patterns, and can lead to bias if used excessively. Because our analyses are always probabilistic, we can always try enough things that one will appear to work. For instance, if we test 100 different variables for correlation with a dataset using a 5% significance level, we expect to find 5 that are significantly correlated with the data just by random chance. Below we test this for random variables and a random dataset. The result will be different each time, so try rerunning the cell!


In [11]:
import numpy as np
from scipy.stats import pearsonr

# Generate completely random numbers
randos = [np.random.rand(100) for i in range(100)]
y = np.random.rand(100)

# Compute correlation coefficients (Pearson r) and record their p-values (2nd value returned by pearsonr)
ps = [pearsonr(x,y)[1] for x in randos]

# Print the p-values of the significant correlations, i.e. those that are less than .05
print [p for p in ps if p < .05]


[0.021568677927077139, 0.014660970914072665, 0.019467620379224219, 0.021908016364894763, 0.03012168996281734, 0.014491673783850792]

Above we data-mined by hand. There is also intergeneratinal data mining, which is using previous results about the same dataset you are investigating or a highly related one, which can also lead to bias.

The problem here is that there is no reason to believe that the pattern we found will continue; for instance, if we continue to generate random numbers above, they will not continue to be correlated. Meaningless correlations can arise between datasets by coincidence. This is similar to the problem of overfitting, where a model is contorted to fit historical data perfectly but then fails out of sample. It is important to perform such an out-of-sample test (that is, using data not overlapping with that which was examined when creating the model) in order to check for data-mining bias.

Sample selection bias

Bias resulting from data availability is called sample selection bias. Sometimes it is impossible to avoid, but awareness of the phenomenon can help avoid incorrect conclusions. Survivorship bias occurs when securities dropped from databases are not taken into account. This causes a bias because future analyses then do not take into account, for example, stocks of businesses that went bankrupt. However, businesses whose stock you buy now may very well go bankrupt in the future, so it is important to incorporate the behavior of such stocks into your model.

Look-ahead bias

Look-ahead bias occurs when attempting to analyze from the perspective of a day in the past and using information that was not available on that day. For instance, fundamentals data may not be reported in real time. Models subject to look-ahead bias cannot be used in practice since they would require information from the future.

Time-period bias

The choice of sample period affects results, and a model or analysis may not generalize to future periods. This is known as time-period bias. If we use only a short time period, we risk putting a lot of weight on a local phenomenon. However, if we use a long time period, we may include data from a prior regime that is no longer relevant.