Overview

Experiments provide a way to assess hypotheses. They are a set of methods for collecting data that enable us to make valid conclusions. They allow specific factors to be altered, thus potentially providing insights into causal relationships between those factors and particular outcomes.

To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of (Fisher, 1938).

The statistical analyses that are appropriate for a particular set of data or that are needed to answer a specific question, affect the design decisions of experiments.

Causality

In many cases, albeit for different reasons, it may be important to make inferences about the effects of particular interventions. For example, we may be interested in understanding the efficacy of medical treatments, the effects of education programs on student outcomes, or the impacts of web products or features on particular business metrics. There are several ways in which we could make inferences.

One approach is to rely on intuition or anecdotes. However, this is problematic because examples supporting particular views might be readily available or, worse yet, because they might be open to interpretation, may be made to favor a particular argument.

Another approach is to gather available data on factors that are believed to influence particular outcomes and to analyze their relationships. However, these relationships may not—and, often times, cannot—imply anything about causality. This could be due, for example, to unobserved factors that are correlated with a particular outcome but are not measured or are measured incorrectly.

A common example in labor economics deals with "returns to education," which tries to measure the impact that additional schooling has on wages. Common unobserved factors in this context are ability and motivation. It's not controversial to think that the reasons why particular individuals choose (or have the opportunity) to get additional schooling could relate to or affect future wages. The problem with unobserved factors is that they may not be completely enumerable or even available.

The approach that gives us the best ability to assess causal claims and, thus, to be able to make better inferences, is experimentation.

Experiments

Experiments allow us to, among other things, make direct comparisons between treatments of interest.

An experiment is characterized by the treatments and experimental units to be used, the way treatments are assigned to units, and the responses that are measured (Oehlert, 2010).

A treatment is a condition (or conditions) we are interested in assessing. The experimental units are the entities to which the treatments are assigned. The responses are the observed outcomes on the experimental units after applying the treatments.

Treatments

A treatment is a condition or intervention that is thought to have some effect on the units to which it is applied. The goal of many research projecs is to identify, quantify, and make inferences about these effects.

In experiments, treatments are assigned to units. In observational studies, on the other hand, which also have treatments, units, and responses, treatment assignments are not controlled by the researcher. Rather, researchers choose the units to observe. This is a subtle, yet important distinction.

Imagine a study of a medical treatment being conducted at a hospital, where the treatment is only administered to patients who are of the highest need. In this case, researchers are choosing the populations of units to compare; individual units are not assigned a treatment. Therefore, this is an example of an observational study.

Units

The experimental units are the objects that are assigned a treatment. These can, for example, be individuals or groups of individuals. The experimental units should be representative of the population of interest that is being studied. That is, they should "resemble the actors who ordinarily encounter [the] interventions" (Gerber and Green, 2012).

The experimental units aren't always the entities on which the outcomes of interest are measured, though. Because of this, we introduce the term measurement unit. Consider an experiment where a teaching method is applied to classrooms. The classrooms, in this case, because they are the objects to which treatments are assigned, are the experimental units. The outcomes of interest (e.g., test scores), on the other hand, are measured for individual students. They are the measurement units.

Responses

The response is the observed outcome that is used to assess the effect of a treatment (or treatments). There are two concepts here—the idea of which outcomes to measure and the idea of how to represent the data associated with those outcomes. These choices will determine both the types of analyses that are possible and the efficacy of those analyses.

Responses can be measured—that is, represented—in several ways.

Nominal
Ordinal
Interval
Ratio

Measurement choices can determine the methods that are appropriate for a given experiment. In addition, they dictate the types of transformations that can be made on the outcomes data.

A response's quality can be judged in two ways. The reliability of a response is the degree to which similar measurements would result if the experiment were repeated under the same exact conditions. Validity, on the other hand, refers to the degree to which the response is relevant for the purposes of the particular study.

Responses should "resemble the actual outcomes of theoretical or practical interest" (Gerber and Green, 2012).

The experimental design involves making decisions about all of these, including how to assign treatments to units. Notice that while analysis methods are not explicitly mentioned, analyses should be considered when designing and planning an experiment. In general, though not always, the experimental design and the statistical model assumptions determine the proper analysis. Good designs avoid systematic error, are precise, allow for the estimation of error, and have broad validity (Oehlert, 2010).



In [ ]: