Design of experiments: (Wikipedia)
The design of experiments is the design of any task that aims to describe or explain the variation of information under conditions that are hypothesized to reflect the variation.
Let's see an example of a design of experiments table:
TC | $X_1$ | $X_2$ | $Y$ |
---|---|---|---|
1 | +1 | +1 | $y_1$ |
2 | +1 | -1 | $y_2$ |
3 | -1 | +1 | $y_3$ |
4 | -1 | -1 | $y_4$ |
This table shows a 2 factor, 2 level design of experiments that has 4 treatment conditions.
What is the meaning of the +1, -1? We do design of experiments to see if factors affect something. For example, our response might be the concentration of a chemical species and factors could be temperature and pressure. Because there are many temperatures to test, we might just only consider two temperature: hot and cold. This can be coded as levels: -1, +1. This is often done because there are standard analysis equations that would with integer levels, especially with two levels.
If we regress against these integer levels, the regression coefficients aren't really meaningful. Instead, we care about p-values. That is, we care about discovering if certain factors affect our response. This will allow to say "temperature affects the concentration" or "pressure does not affect concentration".
Note that our experimental design doesn't include replicates. The design of experiments is meant to be as efficient as possible. Note that here we're trying to see what matters, and not trying to get an accurate regression model. If you want to do regression for accuracy, then you should include replicates and work with actual factor values instead of levels.
We saw in unit 12, lecture 3 how to treat discrete data like this. Let's try regressing it! The data is 2 dimensional, so we will use 2 dimensional least squares. Should we include an intercept? Yes! One way to include is it to compute the grand mean from all responses so that they are centered at 0. Then the intercept will be 0. You should know this is commonly done, but we won't do this for our analysis. We'll just use a regular intercept as we saw in our regression unit.
In [26]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as ss
import numpy.linalg as linalg
x1 = [1, 1, -1, -1]
x2 = [1, -1, 1, -1]
y = [1.2, 3.2, 4.1, 3.6]
We'll use multidimensional ordinary least squares with an intercept:
In [36]:
x_mat = np.column_stack((np.ones(4), x1, x2))
x_mat
Out[36]:
We'll compute our coefficients and their standard error
In [44]:
beta, *_ = linalg.lstsq(x_mat, y)
y_hat = x_mat @ beta
resids = (y - y_hat)
SSR = np.sum(resids**2)
se2_epsilon = SSR / (len(x) - len(beta))
se2_beta = se2_epsilon * linalg.inv(x_mat.transpose() @ x_mat)
print(np.sqrt(se2_beta), np.sqrt(se2_epsilon))
Now we can compute p-values and confidence intervals:
In [47]:
df = len(x) - len(beta)
print('df = ', df)
for i in range(len(beta)):
#get our T-value for the confidence interval
T = ss.t.ppf(0.975, df)
# Get the width of the confidence interval using our previously computed standard error
cwidth = T * np.sqrt(se2_beta[i,i])
# print the result, using 2 - i to match our numbering above
hypothesis_T = -abs(beta[i]) / np.sqrt(se2_beta[i,i])
p = 2 * ss.t.cdf(hypothesis_T, df + 1) # +1 because null hypothesis doesn't include coefficient
print(f'beta_{i} is {beta[i]:.2f} +/- {cwidth:.2f} with 95% confidence. p-value: {p:.2f} (T = {hypothesis_T:.2f})')
So we found that our intercept is likely necessary (p < 0.05), but the two factors do not have a significant effect. We also found that factor 1 is more important than factor 2 as judged from the p-value
We're going to be using a new library to do regression on this unit because of its ability to do an ANOVA analysis. We'll learn about ANOVA below, but let's first repeat the above regression with this tool. Creating a statsmodel requires two ingredients: data and a formula. The formula is a string that matches your regression model. In this case we use y ~ x1 + x2
. The ~
means equal to here. The data should be a dictionary whose keys match the variables you used in your formula. Thus doing data[y]
should give the y
vector. The statsmodels regression is created by calling ols
and then we must call fit()
to do the regression and summary
to get a report on the results.
In [42]:
from statsmodels.formula.api import ols
x1 = [1, 1, -1, -1]
x2 = [1, -1, 1, -1]
y = [1.2, 3.2, 4.1, 3.6]
data = {'x1': x1, 'x2': x2, 'y': y}
model = ols('y ~ x1 + x2', data=data).fit()
model.summary()
Out[42]:
This regression summary has a huge amount of information. The top table includes information about the goodness of fit and regression model, like degrees of freedom and what the independent variable is. The middle table contains information about the regression coefficients including confidence intervals and p-values. The final table contains information about the residuals. The Jarque-Bera test is a normality test, like the Shapiro-Wilks test we learned previously. The p-values are slightly different because they use dof as 1, instead of 2, for their hypothesis test.
One of the most common analysis techniques of a design of experiments is the use of an Analysis of Variance (ANOVA). An ANOVA breaks up the response variance into factor variances. It explains where the variance in the response comes from. We aren't going to go deeply into the theory of ANOVA, but it's important that you know how it's used and how to intepret it. An ANOVA is based on a linear regression like above, but it's a different way of computing p-values. The p-values are the most relevant output of an ANOVA.
Here's an ANOVA of the above example:
In [51]:
sm.stats.anova_lm(model)
Out[51]:
The ANOVA test gives information about each factor. The df is the degrees of freedom used to model each factor, the sum_sq is difference between the grand mean response and mean response of the treatment, the mean_sq is the sum_sq divided by the degrees of freedom, the F is an F-test statistic (like a T statistic from a t-test), and the final column contains p-value for the existence of each treatment.
The F-test is an alternative to the t-tests we do for regression coefficients being non-zero. The F-test is a little bit different than a t-test. One important idea of an F-test is that when we consider regression coefficents, we imagine our null model as being nested within the model we're considering. That means that the null hypothesis, the regression coefficient is zero, is a special case of the model we're considering where the regression coefficient is non-zero. An example of models that are not nested would be comparing using a $\beta \sin x$ vs $\beta x^2$. There is no obvious way to nest one of these models in the other to create a null hypothesis. Notice that if you imagine the F-test exactly the same as the t-test (null is regression coefficient being 0), then you'll always have nested models.
TC | Water | Sun | Music | Plant Growth |
---|---|---|---|---|
1 | 0 | 0 | 0 | $y_1$ |
2 | 1 | 0 | 0 | $y_2$ |
3 | 0 | 1 | 0 | $y_3$ |
4 | 0 | 0 | 1 | $y_4$ |
Notice that we have switched our level coding to $0$ and $1$. The choice is arbitrary, but it demonstrates better the idea of one at a time design. What is wrong with this design?
Water and sun will never be active at the same time, meaning that all of our experiments will not actually have plant growth. As we discussed in unit 12, lecture 3, this means we're missing interactions. Look at the model equation we assume with one at a time:
This is missing those interactions, like how the system changes when both water and sun are given to the plant. The correct model equation is:
$$ y = \beta_w x_w + \beta_s x_s + \beta_m x_m + \beta_{ws} x_{ws} + \beta_{wm} x_{wm} + \beta_{sm} x_{sm} + \beta_{wsm} x_{wsm} + \epsilon $$To solve for all these regression coefficients, we need to have at least as many experiments. This leads to..
With a factorial design, we have one treatment condition for all permutations of the factor levels. For our plant growth example, the experiments would look like:
TC | Water | Sun | Music | Plant Growth |
---|---|---|---|---|
1 | 0 | 0 | 0 | $y_1$ |
2 | 1 | 0 | 0 | $y_2$ |
3 | 0 | 1 | 0 | $y_3$ |
4 | 0 | 0 | 1 | $y_4$ |
5 | 1 | 1 | 0 | $y_5$ |
6 | 0 | 1 | 1 | $y_6$ |
7 | 1 | 0 | 1 | $y_7$ |
8 | 1 | 1 | 1 | $y_8$ |
The factorial design will have $L^K$ treatment conditions, where $L$ is the number of levels and $K$ is the number of factors. $2^3 = 8$ in this case. One at a time is $1 + K$ treatment conditions for comparison.
In [57]:
xw = [0, 1, 0, 0, 1, 0, 1, 1]
xs = [0, 0, 1, 0, 1, 1, 0, 1]
xm = [0, 0, 0, 1, 0, 1, 1, 1]
y = [0.4, 0.3, 0.3, 0.2, 4.6, 0.3, 0.2, 5.2, 0.3, 0.2, 0.4, 0.3, 5.0, 0.3, 0.3, 5.0]
# we do xw + xw because we have 2 replicates at each condition
data = {'xw': xw + xw, 'xs': xs + xs, 'xm': xm + xm, 'y': y}
model = ols('y~xw + xs + xm + xw * xs + xw * xm + xs * xm + xw * xm * xs', data=data).fit()
sm.stats.anova_lm(model, typ=2)
Out[57]:
Due to how many treatment conditions are required for factorial, often people will neglect some of the interaction effects and leave them as confounded. For example, if you have 3 levels and 5 factors, there will be 11 treatment conditions measuring the treatment effects without interaction and 232 for measuring interactions. If we study only a fraction of these, we can greatly reduce the number of experiments. The choice of how to reduce the number of experiments is a complex topic, but essentially any design between one at a time and factorial is fractional factorial
Often we have factors that are known but not interesting. For example, we might be conducting experiments on Monday and Wednesday. That is a factor, but not one we're interested in. A common example is gender in drug trails. We know gender may effect response to a drug, but we don't want to study it. There are a variety of ways to deal with these.
To remove unknown nuisance factors, or nuisance factors that are hard to quantify, you can randomize your order of experiments. This gives some robustness to the possibility of unkown nuisance factors affecting your experiments. For example, if you get better at an experiment so that you conduct it more accurately and precise as time goes on, this is a hard to quantify nuisance factor. If you randomize your order though, this will mean that you will not have your accuracy indirectly affecting your conclusion about other treatment conditions.
For nuisance factors which you can measure and sort into levels, you can use blocking to remove their effect. Blocking means arranging your factors so that in each block you have the same nuisance factors.
For example, if I want to do a drug trial I need to two blocks: the control and the block given the drug. If I want to remove gender, I will make each block have equal numbers of the two genders. Let's saw you have 12 participants and 8 are women. Block I would be 4 women and 2 men. Block II would be 4 women and 2 men. My experiment would look like:
TC | Block | $X$ | $Y$ |
---|---|---|---|
1 | I | 0 | $y_1$ |
2 | II | 1 | $y_2$ |
This blocking means that the nuisance factor of gender does not vary when other factors vary.