In [ ]:
dat <- read.csv("full_cohort_data.csv")
\doublespacing
In this subchapter, the reader will learn the fundamentals of linear regression, and how to present and interpret such an analysis.
\doublespacing
Linear regression provides the foundation for many types of analyses we perform on health data. In the simplest scenario, we try to relate one continuous outcome, $y$, to a single continuous covariate, $x$, by trying to find values for $\beta_0$ and $\beta_1$ so that the following equation:
\centering
$y=\beta_0 + \beta_1 \times x$
\raggedright
fits the data 'optimally'1. We call these optimal values: $\hat{\beta}_0$ and $\hat{\beta}_1$ to distinguish them from the true values of $\beta_0$ and $\beta_1$ which are often unknowable. In Figure 1, we see a scatter plot of TCO2 (y: outcome) levels versus PCO2 (x: covariate) levels. We can clearly see that as PCO2 levels increase, the TCO2 levels also increase. This would suggest that we may be able to fit a linear regression model which predicts TCO2 from PCO2.
\singlespacing
Exactly what optimally means is beyond the scope of this chapter, but for those who are interested, we are trying to find values of $\beta_0$ and $\beta_1$ which minimize the squared distance between the fitted line and the observed data point, summed over all data points. This quantity is known as sum of squares error, or when divided by the number of observations is known as the mean squared error.↩
In [ ]:
postscript("FigB1.eps")
plot(dat$pco2_first,dat$tco2_first,xlab="PCO2",ylab="TCO2",pch=19,xlim=c(0,175))
co2.lm <- lm(tco2_first ~ pco2_first,data=dat)
abline(co2.lm,col='red',lwd=2)
co2.quad.lm <- lm(tco2_first ~ pco2_first + I(pco2_first^2),data=dat)
abline(co2.quad.lm,col='blue',lwd=2)
dev.off()
In [ ]:
plot(dat$pco2_first,dat$tco2_first,xlab="PCO2",ylab="TCO2",pch=19,xlim=c(0,175))
co2.lm <- lm(tco2_first ~ pco2_first,data=dat)
abline(co2.lm,col='red',lwd=2)
co2.quad.lm <- lm(tco2_first ~ pco2_first + I(pco2_first^2),data=dat)
abline(co2.quad.lm,col='blue',lwd=2)
#grid.pred <- data.frame(pco2_first=seq.int(from=min(dat$pco2_first,na.rm=T),to=max(dat$pco2_first,na.rm=T)));
#preds <- predict(co2.lm,newdata=grid.pred,interval = "prediction")
#lines(grid.pred$pco2_first,preds[,2],lty=3)
#lines(grid.pred$pco2_first,preds[,3],lty=3)
\doublespacing
It is always a good idea to visualize the data when you can, which allows one to assess if the subsequent analysis corresponds to what you could see with your eyes. In this case, a scatter plot can be produced using the plot
function:
\singlespacing
In [ ]:
plot(dat$pco2_first,dat$tco2_first,xlab="PCO2",ylab="TCO2",pch=19,xlim=c(0,175))
which produces the scattered points in Figure 1.
\doublespacing
Finding the best fit line for the scatter plot in Figure 1 in R
is relatively straightforward:
\singlespacing
In [ ]:
co2.lm <- lm(tco2_first ~ pco2_first,data=dat)
\doublespacing
Dissecting this command from left to right. The co2.lm <-
part assigns the right part of the command to a new variable called co2.lm
. The right side of this command runs the lm
function in R
. lm
is a powerful function in R
that fits linear models. As with any command in R
, you can find additional help information by running ?lm
from the R
command prompt. The basic lm
command has two parts. The first is the formula which has the general syntax outcome ~ covariates
. Here, our outcome variable is called tco2_first
and we are just fitting one covariate, pco2_first
, so our formula is tco2_first ~ pco2_first
. The second argument is separated by a comma and is specifying the data frame to use. In our case, the data frame is called dat
, so we pass data=dat
, noting that both tco2_first
and pco2_first
are columns in the dataframe dat
. The overall procedure of specifying a model formula (tco2_first ~ pco2_first
), a data frame (data=dat
) and passing it an appropriate R
function (lm
) will be used throughout this chapter, and is the foundation for many types of statistical modeling in R
.
We would like to see some information about the model we just fit, and often a good way of doing this is to run the summary
command on the object we created:
\singlespacing
In [ ]:
summary(co2.lm)
\doublespacing
This outputs information about the lm
object we created in the previous step. The first part recalls the model we fit, which is useful when we have fit many models, and are trying to compare them. The second part lists some summary information about what are called residuals -- a topic we will cover later on in this book (Chapter 2.6). Next lists the coefficient estimates -- these are the $\hat{\beta}_0$, (Intercept)
, and $\hat{\beta}_1$, pco2_first
, parameters in the best fit line we are trying to estimate. This output is telling us that the best fit equation for the data is:
\singlespacing
\centering
tco2_first
= 16.21 + 0.189 $\times$ pco2_first
.
\raggedright
\doublespacing
These two quantities have important interpretations. The estimated intercept ($\hat{\beta}_0$) tells us what TCO2 level we would predict for an individual with a PCO2 level of 0. This is the mathematical interpretation, and often this quantity's has limited practical use. The estimated slope ($\hat{\beta}_1$) on the other hand can be interpreted as how quickly the predicted value of TCO2 goes up for every unit increase in PCO2. In this case, we estimate that TCO2 goes up about 0.189 mmol/L for every 1 mm Hg increase in PCO2. Each coefficient estimate has a corresponding Std. Error
(standard error). This is a measure of how certain we are about the estimate. If the standard error is large relative to the coefficient then we are less certain about our estimate. Many things can affect the standard error, including the study sample size. The next column in this table is the t value
, which is simply the coefficient estimate divided by the standard error. This is followed by Pr(>|t|)
which is also known as the p-value. The last two quantities are relevant to an area of statistics called hypothesis testing which we will cover briefly now.
Hypothesis testing in statistics is fundamentally about evaluating two competing hypotheses. One hypothesis, called the null hypothesis is setup as a straw man (a sham argument set up to be defeated), and is the hypothesis you would like to provide evidence against. In the analysis methods we will discuss in this chapter, this is almost always $\beta_k=0$, and it is often written as $H_0: \beta_k=0$. The alternative (second) hypothesis is commonly assumed to be $\beta_k \neq 0$, and will often be written as $H_A: \beta_k \neq 0$. A statistical significance level, $\alpha$, should be established before any analysis is performed. This value is known as the Type I error, and is the probability of rejecting the null hypothesis when the null hypothesis is true, i.e. of incorrectly concluding that the null hypothesis is false. In our case, it is the probability that we falsely conclude that the coefficient is non-zero, when the coefficient is actually zero. It is common to set the Type I error at 0.05.
After specifying the null and alternative hypotheses, along with the significance level, hypotheses can be tested by computing a p-value. The actual computation of p-values is beyond the scope of this chapter, but we will cover the interpretation and provide some intuition. P-values are the probability of observing data as extreme or more extreme than what was seen, assuming the null hypothesis is true. The null hypothesis is $\beta_k=0$, so when would this be unlikely? It it probably unlikely when we estimate $\beta_k$ to be rather large. However, how large is large enough? This would likely depend on how certain we are about the estimate of $\beta_k$. If we were very certain, $\hat{\beta}_k$ likely would not have to be very large, but if we are less certain, then we might not think it to be unlikely for even very large values of $\hat{\beta}_k$. A p-value balances both of these aspects, and computes a single number. We reject the null hypothesis when the p-value is smaller than the significance level, $\alpha$.
Returning to our fit model, we see that the p-value for both coefficients are tiny (<2e-16
), and we would reject both null hypotheses, concluding that neither coefficient is likely zero. What do these two hypotheses mean at a practical level? The intercept being zero, $\beta_0=0$ would imply the best fit line goes through the origin [ the (x,y) point (0,0)], and we would reject this hypothesis. The slope being zero would mean that the best fit line would be a flat horizontal line, and did not increase as PCO2 increases. Clearly there is a relationship between TCO2 and PCO2, so we would also reject this hypothesis. In summary, we would conclude that we need both an intercept and a slope in the model. A next obvious question would be, could the relationship be more complicated than a straight line? We will examine this next.
Model selection are techniques related to selecting the best model from a list (perhaps rather large list) of candidate models. We will cover some basics here, as more complicated techniques will be covered in a later chapter. In the simplest case, we have two models, and we want to know which one we should use.
We will begin by examining if the relationship between TCO2 and PCO2 is more complicated than the model we fit in the previous section. If you recall, we fit a model where we considered a linear pco2_first
term: tco2_first
= $\beta_0 + \beta_1 \times$ pco2_first
. One may wonder if including a quadratic term would fit the data better, i.e. whether:
\centering
tco2_first
= $\beta_0 + \beta_1 \times$ pco2_first
+ $\beta_2 \times$ pco2_first
$^2$,
\raggedright
is a better model. One way to evaluate this is by testing the null hypothesis: $\beta_2=0$. We do this by fitting the above model, and looking at the output. Adding a quadratic term (or any other function) is quite easy using the lm
function. It is best practice to enclose any of these functions in the I()
function to make sure they get evaluated as you intended. The I()
forces the formula to evaluate what is passed to it as is, as the ^
operator has a different use in formulas in R
(see ?formula
for further details). Fitting this model, and running the summary
function for the model:
\doublespacing
\singlespacing
In [ ]:
co2.quad.lm <- lm(tco2_first ~ pco2_first + I(pco2_first^2),data=dat)
summary(co2.quad.lm)$coef
\doublespacing
You will note that we have abbreviated the output from the summary
function by appending $coef
to the summary
function: this tells R
we would like information about the coefficients only. Looking first at the estimates, we see the best fit line is estimated as:
\centering
tco2_first
= 16.09 + 0.19 $\times$ pco2_first
+ 0.00004 $\times$ pco2_first
$^2$.
\raggedright
We can add both best fit lines to Figure 1 using the abline
function:
\singlespacing
In [ ]:
abline(co2.lm,col='red')
abline(co2.quad.lm,col='blue')
\doublespacing
and one can see that the red (linear term only) and blue (linear and quadratic terms) fits are nearly identical. This corresponds with the relatively small coefficient estimate for the I(pco2_first^2)
term. The p-value for this coefficient is about 0.86, and at the 0.05 significance level we would likely conclude that a quadratic term is not necessary in our model to fit the data, as the linear term only model fits the data nearly as well.
We have concluded that a linear (straight line) model fit the data quite well, but thus far we have restricted our exploration to just one variable at a time. When we include other variables, we may wonder if the same straight line is true for all patients. For example, could the relationship between PCO2 and TCO2 be different among men and women? We could subset the data into a data frame for men and a data frame for women, and then fit separate regression for each gender. Another more efficient way to accomplish this is by fitting both genders in a single model, and including gender as a covariate. For example, we may fit:
\centering
tco2_first
= $\beta_0$ + $\beta_1$ $\times$ pco2_first
+ $\beta_2$ $\times$ gender_num
.
\raggedright
The variable gender_num
takes on values 0 for women and 1 for men, and for men the model is:
\centering
tco2_first
= $\underbrace{(\beta_0 + \beta_2)}_{\text{intercept}}$ + $\beta_1$ $\times$ pco2_first
,
\raggedright
and in women:
\centering
tco2_first
= $\beta_0$ + $\beta_1$ $\times$ pco2_first
.
\raggedright
As one can see these models have the same slope, but different intercepts (the distance between the slopes is $\beta_2$). In other words, the lines fit for men and women will be parallel and be separated by a distance of $\beta_2$ for all values of pco2_first
. This isn't exactly what we would like, as the slopes may also be different. To allow for this, we need to discuss the idea of an interaction between two variables. An interaction is essentially the product of two covariates. In this case, which will we call the interaction model, we would be fitting:
\centering
tco2_first
= $\beta_0$ + $\beta_1$ $\times$ pco2_first
+ $\beta_2$ $\times$ gender_num
+ $\beta_3$ $\times$ $\underbrace{\texttt{gender\_num} \times \texttt{pco2\_first}}_{\text{interaction term}}$.
\raggedright
Again, separating the cases for men:
\centering
tco2_first
= $\underbrace{(\beta_0 + \beta_2)}_{\text{intercept}}$ + $\underbrace{(\beta_1 + \beta_3)}_{\text{slope}}$ $\times$ pco2_first
,
\raggedright
and women:
\centering
tco2_first
= $\underbrace{(\beta_0 )}_{\text{intercept}}$ + $\underbrace{(\beta_1 )}_{\text{slope}}$ $\times$ pco2_first
.
\raggedright
Now men and women have different intercepts and slopes.
Fitting these models in R
is relatively straightforward. Although not absolutely required in this particular circumstance, it is wise to make sure that R
handles data types in the correct way by ensuring our variables are of the right class. In this particular case, men are coded as 1
and women as 0
(a discrete binary covariate) but R
thinks this is numeric (continuous) data:
In [ ]:
class(dat$gender_num)
Leaving this unaltered, will not affect the analysis in this instance, but it can be problematic when dealing with other types of data such as categorical data with several categories (e.g., ethnicity). Also, by setting the data to the right type, the output R
generates can also be more informative. We can set the gender_num
variable to the class factor
by using the as.factor
function.
In [ ]:
dat$gender_num <- as.factor(dat$gender_num)
Here we have just overwritten the old variable in the dat
data frame with a new copy which is of class factor
:
In [ ]:
class(dat$gender_num)
Now that we have the gender variable correctly encoded, we can fit the models we discussed above. First the model with gender as a covariate, but no interaction. We can do this by simply adding the variable gender_num
to the previous formula for our co2.lm
model fit.
\singlespacing
In [ ]:
co2.gender.lm <- lm(tco2_first ~ pco2_first + gender_num,data=dat)
summary(co2.gender.lm)$coef
\doublespacing
This output is very similar to what we had before, but now there's a gender_num
term as well. The 1
is present in the first column after gender_num
, and it tells us who this coefficient is relevant to (subjects with 1
for the gender_num
-- men). This is always relative to the baseline group, and in this case this is women. The estimate is negative, meaning that the line fit for males will be below the line for females. Plotting this fit curve in Figure 2:
\singlespacing
In [ ]:
plot(dat$pco2_first,dat$tco2_first, col=dat$gender_num, xlab="PCO2",ylab="TCO2", xlim=c(0,40), type="n", ylim=c(15,25))
abline(a=c(coef(co2.gender.lm)[1]),b=coef(co2.gender.lm)[2])
abline(a=coef(co2.gender.lm)[1]+coef(co2.gender.lm)[3],b=coef(co2.gender.lm)[2],col='red')
\doublespacing
we see that the lines are parallel, but almost indistinguishable. In fact, this plot has been cropped in order to see any difference at all. From the estimate from the summary
output above, the difference between the two lines is -0.182 mmol/L, which is quite small, so perhaps this isn't too surprising. We can also see in the above summary
output that the p-value is about 0.42, and we would likely not reject the null hypothesis that the true value of the gender_num
coefficient is zero.
In [ ]:
postscript("FigB2.eps")
plot(dat$pco2_first,dat$tco2_first,col=dat$gender_num,xlab="PCO2",ylab="TCO2",xlim=c(0,40),type="n",ylim=c(15,25))
abline(a=c(coef(co2.gender.lm)[1]),b=coef(co2.gender.lm)[2])
abline(a=coef(co2.gender.lm)[1]+coef(co2.gender.lm)[3],b=coef(co2.gender.lm)[2],col='red')
co2.gender.interaction.lm <- lm(tco2_first ~ pco2_first*gender_num,data=dat)
abline(a=coef(co2.gender.interaction.lm)[1], b=coef(co2.gender.interaction.lm)[2],lty=3,lwd=2)
abline(a=coef(co2.gender.interaction.lm)[1] + coef(co2.gender.interaction.lm)[3], b=coef(co2.gender.interaction.lm)[2] + coef(co2.gender.interaction.lm)[4],col='red',lty=3,lwd=2)
legend(24,20,lty=c(1,1,3,3),lwd=c(1,1,2,2),col=c("black","red","black","red"),c("Female","Male","Female (Interaction Model)","Male (Interaction Model)"),cex=0.75)
dev.off()
In [ ]:
plot(dat$pco2_first,dat$tco2_first,col=dat$gender_num,xlab="PCO2",ylab="TCO2",xlim=c(0,40),type="n",ylim=c(15,25))
abline(a=c(coef(co2.gender.lm)[1]),b=coef(co2.gender.lm)[2])
abline(a=coef(co2.gender.lm)[1]+coef(co2.gender.lm)[3],b=coef(co2.gender.lm)[2],col='red')
co2.gender.interaction.lm <- lm(tco2_first ~ pco2_first*gender_num,data=dat)
abline(a=coef(co2.gender.interaction.lm)[1], b=coef(co2.gender.interaction.lm)[2],lty=3,lwd=2)
abline(a=coef(co2.gender.interaction.lm)[1] + coef(co2.gender.interaction.lm)[3], b=coef(co2.gender.interaction.lm)[2] + coef(co2.gender.interaction.lm)[4],col='red',lty=3,lwd=2)
legend(24,20,lty=c(1,1,3,3),lwd=c(1,1,2,2),col=c("black","red","black","red"),c("Female","Male","Female (Interaction Model)","Male (Interaction Model)"),cex=0.75)
And now moving on to the model with an interaction between pco2_first
and gender_num
. To add an interaction between two variables use the *
operator within a model formula. By default, R
will add all of the main effects (variables contained in the interaction) to the model as well, so simply adding pco2_first*gender_num
will add effects for pco2_first
and gender_num
in addition to the interaction between them to the model fit.
\singlespacing
In [ ]:
co2.gender.interaction.lm <- lm(tco2_first ~ pco2_first*gender_num,data=dat)
summary(co2.gender.interaction.lm)$coef
\doublespacing
The estimated coefficients are $\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2$ and $\hat{\beta}_3$, respectively, and we can determine the best fit lines for men:
\centering
tco2_first
= $(15.85 + 0.81)$ + $(0.20 - 0.023)$ $\times$ pco2_first
= $16.67$ + $0.18$ $\times$ pco2_first
,
\raggedright
and for women:
\centering
tco2_first
= $15.85$ + $0.20$ $\times$ pco2_first
.
\raggedright
Based on this, the men's intercept should be higher, but their slope should be not as steep, relative to the women. Let's check this and add the new model fits as dotted lines and add a legend to Figure 2.
\singlespacing
In [ ]:
abline(a=coef(co2.gender.interaction.lm)[1], b=coef(co2.gender.interaction.lm)[2],lty=3,lwd=2)
abline(a=coef(co2.gender.interaction.lm)[1] + coef(co2.gender.interaction.lm)[3], b=coef(co2.gender.interaction.lm)[2] + coef(co2.gender.interaction.lm)[4],col='red',lty=3,lwd=2)
legend(24,20, lty=c(1,1,3,3), lwd=c(1,1,2,2), col=c("black","red","black","red"), c("Female","Male","Female (Interaction Model)","Male (Interaction Model)") )
\doublespacing
We can see that the fits generated from this plot are a little different than the one generated for a model without the interaction. The biggest difference is that the dotted lines are no longer parallel. This has some serious implications, particularly when it comes to interpreting our result. First note that the estimated coefficient for the gender_num
variable is now positive. This means that at pco2_first=0
, men (red) have higher tco2_first
levels than women (black). If you recall in the previous model fit, women had higher levels of tco2_first
at all levels of pco2_first
. At some point around pco2_first=35
this changes and women (black) have higher tco2_first
levels than men (red). This means that the effect of gender_num
may vary as you change the level of pco2_first
, and is why interactions are often referred to as effect modification in the epidemiological literature. The effect need not change signs (i.e., the lines do not need to cross) over the observed range of values for an interaction to be present.
The question remains, is the variable gender_num
important? We looked at this briefly when we examined the t value
column in the no interaction model which included gender_num
. What if we wanted to test (simultaneously) the null hypothesis: $\beta_2$ and $\beta_3=0$ There is a useful test known as the F-test which can help us in this exact scenario where we want to look at if we should use a larger model (more covariates) or use a smaller model (fewer covariates). The F-test applies only to nested models -- the larger model must contain each covariate that is used in the smaller model, and the smaller model cannot contain covariates which are not in the larger model. The interaction model and the model with gender are nested models since the all the covariates in the model with gender are also in the larger interaction model. An example of a non-nested model would be the quadratic model and the interaction model: the smaller (quadratic) model has a term (pco2_first
$^2$) which is not in the larger (interaction) model. An F-test would not be appropriate for this latter case.
To perform an F-test, first fit the two models you wish to consider, and then run the anova
command passing the two model objects.
\singlespacing
In [ ]:
anova(co2.lm,co2.gender.interaction.lm)
\doublespacing
As you can see, the anova
command first lists the models it is considering. Much of the rest of the information is beyond the scope of this chapter, but we will highlight the reported F-test p-value (Pr(>F)
), which in this case is 0.2515. In nested models, the null hypothesis is that all coefficients in the larger model and not in the smaller model are zero. In the case we are testing, our null hypothesis is $\beta_2$ and $\beta_3=0$. Since the p-value exceeds the typically used significance level ($\alpha=0.05$), we would not reject the null hypothesis, and likely say the smaller model explains the data just as well as the larger model. If these were the only models we were considering, we would use the smaller model as our final model and report the final model in our results. We will now discuss what exactly you should report and how you can interpret the results.
We will briefly discuss how to communicate a linear regression analysis. In general, before you present the results, some discussion of how you got the results should be done. It is a good idea to report: whether you transformed the outcome or any covariates in anyway (e.g., by taking the logarithm), what covariates you considered and how you chose the covariates which were in the model you reported. In our above example, we did not transform the outcome (TCO2), we considered PCO2 both as a linear and quadratic term, and we considered gender on its own and as an interaction term with PCO2. We first evaluated whether a quadratic term should be included in the model by using a t-test, after which we considered a model with gender and a gender-PCO2 interaction, and performed model selection with an F-test. Our final model involved only a linear PCO2 term and an intercept.
When reporting your results, it's a good idea to report three aspects for each covariate. Firstly, you should always report the coefficient estimate. The coefficient estimate allows the reader to assess the magnitude of the effect. There are many circumstances where a result may be statistically significant, but practically meaningless. Secondly, alongside your estimate you should always report some measure of uncertainty or precision. For linear regression, the standard error (Std. Error
column in the R
output) can be reported. We will cover another method called a confidence interval later on in this section. Lastly, a reporting a p-value for each of the coefficients is also a good idea. An example of appropriate presentation of our final model would be something similar to: TCO2 increased 0.18 (SE: 0.008, p-value<0.001) units per unit increase of PCO2. You will note we reported p-value<0.001, when in fact it is smaller than this. It is common to report very small p-values as <0.001 or <0.0001 instead using a large number of decimal places. While sometimes it's simply reported whether p<0.05 or not (i.e., if the result is statistically significant or not), this practice should be avoided.
Often it's a good idea to also discuss how well the overall model fit. There are several ways to accomplish this, but reporting a unitless quantity known as $R^2$ (pronounced r-squared) is often done. Looking back to the output R
provided for our chosen final model, we can find the value of $R^2$ for this model under Multiple R-squared
: 0.2647. This quantity is a proportion (a number between 0 and 1), and describes how much of the total variability in the data is explained by the model. An $R^2$ of 1 indicates a perfect fit, where 0 explains no variability in the data. What exactly constitutes a 'good' $R^2$ depends on subject matter and how it will be used. Another way to describe the fit in your model is through the residual standard error. This is also in the lm
output when using the summary
function. This roughly estimates square-root of the average squared distance between the model fit and the data. While it is in the same units as the outcome, it is in general more difficult to interpret than $R^2$. It should be noted that for evaluating prediction error, the values reported by R
are likely to optimistic when applied to new data, and a better estimate of the error should be evaluated by other methods (e.g., cross-validation), which will be covered in another chapter and elsewhere [@friedman2009elements; @james2013introduction].
Interpreting the results is an important component to any data analysis. We have already covered interpreting the intercept, which is the prediction for the outcome when all covariates are set at zero. This quantity is not of direct interest in most studies. If one does want to interpret it, subtracting the mean from each of the model's covariates will make it more interpretable -- the expected value of the outcome when all covariates are set to the study's averages.
The coefficient estimates for the covariates are in general the quantities most of scientific interest. When the covariate is binary (e.g., gender_num
), the coefficient represents the difference between one level of the covariate (1) relative to the other level (0), while holding any other covariates in the model constant. Although we won't cover it until the next section, extending discrete covariates to the case when they have more than two levels (e.g., ethnicity or service_unit
) is quite similar, with the noted exception that it's important to reference the baseline group (i.e., what is the effect relative to). We will return to this topic later on in the chapter. Lastly, when the covariate is continuous the interpretation is the expected change in the outcome as a result of increasing the the covariate in question by one unit, while holding all other covariates fixed. This interpretation is actually universal for any non-intercept coefficient, including for binary and other discrete data, but relies more heavily on understanding how R
is coding these covariates with dummy variables.
We examined statistical interactions briefly, and this topic can be very difficult to interpret. It is often advisable, when possible, to represent the interaction graphically, as we did in Figure 2.
As mentioned above, one method to quantify the uncertainty around coefficient estimates is by reporting the standard error. Another commonly used method is to report a confidence interval, most commonly a 95\% confidence interval. A 95\% confidence interval for $\beta$ is an interval for which if the data were collected repeatedly, about 95\% of the intervals would contain the true value of the parameter, $\beta$, assuming the modeling assumptions are correct.
To get 95\% confidence intervals of coefficients, R
has a confint
function, which you pass an lm
object to. It will then output 2.5\% and 97.5\% confidence interval limits for each coefficient.
\singlespacing
In [ ]:
confint(co2.lm)
\doublespacing
The 95\% confidence interval for pco2_first
is about 0.17-0.20, which may be slightly more informative than reporting the standard error. Often people will look at if the confidence interval includes zero (no effect). Since it does not, and in fact since the interval is quite narrow and not very close to zero, this provides some additional evidence of its importance. There is a well known link between hypothesis testing and confidence intervals which we will not get into detail here.
When plotting the data with the model fit, similar to Figure 1, it is a good idea to include some sort of assessment of uncertainty as well. To do this in R
, we will first create a data frame with PCO2 levels which we would like to predict. In this case, we would like to predict the outcome (TCO2) over the range of observed covariate (PCO2) values. We do this by creating a data frame, where the variable names in the data frame must match the covariates used in the model. In our case, we have only one covariate (pco2_first
), and we predict the outcome over the range of covariate values we observed determined by the min
and max
functions.
\singlespacing
In [ ]:
grid.pred <- data.frame(pco2_first=seq.int(from=min(dat$pco2_first,na.rm=T),
to=max(dat$pco2_first,na.rm=T)));
\doublespacing
Then, by using the predict
function, we can predict TCO2 levels at these PCO2 values. The predict
function has three arguments: the model we have constructed (in this case, using lm
), newdata
, and interval
. The newdata
argument allows you to pass any data frame with the same covariates as the model fit, which is why we created grid.pred
above. Lastly, the interval
argument is optional, and allows for the inclusion of any confidence or prediction intervals. We want to illustrate a prediction interval which incorporates both uncertainty about the model coefficients, in addition to the uncertainty generated by the data generating process, so we will pass interval="prediction"
.
\singlespacing
In [ ]:
preds <- predict(co2.lm,newdata=grid.pred,interval = "prediction")
preds[1:2,]
\doublespacing
We have printed out the first two rows of our predictions, preds
, which are the model's predictions for PCO2 at r seq.int(from=min(dat$pco2_first,na.rm=T),to=max(dat$pco2_first,na.rm=T))[1]
and r seq.int(from=min(dat$pco2_first,na.rm=T),to=max(dat$pco2_first,na.rm=T))[2]
. We can see that our predictions (fit
) are about 0.18 apart, which make sense given our estimate of the slope (0.18). We also see that our 95\% prediction intervals are very wide, spanning about 9 (lwr
) to 26 (upr
). This indicates that, despite coming up with a model which is very statistically significant, we still have a lot of uncertainty about the predictions generated from such a model. It is a good idea to capture this quality when plotting how well your model fits by adding the interval lines as dotted lines. Let's plot our final model fit, co2.lm
, along with the scatterplot and prediction interval in Figure 3.
\singlespacing
In [ ]:
postscript("FigB3.eps")
plot(dat$pco2_first,dat$tco2_first,xlab="PCO2",ylab="TCO2",pch=19,xlim=c(0,175))
co2.lm <- lm(tco2_first ~ pco2_first,data=dat)
abline(co2.lm,col='red',lwd=2)
lines(grid.pred$pco2_first,preds[,2],lty=3)
lines(grid.pred$pco2_first,preds[,3],lty=3)
dev.off()
In [ ]:
plot(dat$pco2_first,dat$tco2_first,xlab="PCO2",ylab="TCO2",pch=19,xlim=c(0,175))
co2.lm <- lm(tco2_first ~ pco2_first,data=dat)
abline(co2.lm,col='red',lwd=2)
lines(grid.pred$pco2_first,preds[,2],lty=3)
lines(grid.pred$pco2_first,preds[,3],lty=3)
\doublespacing
Linear regression is an extremely powerful tool for doing data analysis on continuous outcomes. Despite this, there are several aspects to be aware of when performing this type of analysis.
1) Hypothesis testing and the interval generation are reliant on modelling assumptions. Doing diagnostic plots is a critical component when conducting data analysis. There is subsequent discussion on this elsewhere in the book, and we will refer you to [@harrell2015regression;@venables2013modern;@weisberg2005applied] for more information about this important topic.
2) Outliers can be problematic when fitting models. When there are outliers in the covariates, it's often easiest to turn a numeric variable into a categorical one (2 or more groups cut along values of the covariate). Removing outliers should be avoided when possible, as they often tell you a lot of information about the data generating process. In other cases, they may identify problems for the extraction process. For instance, a subset of the data may use different units for the same covariate (e.g., inches and centimeters for height), and thus the data needs to be converted to common units. Methods robust to outliers are available in R
, a brief introduction of how to get started with some of the functions in R
is available [@venables2013modern].
3) Be concerned about missing data. R
reports information about missing data in the summary
output. For our model fit co2.lm
, we had 186 observations with missing pco2_first
observations. R
will leave these observations out of the analysis, and fit on the remaining non-missing observations. Always check the output to ensure you have as many observations as you think that you are supposed to. When many observations have missing data and you try to build a model with a large number of coefficients, you may be fitting the model on only a handful of observations.
4) Assess potential multi-colinearity. Co-linearity can occur when two or more covariates are highly correlated. For instance, if blood pressure on the left and right arms were simultaneously measured, and both used as covariates in the model. In this case, consider taking the sum, average or difference (whichever is most useful in the particular case) to craft a single covariate. Co-linearity can also occur when a categorical variable has been improperly generated. For instance, defining groups along the PCO2 covariate of 0-25, 5-26, 26-50, >50 may cause linear regression to encounter some difficulties as the first and second group are nearly identical (usually these types of situations are programming errors). Identifying covariates which may be colinear is a key part of the exploratory analysis stage, where they can often (but not always) be seen by plotting the data.
5) Check to see if outcomes are dependent. This most commonly occurs when one patient contributes multiple observations (outcomes). There are alternative methods for dealing with this situation [@diggle2013analysis], but it is beyond the scope of this chapter.
These concerns should not discourage you from using linear regression. It is extremely powerful and reasonably robust to some of the problems discussed above, depending on the situation. Frequently a continuous outcome is converted to a binary outcome, and often there is no compelling reason this is done. By discretizing the outcome you may be losing information about which patients may benefit or be harmed most by a therapy, since a binary outcome may treat patients who had very different outcomes on the continuous scale as the same. The overall framework we took in linear regression will closely mirror that which what way in which we approach the other analysis techniques we discuss later in this chapter.