In [ ]:
dat <- read.csv("full_cohort_data.csv")
\doublespacing
In this subchapter, the reader will learn the fundamentals of logistic regression, and how to present and interpret such an analysis.
\doublespacing
In subchapter 5b we covered a very useful methodology for modeling quantitative or continuous outcomes. We of course know though that health outcomes come in all different types of data types. In fact, the health outcomes we often care about most -- cured/not cured, alive/dead, are discrete binary outcomes. It would be ideal if we could extend the same general framework for continuous outcomes to these binary outcomes. Logistic regression allows us to incorporate much of what we learned in the previous subchapter and apply the same principles to binary outcomes.
When dealing with binary data, we would like to be able to model the probability of a type of outcome given one or more covariates. One might ask, why not just simply use linear regression? There are several reasons why this is generally a bad idea. Probabilities need to be somewhere between zero and one, and there is nothing in linear regression to constrain the estimated probabilities to this interval. This would mean that you could have an estimated probability 2, or even a negative probability! This is one unattractive property of such a method (there are others), and although it is sometimes used, the availability of good software such as R
allows us to perform better analyses easily and efficiently. Before introducing such software, we should introduce the analysis of small contingency tables.
\doublespacing
Contingency tables are the best way to start to think about binary data. A contingency table cross-tabulates the outcome across two or more levels of a covariate. Let's begin by creating a new variable (age.cat
) which dichotomizes age
into two age categories: $\le55$ and $>55$. Note, because we are making age a discrete variable, we also change the data type to a factor. This is similar to what we did for the gender_num
variable when discussing linear regression in the previous subchapter. We can get a breakdown of the new variable using the table
function.
\singlespacing
In [ ]:
dat$age.cat <- as.factor(ifelse(dat$age<=55, "<=55",">55"))
table(dat$age.cat)
\doublespacing
We would like to see how 28 day mortality is distributed among the age categories. We can do so by constructing a contingency table, or in this case what is commonly referred to as a 2x2 table.
\singlespacing
In [ ]:
table(dat$age.cat,dat$day_28_flg)
\doublespacing
From the above table, you can see that 40 patients in the young group ($\le55$) died within 28 days, while 243 in the older group died. These correspond to $P(\text{die} | \text{age}\le55) = 0.043$ or 4.3\% and $P(\text{die} | \text{age}>55) = 0.284$ or 28.4\%, where the "|" can be interpreted as "given" or "for those who have." This difference is quite marked, and we know that age is an important factor in mortality, so this is not surprising.
The odds of an event happening is a positive number and can be calculated from the probability of an event, $p$, by the following formula
\centering
$\text{Odds} = \frac{p}{1-p}$.
\raggedright
An event with an odds of zero never happens, and an event with a very large odds (>100) is very likely to happen. Here, the odds of dying within 28 days in the young group is 0.043/(1-0.043)=0.045, and in the older group is 0.284/(1-0.284)=0.40. It is convenient to represent these two figures as a ratio, and the choice of what goes in the numerator and the denominator is somewhat arbitrary. In this case, we will choose to put the older group's odds on the numerator and the younger in the denominator, and it's important to make it clear which group is in the numerator and denominator in general. In this case the Odds ratio is 0.40/0.045 = 8.79, which indicates a very strong association between age and death, and means that the odds of dying in the older group is nearly 9 fold higher than when compared to the younger group. There is a convenient shortcut for doing odds ratio calculation by making an X on a 2x2 table and multiplying top left by bottom right, then dividing it by the product of bottom left and top right. In this case $\frac{883 \times 243}{610 \times 40}= 8.79$.
Now let us look at a slightly different case -- when the covariate takes on more than two values. Such a variable is the service_unit
. Let's see how the deaths are distributed among the different units:
\singlespacing
In [ ]:
deathbyservice <- table(dat$service_unit,dat$day_28_flg)
deathbyservice
\doublespacing
we can get frequencies of these service units by applying the prop.table
function to our cross-tabulated table.
\singlespacing
In [ ]:
dbys.proptable <- prop.table(deathbyservice,1)
dbys.proptable
\doublespacing
It appears as though the FICU
may have a lower rate of death than either the MICU
or SICU
. To compute an odds ratios, first compute the odds:
\singlespacing
In [ ]:
dbys.proptable[,"1"]/dbys.proptable[,"0"]
\doublespacing
and then we need to pick which of FICU
, MICU
or SICU
will serve as the reference or baseline group. This is the group which the other two groups will be compared to. Again the choice is arbitrary, but should be dictated by the study objective. If this were a clinical trial with two drug arms and a placebo arm, it would be foolish to use one of the treatments as the reference group, particularly if you wanted to compare the efficacy of the treatments. In this particular case, there is no clear reference group, but since the FICU is so much smaller than the other two units, we will use it as the reference group. Computing the odds ratio for MICU and SICU we get 4.13 and 3.63, respectively. These are also very strong associations, meaning that the odds of dying in the SICU and MICU are around 4 times higher than in the FICU, but relatively similar.
Contingency tables and 2x2 tables in particular are the building blocks of working with binary data, and it's often a good way to begin looking at the data.
While contingency tables are a fundamental way of looking at binary data, they are somewhat limited. What happens when the covariate of interest is continuous? We could of course create categories from the covariate by establishing cut points, but we may still miss some important aspect of the relationship between the covariate and the outcome by not choosing the right cut points. Also, what happens when we know that a nuisance covariate is related to both the outcome and the covariate of interest. This type of nuisance variable is called a confounder and occurs frequently in observational data, and although there are ways of accounting for confounding in contingency tables, they become more difficult to use when there are more than one present.
Logistic regression is a way of addressing both of these issues, among many others. If you recall, using linear regression is problematic because it is prone to estimating probabilities outside of the [0,1] range. Logistic regression has no such problem per se, because it uses a link function known as the logit function which maps probabilities in the interval $[0,1]$ to a real number $(-\infty,\infty)$. This is important for many practical and technical reasons. The logit of $p$ and how it is related to the covariates is defined as
\centering
$logit(p_x) = log(Odds_x) = log(\frac{p_x}{1-p_x}) = \beta_0 + \beta_1 \times x$.
\raggedright
It is worth pointing out here that log here, and in most places in statistics is referring to the natural logarithm, sometimes denoted $ln$.
The first covariate we were considering, age.cat
was also a binary variable, where it takes on values 1 when the age
$>55$ and 0 when age
$\le55$. So plugging these values in, first for the young group $(x=0)$:
\centering
$logit(p_{x=0}) = log(Odds_{x=0}) = log(\frac{p_{x=0}}{1-p_{x=0}}) = \beta_0 + \beta_1 \times 0 = \beta_0$,
\raggedright
and then for the older group $(x=1)$:
\centering
$logit(p_{x=1}) = log(Odds_{x=1}) = log(\frac{p_{x=1}}{1-p_{x=1}}) = \beta_0 + \beta_1 \times 1 = \beta_0 + \beta_1$.
\raggedright
If we subtract the two cases $logit(p_{x=1}) - logit(p_{x=0}) = log(Odds_{x=1}) - log(Odds_{x=0})$, and we notice that this quantity is equal to $\beta_1$. If you recall the properties of logarithms, that the difference of two logs is the log of their ratio, so $log(Odds_{x=1}) - log(Odds_{x=0}) = log(Odds_{x=1}/Odds_{x=0})$, which may be looking familiar. This is the log ratio of the odds or the log odds ratio in the $x=1$ group relative to the $x=0$ group. Hence, we can estimate odds ratios using logistic regression by exponentiating the coefficients of the model (the intercept notwithstanding, which we will get to in a moment).
Let's fit this model, and see how this works using a real example. We fit logistic regression very similarly to how we fit linear regression models, with a few exceptions. First, we will use a new function called glm
, which is a very powerful function in R
which allow one to fit a class of models known as generalized linear models or GLMs [@mccullagh1989generalized]. The glm
function works in much the same way the lm
function does. We need to specify a formula of the form: outcome ~ covariates
, specify what dataset to use (in our case the dat
data frame), and then specify the family. For logistic regression family='binomial'
will be our choice. You can run the summary
function, just like you did for lm
and it produces output very similar to what lm
did.
\singlespacing
In [ ]:
age.glm <- glm(day_28_flg ~ age.cat,data=dat,family="binomial")
summary(age.glm)
\doublespacing
As you can see, we get a coefficients table that is similar to the lm
table we used earlier. Instead of a t value
, we get a z value
, but this can be interpreted similarly. The rightmost column is a p-value, for testing the null hypothesis $\beta=0$. If you recall, the non-intercept coefficients are log-odds ratios, so testing if they are zero is equivalent to testing if the odds ratios are one. If an odds ratio is one the odds are equal in the numerator group and denominator group, indicating the probabilities of the outcome are equal in each group. So, assessing if the coefficients are zero will be an important aspect of doing this type of analysis.
Looking more closely at the coefficients. The intercept is r round(age.glm$coef[1],2)
and the age.cat
coefficient is r round(age.glm$coef[2],2)
. The coefficient for age.cat
is the log odds ratio for the 2x2 table we previously did the analysis on. When we exponentiate r round(age.glm$coef[2],2)
, we get exp(
r round(age.glm$coef[2],2)
)
= r round(exp(age.glm$coef[2]),2)
. This corresponds with the estimate using the 2x2 table. For completeness, let's look at the other coefficient, the intercept. If you recall, $log(Odds_{x=0}) = \beta_0$, so $\beta_0$ is the log odds of the outcome in the younger group. Exponentiating again, exp(
r round(age.glm$coef[1],2)
)
= r round(exp(age.glm$coef[1]),3)
, and this corresponds with the previous analysis we did. Similarly, $log(Odds_{x=1}) = \beta_0 + \beta_1$, and the estimated odds of 28 day death in the older group is exp(
r round(age.glm$coef[1],2)
+
r round(age.glm$coef[2],2)
)
= r round(exp(sum(age.glm$coef[1:2])),2)
, as was found above. Converting estimated odds into a probability can be done directly using the plogis
function, but we will cover a more powerful and easier way of doing this later on in the section.
While the above analysis is useful for illustration, it does not readily demonstrate anything we could not do with our 2x2 table example above. Logistic regression allows us to extend the basic idea to at least two very relevant areas. The first is the case where we have more than one covariate of interest. Perhaps we have a confounder, we are concerned about, and want to adjust for it. Alternatively, maybe there are two covariates of interest. Secondly, it allows use to use covariates as continuous quantities, instead of discretizing them into categories. For example, instead of dividing age up into exhaustive strata (as we did very simply by just dividing the patients into two groups, $\le55$ and $>55$ ), we could instead use age as a continuous covariate.
First, having more than one covariate is simple. For example, if we wanted to add service_unit
to our previous model, we could just add it as we did when using the lm
function for linear regression. Here we specify day_28_flg ~ age.cat + service_unit
and run the summary
function.
\singlespacing
In [ ]:
ageunit.glm <- glm(day_28_flg ~ age.cat + service_unit,data=dat,family="binomial")
summary(ageunit.glm)$coef
\doublespacing
A coefficient table is produced, and now we have four estimated coefficients. The same two, (Intercept)
and age.cat
which were estimated in the unadjusted model, but also we have service_unitMICU
and service_unitSICU
which correspond to the log odds ratios for the MICU and SICU relative to the FICU. Taking the exponential of these will result in an odds ratio for each variable, adjusted for the other variables in the model. In this case the adjusted odds ratios for Age>55, MICU and SICU are r round(exp(ageunit.glm$coef[2]),2)
, r round(exp(ageunit.glm$coef[3]),2)
, and r round(exp(ageunit.glm$coef[4]),2)
, respectively. We would conclude that there is an almost 9-fold increase in the odds of 28 day mortality for those in the $>55$ year age group relative to the younger $\le55$ group while holding service unit constant. This adjustment becomes important in many scenarios where groups of patients may be more or less likely to receive treatment, but also more or less likely to have better outcomes, where one effect is confounded by possibly many others. Such is almost always the case with observational data, and this is why logistic regression is such a powerful data analysis tool in this setting.
Another case we would like to be able to deal with is when we have a continuous covariate we would like to include in the model. One can always break the continuous covariate into mutually exclusive categories by selecting break or cut points, but selecting the number and location of these points can be arbitrary, and in many cases unnecessary or inefficient. Recall that in logistic regression we are fitting a model:
\centering
$logit(p_x) = log(Odds_x) = log(\frac{p_x}{1-p_x}) = \beta_0 + \beta_1 \times x$,
\raggedright
but now assume $x$ is continuous. Imagine a hypothetical scenario where you know $\beta_0$ and $\beta_1$ and have a group of 50 year olds, and a group of 51 year olds. The difference in the log Odds between the two groups is:
\centering
$log(Odds_{51}) -log(Odds_{50}) = (\beta_0 + \beta_1 \times 51) - (\beta_0 + \beta_1 \times 50) = \beta_1(51-50) = \beta_1$.
\raggedright
Hence, the odds ratio for 51 year olds versus 50 year olds is $\exp{(\beta_1)}$. This is actually true for any group of patients which are 1 year apart, and this gives a useful way to interpret and use these estimated coefficients for continuous covariates. Let's work with an example. Again fitting the 28 day mortality outcome as a function of age, but treating age as it was originally recorded in the dataset, a continuous variable called age
.
\singlespacing
In [ ]:
agects.glm <- glm(day_28_flg ~ age,data=dat,family="binomial")
summary(agects.glm)$coef
\doublespacing
We see the estimated coefficient is r round(agects.glm$coef[2],2)
and still very statistically significant. Exponentiating the log odds ratio for age, we get an estimated odds ratio of r round(exp(agects.glm$coef[2]),2)
, which is per 1 year increase in age. What if the age difference of interest is ten years instead of one year? There are at least two ways of doing this. One is to replace age
with I(age/10)
, which uses a new covariate which is age
divided by ten. The second is to use the agects.glm
estimated log odds ratio, and multiple by ten prior to exponentiating. They will yield equivalent estimates of r round(exp(agects.glm$coef[2]*10),2)
, but it is now per 10 year increases in age. This is useful when the estimated odds ratios (or log odds ratios) are close to one (or zero). When this is done, one unit of the covariate is 10 years, so the generic interpretation of the coefficients remains the same, but the units (per 10 years instead of per 1 year) changes.
This of course assumes that the form of our equation relating the log odds of the outcome to the covariate is correct. In cases where odds of the outcome decreases and increases as a function of the covariate, it is possible to estimate a relatively small effect of the linear covariate, when the outcome may be strongly affected by the covariate, but not in the way the model is specified. Assessing the linearity of the log odds of the outcome and some discretized form of the covariate can be done graphically. For instance, we can break age into 5 groups, and estimate the log odds of 28 day mortality in each group. Plotting these quantities in Figure 1, we can see in this particular case, age is indeed strongly related to the odds of the outcome. Further, expressing age linearly appears like it would be a good approximation. If on the other hand, 28 day mortality has more of a "U"-shaped curve, we may falsely conclude that no relationship between age and mortality exists, when the relationship may be rather strong. Such may be the case when looking at the the log odds of mortality by the first temperature (temp_1st
) in Figure 1 (right).
In [ ]:
library(Hmisc); library(grid); library(gridExtra)
postscript("FigC1.eps")
#tmp <- prop.table(table(cut2(dat$age,g=5), dat$day_28_flg),1)
tmp.glm <- glm(day_28_flg ~ cut2(age,g=5),data=dat,family="binomial")
tmp <- tmp.glm$coef
tmp <- tmp[1] + c(0,tmp[2:5])
names(tmp) <- levels(cut2(dat$age,g=5))
library(ggplot2)
se <- sqrt(diag(summary(tmp.glm)$cov.unscaled) + c(0,diag(summary(tmp.glm)$cov.unscaled)[-1]) + 2*c(0,summary(tmp.glm)$cov.unscaled[1,2:5]))
limits <- aes(ymax = tmp + se, ymin=tmp - se)
plotage <- qplot(names(tmp),tmp) + xlab("Age Group") + ylab("Log Odds of 28 Day Mortality") + geom_errorbar(limits, width=0.12) + theme(axis.text.x = element_text(colour="grey20",size=6,angle=0,hjust=.5,vjust=.5,face="plain"))
tmp2.glm <- glm(day_28_flg ~ cut2(temp_1st,g=5),data=dat,family="binomial")
tmp2 <- tmp2.glm$coef
tmp2 <- tmp2[1] + c(0,tmp2[2:5])
names(tmp2) <- levels(cut2(dat$temp_1st,g=5))
library(ggplot2)
se <- sqrt(diag(summary(tmp2.glm)$cov.unscaled) + c(0,diag(summary(tmp2.glm)$cov.unscaled)[-1]) + 2*c(0,summary(tmp2.glm)$cov.unscaled[1,2:5]))
limits <- aes(ymax = tmp2 + se, ymin=tmp2 - se)
plottemp <- qplot(names(tmp2),tmp2) + xlab("Temperature Group") + ylab("Log Odds of 28 Day Mortality") + geom_errorbar(limits, width=0.12) + theme(axis.text.x = element_text(colour="grey20",size=6,angle=0,hjust=.5,vjust=.5,face="plain"))
grid.arrange(plotage, plottemp, nrow=1, ncol=2)
dev.off()
```{r echo=FALSE,message=FALSE,warning=FALSE,fig.cap="Plot of log-odds of mortality for each of the five age and temperature groups. Error bars represent 95% confidence intervals for the log odds"}
tmp.glm <- glm(day_28_flg ~ cut2(age,g=5),data=dat,family="binomial")
tmp <- tmp.glm$coef
tmp <- tmp[1] + c(0,tmp[2:5])
names(tmp) <- levels(cut2(dat$age,g=5))
library(ggplot2)
se <- sqrt(diag(summary(tmp.glm)$cov.unscaled) + c(0,diag(summary(tmp.glm)$cov.unscaled)[-1]) + 2*c(0,summary(tmp.glm)$cov.unscaled[1,2:5]))
limits <- aes(ymax = tmp + se, ymin=tmp - se)
plotage <- qplot(names(tmp),tmp) + xlab("Age Group") + ylab("Log Odds of 28 Day Mortality") + geom_errorbar(limits, width=0.12) + theme(axis.text.x = element_text(colour="grey20",size=6,angle=0,hjust=.5,vjust=.5,face="plain"))
tmp2.glm <- glm(day_28_flg ~ cut2(temp_1st,g=5),data=dat,family="binomial")
tmp2 <- tmp2.glm$coef
tmp2 <- tmp2[1] + c(0,tmp2[2:5])
names(tmp2) <- levels(cut2(dat$temp_1st,g=5))
library(ggplot2)
se <- sqrt(diag(summary(tmp2.glm)$cov.unscaled) + c(0,diag(summary(tmp2.glm)$cov.unscaled)[-1]) + 2*c(0,summary(tmp2.glm)$cov.unscaled[1,2:5]))
limits <- aes(ymax = tmp2 + se, ymin=tmp2 - se)
plottemp <- qplot(names(tmp2),tmp2) + xlab("Temperature Group") + ylab("Log Odds of 28 Day Mortality") + geom_errorbar(limits, width=0.12) + theme(axis.text.x = element_text(colour="grey20",size=6,angle=0,hjust=.5,vjust=.5,face="plain"))
grid.arrange(plotage, plottemp, nrow=1, ncol=2)
Just as in the case for linear regression, there is a way to test hypotheses for logistic regression. It follows much of the same framework, with the null hypothesis being $\beta=0$. If you recall, this is the log odds ratio, and testing if it is zero is equivalent to a test for the odds ratio being equal to one. Particularly when dealing with a single categorical covariate, there are techniques taught in introductory statistics courses which can be applied here (see ?fisher.test
and ?chisq.test
). In this chapter, we focus on how to conduct such a test in R
.
As was the case when using lm
, we first fit the two competing models, a larger (alternative model), and a smaller (null model). Provided that the models are nested, we can again use the anova
function, passing the smaller model, then the larger model. Here our larger model is the one which contained service_unit
and age.cat
, and the smaller only contains age.cat
, so they are nested. We are then testing if the log odds ratios for the two coefficients associated with service_unit
are zero. Let's call these coefficients $\beta_{MICU}$ and $\beta_{SICU}$. To test if $\beta_{MICU}$ and $\beta_{SICU} = 0$, we can use the anova
function, where this time we will specify the type of test, in this case set the test
parameter to "Chisq"
.
\singlespacing
In [ ]:
anova(age.glm,ageunit.glm,test="Chisq")
\doublespacing
Here the output of the anova
function when applied to glm
objects looks similar to the output generated when used on lm
objects. A couple good practices to get in a habit are to first make sure the two competing models are correctly specified. He we are are testing ~ age.cat
versus age.cat + service_unit
. Next, the difference between the residual degrees of freedom (Resid. Df
) in the two models tell us how many more parameters the larger model has when compared to the smaller model. Here we see 1774 - 1772 = 2
which means that there are two more coefficients estimated in the larger model than the smaller one, which corresponds with the output from the summary
table above. Next looking at the p-value (Pr(>Chi)
), we see a test for $\beta_{MICU}$ and $\beta_{SICU} = 0$ has a p-value of around 0.08. At the typical 0.05 significance level, we would not reject the null, and use the simpler model without the service unit. In logistic regression, this is a common way of testing whether a categorical covariate should be retained in the model, as it can be difficult to assess using the z value
in the summary
table, particularly when one is very statistically significant, and one is not.
Generating confidence intervals for either the log-odds ratios or the odds ratios are relatively straightforward. To get the log-odds ratios and respective confidence intervals for the ageunit.glm
model which includes both age and service unit.
\singlespacing
In [ ]:
ageunit.glm$coef
confint(ageunit.glm)
\doublespacing
Here the coefficient estimates and confidence intervals are presented in much the same way as for a linear regression. In logistic regression, it is often convenient to exponentiate these quantities to get it on a more interpretable scale.
\singlespacing
In [ ]:
exp(ageunit.glm$coef[-1])
exp(confint(ageunit.glm)[-1,])
\doublespacing
Similar to linear regression, we will look at if the confidence intervals for the log odds ratios include zero. This is equivalent to seeing if the intervals for the odds ratios include 1. Since the odds ratios are more directly interpretable it is often more convenient to report them instead of the coefficients on the log odds ratio scale.
Once you have decided on your final model, you may want to generate predictions from your model. Such a task may occur when doing a propensity score analysis (Chapter 3.9) or creating tools for clinical decision support. In the logistic regression setting this involves attempting to estimating the probability of the outcome given the characteristics (covariates) of a patient. This quantity is often denoted $P(outcome | X)$. This is relatively easy to accomplish in R
using the predict
function. One must pass a dataset with all the variables contained in the model. Let's assume that we decided to include the service_unit
in our final model, and want to generate predictions from this based on a new set of patients. Let's first create a new data frame called newdat
using the expand.grid
function which computes all combinations of the values of variables passed to it.
\singlespacing
In [ ]:
newdat <- expand.grid(age.cat=c("<=55",">55"),service_unit=c("FICU","MICU","SICU"))
newdat$pred <- predict(ageunit.glm,newdata=newdat,type="response")
newdat
\doublespacing
We followed this by adding a pred
column to our new data frame by using the predict
function. The predict
function for logistic regression works similar to when we used it for linear regression, but this time we also specify type="response"
which ensures the quantities computed are what we need, $P(outcome | X)$. Outputting this new object shows our predicted probability of 28 day mortality for six hypothetical patients. Two in each of the service units, where one is in the younger group and another in the older group. We see that our lowest prediction is for the youngest patients in the FICU, while the patients with highest risk of 28 day mortality are the older group in the MICU, but the predicted probability is not all that much higher than the same age patients in the SICU.
To do predictions on a different dataset, just replace the newdata
argument with the other dataset. We could, for instance, pass newdata=dat
and receive predictions for the dataset we built the model on. As was the case with linear regression, evaluating the predictive performance of our model on data used to build the model will generally be too optimistic as to how well it would perform in the real world. How to get a better sense of the accuracy of such models is covered in Chapter 3.2.
In general, presenting the results from a logistic regression model will follow quite closely to what was done in the linear regression setting. Results should always be put in context, including what variables were considered and which variables were in the final model. Reporting the results should always include some form of the coefficient estimate, a measure of uncertainty and likely a p-value. In medical and epidemiological journals, coefficients are usually exponentiated so that they are no longer on the log scale, and reported as odds ratios. Frequently, multivariable analyses (analysis with more than one covariate) is distinguished from univariate analyses (one covariate) by denoting the estimated odds ratios as adjusted odds ratios (AOR).
For the age.glm
model, an example of what could be reported is:
Mortality at 28 days was much higher in the older ($>55$ years) group than the younger group ($\le55$ years), with rates of 28.5% and 4.3%, respectively (OR=8.79, 95% CI: 6.27-12.64, p<0.001).
For when treating age as a continuous covariate in the agects.glm
model we could report:
Mortality at 28 days was associated with older age (OR=1.07 per year increase, 95% CI: 1.06-1.08, p<0.001).
And for the case with more than one covariate, (ageunit.glm
) an example of what could be reported:
Older age ($>55$ vs $\le55$ years) was independently associated with 28 day mortality (AOR=8.68, 95% CI: 6.18-12.49, p<0.001) after adjusting for service unit.
As was the case with linear regression, logistic regression is an extremely powerful tool for data analysis of health data. Although the study outcomes in each approach are different, the framework and way of thinking of the problem have similarities. Likewise, many of the problems encountered in linear regression are also of concern in logistic regression. Outliers, missing data, colinearity and dependent/correlated outcomes are all problems for logistic regression as well, and can be dealt with in a similar fashion. Modelling assumptions are as well, and we briefly touched on this when discussing whether it was appropriate to use age as a continuous covariate in our models. Although continuous covariates are frequently modeled in this way, it is important to ensure if the relationship between the log odds of the outcome is indeed linear with the covariate. In cases where the data has been divided into too many subgroups (or the study may be simply too small), you may encounter a level of a discrete variable where none (or very few) of one of the outcomes occurred. For example, if we had an additional service_unit
with 50 patients, all of whom lived. In such a case, the estimated odds ratios and subsequent confidence intervals or hypothesis testing may not be appropriate to use. In such a case, collapsing the discrete covariate into fewer categories will often help return the analysis into a manageable form. For our hypothetical new service unit, creating a new group of it and FICU would be a possible solution. Sometimes a covariate is so strongly related to the outcome, and this is no longer possible, and the only solution may be to report this finding, and remove these patients.
Overall, logistic regression is a very valuable tool in modelling binary and categorical data. Although we did not cover this latter case, a similar framework is available for discrete data which is ordered or has more than one category (see ?multinom
in the nnet
package in R
for details about multinomial logistic regression). This and other topics such as assessing model fit, and using logistic regression in more complicated study designs are discussed in [@hosmer2004applied].