In [ ]:
dat <- read.csv("full_cohort_data.csv")
\doublespacing
In this subchapter, we will work through a case study, and discuss the data analysis components which should be included in an original research article suitable for an clinical journal. We will also discuss some approaches for model and feature selection.
\doublespacing
We will now use what we learned in the previous sections to examine if indwelling arterial catheters (IAC) have any effect on patient mortality [@hsu2015association]. As reiterated throughout, clearly identifying a study objective is important for a smooth data analysis. In our case, we'd like to estimate the effect of IAC on mortality, but acknowledge a few potential problem areas. First, the groups who receive IAC and and those who don't are likely different in many respects, and many of these differences likely also have some effect on mortality. Second, we would like to be able to limit ourselves on mortality events which occur in close proximity to the ICU admission. The dataset includes 28 day mortality, so that would seem to be in close proximity to the ICU admission. As for the first issue, we also have many covariates which capture some of the features we may be concerned with, including severity of illness (sapsi_first
and sofa_first
), age (age
), patient gender (gender_num
) and co-morbidities (chf_flg
, afib_flg
, renal_flg
, etc).
With all these in mind, we should have a good start on determining our study objective. In our case, it might be,
To estimate the effect that administration of IAC during an ICU admission has on 28 day mortality in patients within the MIMIC II study who received mechanical ventilation, while adjusting for age, gender, severity of illness and comorbidities.
For now, this describes our outcome and covariates quite well. One of the first things that is often done is to describe our population by computing summary statistics of all or a subset of variables collected in the study. This description allows the reader to understand how well the study would generalize to other populations. We have made available an R
package on GitHub that will allow one to construct preliminary forms of such a table quite quickly. To install the R
package, first install and load the devtools
package:
\singlespacing
In [ ]:
install.package("devtools")
library(devtools)
\doublespacing
and then install and load our package by using the install_github
function.
\singlespacing
In [ ]:
install_github("jraffa/MIMICbook")
library(MIMICbook);
\doublespacing
Before we do any in depth analysis, let's make sure we are using the original dataset, first by removing and then reloading the dat
data frame. In order to ensure our research is reproducible, it's a good idea to make sure the entire process of doing the analysis is documented. By starting from the original copy of the dataset, we are able to present precisely what methods we used in an analysis.
\singlespacing
In [ ]:
rm(dat);
dat <- read.csv("full_cohort_data.csv")
In [ ]:
rm(dat);
dat <- read.csv(url)
\doublespacing
As mentioned before, recoding binary encoded variables (ones which are 0s and 1s) to the R
data class factor
can sometimes make interpreting the R
output easier. The following piece of code cycles through all the columns in dat
and converts any binary variables to a factor
.
\singlespacing
In [ ]:
# Identify which columns are binary coded
bincols <- colMeans((dat==1 | dat==0),na.rm=T)==1
for(i in 1:length(bincols)){ #Turn the binary columns into a factor
if(bincols[i]) {
dat[[i]] <- as.factor(dat[[i]]);
}
}
\doublespacing
We are now ready to generate a summary of the patient characteristics in our study. The MIMICbook
package has a produce.table1
function. This generates a summary table of the data frame you pass to it, using an appropriate summary for continuous variables (average and standard deviation) and categorical variables (number and percentages) for each variable. In its most simple form, produce.table1
can be passed a data frame as an argument, which we do (passing it the dat
data frame). This output is not very nice, and we can make it look nicer by using a powerful R
package called knitr
, which provides many tools to assist in performing reproducible research. You can find out more about knitr
by running ?knitr
on the R
console after loading it. We will be using the kable
command, which will take our tab1
variable -- a summary table we generated using the produce.table1
function, and make it look a little nicer.
In [ ]:
produce.table1 <- function(x,labels=NULL) { # May throw this in an R package on github so I can omit from the book.
out <- matrix(NA,nr=length(x[1,]))
rrn <- NULL;
for(i in 1:length(x[1,])) {
if(is.factor(x[,i])) {
if(is.null(labels[i])) {
tmp<- table(x[,i])
most.prev.name <- names(which.max(tmp))
} else {
if(is.na(labels[i])) {
tmp<- table(x[,i])
most.prev.name <- names(which.max(tmp))
} else {
most.prev.name <- labels[i];
}
}
if(sum(is.na(x[,i]))==0) {
out[i,] <- paste0(sum(x[,i]==most.prev.name,na.rm=T), " (", round(100*mean(x[,i]==most.prev.name,na.rm=T),1), "%)")
} else {
out[i,] <- paste0(sum(x[,i]==most.prev.name,na.rm=T), " (", round(100*mean(x[,i]==most.prev.name,na.rm=T),1), "%)", " [Missing: ", sum(is.na(x[,i])), "]")
}
rrn[i] <- paste0(names(x)[i], "==", most.prev.name);
labels[i] <- most.prev.name;
} else {
if(sum(is.na(x[,i]))==0) {
out[i,] <- paste0(round(mean(x[,i],na.rm=T),1), " (" , round(sd(x[,i],na.rm=T),1), ")")
} else {
out[i,] <- paste0(round(mean(x[,i],na.rm=T),1), " (" , round(sd(x[,i],na.rm=T),1), ")", " [Missing: ", sum(is.na(x[,i])), "]")
}
rrn[i] <- paste0(names(x)[i]);
}
}
rownames(out) <- rrn;
colnames(out) <- "Average (SD), or N (%)";
attr(out,"labels") <- labels;
return(out)
}
\singlespacing
In [ ]:
tab1 <- produce.table1(dat);
library(knitr);
kable(tab1,caption = "Overall patient characteristics")
In [ ]:
tab1 <- produce.table1(dat);
library(knitr);
kable(tab1,caption = "Overall patient characteristics",format="latex")
\doublespacing
The row descriptors are not very informative, and what we have produced would not be usable for final publication, but it suits our purposes for now. knitr
allows one to output such tables in HTML, \LaTeX\ or even a Word document, which you can edit and make the table more informative. The results are contained in Table 1.
A couple things we may notice from the baseline characteristics are:
bmi
, po2_first
, iv_day_1
).Both of these points are important, and illustrates why it is always a good idea to perform basic descriptive analyses before beginning any modeling. The missing data is primarily related to weight/BMI, or lab values. For the purpose of this chapter, we are going to ignore both of these classes of variables. While we would likely want to adjust for some of these covariates in a final version of the paper, and Chapter 2.3 gives some useful techniques for dealing with such a situation, we are going to focus on the set of covariates we had identified in our study objective, which do not include these variables. The issue related to sepsis is also of note. Sepsis certainly would contribute to higher rates of mortality when compared to patients without sepsis, but since we do not have any patients with sepsis, we cannot and do not need to adjust for this covariate per se. What we do need to do is acknowledge this fact by revising our study objective. We originally identified our population as patients within MIMIC, but because this is a subset of MIMIC -- those without sepsis, we should revise the study objective to:
To estimate the effect that administration of IAC during an ICU admission has on 28 day mortality in patients without sepsis who received mechanical ventilation within MIMIC II, while adjusting for age, gender, severity of illness and comorbidities.
We will also not want to include the sepsis_flg
variable as a covariate in any of our models, as there is no patients with sepsis within this study to estimate the effect of sepsis. Now that we have examined the basic overall characteristics of the patients, we can begin the next steps in the analysis.
The next steps will vary slightly, but it is often useful to put yourself in the shoes of a peer reviewer. What problems will a reviewer likely find with your study and how can you address them? Usually, the reviewer will want to see how the population differs for different values of the covariate of interest. In our case study, if the treated group (IAC) differed substantially from the untreated group (no IAC), then this may account for any effect we demonstrate. We can do this by summarizing the two groups in a similar fashion as was done for Table 1. We can reuse the produce.table1
function, but we pass it the two groups separately by splitting the dat
data frame into two using the split
function (by the aline_flg
variable), later combining them into one table using cbind
to yield Table 2. It's important to ensure that the same reference groups are used across the two study groups, and that's what the labels argument is used for (see ?produce.table1
for more details).
\singlespacing
In [ ]:
datby.aline <- split(dat,dat$aline_flg)
reftable <- produce.table1(datby.aline[[1]]);
tab2 <- cbind(produce.table1(datby.aline[[1]],labels=attr(reftable,"labels")),
produce.table1(datby.aline[[2]],labels=attr(reftable,"labels")))
colnames(tab2) <- paste0("Average (SD), or N (%)",c(", No-IAC", ", IAC"))
kable(tab2, caption="Patient characteristics stratified by IAC administration")
In [ ]:
datby.aline <- split(dat,dat$aline_flg)
reftable <- produce.table1(datby.aline[[1]]);
tab2 <- cbind(produce.table1(datby.aline[[1]],labels=attr(reftable,"labels")),
produce.table1(datby.aline[[2]],labels=attr(reftable,"labels")))
colnames(tab2) <- paste0("Average (SD), or N (%)",c(", No-IAC", ", IAC"))
kable(tab2, caption="Patient characteristics stratified by IAC administration",format="latex")
\doublespacing
As you can see in Table 2, the IAC group differs in many respects to the non-IAC group. Patients who were given IAC tended to have higher severity of illness at baseline (sapsi_first
and sofa_first
), slightly older, less likely to be from the MICU, and have slightly different co-morbidity profiles when compared to the non-IAC group.
Next, we can see how the covariates are distributed among the different outcomes (death within 28 days versus alive at 28 days). This will give us an idea of which covariates may be important for affecting the outcome. The code to generate this is nearly identical to that used to produce Table 2, but instead, we replace aline_flg
with day_28_flg
(the outcome) to get Table 3.
\singlespacing
In [ ]:
datby.28daymort <- split(dat,dat$day_28_flg)
reftablemort <- produce.table1(datby.28daymort[[1]]);
tab3 <- cbind(produce.table1(datby.28daymort[[1]],labels=attr(reftablemort,"labels")),
produce.table1(datby.28daymort[[2]],labels=attr(reftablemort,"labels")))
colnames(tab3) <- paste0("Average (SD), or N (%)",c(",Alive", ",Dead"))
kable(tab3,caption="Patient characteristics stratified by 28 day mortality")
In [ ]:
datby.28daymort <- split(dat,dat$day_28_flg)
reftablemort <- produce.table1(datby.28daymort[[1]]);
tab3 <- cbind(produce.table1(datby.28daymort[[1]],labels=attr(reftablemort,"labels")),
produce.table1(datby.28daymort[[2]],labels=attr(reftablemort,"labels")))
colnames(tab3) <- paste0("Average (SD), or N (%)",c(", Alive", ", Dead"))
kable(tab3,caption="Patient characteristics stratified by 28 day mortality",format="latex")
\doublespacing
As can be seen in Table 3, those patients who died within 28 days differ in many ways with those who did not. Those who died had higher SAPS and SOFAS scores, were on average older, and had different co-morbidity profiles.
In Table 3, we see that of the 984 subjects receiving IAC, 170 (17.2%) died within 28 days, whereas 113 of 792 (14.2%) died in the no-IAC group. In a univariate analysis we can assess if the lower rate of mortality is statistically significant, by fitting a single covariate aline_flg
logistic regression.
\singlespacing
In [ ]:
uvr.glm <- glm(day_28_flg ~ aline_flg,data=dat,family="binomial")
exp(uvr.glm$coef[-1])
exp(confint(uvr.glm)[-1,]);
\doublespacing
Those who received IAC had over a 25\% increase in odds of 28 day mortality when compared to those who did not receive IAC. The confidence interval includes one, so we would expect the p-value would be >0.05. Running the summary
function, we see that this is the case.
\singlespacing
In [ ]:
summary(uvr.glm)$coef
\doublespacing
Indeed, the p-value for aline_flg
is about 0.09. As we saw in Table 2, there are likely several important covariates that differed among those who received IAC and those who did not. These may serve as confounders, and the possible association we observed in the univariate analysis may be stronger, non-existent or in the opposite direction (i.e., IAC having lower rates of mortality) depending on the situation. Our next step would be to adjust for these confounders. This is an exercise in what is known as model building, and there are several ways people do this in the literature. A common approach is to fit all univariate models (one covariate at a time, as we did with aline_flg
, but separately for each covariate and without aline_flg
), and perform a hypothesis test on each model. Any variables which had statistical significance under the univariate models would then be included in a multivariable model. Another approach begins with the model we just fit (uvr.glm
which only has aline_flg
as a covariate), and then sequentially adds variables one at a time. This approach is often called step-wise forward selection. We will make a choice to do step-wise backwards selection, which is as it sounds -- the opposite direction of step-wise forward selection. Model selection is a challenging task in data analysis, and there are many other methods [see @dash1997feature;@harrell2015regression;@friedman2009elements;@james2013introduction] we couldn't possibly describe in full detail here. As an overall philosophy, it is important to outline and describe the process by which you will do model selection before you actually do it and stick with the process.
In our stepwise backwards elimination procedure, we are going to fit a model containing IAC (aline_flg
), age (age
), gender, (gender_num
), disease severity (sapsi_first
and sofa_first
), service type (service_unit
), and comorbidities (chf_flg
, afib_flg
, renal_flg
, liver_flg
, copd_flg
, cad_flg
, stroke_flg
, mal_flg
and resp_flg
). This is often called the full model, and is fit below. From the full model, we will proceed by eliminating one variable at a time, until we are left with a model where we are left with only statistically significant covariates. Because aline_flg
is the covariate of interest, it will remain in the model regardless of its statistical significance. At each step we need to come up with a criteria to choose which variable we will eliminate. There are several ways of doing this, but one way we can make this decision is performing a hypothesis test for each covariate, and choosing to eliminate the covariate with the largest p-value, unless all p-values are <0.05 or the largest p-value is aline_flg
, in which case we would stop or eliminate the next largest p-value, respectively.
Most of the covariates are binary or categorical in nature, and we've already converted them to factors. The disease severity scores (SAPS and SOFA) are continuous. We could add them as we did age, but this assumes a linear trend in the odds of death as these scores change. This may or may not be appropriate (see Figure 1). Indeed, when we plot the log odds of 28 day death by SOFA score, we note that while the log odds of death generally increase as the SOFA score increases the relationship may not be linear (Figure 1).
In [ ]:
library(Hmisc)
postscript("FigE1.eps")
#tmp <- prop.table(table(cut2(dat$age,g=5), dat$day_28_flg),1)
tmp.glm <- glm(day_28_flg ~ cut2(sofa_first,c(1:14)),data=dat,family="binomial")
tmp <- tmp.glm$coef
tmp <- tmp[1] + c(0,tmp[2:15])
names(tmp) <- levels(cut2(dat$sofa_first,c(1:14)));
names(tmp)[15] <- "14-17"
#names(tmp)[2:3] <- c("[5]", "[6]")
library(ggplot2)
se <- sqrt(diag(summary(tmp.glm)$cov.unscaled) + c(0,diag(summary(tmp.glm)$cov.unscaled)[-1]) + 2*c(0,summary(tmp.glm)$cov.unscaled[1,2:15]))
limits <- aes(ymax = tmp + se, ymin=tmp - se)
qplot((names(tmp)),tmp) + xlab("SOFA Group") + ylab("Log Odds of 28 Day Mortality") + geom_errorbar(limits, width=0.12) + scale_x_discrete(limits=names(tmp))
dev.off()
In [ ]:
#tmp <- prop.table(table(cut2(dat$age,g=5), dat$day_28_flg),1)
tmp.glm <- glm(day_28_flg ~ cut2(sofa_first,c(1:14)),data=dat,family="binomial")
tmp <- tmp.glm$coef
tmp <- tmp[1] + c(0,tmp[2:15])
names(tmp) <- levels(cut2(dat$sofa_first,c(1:14)));
names(tmp)[15] <- "14-17"
#names(tmp)[2:3] <- c("[5]", "[6]")
library(ggplot2)
se <- sqrt(diag(summary(tmp.glm)$cov.unscaled) + c(0,diag(summary(tmp.glm)$cov.unscaled)[-1]) + 2*c(0,summary(tmp.glm)$cov.unscaled[1,2:15]))
limits <- aes(ymax = tmp + se, ymin=tmp - se)
qplot((names(tmp)),tmp) + xlab("SOFA Group") + ylab("Log Odds of 28 Day Mortality") + geom_errorbar(limits, width=0.12) + scale_x_discrete(limits=names(tmp))
\doublespacing
What can be done in this situation is to turn a continuous covariate into a discrete one. A quick way of doing this is using the cut2
function in the Hmisc
package[^Hmisc]. Applying cut2(sofa_first,g=5)
turns the sofa_first
variable into five approximately equal sized groups by SOFA score. For illustration, SOFA breaks down into the following sized groups by SOFA scores:
\singlespacing
In [ ]:
library(Hmisc)
table(cut2(dat$sofa_first,g=5))
\doublespacing
with not quite equal groups, due to the already discretized nature of SOFA to beginning with. We will treat both SAPS and SOFA in this way in order to avoid any model misspecification that may occur as a result of assuming a linear relationship.
Returning to fitting the full model, we use these new disease severity scores, along with the other covariates we identified to include in the full model.
\singlespacing
In [ ]:
mva.full.glm <- glm(day_28_flg ~ aline_flg + age + gender_num + cut2(sapsi_first,g=5) + cut2(sofa_first,g=5) + service_unit + chf_flg + afib_flg + renal_flg + liver_flg + copd_flg + cad_flg + stroke_flg + mal_flg + resp_flg,data=dat,family="binomial")
summary(mva.full.glm)
\doublespacing
The summary
output show that some of the covariates are very statistically significant, while others may be expendable as they are not important. Ideally, we would like as simple of a model as possible that can explain as much of the variation in the outcome as possible. We will attempt to remove our first covariate by the procedure we outlined above. For each of the variables we consider removing, we could fit a logistic regression model without that covariate, and then test it against the current model. R
has a useful function that automates this process for us, called dropl
. We pass to dropl
our logistic regression object (mva.full.glm
) and the type of test you would like to do. If you recall from the logistic regression section, we used test="Chisq"
, and this is what we will pass the drop1
function as well.
\singlespacing
In [ ]:
drop1(mva.full.glm,test="Chisq")
\doublespacing
As you see from the output, each covariate is listed, along with a p-value (Pr(>Chi)
). Each row represents a hypothesis test with the bigger (alternative model) being the full model (mva.full.glm
), and each null being the full model without the row's covariate. The p-values here should match those output if you were to do this exact test with anova
. As we can see from the listed p-values, aline_flg
has the largest p-value, but we stipulated in our model selection plan that we would retain this covariate as it's our covariate of interest. We will then go to the next largest p-value which is the cad_flg
variable (coronary artery disease). We will update our model, and repeat the backwards elimination step on the updated model. We could just cut and paste the mva.full.glm
command and remove + cad_flg
, but an easier way less prone to errors is to use the update
command. The update
can take a glm
or lm
object, and alter one of the covariates. To do a backwards elimination, the second argument is .~. - variable
. The .~.
part indicates keep the outcome and the rest of the variables the same, and the - variable
indicates to fit the model without the variable called variable
. Hence, to fit a new model from the full model, but without the cad_flg
variable, we would run:
\singlespacing
In [ ]:
mva.tmp.glm <- update(mva.full.glm, .~. - cad_flg)
\doublespacing
We then repeat the drop1
step:
\singlespacing
In [ ]:
drop1(mva.tmp.glm,test="Chisq")
\doublespacing
and see that aline_flg
still has the largest p-value, but chf_flag
has the second largest, so we'll choose to remove it next. To update the new model, and run another elimination step, we would run:
\singlespacing
In [ ]:
mva.tmp.glm2 <- update(mva.tmp.glm, .~. - chf_flg)
drop1(mva.tmp.glm2,test="Chisq")
\doublespacing
where again aline_flg
has the largest p-value, and gender_num
has the second largest. We continue, eliminating gender_num
, copd_flg
, liver_flg
, cut2(sofa_first, g = 5)
, renal_flg
, and service_unit
, in that order (results omitted). The table produced by drop1
from our final model is as follows:
In [ ]:
mva.tmp.glm3 <- update(mva.tmp.glm2, .~. - gender_num)
drop1(mva.tmp.glm3,test="Chisq")
mva.tmp.glm4 <- update(mva.tmp.glm3, .~. - copd_flg)
drop1(mva.tmp.glm4,test="Chisq")
mva.tmp.glm5 <- update(mva.tmp.glm4, .~. - liver_flg)
drop1(mva.tmp.glm5,test="Chisq")
mva.tmp.glm6 <- update(mva.tmp.glm5, .~. - cut2(sofa_first, g = 5))
drop1(mva.tmp.glm6,test="Chisq")
mva.tmp.glm7 <- update(mva.tmp.glm6, .~. - renal_flg)
drop1(mva.tmp.glm7,test="Chisq")
mva.tmp.glm8 <- update(mva.tmp.glm7, .~. - service_unit)
drop1(mva.tmp.glm8,test="Chisq")
\singlespacing
In [ ]:
drop1(mva.tmp.glm8,test="Chisq")
\doublespacing
All variables are statistically significant at the 0.05 significance level. Looking at the summary
output, we see that aline_flg
is not statistically significant (p=0.98), but all other terms are statistically significant, with the exception of the cut2(sapsi_first, g = 5)[12,14)
, which suggest that the second to lowest SAPS group may not be statistically significantly different than the baseline (lowest SAPS group).
\singlespacing
In [ ]:
mva.final.glm <- mva.tmp.glm8;
summary(mva.final.glm)
\doublespacing
We would call this model our final model, and would present it in a table similar to Table 4. Since the effect of IAC was of particular focus, we will highlight it by say that it is not associated with 28 day mortality with an estimated adjusted odds ratio of 1.01 (95% CI: 0.71-1.43, p=0.98). We may conclude that after adjusting for the other potential confounders found in Table 4, we do not find any impact of using IAC on mortality.
\singlespacing
In [ ]:
library(sjPlot);
out <- sjt.glm(mva.final.glm,no.output=TRUE,emph.p=FALSE)
final.table <- out$data[-1,-6];
colnames(final.table) <- c("Covariate", "AOR", "Lower 95% CI", "Upper 95% CI", "p-value")
final.table[,1] <- c("IAC", "Age (per year increase)", "SAPSI [12-14)* (relative to SAPSI<12)", "SAPSI [14-16)*", "SAPSI [16-19)*", "SAPSI [19-32]*", "Atrial Fibrillation", "Stroke", "Malignancy", "Non-COPD Respiratory disease ")
final.table[,5] <- gsub("<", "<",final.table[,5])
kable(final.table,caption="Multivariable logistic regression analysis for mortality at 28 days outcome (Final Model)",format="latex");
\doublespacing
This brief overview of the modeling techniques for health data has provided you with the foundation to perform the most common types of analyses in health studies. We have cited how important having a clear study objective before conducting data analysis, as it identifies all the important aspects you need to plan and execute your analysis. In particular by identifying the outcome, you should be able to determine what analysis methodology would be most appropriate. Often you will find that you will be using multiple analysis techniques for different study objectives within the same study. Table 5 summarizes some of the important aspects of each analysis approach.
\singlespacing
\begin{table}[] \centering \caption{Summary of different methods} \label{my-label} \begin{tabular}{|l|l|l|l|} \hline & Linear Regression & Logistic Regression & \begin{tabular}[c]{@{}l@{}}Cox Proportional\\ Hazards Model\end{tabular} \\ \hline \begin{tabular}[c]{@{}l@{}}Outcome \\ Data Type\end{tabular} & Continuous & Binary & \begin{tabular}[c]{@{}l@{}}Time to an Event\\ (possibly censored)\end{tabular} \\ \hline \begin{tabular}[c]{@{}l@{}}Useful\\ Preliminary \\ Analysis\end{tabular} & Scatterplot & Contingency \& 2x2 Tables & \begin{tabular}[c]{@{}l@{}}Kaplan-Meier Survivor\\ Function Estimate\end{tabular} \\ \hline \begin{tabular}[c]{@{}l@{}}Presentation\\ Output\end{tabular} & Coefficient & Odds Ratio & Hazard Ratio \\ \hline R Output & Coefficient & Log Odds Ratio & Log Hazard Ratio \\ \hline \begin{tabular}[c]{@{}l@{}}Presentation \\ Interpretation\end{tabular} & \begin{tabular}[c]{@{}l@{}}An estimate of the expected\\ change in the outcome per one \\ unit increase in the covariate,\\ while keeping all other\\ covariates constant\end{tabular} & \begin{tabular}[c]{@{}l@{}}An estimate of the fold\\ change in the odds of \\ the outcome per unit\\ increase in the covariate,\\ while keeping all other\\ covariates constant.\end{tabular} & \begin{tabular}[c]{@{}l@{}}An estimate of the fold \\ change in the hazards of \\ the outcome per unit\\ increase in the covariate,\\ while keeping all other \\ covariates constant.\end{tabular} \\ \hline \end{tabular} \end{table}\doublespacing
Fortunately, R
's framework for conducting these analyses is very similar across the different types of techniques, and this framework will often extend more generally to other more complex models (including machine learning algorithms) and data structures (including dependent/correlated data such as longitudinal data). We have highlighted some areas of concern that careful attention should be paid to including missing data, colinearity, model misspecification, and outliers. These items will be looked at more closely in Chapters 2.6