Load functions
In [1]:
# Loads libraries and automatically installs them if needed.
# Source code: https://stackoverflow.com/questions/19596359/install-package-library-if-not-installed
usePackage <- function(p) {
if (!is.element(p, installed.packages()[,1]))
install.packages(p, dependencies = TRUE)
require(p, character.only = TRUE)
}
In [2]:
usePackage('GGally')
usePackage('ggplot2')
usePackage('ppcor')
usePackage('openintro')
usePackage('stargazer')
usePackage('mctest')
usePackage('yhat')
# usePackage('lmtest') # This package is for the bp test further down. It did not work in Jupyter notebook for Mac but it does work in Windows.
In [3]:
# Look at the data dictionary
help(ncbirths)
In [4]:
# From here onwards we will assign the data set to a new variable so the original dataframe remains untouched and in its originial condition.
df <- ncbirths
In [5]:
# Look at the structure of the data
# Notice how there are continous and categorical variables
# Categorical variables are known as factor type data in R. The sample below has factors with different levels.
# These factors can be given an order. This would make them ordinal variables.
str(df)
In [6]:
# Look at a sample of the data
head(df)
In [7]:
# Summary statistics
summary(df)
Creating ordinal variables
In [8]:
# Creating levels for factor type data
df$lowbirthweight <- factor(df$lowbirthweight,
levels = c('low', 'not low'),
labels = c('low', 'not low'),
ordered=TRUE) # If ordered is left as FALSE then there won't be any order beyong the alphabetic order.
df$habit <- factor(df$habit,
levels = c('nonsmoker', 'smoker'),
labels = c('nonsmoker', 'smoker'),
ordered=TRUE) # If ordered is left as FALSE then there won't be any order beyong the alphabetic order.
In [9]:
# Notice how the 'habit' column is has an order to the values of the data column.
# This is useful for understanding regression. We'll use the variable 'habit' later on.
str(df)
In [10]:
# Finally, don't be afraid of looking at the data in Excel.
# Notice the ordinal nature of the variables are not apparent.
write.csv(df, "df.csv", row.names=FALSE)
Cleaning up the dataset
In [11]:
# Drop the missing data from the dataframe for the purpose of this exercise
# How to report and treat missing data is important in your analysis. We won't disucss this in detail this session.
# To keep things simple we will drop all rows with missing data.
# Source code: https://stackoverflow.com/questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame#4862502
df <- na.omit(df) # We are overwriting the df.
In [12]:
# Notice how we have 200 rows removed from the total of 1000.
str(df)
Plot your data
In [13]:
# The ggpairs package allows for seeing the relationship of all continous variables in two by two comparisons and the distribution of each variable.
# Drop column from data frame with a non-continous variable.
drops <- c('mature', 'premie', 'marital', 'lowbirthweight', 'gender', 'habit', 'whitemom')
df_cor <- df[ , !(names(df) %in% drops)]
# Print correlations
print(ggpairs(df_cor))
In [14]:
# First plot the relationship you want to use in your regression
p1 <- ggplot(df,aes(y = weight,x = weeks)) +
geom_point() +
geom_smooth(method="lm")
p1
ggsave('p1.pdf', plot = p1)
In [15]:
m1 <- lm(weight ~ weeks, data = df)
In [16]:
summary(m1)
# Adjusted R squared is the proportion of the variance in the data that's explained by the model.
# From 0 to 100. Usually above 85 is good, this high number not achieved in a lot of clinicla work.
In [17]:
# Startgazer is a useful method to create regression tables for presentation.
# You can costumize it using this link: https://www.jakeruss.com/cheatsheets/stargazer/
# Simply copy/paste the output of the HTML file into Word. Or you can use R markdown.
invisible(capture.output(stargazer(m1, out = 'result1.html', model.names = FALSE))) # Change to 'TRUE' to display OLS
Model quality
Akaike Information Criterion
In [18]:
# A measure of model quality and simplicity. The lower the value the better.
# You want to select a model with the least complexity that explains the most variance.
AIC(m1)
Assumptions of the linear regression
In [19]:
par(mfrow=c(2,2)) # to put the plots 2 by 2.
# The first plot is used to identify heteroscedasticity with residual plots. You want it to be random.
# The second plot shows the normal distribution of the residuals.
# You can plot these in ggplot for a better visualization as well.
plot(m1)
# Formal checking of Heteroscedasticity with the Breusch-Pagan test. The p value must be non-significant.
# bptest(m1) # Currently doesn't work in Jupyter on Mac, but worked in Windows version of Jupyter Lab's Anaconda
# Cook's distance measures the influence of a given data point.
# These influential points can be omitted if beyond a certain threshold of influence.
In [20]:
# You can also include categorical variables such as smoking habit (yes/no).
# We made 'habit' an ordinal variable for this purpose. Non-smokers became the reference.
# If more than one category you'd need dummy variables.
m2 <- lm(weight ~ weeks + habit, data = df)
invisible(capture.output(stargazer(m2, out = 'result2.html')))
In [21]:
AIC(m2)
Tools to assess existence of multicollinearity
In [22]:
# Notice how mother's age and father's age are highly correlated in the graph above. This is called collinearity, which can lead to distorted estimates.
# We will discuss this again further down.
m3 <- lm(weight ~ weeks + mage + fage, data = df)
invisible(capture.output(stargazer(m3, out = 'result3.html')))
In [23]:
# Drop column from data frame with a non-continous variable.
# We are about to assess all predictor variables that are continous so we will remove categorical ones and
# 'weight' which is the outcome.
drops <- c('mature', 'premie', 'marital', 'lowbirthweight', 'gender', 'habit', 'whitemom', 'weight')
df_cor <- df[ , !(names(df) %in% drops)]
In [24]:
print(omcdiag(x = df_cor, y = df$weight))
# Error is most likely because the outcome variable is still in the dataframe.
# As such the outcome variable is also listed as a predictor.
# The error tells us one of the variables has exactly the same characteristic as the outcome variable.
In [25]:
# VIF and TOl are the same thing but inversed. You set the threshold or leave the default values.
print(imcdiag(x = df_cor, y = df$weight, vif = 2.5, tol = 0.40, all = TRUE))
In [26]:
print(mc.plot(x = df_cor, y = df$weight, vif = 2.5))
Tools to support interpreting multiple regression in the face of multicollinearity.
In [27]:
# Understand package
help('yhat')
In [28]:
# Use sample code of the package in our data set
help('plotCI.yhat')
In [29]:
## Regression
lm.out <- m3 # you can also just type in the model.
## Calculate regression metrics
regrOut<-calc.yhat(lm.out)
## Bootstrap results
usePackage ("boot")
boot.out<-boot(df,boot.yhat,100,lmOut=lm.out,regrout0=regrOut)
## Evaluate bootstrap results
result<-booteval.yhat(regrOut,boot.out,bty="perc")
## Plot results
plotCI.yhat(regrOut$PredictorMetrics[-nrow(regrOut$PredictorMetrics),],
result$upperCIpm,result$lowerCIpm, pid=which(colnames(regrOut$PredictorMetrics)
%in% c("Beta","rs","CD:0","CD:1","CD:2","GenDom","Pratt","RLW") == TRUE),nr=3,nc=3)
In [30]:
regrOut
k-fold validation of models
In [31]:
# Choosing the best model with the lowest AIC. This AIC is a measure of the simplicity of the model, the lower the better.
# Source code: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-regression-essentials-in-r/
model_kfold <- function(df){
usePackage("caret")
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "repeatedcv", number=10, repeats=3)
# Train the model
model <- train(weight ~ weeks, data = df, # instead of 'weight ~.', replace the '.' with specific predictors or interest.
method = "lm", # replace lm with lmStepAIC and see what happens
trControl = train.control)
detach(package:caret)
return(model)
}
# When you replace the 'lm' method with the 'lmStepAIC' in the above function the model will
# We won't be discussing how to review the different models. Our focus is on the final chosen mdoel.
# Capture modeling
m4 <- model_kfold(df)
# Capture final model
fit4 <- m4$finalModel
# Capture RMSE
RMSE4 <- m4$results[, 2]
# Capture MAE
MAE4 <- m4$results[, 4]
# You can just click the blue bar on the left of the output to close the output of the modeling.
In [32]:
invisible(capture.output(stargazer(fit4, out = 'result4.html')))
In [33]:
# Model accuracy
m4$results
# Final model coefficients
m4$finalModel
# Summary of the model
summary(m4$finalModel)
In [ ]:
In [ ]: