Cars

Model


In [ ]:
library(ggplot2)
library(dplyr)
library(tidyr)

In [ ]:
options(repr.plot.width = 10, repr.plot.height = 6)

In [ ]:
df <- read.csv("data/cars.tidy.csv", stringsAsFactor = FALSE)
df$price_in_1000 <- df$price / 1000

Correlation of features

Mileage in city vs price


In [ ]:
cor(df$mileage_city, df$price_in_1000)

In [ ]:
ggplot(df) + aes(mileage_city, price_in_1000) + geom_point()

Mileage on highway vs Price


In [ ]:
cor(df$mileage_highway, df$price_in_1000)

Engine capacity vs Mileage in city


In [ ]:
cor(df$engine, df$mileage_city)

Price vs Mileage in city


In [ ]:
cor(df$price_in_1000, df$mileage_city)

Gears & Engine capacity vs Mileage on Highway


In [ ]:
cor(c(df$engine * df$gears), df$mileage_highway)

In [ ]:
df$mileage_city <- as.numeric(df$mileage_city)
head(df$mileage_city)

Linear Regression

Simple Linear Regression

Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). It takes the following form:

y=β0+β1x

What does each term represent?

  • y is the response
  • x is the feature
  • β0 is the intercept
  • β1 is the coefficient for x

Together, β0 and β1 are called the model coefficients. To create your model, you must "learn" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict Price!

Estimating ("Learning") Model Coefficients

Generally speaking, coefficients are estimated using the least squares criterion, which means we are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors"):

What elements are present in the diagram?

  • The black dots are the observed values of x and y.
  • The blue line is our least squares line.
  • The red lines are the residuals, which are the distances between the observed values and the least squares line.

How do the model coefficients relate to the least squares line?

  • β0 is the intercept (the value of y when x=0)
  • β1 is the slope (the change in y divided by change in x) Here is a graphical depiction of those calculations:


In [ ]:
model <- lm(price_in_1000 ~ mileage_city, data = df)
model

Interpreting Model Coefficients

How do we interpret the mileage coefficient (β1)? Increase in mileage is associated with a 1153 decrease in price.

Note that if an increase in mileage was associated with a positive in price, β1 would be positive.

Using the Model for Prediction

y=β0+β1x

In [ ]:
summary(model)

Understanding the Output

# Name Description
1 Residuals The residuals are the difference between the actual values of the variable you're predicting and predicted values from your regression--y - &ycirc. For most regressions you want your residuals to look like a normal distribution when plotted. If our residuals are normally distributed, this indicates the mean of the difference between our predictions and the actual values is close to 0 (good) and that when we miss, we're missing both short and long of the actual value, and the likelihood of a miss being far from the actual value gets smaller as the distance from the actual value gets larger.

Think of it like a dartboard. A good model is going to hit the bullseye some of the time (but not everytime). When it doesn't hit the bullseye, it's missing in all of the other buckets evenly (i.e. not just missing in the 16 bin) and it also misses closer to the bullseye as opposed to on the outer edges of the dartboard.
2 Significance Stars The stars are shorthand for significance levels, with the number of asterisks displayed according to the p-value computed. *** for high significance and * for low significance.
3 Estimated Coeffecient The estimated coefficient is the value of slope calculated by the regression. It might seem a little confusing that the Intercept also has a value, but just think of it as a slope that is always multiplied by 1. This number will obviously vary based on the magnitude of the variable you're inputting into the regression, but it's always good to spot check this number to make sure it seems reasonable.
4 Standard Error of the Coefficient Estimate Measure of the variability in the estimate for the coefficient. Lower means better but this number is relative to the value of the coefficient. As a rule of thumb, you'd like this value to be at least an order of magnitude less than the coefficient estimate.
5 t-value of the Coefficient Estimate Score that measures whether or not the coefficient for this variable is meaningful for the model. You probably won't use this value itself, but know that it is used to calculate the p-value and the significance levels.
6 Variable p-value Probability the variable is NOT relevant. You want this number to be as small as possible. If the number is really small, R will display it in scientific notation.
7 Significance Legend The more punctuation there is next to your variables, the better.

Blank=bad, Dots=pretty good, Stars=good, More Stars=very good
8 Residual Std Error / Degrees of Freedom The Residual Std Error is just the standard deviation of your residuals. You'd like this number to be proportional to the quantiles of the residuals in #1. For a normal distribution, the 1st and 3rd quantiles should be 1.5 +/- the std error.

The Degrees of Freedom is the difference between the number of observations included in your training sample and the number of variables used in your model (intercept counts as a variable).
9 R-squared Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best. Corresponds with the amount of variability in what you're predicting that is explained by the model.
WARNING: While a high R-squared indicates good correlation, correlation does not always imply causation.
10 F-statistic & resulting p-value Performs an F-test on the model. This takes the parameters of our model (in our case we only have 1) and compares it to a model that has fewer parameters. In theory the model with more parameters should fit better. If the model with more parameters (your model) doesn't perform better than the model with fewer parameters, the F-test will have a high p-value (probability NOT significant boost). If the model with more parameters is better than the model with fewer parameters, you will have a lower p-value.

The DF, or degrees of freedom, pertains to how many variables are in the model.

How Well Does the Model Fit the data?

The most common way to evaluate the overall fit of a linear model is by the R-squared value. R-squared is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)

R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the model. Here's an example of what R-squared "looks like":

Goodness of fit - R2 score


In [ ]:
model$fitted.values

In [ ]:
resid <- data.frame(model$residuals, model$fitted.values)

In [ ]:
head(resid)

In [ ]:
ggplot(resid) + aes(y = model.residuals, x = model.fitted.values) + geom_point() + stat_smooth()

Exercise plot log price vs engine capacity and build a model with it


In [ ]:

Using multiple variable for regression

Gears & Engine capacity vs Mileage on highway


In [ ]:
ggplot(df) + aes(y = mileage_highway, x= engine, color = gears) + geom_point()

Gears & Engine vs Mileage on highway


In [ ]:
model <- lm(mileage_highway ~ gears + engine, data = df)

In [ ]:
summary(model)

In [ ]:
model <- lm(mileage_highway ~ gears + engine - 1, data = df)
summary(model)

In [ ]: