Regression


In [22]:
source("https://raw.githubusercontent.com/eogasawara/mylibrary/master/myPreprocessing.R")

loadlibrary("MASS")
loadlibrary("plotly")
loadlibrary("reshape2")

plot_size(4, 3)

Dataset

Both independent and dependent variables are numeric.


In [23]:
exp_table(t(sapply(Boston, class)))
exp_table(Boston)
?MASS::Boston


crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
numericnumericnumericintegernumericnumericnumericnumericintegernumericnumericnumericnumericnumeric
crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
0.0063218 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7

Fitting a first model

Explaining house price using lower status population variable.

$lm$ builds the model.

$summary$ describes the significance of the built model.


In [24]:
lm.fit = lm(medv ~ lstat, data = Boston)

summary(lm.fit)


Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,	Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

prediction

The $predict$ function makes predictions from the adjusted model.

The predictions can be presented with either $confidence$ and $prediction$ intervals.

These intervals can be analyzed at https://statisticsbyjim.com/hypothesis-testing/confidence-prediction-tolerance-intervals/


In [25]:
predict(lm.fit, data.frame(lstat =(c(5, 10, 15))), interval = "confidence")
predict(lm.fit, data.frame(lstat =(c(5, 10, 15))), interval = "prediction")


fitlwrupr
29.8035929.0074130.59978
25.0533524.4741325.63256
20.3031019.7315920.87461
fitlwrupr
29.80359 17.56567542.04151
25.05335 12.82762637.27907
20.30310 8.07774232.52846

Plotting the regression model

It is a good practice to plot the regression model. It enables us to have a feeling of its quality.


In [26]:
axis_x <- seq(min(Boston$lstat), max(Boston$lstat), by = 0.5)
axis_y <- predict(lm.fit, data.frame(lstat=axis_x))

data_adj = data.frame(lstat=axis_x, medv=axis_y)

ggplot(Boston) + geom_point(aes(x = lstat, y = medv)) + geom_line(data=data_adj,aes(x=lstat,y=medv), color="Blue") + theme_bw(base_size = 10)


Polynomial regression

It is possible to introduce polynomial dimensions of independent data.


In [27]:
lm.fit_p =lm(medv~lstat+I(lstat^2), data=Boston)
summary (lm.fit_p)


Call:
lm(formula = medv ~ lstat + I(lstat^2), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2834  -3.8313  -0.5295   2.3095  25.4148 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.862007   0.872084   49.15   <2e-16 ***
lstat       -2.332821   0.123803  -18.84   <2e-16 ***
I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared:  0.6407,	Adjusted R-squared:  0.6393 
F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

Plotting the polynomial regression model

It seems to be better adjusted with the data. Is it significant?


In [28]:
axis_x <- seq(min(Boston$lstat), max(Boston$lstat), by = 0.5)
axis_x2 <- axis_x^2
axis_y <- predict(lm.fit_p, data.frame(lstat=axis_x, `I(lstat^2)`=axis_x2))


data_adj = data.frame(lstat=axis_x, medv=axis_y)
ggplot(Boston) + geom_point(aes(x = lstat, y = medv)) + geom_line(data=data_adj,aes(x=lstat,y=medv), color="Blue") + theme_bw(base_size = 10)


ANOVA test

It is possible to check if a built model is significantly better than another model using the ANOVA test.

The null hypothesis is that both models are not different ($\operatorname{p-value} > 5\%$). The alternative hypothesis says that they are different ($\operatorname{p-value} < 5\%$).


In [29]:
anova(lm.fit, lm.fit_p)


Res.DfRSSDfSum of SqFPr(>F)
504 19472.38 NA NA NA NA
503 15347.24 1 4125.138 135.1998 7.630116e-28

Multiple regression

It is possible to use more than one dimension as independent data for the regression model.


In [30]:
lm.fit2 =lm(medv~lstat+age, data=Boston)
summary (lm.fit2)


Call:
lm(formula = medv ~ lstat + age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.981  -3.978  -1.283   1.968  23.158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
age          0.03454    0.01223   2.826  0.00491 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared:  0.5513,	Adjusted R-squared:  0.5495 
F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

Checking the significance of the model


In [31]:
anova(lm.fit ,lm.fit2)


Res.DfRSSDfSum of SqFPr(>F)
504 19472.38 NA NA NA NA
503 19168.13 1 304.2528 7.984043 0.004906776

Plotting the surface of regression


In [32]:
axis_x <- seq(min(Boston$lstat), max(Boston$lstat), by = 0.5)
axis_y <- seq(min(Boston$age), max(Boston$age), by = 0.5)

lm_surface <- expand.grid(lstat = axis_x, age = axis_y, KEEP.OUT.ATTRS = F)
lm_surface$medv <- predict.lm(lm.fit2, newdata = lm_surface)
lm_surface <- acast(lm_surface, age ~ lstat, value.var = "medv") #y ~ x

b3d_plot <- plot_ly(Boston, 
                     x = ~Boston$lstat, 
                     y = ~Boston$age, 
                     z = ~Boston$medv,
                     text = Boston$medv, 
                     type = "scatter3d",
                     mode = "markers"
)


b3d_plot <- add_trace(p = b3d_plot,
                       z = lm_surface,
                       x = axis_x,
                       y = axis_y,
                       type = "surface")

b3d_plot


Warning message:
"'surface' objects don't have these attributes: 'mode'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'z', 'x', 'y', 'text', 'surfacecolor', 'cauto', 'cmin', 'cmax', 'colorscale', 'autocolorscale', 'reversescale', 'showscale', 'colorbar', 'contours', 'hidesurface', 'lightposition', 'lighting', '_deprecated', 'xcalendar', 'ycalendar', 'zcalendar', 'scene', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'zsrc', 'xsrc', 'ysrc', 'textsrc', 'surfacecolorsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
"Warning message:
"'surface' objects don't have these attributes: 'mode'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'z', 'x', 'y', 'text', 'surfacecolor', 'cauto', 'cmin', 'cmax', 'colorscale', 'autocolorscale', 'reversescale', 'showscale', 'colorbar', 'contours', 'hidesurface', 'lightposition', 'lighting', '_deprecated', 'xcalendar', 'ycalendar', 'zcalendar', 'scene', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'zsrc', 'xsrc', 'ysrc', 'textsrc', 'surfacecolorsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
"

Logistic Regression

In this example the predicted dependent variable is categorical.


In [33]:
set.seed(1)
exp_table(t(sapply(iris, class)))
exp_table(iris)

??datasets::iris


Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
numericnumericnumericnumericfactor
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

To make the problem simpler, let us assume that we intend to predict if a species is $versicolor$ or if it is $other$ species.


In [34]:
data <- iris
data$versicolor <- as.integer(data$Species=="versicolor")
data$Species <- c('other', 'versicolor')[data$versicolor+1]

Using preprocessing functions, we separate both training and test data.


In [35]:
sampler <- sample.random(data)
train <- sampler$sample
test <- sampler$residual
head(train)


Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesversicolor
405.1 3.4 1.5 0.2 other 0
565.7 2.8 4.5 1.3 versicolor1
855.4 3.0 4.5 1.5 versicolor1
1346.3 2.8 5.1 1.5 other 0
304.7 3.2 1.6 0.2 other 0
1317.4 2.8 6.1 1.9 other 0

This dataset is unbalanced using this perspective. If the prediction for $versicolor$ is higher than its probability, it can be classified as $versicolor$.


In [36]:
t <- mean(train$versicolor)
print(t)


[1] 0.35

The creation of the logistic regression model using all independent variables uses $glm$ function.


In [37]:
pred <- glm(versicolor ~ .-Species, data=train, family = binomial)

The quality of adjustment using training data is measured using the confusion table.


In [38]:
res <- predict(pred, train, type="response")
res <- as.integer(res >= t)
table(res, train$versicolor)


   
res  0  1
  0 60 10
  1 18 32

The quality of prediction using the test data is measured using the confusion table.


In [39]:
res <- predict(pred, test, type="response")
res <- res >= t
table(res, test$versicolor)


       
res      0  1
  FALSE 13  2
  TRUE   9  6

Creation of the logistic regression model using the independent variables with lower entropy during binning transformation.


In [40]:
pred <- glm(versicolor ~ Petal.Length + Petal.Width, data=train, family = binomial)

The quality of adjustment using training data is measured using the confusion table.


In [41]:
res <- predict(pred, train, type="response")
res <- as.integer(res >= t)
table(res, train$versicolor)


   
res  0  1
  0 60  8
  1 18 34

The quality of prediction using the test data is measured using the confusion table.


In [42]:
res <- predict(pred, test, type="response")
res <- as.integer(res >= t)
table(res, test$versicolor)


   
res  0  1
  0 16  0
  1  6  8

In [ ]: