In [22]:
    
source("https://raw.githubusercontent.com/eogasawara/mylibrary/master/myPreprocessing.R")
loadlibrary("MASS")
loadlibrary("plotly")
loadlibrary("reshape2")
plot_size(4, 3)
    
In [23]:
    
exp_table(t(sapply(Boston, class)))
exp_table(Boston)
?MASS::Boston
    
    
    
In [24]:
    
lm.fit = lm(medv ~ lstat, data = Boston)
summary(lm.fit)
    
    
The $predict$ function makes predictions from the adjusted model.
The predictions can be presented with either $confidence$ and $prediction$ intervals.
These intervals can be analyzed at https://statisticsbyjim.com/hypothesis-testing/confidence-prediction-tolerance-intervals/
In [25]:
    
predict(lm.fit, data.frame(lstat =(c(5, 10, 15))), interval = "confidence")
predict(lm.fit, data.frame(lstat =(c(5, 10, 15))), interval = "prediction")
    
    
    
In [26]:
    
axis_x <- seq(min(Boston$lstat), max(Boston$lstat), by = 0.5)
axis_y <- predict(lm.fit, data.frame(lstat=axis_x))
data_adj = data.frame(lstat=axis_x, medv=axis_y)
ggplot(Boston) + geom_point(aes(x = lstat, y = medv)) + geom_line(data=data_adj,aes(x=lstat,y=medv), color="Blue") + theme_bw(base_size = 10)
    
    
In [27]:
    
lm.fit_p =lm(medv~lstat+I(lstat^2), data=Boston)
summary (lm.fit_p)
    
    
In [28]:
    
axis_x <- seq(min(Boston$lstat), max(Boston$lstat), by = 0.5)
axis_x2 <- axis_x^2
axis_y <- predict(lm.fit_p, data.frame(lstat=axis_x, `I(lstat^2)`=axis_x2))
data_adj = data.frame(lstat=axis_x, medv=axis_y)
ggplot(Boston) + geom_point(aes(x = lstat, y = medv)) + geom_line(data=data_adj,aes(x=lstat,y=medv), color="Blue") + theme_bw(base_size = 10)
    
    
In [29]:
    
anova(lm.fit, lm.fit_p)
    
    
In [30]:
    
lm.fit2 =lm(medv~lstat+age, data=Boston)
summary (lm.fit2)
    
    
In [31]:
    
anova(lm.fit ,lm.fit2)
    
    
In [32]:
    
axis_x <- seq(min(Boston$lstat), max(Boston$lstat), by = 0.5)
axis_y <- seq(min(Boston$age), max(Boston$age), by = 0.5)
lm_surface <- expand.grid(lstat = axis_x, age = axis_y, KEEP.OUT.ATTRS = F)
lm_surface$medv <- predict.lm(lm.fit2, newdata = lm_surface)
lm_surface <- acast(lm_surface, age ~ lstat, value.var = "medv") #y ~ x
b3d_plot <- plot_ly(Boston, 
                     x = ~Boston$lstat, 
                     y = ~Boston$age, 
                     z = ~Boston$medv,
                     text = Boston$medv, 
                     type = "scatter3d",
                     mode = "markers"
)
b3d_plot <- add_trace(p = b3d_plot,
                       z = lm_surface,
                       x = axis_x,
                       y = axis_y,
                       type = "surface")
b3d_plot
    
    
    
In [33]:
    
set.seed(1)
exp_table(t(sapply(iris, class)))
exp_table(iris)
??datasets::iris
    
    
    
To make the problem simpler, let us assume that we intend to predict if a species is $versicolor$ or if it is $other$ species.
In [34]:
    
data <- iris
data$versicolor <- as.integer(data$Species=="versicolor")
data$Species <- c('other', 'versicolor')[data$versicolor+1]
    
Using preprocessing functions, we separate both training and test data.
In [35]:
    
sampler <- sample.random(data)
train <- sampler$sample
test <- sampler$residual
head(train)
    
    
This dataset is unbalanced using this perspective. If the prediction for $versicolor$ is higher than its probability, it can be classified as $versicolor$.
In [36]:
    
t <- mean(train$versicolor)
print(t)
    
    
The creation of the logistic regression model using all independent variables uses $glm$ function.
In [37]:
    
pred <- glm(versicolor ~ .-Species, data=train, family = binomial)
    
The quality of adjustment using training data is measured using the confusion table.
In [38]:
    
res <- predict(pred, train, type="response")
res <- as.integer(res >= t)
table(res, train$versicolor)
    
    
The quality of prediction using the test data is measured using the confusion table.
In [39]:
    
res <- predict(pred, test, type="response")
res <- res >= t
table(res, test$versicolor)
    
    
Creation of the logistic regression model using the independent variables with lower entropy during binning transformation.
In [40]:
    
pred <- glm(versicolor ~ Petal.Length + Petal.Width, data=train, family = binomial)
    
The quality of adjustment using training data is measured using the confusion table.
In [41]:
    
res <- predict(pred, train, type="response")
res <- as.integer(res >= t)
table(res, train$versicolor)
    
    
The quality of prediction using the test data is measured using the confusion table.
In [42]:
    
res <- predict(pred, test, type="response")
res <- as.integer(res >= t)
table(res, test$versicolor)
    
    
In [ ]: