Regression Analyses in R

Vahid Mirjalili, Data Mining Researcer

Linear Regression
Ridge Regression
Logistic Regression
Kernel Regression

There are a dozen of regression models, that each may fit to a particular data. In this article, a handset of regression models are described with examples with **R**.

A 2D dataset is provided as drawn below,



In [2]:

    
## sessionInfo()

sessionInfo()









    



packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: RCurl

packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: bitops







    Out[2]:





R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RCurl_1.95-4.1 bitops_1.0-6  

loaded via a namespace (and not attached):
[1] base64_1.1     digest_0.6.4   evaluate_0.5.5 IRdisplay_0.1  IRkernel_0.1  
[6] jsonlite_0.9.8 rzmq_0.7.0     stringr_0.6.2  uuid_0.1-1



In [1]:

    
require(RCurl)
url <- getURL("https://raw.githubusercontent.com/mirjalil/DataScience/master/data/input_4regression-2.csv")
df <- read.csv(text=url, header=F)

#par(mar=c(4.7, 4.7, 0.7, 0.7))
#plot(df, col=rgb(0.2, 0.4, 0.8, 0.7), pch=19, cex=2, cex.axis=1.7, cex.lab=1.7)









    



packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: RCurl

packageStartupMessage in packageStartupMessage(gettextf("Loading required package: %s", : Loading required package: bitops

1. Linear Regression

</a>

Linear regression assumes that the relationship of dependant variable (y) and the predictors (features) is linear. The parameters of such linear model can be estimated via closed form solution of ordinary least square:

$ \beta^\hat \ = (X^t X)^{-1}(X^t y) $

lm() to generate a linear model, given a formula such as formula=y ~ x
coef() returns the coefficients of the linear model built by lm()
resid() retyrns the residual of each point
predict() to get the predicted values

Here, we represent the applicaiton of linear regression on a sample dataset with 1 feature, then we extend the dataset by adding higher order polynomial terms, and build a second regression model.



In [33]:

    
colnames(df) = c("x", "y")

linear.model <- lm(formula=y ~ x, data=df)
coef(linear.model)

y.pred <- x*coef(linear.model)[2] + coef(linear.model)[1]
head(y.pred)

head(predict(linear.model))

library(lattice)

xyplot(y ~ x, df, pch=19, col=rgb(0.2, 0.4, 0.8, 0.7), cex=2,
       scales=list(cex=1.7),
       xlab=list("", cex=1.7), ylab=list("", cex=1.7),
       main=list("Linear Regression w. 1 Attribute", cex=1.6),
       panel = function(x,y, ...) {
            panel.xyplot(x, y, ...)
            llines(x, predict(linear.model), col="black", lwd=6)
            #llines(x, y.pred, col="black", lwd=6)
       })

#with(df, {
#     plot(x, y, pch=19, col=rgb(0.2, 0.4, 0.8, 0.7), cex=2, xlab="x", ylab="y", cex.axis=1.7, cex.lab=1.7);
#     #lines(x, predict(linear.model), col="black", lwd=6)
#     abline(linear.model, lwd=6)
#    })









    Out[33]:





(Intercept)           x 
  1.7008612   0.5616422 






    Out[33]:





numeric(0)






    Out[33]:





         1          2          3          4          5          6 
-0.5457078 -0.4895436 -0.4333794 -0.3772151 -0.3210509 -0.2648867 






    Out[33]:

Adding Higher Order Terms

The linear model built above is too simplisitic, in fact the model is under-fitted. So, to make a more accurate model, we add higher order polynomial terms such as $ x^2, x^3, x^4, ...\ x^{10} $.



In [3]:

    
##=============================#
## Adding higher order terms: ##

df$x2 <- df$x^2
df$x3 <- df$x^3
df$x4 <- df$x2^2
df$x5 <- df$x2 * df$x3
df$x6 <- df$x3^2
df$x7 <- df$x3 * df$x4
df$x8 <- df$x4^2
df$x9 <- df$x4 * df$x5
df$x10 <- df$x5^2

## Alternative way ==> poly(df$x, 10)

lm.xtend <- lm(formula=y ~ ., df)

coef(lm.xtend)









    Out[3]:





  (Intercept)             x            x2            x3            x4 
 2.496688e-01  8.976677e-01  1.162846e+00 -2.673306e-01 -3.172852e-01 
           x5            x6            x7            x8            x9 
 6.663939e-02  3.919988e-02 -6.092869e-03 -2.128504e-03  1.810536e-04 
          x10 
 4.236224e-05



In [4]:

    
##================================##
## Plotting the regression models ##

xyplot(y ~ x, df, pch=19, col=rgb(0.2, 0.4, 0.8, 0.7), cex=2,
       scales=list(cex=1.7),
       xlab=list("x", cex=1.7), ylab=list("y", cex=1.7),
       main=list("Linear Regression w. Polynomial Attributes", cex=1.6),
       panel = function(x,y, ...) {
            panel.xyplot(x, y, ...)
            llines(x, predict(linear.model), col="black", lwd=6, lty=2)
            llines(x, predict(lm.xtend), col="purple", lwd=6, lty=2)
       })









    Out[4]:

2. Ridge Regression

</a>

Ridge regression is a regularized regression model, which penalizes the model complexity by sum square of model coefficients.

$ \beta^\hat \ = (X^t X + \lambda I)^{-1}(X^t y) $

where $ \lambda $ is the ridge constant, and controls the degree of penalizing model coefficients, with $ \lambda=0 $ yields the same result as linear regression.

Ridge Regression Practice in R:

Package ridge
Input data has to have at least 2 feature vectors (ndim(X)>=2)
linearRidge() to build a linear ridge regression model
predict() to get the predicted values, given a ridge model and data



In [5]:

    
library(ridge)

ridge.lin <- linearRidge(formula=y ~ ., df, lambda=0.1)
coef(ridge.lin)

with(df, {
    plot(df$x, df$y, pch=19, col=rgb(0.2, 0.4, 0.8, 0.7), cex=2, xlab="", ylab="", cex.axis=1.7)
    lines(x, predict(lm.xtend), col="purple", lwd=6, lty=3)
    lines(x, predict(ridge.lin), col="darkgreen", lwd=6, lty=2)
    legend(x=-4, y=5.5, lwd=6, col=c("purple", "darkgreen"), lty=c(3, 2), 
           legend=c("Polynomial Linear Regression", "Ridge Regression"),
           cex=1.7, border="white", box.lwd=0)
})

## Alternative plotting w. lattice package:
#xyplot(y ~ x, df, pch=19, col=rgb(0.2, 0.4, 0.8, 0.7), cex=2,
#       scales=list(cex=1.7),
#       xlab=list("x", cex=1.7), ylab=list("y", cex=1.7),
#       main=list("Linear Regression w. Polynomial Attributes", cex=1.6),
#       auto.key=T,
#       panel = function(x, y, ...) {
#            panel.xyplot(x, y, ...)
#            llines(x, predict(lm.xtend), col="purple", lwd=6, lty=3)
#            llines(x, predict(ridge.lin), col="darkgreen", lwd=6, lty=2)
#       })









    Out[5]:





  (Intercept)             x            x2            x3            x4 
 9.658285e-01  4.241134e-01  1.207052e-01  1.339750e-02  1.730759e-03 
           x5            x6            x7            x8            x9 
 1.522917e-04  1.062585e-05 -1.540744e-05 -1.271680e-06 -1.605201e-06 
          x10 
-1.336680e-07

3. Logistic Regression

</a>

Logistic regression is used for binary classification (in binomial logistic regression, and multiple classes in multinomial case). Logistic regression uses a logistic function (see below) that varies from [0-1] in a continous fashion, however, for classification, a proper threshold can be used to determine class label as "0" or "1".

Logistic Function:

$ f(x) = \frac{1}{1+\exp(-x)} $



In [6]:

    
t <- seq(-5, 5, by=0.2)
par(mar=c(4, 4, 1.5, 0.5))
plot(t, 1/(1+exp(-t)), type="l", lwd=4, col="blue",
     xlab="", ylab="", main="Logistic Function",
     cex.axis=1.7)

Logistic Regression in R:

glm() trains the logistic regression model
coef() returns the coefficitns of the model, or the decision boundary

For logistic regression, a new 2D dataset with class labels is provided here.



In [6]:

    
url <- getURL("https://raw.githubusercontent.com/mirjalil/DataScience/master/data/input_4regression-4.csv")
df.labeled <- read.csv(text=url, header=F, colClasses=c('numeric', 'numeric', 'factor'))
colnames(df.labeled) = c("x1", "x2", "y")

summary(df.labeled)

par(mar=c(4, 4, 0.4, 0.4))
plot(df.labeled$x1, df.labeled$x2, 
     main="",
     pch=19, col=as.numeric(df.labeled$y)+10, cex=2,
     xlab="", ylab="", cex.axis=1.7)









    Out[6]:





       x1                x2           y      
 Min.   :-4.9698   Min.   :-4.99219   0:194  
 1st Qu.:-2.8755   1st Qu.:-1.69488   1:206  
 Median :-0.6838   Median : 0.09237          
 Mean   :-0.3113   Mean   : 0.27510          
 3rd Qu.: 2.4070   3rd Qu.: 2.68705          
 Max.   : 4.8855   Max.   : 4.99041



In [7]:

    
logit.reg <- glm(formula=y ~ ., data=df.labeled, family="binomial")


cf.logit <- coef(logit.reg)
cf.logit









    Out[7]:





(Intercept)          x1          x2 
0.007221144 0.251025660 0.597812737



In [14]:

    
library(lattice)
t <- seq(-6, 6, by=0.2)
xyplot(x2~x1,type='p', data=df.labeled,
       pch=19, cex=2, xlab="", ylab="", col=as.numeric(df.labeled$y)+10,
       scales=list(cex=1.7),
       panel=function(x,y,...){
         panel.xyplot(x,y,...)
         y2 <- (-cf.logit[1] - cf.logit[2]*t)/cf.logit[3]
         x3 <- c(-6, t, 6)
         y3 <- c(-6, y2, -6)
         panel.polygon(x3, y3,col=rgb(0.2, 0, 0,0.1), border=NA)
       })









    Out[14]:

Building a more complex decision boundary



In [17]:

    
df.labeled$x12 <- df.labeled$x1^2
df.labeled$x13 <- df.labeled$x1^3
df.labeled$x14 <- df.labeled$x1^4
#df.labeled$x14 <- df.labeled$x1^

logit.reg.poly <- glm(formula=y ~ ., data=df.labeled, family="binomial")


cf.logit.poly <- coef(logit.reg.poly)
cf.logit.poly

xyplot(x2~x1,type='p', data=df.labeled,
       pch=19, cex=2, xlab="", ylab="", col=as.numeric(df.labeled$y)+10,
       scales=list(cex=1.7),
       panel=function(x,y,...){
         panel.xyplot(x,y,...)
         y2 <- (-cf.logit.poly[1] - cf.logit.poly[2]*t - cf.logit.poly[4]*t^2 
                -cf.logit.poly[5]*t^3 -cf.logit.poly[6]*t^4)/cf.logit[3]
         x3 <- c(-6, t)
         y3 <- c(-6, y2)
         panel.polygon(x3, y3,col=rgb(0.2, 0, 0,0.1), border=NA)
       })









    Out[17]:





  (Intercept)            x1            x2           x12           x13 
-0.2164771089 -1.7360739382  1.3708726927 -0.0027069186  0.1457319484 
          x14 
 0.0002446804 






    Out[17]:



In [ ]: