The second dimension of a data analysis problem is the variable dimension. The main algorithms for that dimension are the regression and classification algorithms ("supervised learning," in machine learning terms). As we saw, the variables we choose to consider for our target population allow us to carry out a "population analysis" by giving us criteria (the variable values) to group together individuals with similar values for the variable at hand. An orthogonal question to that grouping is the grouping of variables into groups entertaining a certain dependence (covariance) in the values that variables in the same group can take.

Here we will mostly focus on linear regression techniques to analyse (linear) variable relationships between the variables.

Simulating and visualizing relations


In [1]:
%load_ext rmagic
%R rm()

In [2]:
%%R

NL = function() cat('\n\n')

Simulating explanatory variables: X


In [3]:
%%R

clean.range = function(X, var.range){
        X[X < var.range[1]] = var.range[1]
        X[X > var.range[2]] = var.range[2]
        return(X)
    }

In [4]:
%%R

simulate.X = function(var.size=1, var.mean=0, var.sd=1, var.range=NULL)
{
    X = rnorm(var.size, mean=var.mean, sd=var.sd)
    if (!is.null(var.range)) X = clean.range(X, var.range)
    return(X)
}

In [5]:
%%R -r 86 -w 300 -h 300
X = simulate.X(var.size=200, var.mean=40, var.sd=20, var.range=c(0,100))
hist(X)


Simulating response variables: Y


In [6]:
%%R

simulate.Y = function(X.variables,      #
                      alpha,            #
                      beta,             #
                      residual.sd=1,    #
                      residual.mean=0,  #
                      var.range=NULL    #
                     )
    {
        X = as.matrix(X.variables)
        residuals = rnorm(dim(X)[1], mean=residual.mean, sd=residual.sd)
        e = matrix(residuals, ncol=1)
        v = matrix(beta, ncol=1)
    
        Y = alpha + X %*% v + e
    
        if(!is.null(var.range)) Y = clean.range(Y, var.range)
        return(as.numeric(Y))
    }

In [7]:
%%R -r 86 -w 300 -h 300

Xsize=1000; Xmean = 0; Xsd = 100;

alpha = 2
beta  = 1
error = 20

X = simulate.X(Xsize, Xmean, Xsd)
Y = simulate.Y(X, alpha, beta, error)

In [8]:
%%R -r 86 -w 400 -h 400

plot(X,Y)


Simulating clusters


In [9]:
%%R
#cluster informations
sizes.default   = c(30, 40, 50)
means.default   = c(-50, 100, 50)
sd.default      = c(10, 10, 10)
alphas.default  = c(20, 10, 30)
betas.default   = c(0.9, 1.2, -0.9)
errors.default  = c(10, 0, 20)

In [10]:
%%R

clusters.info = function(cl.sizes=sizes.default,     #
                         cl.means=means.default,     #
                         cl.sd=sd.default,           #
                         cl.alphas=alphas.default,   #
                         cl.betas=betas.default,     #
                         cl.errors=errors.default    #
                        )
    {
        cl.list     = list()
        cl.number   = length(cl.sizes)
        sample.size = sum(cl.sizes)
        
        for (i in 1:cl.number){
            cl.name = paste('cluster', i, sep='')
            cl.list[[cl.name]] = c(size=cl.sizes[i],    #
                                   mean=cl.means[i],    #
                                   sd=cl.sd[i],         #
                                   alpha=cl.alphas[i],  #
                                   beta=cl.betas[i],    #
                                   error=cl.errors[i])  #
        }
        return(t(as.data.frame(cl.list)))
    }

In [11]:
%%R

clusters.default = clusters.info()
print(clusters.default)


         size mean sd alpha beta error
cluster1   30  -50 10    20  0.9    10
cluster2   40  100 10    10  1.2     0
cluster3   50   50 10    30 -0.9    20

In [12]:
%%R

simulate.bivariateCluster = function(cl.info){
    X = simulate.X(cl.info['size'], cl.info['mean'], cl.info['sd'])
    Y = simulate.Y(X, cl.info['alpha'], cl.info['beta'], cl.info['error'])
    return(data.frame(X=X,Y=Y))
}

In [13]:
%%R

simulate.bivariateData = function(clusters.info=clusters.default)
{
    m = nrow(clusters.info)
    data = data.frame()
    for(i in 1:m){
        df = simulate.bivariateCluster(clusters.info[i,])
        df$Cluster = factor(rep(i, nrow(df)))
        data = rbind(data, df)
    }
    shuffle = sample(nrow(data))
    data = data[shuffle, ]
    return(data)
}

In [14]:
%%R -r 86 -w 600 -h 600

data = simulate.bivariateData(clusters.default)

print(clusters.default)
NL()
print(head(data))


         size mean sd alpha beta error
cluster1   30  -50 10    20  0.9    10
cluster2   40  100 10    10  1.2     0
cluster3   50   50 10    30 -0.9    20


            X         Y Cluster
118  53.91916 -12.41614       3
23  -50.79793 -14.69194       1
21  -54.74428 -20.46490       1
19  -49.83074 -21.65337       1
61   88.67184 116.40621       2
83   69.47249 -42.17950       3

True clusters and true relations


In [15]:
%%R -r 86 -w 400 -h 400

mycolors = c('Blue', 'Green', 'Red')
colors   = mycolors[data$Cluster]

plot(data$X, data$Y, col=colors)

plot.trueLines = function()
{
    for (cluster in 1:3){
        intercept   = clusters.default[cluster, 'alpha']
        coefficient = clusters.default[cluster, 'beta' ]
        color       = mycolors[cluster]
        abline(intercept, coefficient, color)
    }
}

plot.trueLines()


Computing and visualizing a regression


In [19]:
%%R

model = lm(data$Y ~ data$X)
summary(model)


Call:
lm(formula = data$Y ~ data$X)

Residuals:
    Min      1Q  Median      3Q     Max 
-109.41  -47.77   23.39   44.71   55.39 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.98383    5.82742  -1.198    0.233    
data$X       0.89211    0.08196  10.884   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 51.48 on 118 degrees of freedom
Multiple R-squared:  0.501,	Adjusted R-squared:  0.4968 
F-statistic: 118.5 on 1 and 118 DF,  p-value: < 2.2e-16

Extracting information from the model

The linear regression function lm returns a object of the class lm (in the same way, kmean returned an object of the class kmean, and hclust an object of the class hclust). These R objects are referred to as models, the class constructor takes up the job to fit a statistical model to the data, and returns a model object (i.e. a list whose class name is the model name, lm for linear model, gam for generalized additive model, etc.) having attributes containing

  1. the fitted model parameters, retrieved using the generic function coef(model)
  2. the predicted responses, retrieved using predict(model)
  3. the residual errors (i.e. the difference between the predicted and observed responses), retrieved using resid(model)

The generic function predict can be called with the additional argument newdata to predict values from new explanatory variables values. When passed a model object, the function summary will give you a general diagonsis about the quality of the fit, while the generic function plot will display several diagnosis graphs to the same effect.

The actual regression line for one explanatory variable can be ploted by passing the model to abline.


In [20]:
%%R -r 86 -w 400 -h 400

plot(data$X, data$Y, col=colors)

abline(model, col='Orange')



In [21]:
%%R -r 86 -w 400 -h 400

plot(data$X, data$Y, col=colors)

points(data$X, fitted(model), col='Orange')
segments(data$X, fitted(model), data$X, data$Y)



In [22]:
%%R

error = sum(resid(model)^2)
print(error)


[1] 312745.9

Retrieving the clusters without knowing them


In [23]:
%%R -r 86 -w 700 -h 400


distance  = dist(data[c(1,2)])
htree = hclust(distance)
plot(htree)



In [24]:
%%R

clusters = cutree(htree, 3)
print(clusters)


118  23  21  19  61  83  70  32  72 104  36  11 100  60  13  16  67  55  84  43 
  1   2   2   2   3   1   3   3   1   1   3   2   1   3   2   2   3   3   1   3 
 89  29  57   3 119  48  31  91 117  26  22  50  88  64  42  97  37  75  56 101 
  1   2   3   2   1   3   3   1   1   2   2   3   1   3   3   1   3   1   3   1 
 58 114  87  28 108  82  53  99  90  20  17  68 120  71  92  74  41  63  27  52 
  3   1   1   2   1   1   3   1   1   2   2   3   1   1   1   1   3   3   2   3 
115   6  47  66  51   8  94 106  78  65 107  54  35  10  85 111  15 102  98  77 
  1   2   3   3   3   2   1   1   1   3   1   3   3   2   1   1   2   1   1   1 
109   4 103   9  81  45  49 113  33  12 110 105  46  40  14  25  73   1  44   2 
  1   2   1   2   1   3   3   1   3   2   1   1   3   3   2   2   1   2   3   2 
 69  30   5  34 116  62  93  95  39  80  18  76  79  86  59 112  24  38  96   7 
  3   2   2   3   1   3   1   1   3   1   2   1   1   1   3   1   2   3   1   2 

In [25]:
%%R

A = clusters[clusters == 1]
B = clusters[clusters == 2]
C = clusters[clusters == 3]

print(A)
print(names(A))


118  83  72 104 100  84  89 119  91 117  88  97  75 101 114  87 108  82  99  90 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
120  71  92  74 115  94 106  78 107  85 111 102  98  77 109 103  81 113 110 105 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
 73 116  93  95  80  76  79  86 112  96 
  1   1   1   1   1   1   1   1   1   1 
 [1] "118" "83"  "72"  "104" "100" "84"  "89"  "119" "91"  "117" "88"  "97" 
[13] "75"  "101" "114" "87"  "108" "82"  "99"  "90"  "120" "71"  "92"  "74" 
[25] "115" "94"  "106" "78"  "107" "85"  "111" "102" "98"  "77"  "109" "103"
[37] "81"  "113" "110" "105" "73"  "116" "93"  "95"  "80"  "76"  "79"  "86" 
[49] "112" "96" 

In [27]:
%%R

data.A = data[names(A),]
data.B = data[names(B),]
data.C = data[names(C),]

model.A = lm(data.A$Y ~ data.A$X)
model.B = lm(data.B$Y ~ data.B$X)
model.C = lm(data.C$Y ~ data.C$X)

In [28]:
%%R -r 86 -w 400 -h 400

plot(data$X, data$Y, type='n')

plot.cluster = function(cluster, model, color)
{
    points(cluster$X, cluster$Y, col=color)
    segments(cluster$X, fitted(model), cluster$X, cluster$Y)
    lines(cluster$X, fitted(model), col=color)
}

plot.cluster(data.A, model.A, 'Red')
plot.cluster(data.B, model.B, 'Blue')
plot.cluster(data.C, model.C, 'Green')
plot.trueLines()



In [29]:
%%R -r 86 -w 700 -h 300

par(mfrow=c(1,3))

plot(data.B$X, data.B$Y, type='n')
plot.cluster(data.B, model.B, 'Blue')

plot(data.C$X, data.C$Y, type='n')
plot.cluster(data.C, model.C, 'Green')

plot(data.A$X, data.A$Y, type='n')
plot.cluster(data.A, model.A, 'Red')


Interpreting and predicting with a regression (multivariate)

summary(model)
plot(model)
predict(model)

In [30]:
%%R -r 86 -w 300 -h 300

variable.number = 10
sample.size     = 200
variable.mean   = 100
variable.sd     = 200

X = replicate(variable.number, rnorm(sample.size, variable.mean, variable.sd))

colnames(X) = paste('X', 1:variable.number, sep='')

print(head(X))


             X1        X2        X3        X4         X5        X6        X7
[1,]  -13.03124 -29.37015  -83.8253 286.02426 -170.48349  233.0547 -93.08642
[2,]  202.22622 117.82400  254.1381 -51.54781   87.26346  273.8988 511.03090
[3,]  216.11134  59.09920 -248.8070 228.71888   62.55337  325.0835 111.12092
[4,]   16.99285 -99.08473 -110.5513  95.04547   58.99262 -191.1746 371.90974
[5,] -366.24251 124.87328 -147.6788 316.90991 -144.13019  567.8763 -95.51793
[6,]  256.50637 609.64830  162.0094 130.20049   46.56707  500.9617 270.94912
             X8         X9        X10
[1,]  219.67294  -1.896002  -25.96269
[2,]  436.13833 -46.414640   32.28056
[3,] -115.11859  20.934875  371.39316
[4,]   75.77795 388.258698  357.63153
[5,] -186.27842 358.509404 -119.65809
[6,]   96.42219   7.016323 -124.08163

In [31]:
%%R -r 86 -w 600 -h 300

alpha = 10
beta  = rnorm(variable.number)
error = rep(0, variable.number)#rnorm(variable.number)

Y = simulate.Y(X, alpha, beta, error)

par(mfrow=c(1,2))
hist(Y)
boxplot(Y)



In [32]:
%%R

df = data.frame(X, Y=Y)
#print(head(df))

ind = df$Y > 2000

df.outliers = df[ind,]
print(df.outliers)


          X1       X2       X3       X4        X5      X6        X7        X8
100 97.87127 341.5445 237.1888 13.39588 -500.3197 285.008 -12.44716 -408.0794
           X9       X10        Y
100 -497.2484 -141.4824 2344.742

In [33]:
%%R -r 86 -w 1000 -h 1000

plot(df)


We want to uncover a linear relation of the form:

$$Y = \alpha + \beta_1 X_1 +\cdots + \beta_n X_n$$

That is: We want to find

$\alpha$ and $\beta = (\beta_1,\dots,\beta_n)$

such that the error committed on our sample (here df)

$$E_s = \|Y - (\alpha + X\beta)\|^2$$

is the smallest possible. The residual is the following vector computed on the sample:

$$\epsilon = Y - (\alpha + X\beta) = (\epsilon_1, \dots, \epsilon_m),$$

where the $i^{th}$ component indicates the error commited between the sample value of the $i^{ith}$ observation and the value predicted by the model for this observation: i.e.,

$$\epsilon_i = y_i - (\alpha - \beta_1 x_{i1}+\cdots + \beta_n x_{in}).$$

An an assumption of our model is that the residual are normally distributed around zero. It can be interpreted as a random error or a noise in our data.

Using the linear model lm, we would like to see the strength of the correlation:

$$Y = \alpha + X\beta + \epsilon.$$

This is the generalization to several variables of what we just saw.


In [34]:
%%R

model = lm(Y ~ ., data=df)
stats = summary(model)

print(stats)


Call:
lm(formula = Y ~ ., data = df)

Residuals:
       Min         1Q     Median         3Q        Max 
-1.076e-12 -1.003e-13  7.370e-15  8.636e-14  6.522e-13 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept)  1.000e+01  2.760e-14  3.623e+14   <2e-16 ***
X1          -1.673e-01  6.876e-17 -2.433e+15   <2e-16 ***
X2          -1.545e-01  7.861e-17 -1.966e+15   <2e-16 ***
X3           2.721e-01  7.078e-17  3.845e+15   <2e-16 ***
X4           8.246e-01  7.771e-17  1.061e+16   <2e-16 ***
X5          -1.152e+00  7.366e-17 -1.564e+16   <2e-16 ***
X6           4.143e-01  7.132e-17  5.809e+15   <2e-16 ***
X7           1.932e+00  6.822e-17  2.831e+16   <2e-16 ***
X8          -2.251e+00  7.531e-17 -2.989e+16   <2e-16 ***
X9          -1.121e+00  7.417e-17 -1.512e+16   <2e-16 ***
X10         -1.286e+00  7.269e-17 -1.770e+16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.023e-13 on 189 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:      1 
F-statistic: 2.379e+32 on 10 and 189 DF,  p-value: < 2.2e-16


In [35]:
%%R


a = model$coefficients[1]
b = model$coefficients['X1']

print(c(a,b))


(Intercept)          X1 
 10.0000000  -0.1673287 

In [36]:
%%R

plot(df$X1, df$Y)
abline(a,b)



In [ ]: