The second dimension of a data analysis problem is the variable dimension. The main algorithms for that dimension are the regression and classification algorithms ("supervised learning," in machine learning terms). As we saw, the variables we choose to consider for our target population allow us to carry out a "population analysis" by giving us criteria (the variable values) to group together individuals with similar values for the variable at hand. An orthogonal question to that grouping is the grouping of variables into groups entertaining a certain dependence (covariance) in the values that variables in the same group can take.

Here we will mostly focus on linear regression techniques to analyse (linear) variable relationships between the variables.

Simulating and visualizing relations



In [1]:

%R rm()




In [2]:

%%R

NL = function() cat('\n\n')



Simulating explanatory variables: X



In [3]:

%%R

clean.range = function(X, var.range){
X[X < var.range[1]] = var.range[1]
X[X > var.range[2]] = var.range[2]
return(X)
}




In [4]:

%%R

simulate.X = function(var.size=1, var.mean=0, var.sd=1, var.range=NULL)
{
X = rnorm(var.size, mean=var.mean, sd=var.sd)
if (!is.null(var.range)) X = clean.range(X, var.range)
return(X)
}




In [5]:

%%R -r 86 -w 300 -h 300
X = simulate.X(var.size=200, var.mean=40, var.sd=20, var.range=c(0,100))
hist(X)






Simulating response variables: Y



In [6]:

%%R

simulate.Y = function(X.variables,      #
alpha,            #
beta,             #
residual.sd=1,    #
residual.mean=0,  #
var.range=NULL    #
)
{
X = as.matrix(X.variables)
residuals = rnorm(dim(X)[1], mean=residual.mean, sd=residual.sd)
e = matrix(residuals, ncol=1)
v = matrix(beta, ncol=1)

Y = alpha + X %*% v + e

if(!is.null(var.range)) Y = clean.range(Y, var.range)
return(as.numeric(Y))
}




In [7]:

%%R -r 86 -w 300 -h 300

Xsize=1000; Xmean = 0; Xsd = 100;

alpha = 2
beta  = 1
error = 20

X = simulate.X(Xsize, Xmean, Xsd)
Y = simulate.Y(X, alpha, beta, error)




In [8]:

%%R -r 86 -w 400 -h 400

plot(X,Y)






Simulating clusters



In [9]:

%%R
#cluster informations
sizes.default   = c(30, 40, 50)
means.default   = c(-50, 100, 50)
sd.default      = c(10, 10, 10)
alphas.default  = c(20, 10, 30)
betas.default   = c(0.9, 1.2, -0.9)
errors.default  = c(10, 0, 20)




In [10]:

%%R

clusters.info = function(cl.sizes=sizes.default,     #
cl.means=means.default,     #
cl.sd=sd.default,           #
cl.alphas=alphas.default,   #
cl.betas=betas.default,     #
cl.errors=errors.default    #
)
{
cl.list     = list()
cl.number   = length(cl.sizes)
sample.size = sum(cl.sizes)

for (i in 1:cl.number){
cl.name = paste('cluster', i, sep='')
cl.list[[cl.name]] = c(size=cl.sizes[i],    #
mean=cl.means[i],    #
sd=cl.sd[i],         #
alpha=cl.alphas[i],  #
beta=cl.betas[i],    #
error=cl.errors[i])  #
}
return(t(as.data.frame(cl.list)))
}




In [11]:

%%R

clusters.default = clusters.info()
print(clusters.default)




size mean sd alpha beta error
cluster1   30  -50 10    20  0.9    10
cluster2   40  100 10    10  1.2     0
cluster3   50   50 10    30 -0.9    20




In [12]:

%%R

simulate.bivariateCluster = function(cl.info){
X = simulate.X(cl.info['size'], cl.info['mean'], cl.info['sd'])
Y = simulate.Y(X, cl.info['alpha'], cl.info['beta'], cl.info['error'])
return(data.frame(X=X,Y=Y))
}




In [13]:

%%R

simulate.bivariateData = function(clusters.info=clusters.default)
{
m = nrow(clusters.info)
data = data.frame()
for(i in 1:m){
df = simulate.bivariateCluster(clusters.info[i,])
df$Cluster = factor(rep(i, nrow(df))) data = rbind(data, df) } shuffle = sample(nrow(data)) data = data[shuffle, ] return(data) }   In [14]: %%R -r 86 -w 600 -h 600 data = simulate.bivariateData(clusters.default) print(clusters.default) NL() print(head(data))   size mean sd alpha beta error cluster1 30 -50 10 20 0.9 10 cluster2 40 100 10 10 1.2 0 cluster3 50 50 10 30 -0.9 20 X Y Cluster 118 53.91916 -12.41614 3 23 -50.79793 -14.69194 1 21 -54.74428 -20.46490 1 19 -49.83074 -21.65337 1 61 88.67184 116.40621 2 83 69.47249 -42.17950 3  True clusters and true relations  In [15]: %%R -r 86 -w 400 -h 400 mycolors = c('Blue', 'Green', 'Red') colors = mycolors[data$Cluster]

plot(data$X, data$Y, col=colors)

plot.trueLines = function()
{
for (cluster in 1:3){
intercept   = clusters.default[cluster, 'alpha']
coefficient = clusters.default[cluster, 'beta' ]
color       = mycolors[cluster]
abline(intercept, coefficient, color)
}
}

plot.trueLines()






Computing and visualizing a regression



In [19]:

%%R

model = lm(data$Y ~ data$X)
summary(model)




Call:
lm(formula = data$Y ~ data$X)

Residuals:
Min      1Q  Median      3Q     Max
-109.41  -47.77   23.39   44.71   55.39

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.98383    5.82742  -1.198    0.233
lines(cluster$X, fitted(model), col=color) } plot.cluster(data.A, model.A, 'Red') plot.cluster(data.B, model.B, 'Blue') plot.cluster(data.C, model.C, 'Green') plot.trueLines()     In [29]: %%R -r 86 -w 700 -h 300 par(mfrow=c(1,3)) plot(data.B$X, data.B$Y, type='n') plot.cluster(data.B, model.B, 'Blue') plot(data.C$X, data.C$Y, type='n') plot.cluster(data.C, model.C, 'Green') plot(data.A$X, data.A$Y, type='n') plot.cluster(data.A, model.A, 'Red')    Interpreting and predicting with a regression (multivariate) summary(model) plot(model) predict(model)  In [30]: %%R -r 86 -w 300 -h 300 variable.number = 10 sample.size = 200 variable.mean = 100 variable.sd = 200 X = replicate(variable.number, rnorm(sample.size, variable.mean, variable.sd)) colnames(X) = paste('X', 1:variable.number, sep='') print(head(X))   X1 X2 X3 X4 X5 X6 X7 [1,] -13.03124 -29.37015 -83.8253 286.02426 -170.48349 233.0547 -93.08642 [2,] 202.22622 117.82400 254.1381 -51.54781 87.26346 273.8988 511.03090 [3,] 216.11134 59.09920 -248.8070 228.71888 62.55337 325.0835 111.12092 [4,] 16.99285 -99.08473 -110.5513 95.04547 58.99262 -191.1746 371.90974 [5,] -366.24251 124.87328 -147.6788 316.90991 -144.13019 567.8763 -95.51793 [6,] 256.50637 609.64830 162.0094 130.20049 46.56707 500.9617 270.94912 X8 X9 X10 [1,] 219.67294 -1.896002 -25.96269 [2,] 436.13833 -46.414640 32.28056 [3,] -115.11859 20.934875 371.39316 [4,] 75.77795 388.258698 357.63153 [5,] -186.27842 358.509404 -119.65809 [6,] 96.42219 7.016323 -124.08163   In [31]: %%R -r 86 -w 600 -h 300 alpha = 10 beta = rnorm(variable.number) error = rep(0, variable.number)#rnorm(variable.number) Y = simulate.Y(X, alpha, beta, error) par(mfrow=c(1,2)) hist(Y) boxplot(Y)     In [32]: %%R df = data.frame(X, Y=Y) #print(head(df)) ind = df$Y > 2000

df.outliers = df[ind,]
print(df.outliers)




X1       X2       X3       X4        X5      X6        X7        X8
100 97.87127 341.5445 237.1888 13.39588 -500.3197 285.008 -12.44716 -408.0794
X9       X10        Y
100 -497.2484 -141.4824 2344.742




In [33]:

%%R -r 86 -w 1000 -h 1000

plot(df)






We want to uncover a linear relation of the form:

$$Y = \alpha + \beta_1 X_1 +\cdots + \beta_n X_n$$

That is: We want to find

$\alpha$ and $\beta = (\beta_1,\dots,\beta_n)$

such that the error committed on our sample (here df)

$$E_s = \|Y - (\alpha + X\beta)\|^2$$

is the smallest possible. The residual is the following vector computed on the sample:

$$\epsilon = Y - (\alpha + X\beta) = (\epsilon_1, \dots, \epsilon_m),$$

where the $i^{th}$ component indicates the error commited between the sample value of the $i^{ith}$ observation and the value predicted by the model for this observation: i.e.,

$$\epsilon_i = y_i - (\alpha - \beta_1 x_{i1}+\cdots + \beta_n x_{in}).$$

An an assumption of our model is that the residual are normally distributed around zero. It can be interpreted as a random error or a noise in our data.

Using the linear model lm, we would like to see the strength of the correlation:

$$Y = \alpha + X\beta + \epsilon.$$

This is the generalization to several variables of what we just saw.



In [34]:

%%R

model = lm(Y ~ ., data=df)
stats = summary(model)

print(stats)




Call:
lm(formula = Y ~ ., data = df)

Residuals:
Min         1Q     Median         3Q        Max
-1.076e-12 -1.003e-13  7.370e-15  8.636e-14  6.522e-13

Coefficients:
Estimate Std. Error    t value Pr(>|t|)
(Intercept)  1.000e+01  2.760e-14  3.623e+14   <2e-16 ***
X1          -1.673e-01  6.876e-17 -2.433e+15   <2e-16 ***
X2          -1.545e-01  7.861e-17 -1.966e+15   <2e-16 ***
X3           2.721e-01  7.078e-17  3.845e+15   <2e-16 ***
X4           8.246e-01  7.771e-17  1.061e+16   <2e-16 ***
X5          -1.152e+00  7.366e-17 -1.564e+16   <2e-16 ***
X6           4.143e-01  7.132e-17  5.809e+15   <2e-16 ***
X7           1.932e+00  6.822e-17  2.831e+16   <2e-16 ***
X8          -2.251e+00  7.531e-17 -2.989e+16   <2e-16 ***
X9          -1.121e+00  7.417e-17 -1.512e+16   <2e-16 ***
X10         -1.286e+00  7.269e-17 -1.770e+16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.023e-13 on 189 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:      1
F-statistic: 2.379e+32 on 10 and 189 DF,  p-value: < 2.2e-16




In [35]:

%%R

a = model$coefficients[1] b = model$coefficients['X1']

print(c(a,b))




(Intercept)          X1
10.0000000  -0.1673287




In [36]:

%%R

plot(df$X1, df$Y)
abline(a,b)







In [ ]: