The second dimension of a data analysis problem is the variable dimension. The main algorithms for that dimension are the regression and classification algorithms ("supervised learning," in machine learning terms). As we saw, the variables we choose to consider for our target population allow us to carry out a "population analysis" by giving us criteria (the variable values) to group together individuals with similar values for the variable at hand. An orthogonal question to that grouping is the grouping of variables into groups entertaining a certain dependence (covariance) in the values that variables in the same group can take.
Here we will mostly focus on linear regression techniques to analyse (linear) variable relationships between the variables.
In [1]:
%load_ext rmagic
%R rm()
In [2]:
%%R
NL = function() cat('\n\n')
In [3]:
%%R
clean.range = function(X, var.range){
X[X < var.range[1]] = var.range[1]
X[X > var.range[2]] = var.range[2]
return(X)
}
In [4]:
%%R
simulate.X = function(var.size=1, var.mean=0, var.sd=1, var.range=NULL)
{
X = rnorm(var.size, mean=var.mean, sd=var.sd)
if (!is.null(var.range)) X = clean.range(X, var.range)
return(X)
}
In [5]:
%%R -r 86 -w 300 -h 300
X = simulate.X(var.size=200, var.mean=40, var.sd=20, var.range=c(0,100))
hist(X)
In [6]:
%%R
simulate.Y = function(X.variables, #
alpha, #
beta, #
residual.sd=1, #
residual.mean=0, #
var.range=NULL #
)
{
X = as.matrix(X.variables)
residuals = rnorm(dim(X)[1], mean=residual.mean, sd=residual.sd)
e = matrix(residuals, ncol=1)
v = matrix(beta, ncol=1)
Y = alpha + X %*% v + e
if(!is.null(var.range)) Y = clean.range(Y, var.range)
return(as.numeric(Y))
}
In [7]:
%%R -r 86 -w 300 -h 300
Xsize=1000; Xmean = 0; Xsd = 100;
alpha = 2
beta = 1
error = 20
X = simulate.X(Xsize, Xmean, Xsd)
Y = simulate.Y(X, alpha, beta, error)
In [8]:
%%R -r 86 -w 400 -h 400
plot(X,Y)
In [9]:
%%R
#cluster informations
sizes.default = c(30, 40, 50)
means.default = c(-50, 100, 50)
sd.default = c(10, 10, 10)
alphas.default = c(20, 10, 30)
betas.default = c(0.9, 1.2, -0.9)
errors.default = c(10, 0, 20)
In [10]:
%%R
clusters.info = function(cl.sizes=sizes.default, #
cl.means=means.default, #
cl.sd=sd.default, #
cl.alphas=alphas.default, #
cl.betas=betas.default, #
cl.errors=errors.default #
)
{
cl.list = list()
cl.number = length(cl.sizes)
sample.size = sum(cl.sizes)
for (i in 1:cl.number){
cl.name = paste('cluster', i, sep='')
cl.list[[cl.name]] = c(size=cl.sizes[i], #
mean=cl.means[i], #
sd=cl.sd[i], #
alpha=cl.alphas[i], #
beta=cl.betas[i], #
error=cl.errors[i]) #
}
return(t(as.data.frame(cl.list)))
}
In [11]:
%%R
clusters.default = clusters.info()
print(clusters.default)
In [12]:
%%R
simulate.bivariateCluster = function(cl.info){
X = simulate.X(cl.info['size'], cl.info['mean'], cl.info['sd'])
Y = simulate.Y(X, cl.info['alpha'], cl.info['beta'], cl.info['error'])
return(data.frame(X=X,Y=Y))
}
In [13]:
%%R
simulate.bivariateData = function(clusters.info=clusters.default)
{
m = nrow(clusters.info)
data = data.frame()
for(i in 1:m){
df = simulate.bivariateCluster(clusters.info[i,])
df$Cluster = factor(rep(i, nrow(df)))
data = rbind(data, df)
}
shuffle = sample(nrow(data))
data = data[shuffle, ]
return(data)
}
In [14]:
%%R -r 86 -w 600 -h 600
data = simulate.bivariateData(clusters.default)
print(clusters.default)
NL()
print(head(data))
In [15]:
%%R -r 86 -w 400 -h 400
mycolors = c('Blue', 'Green', 'Red')
colors = mycolors[data$Cluster]
plot(data$X, data$Y, col=colors)
plot.trueLines = function()
{
for (cluster in 1:3){
intercept = clusters.default[cluster, 'alpha']
coefficient = clusters.default[cluster, 'beta' ]
color = mycolors[cluster]
abline(intercept, coefficient, color)
}
}
plot.trueLines()
In [19]:
%%R
model = lm(data$Y ~ data$X)
summary(model)
The linear regression function lm
returns a object of the class lm
(in the same way, kmean
returned an object of the class kmean
, and hclust
an object of the class hclust
). These R objects are referred to as models, the class constructor takes up the job to fit a statistical model to the data, and returns a model object (i.e. a list whose class name is the model name, lm
for linear model, gam
for generalized additive model, etc.) having attributes containing
coef(model)
predict(model)
resid(model)
The generic function predict
can be called with the additional argument newdata
to predict values from new explanatory variables values. When passed a model object, the function summary
will give you a general diagonsis about the quality of the fit, while the generic function plot
will display several diagnosis graphs to the same effect.
The actual regression line for one explanatory variable can be ploted by passing the model to abline
.
In [20]:
%%R -r 86 -w 400 -h 400
plot(data$X, data$Y, col=colors)
abline(model, col='Orange')
In [21]:
%%R -r 86 -w 400 -h 400
plot(data$X, data$Y, col=colors)
points(data$X, fitted(model), col='Orange')
segments(data$X, fitted(model), data$X, data$Y)
In [22]:
%%R
error = sum(resid(model)^2)
print(error)
In [23]:
%%R -r 86 -w 700 -h 400
distance = dist(data[c(1,2)])
htree = hclust(distance)
plot(htree)
In [24]:
%%R
clusters = cutree(htree, 3)
print(clusters)
In [25]:
%%R
A = clusters[clusters == 1]
B = clusters[clusters == 2]
C = clusters[clusters == 3]
print(A)
print(names(A))
In [27]:
%%R
data.A = data[names(A),]
data.B = data[names(B),]
data.C = data[names(C),]
model.A = lm(data.A$Y ~ data.A$X)
model.B = lm(data.B$Y ~ data.B$X)
model.C = lm(data.C$Y ~ data.C$X)
In [28]:
%%R -r 86 -w 400 -h 400
plot(data$X, data$Y, type='n')
plot.cluster = function(cluster, model, color)
{
points(cluster$X, cluster$Y, col=color)
segments(cluster$X, fitted(model), cluster$X, cluster$Y)
lines(cluster$X, fitted(model), col=color)
}
plot.cluster(data.A, model.A, 'Red')
plot.cluster(data.B, model.B, 'Blue')
plot.cluster(data.C, model.C, 'Green')
plot.trueLines()
In [29]:
%%R -r 86 -w 700 -h 300
par(mfrow=c(1,3))
plot(data.B$X, data.B$Y, type='n')
plot.cluster(data.B, model.B, 'Blue')
plot(data.C$X, data.C$Y, type='n')
plot.cluster(data.C, model.C, 'Green')
plot(data.A$X, data.A$Y, type='n')
plot.cluster(data.A, model.A, 'Red')
summary(model)
plot(model)
predict(model)
In [30]:
%%R -r 86 -w 300 -h 300
variable.number = 10
sample.size = 200
variable.mean = 100
variable.sd = 200
X = replicate(variable.number, rnorm(sample.size, variable.mean, variable.sd))
colnames(X) = paste('X', 1:variable.number, sep='')
print(head(X))
In [31]:
%%R -r 86 -w 600 -h 300
alpha = 10
beta = rnorm(variable.number)
error = rep(0, variable.number)#rnorm(variable.number)
Y = simulate.Y(X, alpha, beta, error)
par(mfrow=c(1,2))
hist(Y)
boxplot(Y)
In [32]:
%%R
df = data.frame(X, Y=Y)
#print(head(df))
ind = df$Y > 2000
df.outliers = df[ind,]
print(df.outliers)
In [33]:
%%R -r 86 -w 1000 -h 1000
plot(df)
We want to uncover a linear relation of the form:
$$Y = \alpha + \beta_1 X_1 +\cdots + \beta_n X_n$$That is: We want to find
$\alpha$ and $\beta = (\beta_1,\dots,\beta_n)$
such that the error committed on our sample (here df
)
is the smallest possible. The residual is the following vector computed on the sample:
$$\epsilon = Y - (\alpha + X\beta) = (\epsilon_1, \dots, \epsilon_m),$$where the $i^{th}$ component indicates the error commited between the sample value of the $i^{ith}$ observation and the value predicted by the model for this observation: i.e.,
$$\epsilon_i = y_i - (\alpha - \beta_1 x_{i1}+\cdots + \beta_n x_{in}).$$An an assumption of our model is that the residual are normally distributed around zero. It can be interpreted as a random error or a noise in our data.
Using the linear model lm
, we would like to see the strength of the correlation:
This is the generalization to several variables of what we just saw.
In [34]:
%%R
model = lm(Y ~ ., data=df)
stats = summary(model)
print(stats)
In [35]:
%%R
a = model$coefficients[1]
b = model$coefficients['X1']
print(c(a,b))
In [36]:
%%R
plot(df$X1, df$Y)
abline(a,b)
In [ ]: