Mathematically, a linear relationship between X
and Y
can be written as
$$Y \approx \beta_{0} + \beta_{1}X$$
$\beta_{0}$ and $\beta_{1}$ represent the intercept and slope. They are the model coefficients or parameters. Through regression we estimate these parameters. The estimates are represented as $\hat\beta_{0}$ and $\hat\beta_{1}$. Thus,
$$\hat y = \hat\beta_{0} + \hat\beta_{1}x$$We estimate $\hat\beta_{0}$ and $\hat\beta_{1}$ using least squared regression
. This technique forms a line that minimizes average squared error for all data points. Each point is weighed equally. If
is the prediction for i
th value pair of x, y, then the error is calculated as
$$e_{i} = y_{i} - \hat y_{i}$$. This error is also called a residual
. This the residual sum of squares (RSS) is calculated as
$$RSS = e_{1}^{2} + e_{2}^{2}... + e_{i}^{2}$$
Thus if the relationship between $X$ and $Y$ is approximately linear, then we can write: $$Y = \beta_{0} + \beta_{1}X + \epsilon$$
where $\epsilon$ is the catch-all error that is introduced in forcing a linear fit for the model. The above equation is the population regression line. In reality, this is not known (unless you synthesize data using this model). In practice, you estimate the population regression with a smaller subset of datasets.
Using Central Limit Theorem, we know the average of a number of sample regression coefficients, predict the population coefficients pretty closely. Proof is availble here.
By averaging a number of estimations of $\hat\beta_{0}$ and $\hat\beta_{1}$, we are able to estimate the population coefficients in an unbiased manner. Averaging will greatly reduce any systematic over or under estimations when choosing a small sample.
Now, how far will a single estimate of $\hat\beta_{0}$ be from the actual $\beta_{0}$? We can calculate it using standard error.
To understand standard error let us consider the simple case of estimating population mean using a number of smaller samples. The standard error in a statistic (mean in this case) can be written as:
$$ SE(\hat\mu) = \frac{\sigma}{\sqrt{n}}$$where $\hat\mu$ is the estimate for which we calculate the standard error for (sample mean in this case), $\sigma$ is the standard deviation of the population and $n$ is the size of the sample you draw each time.
The SE of $\hat\mu$ is the same as the Standard Deviation of the sampling distribution of a number of sample means. Thus, the above equation gives the relationship between sample mean and population mean and sample size and how far will the sample mean be off. Thus:
$$ SD(\hat\mu) = SE(\hat\mu) = \frac{\sigma}{\sqrt n}$$Note, SE is likely to be smaller if you have a large sample. The value of SE is the average amount your $\hat\mu$ deviates from $\mu$.
In reality, you don't have $\sigma$ or $\mu$. Thus, using the SE formula, we can calculate the SD of population as:
$$\sigma = \sigma_{\hat\mu}*\sqrt n $$Confidence Intervals: SE is used to compute CI
. A 95
% CI is defined as the range of values such that with 95
% probability, the range will contain the true unknown value of the parameter. Similar is 99
% CI. Thus, 95
% CI for $\beta_{1}$ is written as
Same for $\beta_{0}$.
Hypothesis tests: SE is also used to perform hypothesis tests on coefficients. Commonly,
To disprove $H_{0}$, we need
We compute t-statistic
to evaluate the significance of $\beta$, which is similar to computing z scores
.
$$
t \ = \ \frac{\hat \beta_{1} - 0}{SE(\hat \beta_{1})}
$$
We get scores for t
distribution. t follows standard normal for n>30
. The p-value
we get is used to evaluate the significance of the estimate. A small p-value
indicates the relationship between predictor and response is unlikely to be by chance.
To reject the null hypothesis, we need a p-value < 0.005
and t-statistic < 2
To fit the effects of multiple predictors, we extend the simple linear model by providing a coefficient for each predictor. Thus:
$$ Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + ... + \beta_{p}X_{p} + \epsilon $$we interpret $\beta_{p}$ as the average effect on $Y$ that predictor variable has when holding all other variables constant.
We run hypothesis tests, just that, $H_{0}$ checks for all coefficients to be 0
and $H_{a}$ checks for at least one of $\beta_{p}$ is non zero.
In simple linear regression, the hypothesis tests were conducted against a t distribution, where as in multiple linear regression, we compute a F distribution. The p-value
s are against this F
distribution.
Source: Regression tutorial from Minitab
Applications of regression analysis:
p-value
of each coefficient. A low p-value
, (lower than 0.05
) indicates the confidence in predicting the coefficient.95%
CI of each coefficient predictionmean=0
). If residuals follow a pattern, then the model is missing or leaking some phenomena into the error term. You may be missing an important predictor variable.Example from Minitab
$R^2$ is calculated as the ratio of variance in predicted Y
to actual Y
. In other words, it compares the distance between actual values to mean vs predicted values to mean.
$$
R^2 = \frac{(\hat y - \bar y)^2}{(y-\bar y)^2}
$$
$R^2$ measures the strength of the relationship between predictors and response. It ranges from 0-100%
or 0-1
.
$R^2$ is limited, it cannot quite tell you if the model is systematically under or over predicting values. Since it compares deviation from mean, if values fall farther from regression line, yet keep the same variation from mean as actual, then $R^2$ is high. However, this does not mean the model is a good fit.
Since it is inherently biased, some researchers don't use this at all.
S
- Standard error in RegressionA better estimate of regression is Standard Error (also called RSE - residual standard error) which measures average distance each actual value falls from regression line. It is calculated as below:
$$ S = \sqrt{\frac{\sum (\hat y -y)^2}{n-2}} $$S
represents, in the units of predicted variable, on average how far the actual values fall from prediction.
Approximately 95%
of predictions fall within $\pm 2 standard error$ of regression from regression line.
Multicollinearity is when predictors are correlated. In ideal case, each predictor should be independent which helps us properly assess their individual influence on the phenomena.
The p-value
of a coefficient is the probability of
0.95
for p-values
.
Thus, those coefficents whose p-value
s are less than $\alpha$ have a significant coefficient. The size of the coefficient determines their influence on the model itself, not the p-value
.When you simplify your model (step-wise forward or hierarchical), you start by removing those predictors whose p-value
is greater than $\alpha$.
Regression coefficients represent the mean value by which the predicted variable will change for 1
unit change in that particular predictor variable while holding all other variables constant. Thus, regression provides a high level of explainability associated with each variable and its influence on the phenomena.
As said before, $R^{2}$ measures the variability of predicted data vs actual data with mean. It can be interpreted as proportion of variability in Y that can be explained using X. A regression with high p-value
and low $R^{2}$ is problematic. What this means is the trend (slope) is significant where as the data has a lot of inherent variability that the model does not explain well.
In the case of simple linear regression, $R^{2}$ equals the correlation coefficient $r$. However, in multiple linear regression, this relationship does not extend.
The Std. Error in regression (also called as RSE
(residual standard error)) quantifies how far the points are spread from the regression line. This also helps build the prediction interval and understand what the error would be on average when you predict unknown dataset.
RSE
takes the unit of predicted variable. Determining whether or not an RSE
is good depends on the problem context.
Resource Follow the axioms below when you try to explain a regression analysis
p-value
to explain influence of a predictor. p-value
only suggests the significance of the coefficient not being 0
The response is on a continuous scale.
Contrary to popular belief, linear regression can produce curved fits! An equation is considered non linear when the predictor variables are multiplicative rather than additive. Using log, quadratic, cubic functions, you can produce a curved linear fit.
In general, you choose the order of the equation based on number of curves you need in the fit.
As you see, quadratic produces 1 curve while cubic produces 2 curves.
As X increases, if your data (Y) descends to floor or ascends to a ceiling and flat lines, then the effect of X flattens out as it increases. In these cases, you can do a reciprocal of X and fit against it. Sometimes a quadratic fit of reciprocal of X would be a good fit.
Log transform can rescue a model from non-linear to linear territory. When you take log on both sides of equation, multiplication signs become additions and exponential signs become multiplications.
For instance: $$ Y = e^{B_{0}}X_{1}^{B_{1}}X_{2}^{B_{2}} $$ transforms to below when you take log on both sides (double-log form): $$ Ln Y = B_{0} + B_{1}lnX_{1} + B_{2}lnX_{2} $$
Non linear models have predictors that in product with each other. In general, nonlinear models can take any number of formats. The trick is to finding one that will fit the data. Some limits of non-linear models
p-values
and CI
for the coefficients.
In [ ]: