Let's say you have a dataset with 2 variables. One being the age of a person, the other being the fee they pay yearly for auto insurance. And let's say that, starting from this data, we want to see if we are able to predict how much a person will need to pay based only on his age.
In statistical terms, this is called performing regression analysis in order to estimate the relationship between the independent variable "age" and the dependent variable "fee".
If we create a scatter plot for all our age & fee data it might look something like this:
Unsurprisingly, you can observe from the chart that the fee does seem to go down with age. This suggests that there is, indeed, a correlation between age and fee. It's always a good idea to analyze your data first with some charts before attempting to build a regression model.
We can also see that the chart contains a line that goes through the middle of the datapoints. That is what Excel calls the trendline. Fow now, let's just say that's the line that best fits our data, or that's closest to most points.
Mathematically, the line can be expressed with a simple linear equation:
fee = a*age + b
Linear regression simply means that the predictions follow a line on the plane, or that the dependency between the variables is described with such a linear equation.
Once we know the values a and b, we can start predicting the value of fee starting from values of age. For example, for a = -12 and b = 1100 we get the following predictionsn:
| age | predicted_fee |
|---|---|
| 10 | 980 |
| 20 | 860 |
| 30 | 740 |
| 40 | 620 |
| 50 | 500 |
| 60 | 380 |
| 70 | 260 |
| 80 | 140 |
| 90 | 20 |
| 100 | -100 |
The predictions seem fine for the most part, but it seems we have some minor issues for ages greater than 90 years.
The values a and b are called regression coefficients, or sometimes weights.
There are many methods and algorithms for determining the regression coefficients that describe the best fitting line from an initial dataset in very efficient ways, which is why linear regression is usually the first thing tried when training a model.
Let's say we somehow managed to gather some more information about each person in the dataset. The following new variables are now available:
To improve our predictions we want to also take the values of these variables into consideration when predicting the fee.
Linear regression can be applied just as well for any number of variables. Even though we can no longer represent all 6 dimensions graphically with a chart, the same principle holds. Instead of looking for a line on a plane, we can consider each data as a point in a 6-dimensional space (since we have 6 variables) and look for the 5-dimensional hyperplane.
Mathematically, the equation is still linear, it just has more coefficients:
fee = a*age + b*previous_accidents + c*car_horse_power + d*gender + e*years_driven + f
Let's try to formalize the notation a bit.
Most models and will describe the inputs (usually named x) as an n-dimensional vector. This just means that there are n variables (x1, x2, ...):
x = (x1, x2, x3, x4, ..)
Their outputs (usually named y) can be simple values as in our insurance fee example, in which case the equation becomes:
y = a*x1 + b*x1 + c*x3 + d*x4 + e*x5 + f
Or expressed using weights instead of coefficients:
y = w1*x1 + w2*x1 + w3*x3 + w4*x4 + w5*x5 + w0
Using weights instead of coefficients has both mathematical as well as computational advantages: the list of weights is itself a vector, and the sum of w1*x1 + w2*x1 + ... is called the dot product of the w and x vectors. This can be computed very efficiently.