Supervised learning is the machine learning task of attempting to use a set of grouped examples known as training data, with each example comprising of two things:
to determine the mapping function that generalizes the relationship between the features and the label for each example.
Real estate websites like Zillow use houses that were recently sold to find out how features like square footage, number of rooms, neighborhood, etc., play a role in determining house prices
Spam filters learn how to determine which emails are likely spam by analyzing the contents of spam emails that have been previously tagged by users
In mathematical terms, supervised learning attempts to do this, given input $x$ and output $y$: $$ h_\theta(x) = y $$
When $y$ is a continuous output variable, such as the price of a house, it is referred to as regression problem. When $y$ is a categorical variable, such as the type of animal, it is referred to as a classification problem.
Today we will go over linear regression, which is a linear approach for modeling the relationship between input variables to an output variable.
Let's start off with a linear regression problem with one input variable. Suppose we have the following training data:
In [7]:
import pandas as pd
import matplotlib
matplotlib.style.use('ggplot')
%matplotlib inline
training_data = {
'x': [0, 1, 2, 3],
'y': [4, 7, 7, 8]
}
train_df = pd.DataFrame.from_dict(training_data)
train_df
Out[7]:
Let's plot the data to see what it looks like:
In [9]:
train_df.plot(kind='scatter', x='x', y='y')
Out[9]:
In linear regression, we are attempting to to fit the output onto a continuous result function. For the plot above, this is the question that we are trying to answer here:
Is there a line that can fit onto these four points which can best explain the relationship between x and y?
To do this we need to start by having a hypothesis function in the form of:
\begin{align} \hat{y} &= h_{\theta}(x) \\ \hat{y} &= \theta_0 + \theta_1x \end{align}Basically we are trying to create a function called $h_0(x)$ that is trying to map our input x to our output y. For univariate linear regression, $h_0(x)$ is defined as the equation of a line. We need to determine values of $\theta_0$ and $\theta_1$ which best approximates y, given x.
We can measure the accuracy of our hypothesis function by using a cost function. The cost function measures the accuracy of the hypothesis function against the correct results.
$$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$$The cost function defined above is also known as the mean squared error function. The objective of linear regression is to find $h_{\theta}(x)$ such that the cost function gives the least value, which signifies the least amount of error
In [ ]: