Supervised Learning, Part 1: Regression

What is Supervised Learning?

Supervised learning is the machine learning task of attempting to use a set of grouped examples known as training data, with each example comprising of two things:

an input, which is a set of one or more features
an output, which is commonly referred to as the label

to determine the mapping function that generalizes the relationship between the features and the label for each example.

Real estate websites like Zillow use houses that were recently sold to find out how features like square footage, number of rooms, neighborhood, etc., play a role in determining house prices

Spam filters learn how to determine which emails are likely spam by analyzing the contents of spam emails that have been previously tagged by users

The Two Types: Regression and Classification

In mathematical terms, supervised learning attempts to do this, given input $x$ and output $y$: $$ h_\theta(x) = y $$

When $y$ is a continuous output variable, such as the price of a house, it is referred to as regression problem. When $y$ is a categorical variable, such as the type of animal, it is referred to as a classification problem.

Today we will go over linear regression, which is a linear approach for modeling the relationship between input variables to an output variable.

Linear Regression

Univariate Linear Regression

Let's start off with a linear regression problem with one input variable. Suppose we have the following training data:



In [7]:

    
import pandas as pd
import matplotlib

matplotlib.style.use('ggplot')
%matplotlib inline

training_data = {
    'x': [0, 1, 2, 3],
    'y': [4, 7, 7, 8]
}

train_df = pd.DataFrame.from_dict(training_data)
train_df

Let's plot the data to see what it looks like:



In [9]:

    
train_df.plot(kind='scatter', x='x', y='y')









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x11537d160>

In linear regression, we are attempting to to fit the output onto a continuous result function. For the plot above, this is the question that we are trying to answer here:

Is there a line that can fit onto these four points which can best explain the relationship between x and y?

To do this we need to start by having a hypothesis function in the form of:

\begin{align} \hat{y} &= h_{\theta}(x) \\ \hat{y} &= \theta_0 + \theta_1x \end{align}

Basically we are trying to create a function called $h_0(x)$ that is trying to map our input x to our output y. For univariate linear regression, $h_0(x)$ is defined as the equation of a line. We need to determine values of $\theta_0$ and $\theta_1$ which best approximates y, given x.

Cost Function

We can measure the accuracy of our hypothesis function by using a cost function. The cost function measures the accuracy of the hypothesis function against the correct results.

$$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$$

The cost function defined above is also known as the mean squared error function. The objective of linear regression is to find $h_{\theta}(x)$ such that the cost function gives the least value, which signifies the least amount of error



In [ ]: