CSAL4243: Introduction to Machine Learning

Muhammad Mudassir Khan (mudasssir.khan@ucp.edu.pk)

Lecture 4: Linear Regression and Gradient Descent Example

Overview



Machine Learning pipeline

  • x is called input variables or input features.

  • y is called output or target variable. Also sometimes known as label.

  • h is called hypothesis or model.

  • pair (x(i),y(i)) is called a sample or training example

  • dataset of all training examples is called training set.

  • m is the number of samples in a dataset.

  • n is the number of features in a dataset excluding label.

<img style="float: left;" src="images/02_02.png", width=400>



Linear Regression with one variable

Model Representation

  • Model is represented by h$\theta$(x) or simply h(x)

  • For Linear regression with one input variable h(x) = $\theta$0 + $\theta$1x

  • $\theta$0 and $\theta$1 are called weights or parameters.
  • Need to find $\theta$0 and $\theta$1 that maximizes the performance of model.




Cost Function

Let $\hat{y}$ = h(x) = $\theta$0 + $\theta$1x

Error in single sample (x,y) = $\hat{y}$ - y = h(x) - y

Cummulative error of all m samples = $\sum_{i=1}^{m} (h(x^i) - y^i)^2$

Finally mean squared error or cost function = J($\theta$) = $\frac{1}{2m}\sum_{i=1}^{m} (h(x^i) - y^i)^2$

<img style="float: left;" src="images/03_01.png", width=300> <img style="float: right;" src="images/03_02.png", width=300>



Gradient Descent

Gradient descent equation:

$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$

Linear regression Cost function:

J($\theta$) = $\frac{1}{2m}\sum_{i=1}^{m} (h(x^i) - y^i)^2$


Replacing J($\theta$) in gradient descent equation:

\begin{align*} \text{repeat until convergence: } \lbrace & \newline \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \newline \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \newline \rbrace& \end{align*}




Linear Regression Example

x y
1 0.8
2 1.6
3 2.4
4 3.2

Read data


In [31]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

# read data in pandas frame
dataframe = pd.read_csv('datasets/example1.csv', encoding='utf-8')

# assign x and y
X = np.array(dataframe[['x']])
y = np.array(dataframe[['y']])

m = y.size # number of training examples

In [32]:
# check data by printing first few rows
dataframe.head()


Out[32]:
x y
0 1 0.8
1 2 1.6
2 3 2.4
3 4 3.2

Plot data


In [33]:
#visualize results
plt.scatter(X, y)
plt.title("Dataset")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


Find a line that best fit the data


In [34]:
#best fit line

tmpx = np.array([0, 1, 2, 3, 4])
y1 = 0.2*tmpx
y2 = 0.7*tmpx
y3 = 1.5*tmpx


plt.scatter(X, y)
plt.plot(tmpx,y1)
plt.plot(tmpx,y2)
plt.plot(tmpx,y3)
plt.title("Best fit line")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


Lets assume $\theta_0 = 0$ and $\theta_1=0$

Model h(x) = $\theta_0$ + $\theta_1$x = 0

Cost function J($\theta$) = $\frac{1}{2m}\sum_{i=1}^{m} (h(x^i) - y^i)^2$ = $\frac{1}{2m}\sum_{i=1}^{m} (0 - y^i)^2$


In [35]:
theta0 = 0
theta1 = 0

cost = 0
for i in range(m):
        hx = theta1*X[i,0] + theta0
        cost += pow((hx - y[i,0]),2)

cost = cost/(2*m)             
print (cost)


2.4

plot it


In [36]:
# predict using model
y_pred = theta1*X + theta0

# plot
plt.scatter(X, y)
plt.plot(X, y_pred)
plt.title("Line for theta1 = 0")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


Plot $\theta1$ vs Cost


In [37]:
# save theta1 and cost in a vector
cost_log = []
theta1_log = []

cost_log.append(cost)
theta1_log.append(theta1)

# plot
plt.scatter(theta1_log, cost_log)
plt.title("Theta1 vs Cost")
plt.xlabel("Theta1")
plt.ylabel("Cost")
plt.show()


Lets assume $\theta_0 = 0$ and $\theta_1=1$

Model h(x) = $\theta_0$ + $\theta_1$x = x

Cost function J($\theta$) = $\frac{1}{2m}\sum_{i=1}^{m} (h(x^i) - y^i)^2$ = $\frac{1}{2m}\sum_{i=1}^{m} (x^i - y^i)^2$


In [38]:
theta0 = 0
theta1 = 1

cost = 0
for i in range(m):
        hx = theta1*X[i,0] + theta0
        cost += pow((hx - y[i,0]),2)

cost = cost/(2*m)             
print (cost)


0.15

plot it


In [39]:
# predict using model
y_pred = theta1*X + theta0

# plot
plt.scatter(X, y)
plt.plot(X, y_pred)
plt.title("Line for theta1 = 1")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


Plot $\theta1$ vs Cost again


In [40]:
# save theta1 and cost in a vector
cost_log.append(cost)
theta1_log.append(theta1)

# plot
plt.scatter(theta1_log, cost_log)
plt.title("Theta1 vs Cost")
plt.xlabel("Theta1")
plt.ylabel("Cost")
plt.show()


Lets assume $\theta_0 = 0$ and $\theta_1=2$

Model h(x) = $\theta_0$ + $\theta_1$x = 2x

Cost function J($\theta$) = $\frac{1}{2m}\sum_{i=1}^{m} (h(x^i) - y^i)^2$ = $\frac{1}{2m}\sum_{i=1}^{m} (2x^i - y^i)^2$


In [41]:
theta0 = 0
theta1 = 2

cost = 0
for i in range(m):
        hx = theta1*X[i,0] + theta0
        cost += pow((hx - y[i,0]),2)

cost = cost/(2*m)             
print (cost)


# predict using model
y_pred = theta1*X + theta0

# plot
plt.scatter(X, y)
plt.plot(X, y_pred)
plt.title("Line for theta1 = 2")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


5.4

In [42]:
# save theta1 and cost in a vector
cost_log.append(cost)
theta1_log.append(theta1)

# plot
plt.scatter(theta1_log, cost_log)
plt.title("theta1 vs Cost")
plt.xlabel("Theta1")
plt.ylabel("Cost")
plt.show()


Run it for a while


In [43]:
theta0 = 0
theta1 = -3.1

cost_log = []
theta1_log = [];

inc = 0.1
for j in range(61):
    theta1 = theta1 + inc;
    
    cost = 0
    for i in range(m):
        hx = theta1*X[i,0] + theta0
        cost += pow((hx - y[i,0]),2)

    cost = cost/(2*m)             

    cost_log.append(cost)
    theta1_log.append(theta1)

plot $\theta_1$ vs Cost


In [44]:
plt.scatter(theta1_log, cost_log)
plt.title("theta1 vs Cost")
plt.xlabel("Theta1")
plt.ylabel("Cost")
plt.show()




Lets do it with Gradient Descent now


In [45]:
theta0 = 0
theta1 = -3

alpha = 0.1
interations = 100

cost_log = []
iter_log = [];

inc = 0.1
for j in range(interations):
    
    cost = 0
    grad = 0
    for i in range(m):
        hx = theta1*X[i,0] + theta0      
        cost +=  pow((hx - y[i,0]),2)
        grad += ((hx - y[i,0]))*X[i,0]

    cost = cost/(2*m)
    grad = grad/(2*m)             
    theta1 = theta1 - alpha*grad
    
    cost_log.append(cost)

In [46]:
theta1


Out[46]:
0.79999999999999993

Plot Convergence


In [47]:
plt.plot(cost_log)
plt.title("Convergence of Cost Function")
plt.xlabel("Iteration number")
plt.ylabel("Cost function")
plt.show()


Predict output using trained model


In [48]:
# predict using model
y_pred = theta1*X + theta0

# plot
plt.scatter(X, y)
plt.plot(X, y_pred)
plt.title("Line for Theta1 from Gradient Descent")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


Credits

Raschka, Sebastian. Python machine learning. Birmingham, UK: Packt Publishing, 2015. Print.

Andrew Ng, Machine Learning, Coursera

Lucas Shen

David Kaleko