CSAL4243: Introduction to Machine Learning

Muhammad Mudassir Khan (mudasssir.khan@ucp.edu.pk)

Lecture 4: Multinomial Regression

Overview



Machine Learning pipeline

  • x is called input variables or input features.

  • y is called output or target variable. Also sometimes known as label.

  • h is called hypothesis or model.

  • pair (x(i),y(i)) is called a sample or training example

  • dataset of all training examples is called training set.

  • m is the number of samples in a dataset.

  • n is the number of features in a dataset excluding label.

<img style="float: left;" src="images/02_02.png", width=400>



Linear Regression with one variable

Model Representation

  • Model is represented by h$\theta$(x) or simply h(x)

  • For Linear regression with one input variable h(x) = $\theta$0 + $\theta$1x

  • $\theta$0 and $\theta$1 are called weights or parameters.
  • Need to find $\theta$0 and $\theta$1 that maximizes the performance of model.




Cost Function

Let $\hat{y}$ = h(x) = $\theta$0 + $\theta$1x

Error in single sample (x,y) = $\hat{y}$ - y = h(x) - y

Cummulative error of all m samples = $\sum_{i=1}^{m} (h(x^i) - y^i)^2$

Finally mean error or cost function = J($\theta$) = $\frac{1}{2m}\sum_{i=1}^{m} (h(x^i) - y^i)^2$

<img style="float: left;" src="images/03_01.png", width=300> <img style="float: right;" src="images/03_02.png", width=300>



Gradient Descent

Cost function:

J($\theta$) = $\frac{1}{2m}\sum_{i=1}^{m} (h(x^i) - y^i)^2$

Gradient descent equation:

$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$


Replacing J($\theta$) for each j

\begin{align*} \text{repeat until convergence: } \lbrace & \newline \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \newline \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \newline \rbrace& \end{align*}




Linear Regression Example

x y
1 0.8
2 1.6
3 2.4
4 3.2

Read data


In [123]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

# read data in pandas frame
dataframe = pd.read_csv('datasets/example1.csv', encoding='utf-8')

# assign x and y
X = np.array(dataframe[['x']])
y = np.array(dataframe[['y']])

m = y.size # number of training examples

In [105]:
# check data by printing first few rows
dataframe.head()


Out[105]:
x y
0 1 0.8
1 2 1.6
2 3 2.4
3 4 3.2

Plot data


In [128]:
#visualize results
plt.scatter(X, y)
plt.title("Dataset")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


Lets assume $\theta_0 = 0$ and $\theta_1=0$


In [125]:
theta0 = 0
theta1 = 0

cost = 0
for i in range(m):
        hx = theta1*X[i,0] + theta0
        cost += pow((hx - y[i,0]),2)

cost = cost/(2*m)             
print (cost)


2.4

plot it


In [129]:
# predict using model
y_pred = theta1*X + theta0

# plot
plt.scatter(X, y)
plt.plot(X, y_pred)
plt.title("Line for theta1 = 0")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


Plot $\theta1$ vs Cost


In [133]:
# save theta1 and cost in a vector
cost_log = []
theta1_log = []

cost_log.append(cost)
theta1_log.append(theta1)

# plot
plt.scatter(theta1_log, cost_log)
plt.title("Theta1 vs Cost")
plt.xlabel("Theta1")
plt.ylabel("Cost")
plt.show()


Lets assume $\theta_0 = 0$ and $\theta_1=1$


In [135]:
theta0 = 0
theta1 = 1

cost = 0
for i in range(m):
        hx = theta1*X[i,0] + theta0
        cost += pow((hx - y[i,0]),2)

cost = cost/(2*m)             
print (cost)


0.15

plot it


In [136]:
# predict using model
y_pred = theta1*X + theta0

# plot
plt.scatter(X, y)
plt.plot(X, y_pred)
plt.title("Line for theta1 = 1")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


Plot $\theta1$ vs Cost again


In [137]:
# save theta1 and cost in a vector
cost_log.append(cost)
theta1_log.append(theta1)

# plot
plt.scatter(theta1_log, cost_log)
plt.title("Theta1 vs Cost")
plt.xlabel("Theta1")
plt.ylabel("Cost")
plt.show()


Lets assume $\theta_0 = 0$ and $\theta_1=2$


In [138]:
theta0 = 0
theta1 = 2

cost = 0
for i in range(m):
        hx = theta1*X[i,0] + theta0
        cost += pow((hx - y[i,0]),2)

cost = cost/(2*m)             
print (cost)


# predict using model
y_pred = theta1*X + theta0

# plot
plt.scatter(X, y)
plt.plot(X, y_pred)
plt.title("Line for theta1 = 2")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


5.4

In [139]:
# save theta1 and cost in a vector
cost_log.append(cost)
theta1_log.append(theta1)

# plot
plt.scatter(theta1_log, cost_log)
plt.title("theta1 vs Cost")
plt.xlabel("Theta1")
plt.ylabel("Cost")
plt.show()


Run it for a while


In [140]:
theta0 = 0
theta1 = -3.1

cost_log = []
theta1_log = [];

inc = 0.1
for j in range(61):
    theta1 = theta1 + inc;
    
    cost = 0
    for i in range(m):
        hx = theta1*X[i,0] + theta0
        cost += pow((hx - y[i,0]),2)

    cost = cost/(2*m)             

    cost_log.append(cost)
    theta1_log.append(theta1)

plot $\theta_1$ vs Cost


In [141]:
plt.scatter(theta1_log, cost_log)
plt.title("theta1 vs Cost")
plt.xlabel("Theta1")
plt.ylabel("Cost")
plt.show()




Lets do it with Gradient Descent now


In [142]:
theta0 = 0
theta1 = -3

alpha = 0.1
interations = 100

cost_log = []
iter_log = [];

inc = 0.1
for j in range(interations):
    
    cost = 0
    grad = 0
    for i in range(m):
        hx = theta1*X[i,0] + theta0      
        cost +=  pow((hx - y[i,0]),2)
        grad += ((hx - y[i,0]))*X[i,0]

    cost = cost/(2*m)
    grad = grad/(2*m)             
    theta1 = theta1 - alpha*grad
    
    cost_log.append(cost)

In [143]:
theta1


Out[143]:
0.79999999999999993

Plot Convergence


In [144]:
plt.plot(cost_log)
plt.title("Convergence of Cost Function")
plt.xlabel("Iteration number")
plt.ylabel("Cost function")
plt.show()


Predict output using trained model


In [146]:
# predict using model
y_pred = theta1*X + theta0

# plot
plt.scatter(X, y)
plt.plot(X, y_pred)
plt.title("Line for Theta1 from Gradient Descent")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


Credits

Raschka, Sebastian. Python machine learning. Birmingham, UK: Packt Publishing, 2015. Print.

Andrew Ng, Machine Learning, Coursera

Lucas Shen

David Kaleko