CSAL4243: Introduction to Machine Learning

Muhammad Mudassir Khan (mudasssir.khan@ucp.edu.pk)

Lecture 5: Multinomial Regression

Overview

Classification vs Regression

Machine Learning pipeline

x is called input variables or input features.
y is called output or target variable. Also sometimes known as label.
h is called hypothesis or model.
pair (x⁽ⁱ⁾,y⁽ⁱ⁾) is called a sample or training example
dataset of all training examples is called training set.
m is the number of samples in a dataset.
n is the number of features in a dataset excluding label.

Linear Regression with one variable

Model Representation

Model is represented by h_$\theta$(x) or simply h(x)
For Linear regression with one input variable h(x) = $\theta$₀ + $\theta$₁x

$\theta$₀ and $\theta$₁ are called weights or parameters.
Need to find $\theta$₀ and $\theta$₁ that maximizes the performance of model.

Vectorize Model

Write model in form of matrix multiplication
$h(x)$ = $X \times \theta$
- $X$ and $\theta$ are both matrices
- $X = \left[ \begin{array}{cc} x_1 \\ x_2 \\ x_3 \\ ... \\ x_{m} \end{array} \right]$

$h(x)$ = $\theta_0 + \theta_1 x$ = $X \times \theta$ = $\left[ \begin{array}{cc} 1 & x_i \end{array} \right] \times \left[ \begin{array}{cc} \theta_0 \\ \theta_1 \end{array} \right]$

$h(x)$ = $\left[ \begin{array}{cc} \theta_0 + \theta_1 x_1 \\ \theta_0 + \theta_1 x_2 \\ \theta_0 + \theta_1 x_3 \\ ... \\ \theta_0 + \theta_1 x_{m} \end{array} \right] = \left[ \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \\ ... \\ 1 & x_{m} \end{array} \right] \times \left[ \begin{array}{cc} \theta_0 \\ \theta_1 \end{array} \right]$

In given dataset $X$ has dimensions $m \times 1$ because of 1 variable
$\theta$ has dimension $2\times 1$
Append a column vector of all 1's to X
New X has dimensions $m\times 2$
$h(x) = X \times \theta$ has dimensions $m\times 1$

Linear Regression with multiple variables

Model $h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 .... + \theta_n x_n$

Dimensions of $X$ is $m\times n$

$X = \left[ \begin{array}{cc} x_1^1 & x_1^2 & .. & x_1^{n} \\ x_2^1 & x_2^2 & .. & x_2^{n} \\ x_3^1 & x_3^2 & .. & x_3^{n} \\ ... \\ x_{m}^1 & x_{m}^2 & .. & x_{m}^{n} \end{array} \right]$

$\theta$ has dimension $(n+1)\times 1$

$\theta = \left[ \begin{array}{cc} \theta_0 \\ \theta_1 \\ \theta_2 \\ ... \\ \theta_{n} \\ \end{array} \right]$

Append a column vector of all 1's to X
Now X has dimensions $m\times (n+1)$

$X = \left[ \begin{array}{cc} 1 & x_1^1 & x_1^2 & .. & x_1^{n} \\ 1 & x_2^1 & x_2^2 & .. & x_2^{n} \\ 1 & x_3^1 & x_3^2 & .. & x_3^{n} \\ ... \\ 1 & x_{m}^1 & x_{m}^2 & .. & x_{m}^{n} \end{array} \right]$

where $x_i$ is $i^{th}$ sample, e.g. $x_2 = [ \begin{array}{cc} 4.9 & 3.0 & 1.4 & 0.2 \end{array}]$
and $x_i^{j}$ is value of feature $j$ in the $i^{th}$ training example e.g. $x_2^3=1.4$

$h(x) = X \times \theta$ has dimensions $m\times 1$

Cost Function

Cost function = J($\theta$) = $\frac{1}{2m}\sum_{i=1}^{m} (h(x^i) - y^i)^2$

where $h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 .... + \theta_n x_n$

Gradient Descent

Cost function:

J($\theta$) = $\frac{1}{2m}\sum_{i=1}^{m} (h(x^i) - y^i)^2$

Gradient descent equation:

$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$

Replacing J($\theta$) for each j

$\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x_{i}) - y_{i}) \cdot x^0_{i}\newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x_{i}) - y_{i}) \cdot x^1_{i} \newline \; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x_{i}) - y_{i}) \cdot x^2_{i} \newline & \cdots \newline \rbrace \end{align*}$

or more generally

$\begin{align*}& \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x_{i}) - y_{i}) \cdot x^j_{i} \; & \text{for j := 0...n}\newline \rbrace\end{align*}$

Speed up gradient descent

Tricks to make gradient descent converge faster to optimal value

Each of our input values in roughly the same range.
$\theta$ will descend quickly on small ranges and slowly on large ranges.
$\theta$ Will oscillate inefficiently down to the optimum when the variables are very uneven.

Aim is to have:

$-1 \le x^i \le 1$

$-0.5 \le x^i \le 0.5$

Feature Scaling

Divide the values of a feature by its range

$x^i = \frac{x^i}{\max(x^i) - \min(x^i)}$

Mean Normalization

Bring mean of each feature to zero

$x^i = x^i - \mu^i$

where $\mu^i$ is the mean of feature $i$

Combining both

$x^i = \frac{x^i - \mu^i}{\max(x^i) - \min(x^i)}$

$x^i = \frac{x^i - \mu^i}{\rho^i}$

where $\rho^i$ is standard deviation of feature $i$

Learning Rate $\alpha$

Appropriate $\alpha$ value will speed up gradient descent.
If $\alpha$ is too small: slow convergence.
If $\alpha$ is too large: may not decrease on every iteration and thus may not converge.

For implementation purpose, try out different values of $\alpha$ e.g. 0.001, 0.003, 0.01, 0.03, 0.1 and plot $J(\theta)$ with respect to iterations.
Choose the one the makes gradient descent coverge quickly.

Automatic Covergence Test

Plot $J(\theta)$ vs iterations.
$J(\theta)$ should descrease on each iteration.
If $J(\theta)$ decrease by a very small value in an iteration, you may have reached optimal value.

Linear Regression with Multiple Variables Example

Read data



In [141]:

    
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import matplotlib as mpl

# read data in pandas frame
dataframe = pd.read_csv('datasets/house_dataset2.csv', encoding='utf-8')



In [142]:

    
# check data by printing first few rows
dataframe.head()



In [143]:

    
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
fig.set_size_inches(12.5, 7.5)
ax = fig.add_subplot(111, projection='3d')

ax.scatter(xs=dataframe['size'], ys=dataframe['bedrooms'], zs=dataframe['price'])

ax.set_ylabel('bedrooms'); ax.set_xlabel('size'); ax.set_zlabel('price')
# ax.view_init(10, -45)
plt.show()

Feature Scaling and Mean Normalization



In [144]:

    
dataframe.describe()









    Out[144]:






  
    
      
      size
      bedrooms
      price
    
  
  
    
      count
      47.000000
      47.000000
      47.000000
    
    
      mean
      2000.680851
      3.170213
      340412.659574
    
    
      std
      794.702354
      0.760982
      125039.899586
    
    
      min
      852.000000
      1.000000
      169900.000000
    
    
      25%
      1432.000000
      3.000000
      249900.000000
    
    
      50%
      1888.000000
      3.000000
      299900.000000
    
    
      75%
      2269.000000
      4.000000
      384450.000000
    
    
      max
      4478.000000
      5.000000
      699900.000000



In [145]:

    
#Quick visualize data
plt.grid(True)
plt.xlim([-1,5000])
dummy = plt.hist(dataframe["size"],label = 'Size')
dummy = plt.hist(dataframe["bedrooms"],label = 'Bedrooms')
plt.title('Clearly we need feature normalization.')
plt.xlabel('Column Value')
plt.ylabel('Counts')
dummy = plt.legend()



In [146]:

    
mean_size   = dataframe["size"].mean()
std_size       = dataframe["size"].std()
mean_bed   = dataframe["bedrooms"].mean()
std_bed       = dataframe["bedrooms"].std()



In [147]:

    
dataframe["size"] = (dataframe["size"] - mean_size)/std_size



In [148]:

    
dataframe["bedrooms"] = (dataframe["bedrooms"] - mean_bed)/std_bed



In [149]:

    
dataframe.describe()









    Out[149]:






  
    
      
      size
      bedrooms
      price
    
  
  
    
      count
      4.700000e+01
      4.700000e+01
      47.000000
    
    
      mean
      3.779483e-17
      2.746030e-16
      340412.659574
    
    
      std
      1.000000e+00
      1.000000e+00
      125039.899586
    
    
      min
      -1.445423e+00
      -2.851859e+00
      169900.000000
    
    
      25%
      -7.155897e-01
      -2.236752e-01
      249900.000000
    
    
      50%
      -1.417900e-01
      -2.236752e-01
      299900.000000
    
    
      75%
      3.376348e-01
      1.090417e+00
      384450.000000
    
    
      max
      3.117292e+00
      2.404508e+00
      699900.000000



In [150]:

    
# reassign X
# assign X
X = np.array(dataframe[['size','bedrooms']])
X = np.insert(X,0,1,axis=1)

#Quick visualize data
plt.grid(True)
plt.xlim([-5,5])
dummy = plt.hist(dataframe["size"],label = 'Size')
dummy = plt.hist(dataframe["bedrooms"],label = 'Bedrooms')
plt.title('Feature scaled and normalization.')
plt.xlabel('Column Value')
plt.ylabel('Counts')
dummy = plt.legend()



In [151]:

    
# assign X and y
X = np.array(dataframe[['size','bedrooms']])
y = np.array(dataframe[['price']])

m = y.size # number of training examples

# insert all 1's column for theta_0
X = np.insert(X,0,1,axis=1)

# initialize theta
# initial_theta = np.zeros((X.shape[1],1))
initial_theta = np.random.rand(X.shape[1],1)



In [152]:

    
initial_theta









    Out[152]:





array([[ 0.71242907],
       [ 0.87930663],
       [ 0.73918403]])



In [153]:

    
X.shape









    Out[153]:





(47, 3)



In [154]:

    
initial_theta.shape









    Out[154]:





(3, 1)

Initialize Hyper Parameters



In [155]:

    
iterations = 1500
alpha = 0.1

Model/Hypothesis Function



In [156]:

    
def h(X, theta): #Linear hypothesis function
    hx = np.dot(X,theta)
    return hx

Cost Function



In [157]:

    
def computeCost(theta,X,y): #Cost function
    """
    theta_start is an n- dimensional vector of initial theta guess
    X is matrix with n- columns and m- rows
    y is a matrix with m- rows and 1 column
    """
    #note to self: *.shape is (rows, columns)
    return float((1./(2*m)) * np.dot((h(X,theta)-y).T,(h(X,theta)-y)))

#Test that running computeCost with 0's as theta returns 65591548106.45744:
initial_theta = np.zeros((X.shape[1],1)) #(theta is a vector with n rows and 1 columns (if X has n features) )
print (computeCost(initial_theta,X,y))









    



65591548106.45744

Gradient Descent Function



In [158]:

    
#Actual gradient descent minimizing routine
def gradientDescent(X, theta_start = np.zeros(2)):
    """
    theta_start is an n- dimensional vector of initial theta guess
    X is matrix with n- columns and m- rows
    """
    theta = theta_start
    j_history = [] #Used to plot cost as function of iteration
    theta_history = [] #Used to visualize the minimization path later on
    for meaninglessvariable in range(iterations):
        tmptheta = theta
        # append for plotting
        j_history.append(computeCost(theta,X,y))
        theta_history.append(list(theta[:,0]))
        #Simultaneously updating theta values
        for j in range(len(tmptheta)):
            tmptheta[j] = theta[j] - (alpha/m)*np.sum((h(X,theta) - y)*np.array(X[:,j]).reshape(m,1))
        theta = tmptheta
    return theta, theta_history, j_history

Run Gradient Descent



In [159]:

    
#Actually run gradient descent to get the best-fit theta values
theta, thetahistory, j_history = gradientDescent(X,initial_theta)



In [160]:

    
theta









    Out[160]:





array([[ 340412.65957447],
       [ 110631.05027885],
       [  -6649.47427082]])

Plot Convergence



In [161]:

    
plt.plot(j_history)
plt.title("Convergence of Cost Function")
plt.xlabel("Iteration number")
plt.ylabel("Cost function")
plt.show()

Predict output using trained model



In [162]:

    
dataframe.head()



In [166]:

    
x_test = np.array([1,0.130010,-0.22367])

print("$%0.2f" % float(h(x_test,theta)))



In [168]:

    
hx = h(X, theta)



In [169]:

    
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
fig.set_size_inches(12.5, 7.5)
ax = fig.add_subplot(111, projection='3d')

ax.scatter(xs=dataframe['size'], ys=dataframe['bedrooms'], zs=dataframe['price'])

ax.set_ylabel('bedrooms'); ax.set_xlabel('size'); ax.set_zlabel('price')
# ax.plot(xs=np.array(X[:,0],dtype=object).reshape(-1,1), ys=np.array(X[:,1],dtype=object).reshape(-1,1), zs=hx, color='green')
ax.plot(X[:,0], X[:,1], np.array(hx[:,0]), label='fitted line', color='green') 
# ax.view_init(20, -165)
plt.show()

Resources

Course website: https://w4zir.github.io/ml17s/

Course resources

Credits

Raschka, Sebastian. Python machine learning. Birmingham, UK: Packt Publishing, 2015. Print.

Andrew Ng, Machine Learning, Coursera

Lucas Shen

David Kaleko

	size	bedrooms	price
0	2104	3	399900
1	1600	3	329900
2	2400	3	369000
3	1416	2	232000
4	3000	4	539900

	size	bedrooms	price
0	0.130010	-0.223675	399900
1	-0.504190	-0.223675	329900
2	0.502476	-0.223675	369000
3	-0.735723	-1.537767	232000
4	1.257476	1.090417	539900

	size	bedrooms	price
count	47.000000	47.000000	47.000000
mean	2000.680851	3.170213	340412.659574
std	794.702354	0.760982	125039.899586
min	852.000000	1.000000	169900.000000
25%	1432.000000	3.000000	249900.000000
50%	1888.000000	3.000000	299900.000000
75%	2269.000000	4.000000	384450.000000
max	4478.000000	5.000000	699900.000000

	size	bedrooms	price
count	4.700000e+01	4.700000e+01	47.000000
mean	3.779483e-17	2.746030e-16	340412.659574
std	1.000000e+00	1.000000e+00	125039.899586
min	-1.445423e+00	-2.851859e+00	169900.000000
25%	-7.155897e-01	-2.236752e-01	249900.000000
50%	-1.417900e-01	-2.236752e-01	299900.000000
75%	3.376348e-01	1.090417e+00	384450.000000
max	3.117292e+00	2.404508e+00	699900.000000