CSAL4243: Introduction to Machine Learning

Muhammad Mudassir Khan (mudasssir.khan@ucp.edu.pk)

Lecture 2: Linear Regression

Overview



What is Machine Learning?

  • Machine Learning is making computers/machcines learn from data
  • Learning improve over time with more data

Definition

Mitchell ( 1997 ) define Machine Learning as “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance at tasks in T , as measured by P , improves with experience E .”

Example: playing checkers.

T = the task of playing checkers.

E = the experience of playing many games of checkers

P = the probability that the program will win the next game.



The three different types of machine learning

<img style="float: left;" src="images/01_01.png", width=500>



Supervised Learning

<img style="float: left;" src="images/01_02.png", width=500>



Regression for predicting continuous outcomes

<img style="float: left;" src="images/01_04.png", width=300> <img style="float: right;" src="images/01_11.png", width=500>



Classification for predicting class labels

<img style="float: left;" src="images/01_03.png", width=300> <img style="float: right;" src="images/01_12.png", width=500>



Unsupervised Learning

<img style="float: left;" src="images/01_06.png", width=300>



Reinforcement Learning

<img style="float: left;" src="images/01_05.png", width=300>



Machine Learning pipeline

<img style="float: left;" src="images/model.png", width=500>

  • x is called input variables or input features.

  • y is called output or target variable. Also sometimes known as label.

  • h is called hypothesis or model.

  • pair (x(i),y(i)) is called a sample or training example

  • dataset of all training examples is called training set.

  • m is the number of samples in a dataset.

  • n is the number of features in a dataset excluding label.

<img style="float: left;" src="images/02_02.png", width=400> <img style="float: right;" src="images/02_03.png", width=400>

Question ?

  • What is x(2) and y(2)?



Goal of Machine Learning algorithm

  • How well the algorithm will perform on unseen data.
  • Also called generalization.



Linear Regression with one variable

Model Representation

  • Model is represented by h$\theta$(x) or simply h(x)

  • For Linear regression with one input variable h(x) = $\theta$0 + $\theta$1x

  • $\theta$0 and $\theta$1 are called weights or parameters.
  • Need to find $\theta$0 and $\theta$1 that maximizes the performance of model.



Cost Function

<img style="float: left;" src="images/02_04.png", width=500>

Let $\hat{y}$ = h(x) = $\theta$0 + $\theta$1x

Error in single sample (x,y) = $\hat{y}$ - y = h(x) - y

Cummulative error of all m samples = $\sum_{i=1}^{m} (h(x^i) - y^i)^2$

Finally mean error or cost function = J($\theta$) = $\frac{1}{2m}\sum_{i=1}^{m} (h(x^i) - y^i)^2$



Simple case when $\theta_0$ = 0

<img style="float: center;" src="images/02_06.png", width=700>

<img style="float: center;" src="images/02_07.png", width=700>

<img style="float: center;" src="images/02_08.png", width=700>

<img style="float: center;" src="images/02_09.png", width=700>

When both $\theta_0$ and $\theta_1$ can vary

<img style="float: center;" src="images/02_10.png", width=700>

<img style="float: center;" src="images/02_11.png", width=700>

<img style="float: center;" src="images/02_12.png", width=700>

<img style="float: center;" src="images/02_13.png", width=700>



So what is the price of the house?

Read data


In [1]:
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt

# read data in pandas frame
dataframe = pd.read_csv('datasets/house_dataset1.csv')

# assign x and y
x_feature = dataframe[['Size']]
y_labels = dataframe[['Price']]

In [90]:
# check data by printing first few rows
dataframe.head()


Out[90]:
Size Price
0 2104 399900
1 1600 329900
2 2400 369000
3 1416 232000
4 3000 539900

Plot data


In [91]:
#visualize results
plt.scatter(x_feature, y_labels)
plt.show()



In [92]:
y_labels.shape


Out[92]:
(47, 1)

Train model


In [93]:
#train model on data
body_reg = linear_model.LinearRegression()
body_reg.fit(x_feature, y_labels)


Out[93]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Predict output using trained model


In [94]:
hx = body_reg.predict(x_feature)

Plot results


In [95]:
plt.scatter(x_feature, y_labels)
plt.plot(x_feature, hx)
plt.show()


Parameters


In [99]:
body_reg.coef_()


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-99-f699200b67f1> in <module>()
----> 1 body_reg.coef_()

TypeError: 'numpy.ndarray' object is not callable

Do it yourself


In [86]:
theta0 = 0
theta1 = 0
inc = 1.0

#loop over all values of theta1 from -3.14 to 3.14 with an increment of inc and find cost. 
# The one with minimum cost is the answer.
m = x_feature.shape[0]
n = x_feature.shape[1]

# optimal values to be determined
minCost = 100000000000000
optimal_theta = 0

while theta1 < 1000:
    cost = 0;
    for indx in range(m):
        hx = theta1*x_feature.values[indx,0] + theta0
        cost += pow((hx - y_labels.values[indx,0]),2)
               
    cost = cost/(2*m)        
#     print(theta1)
#     print(cost)
    
    if cost < minCost:
        minCost =  cost
        optimal_theta = theta1
    theta1 += inc
        
print (optimal_theta)


165.0

In [88]:
pred = optimal_theta*x_feature

In [84]:
pred.shape


Out[84]:
(47, 1)

In [89]:
plt.scatter(x_feature, y_labels)
plt.plot(x_feature, pred)
plt.show()


Credits

Raschka, Sebastian. Python machine learning. Birmingham, UK: Packt Publishing, 2015. Print.

Andrew Ng, Machine Learning, Coursera.org.