We will implement linear regression with multiple variables to predict the prices of houses. Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices.
The file ex1data2.txt
contains a training set of housing prices in Port- land, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
df = pd.read_csv('ex1data2.txt', header=None)
print(df.head())
#Lets try to visualize the data
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(df[0], df[1], df[2])
ax.set_zlabel('price')
plt.xlabel('size of the house (in square feet)')
plt.ylabel('number of bedrooms')
plt.show()
print('We have 47 houses data')
In [2]:
import numpy as np
#Data preparation
#We are not adding column of ones here because we want to normalize the features first
X = df.drop([2], axis=1).values
y = df[2].values
print(X[:1], y[:1])
Now we will start with normalization of the features because size of the house is in different range as compared to number of bedrooms
In [3]:
def featureNormalize(X):
mu = X.mean(axis=0)
sigma = X.std(axis=0)
X_norm = (X - mu)/sigma
return (X_norm, mu, sigma)
Data Preparation
In [4]:
X_norm, mu, sigm = featureNormalize(X)
# now lets add ones to the input feature X for theta0
ones = np.ones((X_norm.shape[0], 1), float)
X = np.concatenate((ones,X_norm), axis=1)
print(X[:1])
In [5]:
#Cost function
def computeCostMulti(X, y, theta):
m = X.shape[0]
hypothesis = X.dot(theta) # h_theta = theta.T * x = theta0*x0 + theta1*x1 + ... + thetan*xn
J = (1/(2*m)) * (np.sum(np.square(hypothesis-y)))
return J
In [6]:
theta = np.zeros(X.shape[1])
J_cost = computeCostMulti(X, y, theta)
print('J_Cost', J_cost)
In [7]:
def gradientDescentMulti(X, y, theta, alpha, num_iters):
m = X.shape[0]
J_history = np.zeros(num_iters)
for iter in np.arange(num_iters):
h = X.dot(theta)
theta = theta - alpha * (1/m) * X.T.dot(h-y)
J_history[iter] = computeCostMulti(X, y, theta)
return theta, J_history
alpha = 0.01;
num_iters = 1000;
theta, J_history = gradientDescentMulti(X, y, theta, alpha, num_iters)
In [8]:
#Lets plot something
plt.xlim(0,num_iters)
plt.plot(J_history)
plt.ylabel('Cost J')
plt.xlabel('Iterations')
plt.show()
print(theta)
Now lets predict prices of some houses and compare the result with scikit-learn prediction.
In [9]:
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(X, y)
inputXs = np.array([[1, 100, 3], [1, 200, 3]])
sklearnPrediction = clf.predict(inputXs)
gradientDescentPrediction = inputXs.dot(theta)
print(sklearnPrediction, gradientDescentPrediction)
print("Looks Good :D")