This tutorial , uses multivariate regression to predict house price. The high level goal is the use multiple features (size , number of bedrooms,bathrooms etc) to predict the price of a house. This tutorial is a self-paced tutorial.
The language used throughout will be Python and libraries available in python for scientific and machine learning applications.
One of the Python tools, the IPython notebook = interactive Python rendered as HTML, you're watching right now. We'll go over other practical tools, widely used in the data science industry, below.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import csv
from textblob import TextBlob
import pandas
import sklearn
import pickle
import numpy as np
from sklearn import preprocessing
from sklearn import cross_validation as cv
from sklearn.metrics import explained_variance_score, mean_squared_error,r2_score
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score, train_test_split
from sklearn.learning_curve import learning_curve
from sklearn.externals import joblib
from sklearn.linear_model import LinearRegression
Labeled data is crucial for supervised learning. By the gracious courtesy of redfin.com, we are going to focus on last 3 years of data for zipcode 94555. The houses in this data are at least 4500 sq ft lot size, 3+bed, 2+bath,1500 sqft - 2000 sqft and were not listed more than 30 days in www.redfin.com
$ mkdir data && cd data
$ ls -ltrh
-rw-r--r--@ 1 user staff 41K Feb 17 23:19 redfin_2015-02-17_94555_results.csv
Pandas is an amazing library to load and manipulate data. We will be using pandas to load the data.
In [14]:
labels=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT",
"PARKING SPOTS","PARKING TYPE","ORIGINAL LIST PRICE","LAST SALE PRICE"]
houses=pandas.read_csv('./data/redfin_2015-02-17_94555_results_massaged.csv', quoting=csv.QUOTE_NONE,names=labels)
print(houses.head())
The matrix size of the input data is:
In [16]:
print(houses.shape)
It's always advisable to visualize the data. Let's create a scatter plot with X-axis as the SQFT of the house and Y-axis as the LAST SALE PRICE of the house.
In [12]:
houses.plot(x='SQFT',y='LAST SALE PRICE', kind='scatter')
Out[12]:
pandas also helps with aggregate statistics.
In [20]:
houses.groupby('SQFT').describe()
Out[20]:
In this section, we will convert the data to a format, that can be normalized and consumed by our learning algorithm. We have 9 columns of input data based on which, we need to predict the house price. We have around 116 rows of data (training samples).
We will extract the LAST SALE PRICE column and save that to our target values. We will normalize the rest of the columns. By normalization, we are making sure the features are scaled to an uniform scale and no particular feature has more weight over the other.
We need to extract LAST SALE PRICE and save that to a 'y' array and extract the rest of the columns and save it to 'houses_features' array
In [15]:
y= houses['LAST SALE PRICE'].copy()
houses_features =houses.iloc[:,[0,1,2,3,4,5,6,7,8]].copy()
print ("features: \n",houses_features.head())
print("target: \n", y.head())
The first step for our data would be to give "garage" a numerical value. We can assign 1 for the presence of garage and 0 for the absence of garage.
In [16]:
import math
def convert_parking_type(features):
features['PARKING TYPE'] \
= features['PARKING TYPE'].map(lambda x : 1 if x.strip().lower() == 'garage' else 0)
return features
houses_features = convert_parking_type(houses_features);
print(houses_features['PARKING TYPE'].head())
There are many ways to normalize the features. We will use a simple way of normalizing the feature: given a list of values X[1,2,3,4,5] , we will normalize as:
\begin{equation*} X[i] = \frac{X[i] - mean(X[i])}{\sigma(X[i])} \end{equation*}where m= total no. of features of X
In our case, we will normalize each column of features.
In [17]:
def feature_normalize(matrix, mean=None, std=None):
mean = matrix.mean() if mean is None else mean
std = matrix.std() if std is None else std
matrix = (matrix - mean)/std
return {'matrix_norm': matrix.fillna(0), 'mean': mean, 'std': std}
Sanity Check !!!
In [19]:
results = feature_normalize(houses_features)
houses_norm = results['matrix_norm']
X_mean =results['mean']
X_std = results['std']
results = feature_normalize(y);
y_norm = results['matrix_norm']
y_mean = results['mean']
y_std = results['std']
print ('{0}\n'.format(houses_norm.head()))
print ('{0}\n'.format(y_norm.head()))
We can use Preprocessing module of sklearn to apply zero mean and unit deviation norm. Let's use that and check the results.
In [22]:
scaler = preprocessing.StandardScaler().fit(houses_features.values)
houses_norm = scaler.transform(houses_features)
print("Mean: {0}, STD: {1}".format(scaler.mean_,scaler.std_))
print(houses_norm)
Let's use a simple linear model , that uses the least squared error function.
Optional !!: A quick mathematical background.
$\begin{equation*} y' = h_\theta(X)\end{equation*}$
Where
$\begin{equation*}h_\theta(X) = \theta_0 + \theta_1*x_1 + ... + \theta_n*x_n\end{equation*}$
Here y' is the predicted value and y is the actual value. Our goal is to minimize the delta between these two values. So, if we can choose the best values for $\begin{equation*}\Theta\end{equation*}$, we can accurately predict the house prices.
To achieve that , we come up with least squared error function, for which, we need to find the minimum values of $\begin{equation*}\Theta\end{equation*}$ where , $\begin{equation*}\Theta = [\theta_0, \theta_1 ... \theta_n]\end{equation*}$
$\begin{equation*} J(\Theta) = \sum_{i=0}^m(h_\theta(X)-y)^2\end{equation*}$
We choose some arbitrary number of iterations and in each iteration we simultaneously calculate
$\begin{equation*} \theta_0 := \theta_0 -\alpha*\frac{\partial}{\partial\theta_0}J(\Theta)\end{equation*}$
$\begin{equation*} \theta_1 := \theta_1 - \alpha*\frac{\partial}{\partial\theta_1}J(\Theta)\end{equation*}$
. . .
$\begin{equation*} \theta_n := \theta_n - \alpha*\frac{\partial}{\partial\theta_n}J(\Theta)\end{equation*}$
Where $\begin{equation*}\alpha\end{equation*}$ is a scalar learning rate value.
Phew !! Thankfully, we can use different linear models available in sklearn.
Let's define a train_model function in which,
In [23]:
def train_model(X,y):
regressor = LinearRegression(fit_intercept=True)
regressor.fit(X,y)
return regressor
In [24]:
regr = train_model(houses_norm, y_norm)
print(regr.coef_)
Let's run a prediction and see if we can predict the price of a house. We are going to use the 3rd sample from our training set. The input features (house_test) for this sample : {"LIST PRICE" : 749888, "BEDS" : 3, "BATHS": 2.5, "SQFT": 1642, "LOT SIZE" : 9876, "YEAR BUILT" : 1968, "PARKING SPOTS" : 2, "PARKING TYPE" : "Garage", "ORIGINAL LIST PRICE": 74988}
and the actual output is "LAST SALE PRICE" which is 713,550.
In [25]:
house_test = pandas.DataFrame({"LIST PRICE" : 749888,
"BEDS" : 3,
"BATHS": 2.5,
"SQFT": 1642,
"LOT SIZE" : 9876,
"YEAR BUILT" : 1968,
"PARKING SPOTS" : 2,
"PARKING TYPE" : "Garage",
"ORIGINAL LIST PRICE": 74988},index=[0])
house_test_converted = convert_parking_type(house_test)
house_test_norm = feature_normalize(house_test_converted, X_mean, X_std)
y_predicted_norm = regr.predict(house_test_norm['matrix_norm']);
y_predicted = y_predicted_norm*y_std + y_mean
print('Predicted Price: {0}'.format(y_predicted[0].round(0)))
print('Normalized Input: \n {0} \n'.format(house_test_norm['matrix_norm']))
Let's do some score analysis to find out the performance of our prediction algorithm.
In [178]:
print("Mean square error: {0} \n".format(mean_squared_error([y_norm[2]],y_predicted_norm)))
print("R2 Score: {0} \n".format(regr.score(houses_norm, y_norm)))
In the above "evaluation" we didn't break the data into training and test set. That's a big no no in preparing your machine learning models. Let's fix that.
A proper way is to split the data into a training/test set, where the model only ever sees the training data during its model fitting and parameter tuning. The test data is used for final evaluation of our model to find out it's prediction performance.
In [29]:
houses_norm_train,houses_norm_test, y_norm_train, y_norm_test = \
train_test_split(houses_norm,y_norm,test_size=0.2) #<=== 20% of the samples are used for testing.
print(len(houses_norm_train),len(houses_norm_test),len(houses_norm_train)+len(houses_norm_test))
So, as requested, the test size is 20% of the entire dataset (24 houses out of total 116), and the training is the rest (92 out of 116). Let me introduce the scikit-learn Pipeline concept here. The pipeline helps in breaking our learning into sub-steps and assemble them together. This way, we can run the pipeline on different set of parameters and cross-validation set without changing the code of each individual steps.
In [27]:
linear_regressor_pipeline = Pipeline([
('normalize', preprocessing.StandardScaler()),
('regressor', LinearRegression())
])
A common practice is to split the training data into further smaller subsets of data; for example 5 equally sized subsets. Then, train the model on 4 subsets of data and validate the accuracy on the last subset of data (called "validation set"). If we repeat, this training with different subset as the validation set, we can test the stability of the model. If the model gives different scores for different subsets , then we need to go back and check the model and/or the data.
In [30]:
scores = cross_val_score(linear_regressor_pipeline,
houses_norm_train, #<== training samples
y_norm_train,#<== training output labels
cv=5,#<=== split data randomly into 5 parts, 4 for training and 1 for scoring
scoring='r2', #<== use R^2 score
n_jobs=-1)
print(scores,scores.mean(),scores.std())
A score of 1 implies we have predicted perfectly. As you can see we are little off in our prediction (16% off).The scores are quite consistent. But, we definitely can do better.
How can we fix this ?
There are few design principles that we need to consider :
All of the above advice is good. But, where should we start ?
Let's start with Step 1 , Step 3 and Step 4 and see if we can make our predictions better.
We will get more data and use cubic interpolation to fix missing data. You can read more about interpolation here.
In [181]:
labels=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT",
"PARKING SPOTS","PARKING TYPE","ORIGINAL LIST PRICE","LAST SALE PRICE"]
data=pandas.read_csv('./data/redfin_more_data1.csv', quoting=csv.QUOTE_NONE,names=labels)
print(data.head())
data.plot(x='SQFT', y='LAST SALE PRICE', kind='scatter')
data = convert_parking_type(data)
data.interpolate(method='cubic',inplace=True)
data.fillna(0)
y= data['LAST SALE PRICE'].copy()
X =data.iloc[:,[0,1,2,3,4,5,6,7]].copy()
print ("features {0}: \n{1}".format(X.shape, X.head()))
print("target:{0}: \n{1}".format(y.shape, y.head()))
X_train,X_test, y_train, y_test = \
train_test_split(X,y,test_size=0.2,random_state=42) #<=== 20% of the samples are used for testing.
print(len(X_train),len(X_test),len(X_train)+len(X_test))
print(len(y_train) + len(y_test))
print (X_train.shape)
print (y_train.shape)
In [86]:
from sklearn.linear_model import Ridge
from sklearn.metrics import explained_variance_score, r2_score, mean_squared_error, mean_absolute_error
ridge_regressor_pipeline = Pipeline([
('normalize', preprocessing.StandardScaler()),
('regressor',Ridge())
])
scores = cross_val_score(ridge_regressor_pipeline,
X_train, #<== training samples
y_train,#<== training output labels
cv=5,#<=== split data randomly into 5 parts, 4 for training and 1 for scoring
scoring='r2', #<== use R^2 score
n_jobs=-1)
print(scores,scores.mean(),scores.std())
print("Accuracy: %0.5f (+/- %0.5f)" % (scores.mean(), scores.std() * 2))
Let's test with our test set and compare the predicted result with our actual results.
In [87]:
ridge_regressor_pipeline.fit(X_train,y_train)
y_predicted = ridge_regressor_pipeline.predict(X_test)
print("Variance Score: %0.05f and R2 Score: %0.5f \n" % (explained_variance_score(y_test,y_predicted), r2_score(y_test,y_predicted)))
In [177]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and traning learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see
sklearn.cross_validation module for the list of possible objects
n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
In [184]:
%time plot_learning_curve(ridge_regressor_pipeline, "accuracy vs. training set size", X_train, y_train, cv=5)
Out[184]:
As you can see, with more data, our prediction confidence is almost at 97% confidence. We have limited data samples. Around 150 training examples, our accuracy has almost reached to perfect score.
Let's use GridSearch to find the best params for our model.
In [267]:
from sklearn.linear_model import Ridge
from sklearn.cross_validation import StratifiedKFold
#load the data
houses=pandas.read_csv('./data/redfin_more_data1.csv', quoting=csv.QUOTE_NONE,names=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT",
"PARKING SPOTS","PARKING TYPE","ORIGINAL LIST PRICE","LAST SALE PRICE"])
houses.interpolate(inplace=True) #replace missing values by interpolating the data in a particular column
y= houses['LAST SALE PRICE'].copy()
houses_features =houses.iloc[:,[0,1,2,3,4,5]].copy()
#split 80% of data to training set and 20% to test set
houses_train,houses_test, y_train, y_test = \
train_test_split(houses_features,y,test_size=0.2) #<=== 20% of the samples are used for testing.
#build the pipeline and use the StandardScalar to normalize the data
ridge_regressor_pipeline = Pipeline([
('normalize', preprocessing.StandardScaler()),
('regressor', Ridge())
])
#set some params for the regressor
param_ridge = [
{'regressor__alpha':[0.01, 0.1, 1], 'regressor__solver':['lsqr'],'regressor__tol': [.99], 'regressor__max_iter': [500] },
{'regressor__alpha':[0.01, 0.1, 1], 'regressor__solver':['cholesky'],'regressor__tol': [.99], 'regressor__max_iter': [500] },
]
#search the best fitting params for our regressor
grid_ridge = GridSearchCV(
ridge_regressor_pipeline, #pipeline from above
param_grid=param_ridge, #parameters to tune via cross validation
refit=True, #fit using all data for the best regressor
n_jobs=-1,
scoring='r2',
cv=StratifiedKFold(y_train,n_folds=5)
)
In [188]:
%time price_predictor = grid_ridge.fit(houses_train,y_train)
print (price_predictor.grid_scores_)
So apparently, alpha=0.01, tol=.99 and solver='cholesky' provides the best fit. Let's run our Regressor on our test set for a sanity check.
In [268]:
yp = price_predictor.predict(houses_test)
print("Variance: %0.02f and R2 Score: %0.02f \n" % (explained_variance_score(y_test,yp), r2_score(y_test,yp)))
We get 95% accuracy in our prediction when we try the prediction on 20% of training samples that we had split for test set. This test set has never been seen before by our predictor. Though, our R2 score as gone down a little bit.But, it represents a real world scenario. Let' predict the price of a house?
In [269]:
import math
data = {"LIST PRICE":879000,
"BEDS": 3,
"BATHS": 2.5,
"SQFT": 1900,
"LOT SIZE" : 6800,
"YEAR BUILT":2015}
house_test = pandas.DataFrame(data,index=[0],columns=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT"])
%time yp=price_predictor.predict(house_test)
print("Predicted Price: {0}".format(round(yp[0])))
For production, we may need to save the model, the parameters, cross validation scores, training scores and then run the model on a production database. We would then need to compare the model training , cv and test scores with the production data prediction scores and figure out if our model needs further tuning.
This is a basic example of linear multivariate regresion to predict a continous output. What's next ?
I will try to explore these following topics in future posts.