This tutorial , uses multivariate regression to predict house price. The high level goal is the use multiple features (size , number of bedrooms,bathrooms etc) to predict the price of a house. This tutorial is a self-paced tutorial.

The language used throughout will be Python and libraries available in python for scientific and machine learning applications.

One of the Python tools, the IPython notebook = interactive Python rendered as HTML, you're watching right now. We'll go over other practical tools, widely used in the data science industry, below.

**You can run this notebook interactively**
  1. Install the (free) [Anaconda](https://store.continuum.io/cshop/anaconda/) Python distribution, including Python itself.
  2. Install the TextBlob library: [Textblob](http://textblob.readthedocs.org/en/dev/install.html).
  3. Download the source for this notebook to your computer: [https://github.com/nmishra/mvregression/blob/master/predict_house_price_python.ipynb](https://github.com/nmishra/mvregression/blob/master/predict_house_price_python.ipynb) and run it with:
    `$ ipython notebook predict_house_price_python.ipynb`
  4. Watch the [IPython tutorial video](https://www.youtube.com/watch?v=H6dLGQw9yFQ) for notebook navigation basics.
  5. Run the first code cell below; if it executes without errors, you're good to go!

Python tutorial : Multivariate linear regression to predict house prices


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import csv
from textblob import TextBlob
import pandas
import sklearn
import pickle
import numpy as np
from sklearn import preprocessing
from sklearn import cross_validation as cv
from sklearn.metrics import explained_variance_score, mean_squared_error,r2_score 
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score, train_test_split 
from sklearn.learning_curve import learning_curve
from sklearn.externals import joblib
from sklearn.linear_model import LinearRegression

Step 1: Load data, plot data

Labeled data is crucial for supervised learning. By the gracious courtesy of redfin.com, we are going to focus on last 3 years of data for zipcode 94555. The houses in this data are at least 4500 sq ft lot size, 3+bed, 2+bath,1500 sqft - 2000 sqft and were not listed more than 30 days in www.redfin.com

$ mkdir data && cd data 
$ ls -ltrh
-rw-r--r--@ 1 user  staff    41K Feb 17 23:19 redfin_2015-02-17_94555_results.csv

Pandas is an amazing library to load and manipulate data. We will be using pandas to load the data.


In [14]:
labels=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT",
        "PARKING SPOTS","PARKING TYPE","ORIGINAL LIST PRICE","LAST SALE PRICE"]
houses=pandas.read_csv('./data/redfin_2015-02-17_94555_results_massaged.csv', quoting=csv.QUOTE_NONE,names=labels)
print(houses.head())


   LIST PRICE  BEDS  BATHS  SQFT  LOT SIZE  YEAR BUILT  PARKING SPOTS  \
0      739000     3    2.5  1988      5595        1991              2   
1      749888     3    2.5  1642      9876        1986              2   
2      713500     4    2.0  1504      5800        1969              2   
3      749000     3    3.0  1781      5800        1971              2   
4      835000     4    2.5  1857      5061        1990              2   

  PARKING TYPE  ORIGINAL LIST PRICE  LAST SALE PRICE  
0       Garage               739000           755000  
1       Garage               749888           682500  
2       Garage               599000           713500  
3       Garage               749000           750000  
4       Garage               835000           835000  

The matrix size of the input data is:


In [16]:
print(houses.shape)


(116, 10)

It's always advisable to visualize the data. Let's create a scatter plot with X-axis as the SQFT of the house and Y-axis as the LAST SALE PRICE of the house.


In [12]:
houses.plot(x='SQFT',y='LAST SALE PRICE', kind='scatter')


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x107423cf8>

pandas also helps with aggregate statistics.


In [20]:
houses.groupby('SQFT').describe()


Out[20]:
BATHS BEDS LAST SALE PRICE LIST PRICE LOT SIZE ORIGINAL LIST PRICE PARKING SPOTS YEAR BUILT
SQFT
1500 count 1.0 1.000000 1.000000 1.000000 1.000000 1.000000 1 1.000000
mean 2.0 3.000000 583000.000000 583000.000000 6139.000000 557300.000000 2 1978.000000
std NaN NaN NaN NaN NaN NaN NaN NaN
min 2.0 3.000000 583000.000000 583000.000000 6139.000000 557300.000000 2 1978.000000
25% 2.0 3.000000 583000.000000 583000.000000 6139.000000 557300.000000 2 1978.000000
50% 2.0 3.000000 583000.000000 583000.000000 6139.000000 557300.000000 2 1978.000000
75% 2.0 3.000000 583000.000000 583000.000000 6139.000000 557300.000000 2 1978.000000
max 2.0 3.000000 583000.000000 583000.000000 6139.000000 557300.000000 2 1978.000000
1504 count 3.0 3.000000 3.000000 3.000000 3.000000 3.000000 3 3.000000
mean 2.0 3.666667 731166.666667 687462.666667 5837.666667 649296.000000 2 1974.333333
std 0.0 0.577350 42826.199146 34081.757310 467.639106 50501.236104 0 10.115994
min 2.0 3.000000 700000.000000 648888.000000 5390.000000 599000.000000 2 1968.000000
25% 2.0 3.500000 706750.000000 674444.000000 5595.000000 623944.000000 2 1968.500000
50% 2.0 4.000000 713500.000000 700000.000000 5800.000000 648888.000000 2 1969.000000
75% 2.0 4.000000 746750.000000 706750.000000 6061.500000 674444.000000 2 1977.500000
max 2.0 4.000000 780000.000000 713500.000000 6323.000000 700000.000000 2 1986.000000
1507 count 1.0 1.000000 1.000000 1.000000 1.000000 1.000000 1 1.000000
mean 2.0 4.000000 498000.000000 450000.000000 6323.000000 450000.000000 2 1969.000000
std NaN NaN NaN NaN NaN NaN NaN NaN
min 2.0 4.000000 498000.000000 450000.000000 6323.000000 450000.000000 2 1969.000000
25% 2.0 4.000000 498000.000000 450000.000000 6323.000000 450000.000000 2 1969.000000
50% 2.0 4.000000 498000.000000 450000.000000 6323.000000 450000.000000 2 1969.000000
75% 2.0 4.000000 498000.000000 450000.000000 6323.000000 450000.000000 2 1969.000000
max 2.0 4.000000 498000.000000 450000.000000 6323.000000 450000.000000 2 1969.000000
1523 count 1.0 1.000000 1.000000 1.000000 1.000000 1.000000 1 1.000000
mean 2.0 3.000000 626000.000000 620000.000000 9241.000000 650000.000000 2 1974.000000
std NaN NaN NaN NaN NaN NaN NaN NaN
min 2.0 3.000000 626000.000000 620000.000000 9241.000000 650000.000000 2 1974.000000
25% 2.0 3.000000 626000.000000 620000.000000 9241.000000 650000.000000 2 1974.000000
50% 2.0 3.000000 626000.000000 620000.000000 9241.000000 650000.000000 2 1974.000000
... ... ... ... ... ... ... ... ... ...
1936 std 0.0 0.000000 255265.548008 248265.190875 1675.135965 218849.548777 0 1.414214
min 3.0 4.000000 485000.000000 484900.000000 6400.000000 526500.000000 2 1973.000000
25% 3.0 4.000000 575250.000000 572675.000000 6992.250000 603875.000000 2 1973.500000
50% 3.0 4.000000 665500.000000 660450.000000 7584.500000 681250.000000 2 1974.000000
75% 3.0 4.000000 755750.000000 748225.000000 8176.750000 758625.000000 2 1974.500000
max 3.0 4.000000 846000.000000 836000.000000 8769.000000 836000.000000 2 1975.000000
1942 count 1.0 1.000000 1.000000 1.000000 1.000000 1.000000 1 1.000000
mean 2.0 4.000000 595000.000000 564900.000000 6205.000000 564900.000000 2 1973.000000
std NaN NaN NaN NaN NaN NaN NaN NaN
min 2.0 4.000000 595000.000000 564900.000000 6205.000000 564900.000000 2 1973.000000
25% 2.0 4.000000 595000.000000 564900.000000 6205.000000 564900.000000 2 1973.000000
50% 2.0 4.000000 595000.000000 564900.000000 6205.000000 564900.000000 2 1973.000000
75% 2.0 4.000000 595000.000000 564900.000000 6205.000000 564900.000000 2 1973.000000
max 2.0 4.000000 595000.000000 564900.000000 6205.000000 564900.000000 2 1973.000000
1958 count 1.0 1.000000 1.000000 1.000000 1.000000 1.000000 1 1.000000
mean 2.0 4.000000 690000.000000 649000.000000 6828.000000 649000.000000 2 1985.000000
std NaN NaN NaN NaN NaN NaN NaN NaN
min 2.0 4.000000 690000.000000 649000.000000 6828.000000 649000.000000 2 1985.000000
25% 2.0 4.000000 690000.000000 649000.000000 6828.000000 649000.000000 2 1985.000000
50% 2.0 4.000000 690000.000000 649000.000000 6828.000000 649000.000000 2 1985.000000
75% 2.0 4.000000 690000.000000 649000.000000 6828.000000 649000.000000 2 1985.000000
max 2.0 4.000000 690000.000000 649000.000000 6828.000000 649000.000000 2 1985.000000
1988 count 4.0 4.000000 4.000000 4.000000 4.000000 4.000000 4 4.000000
mean 2.5 3.750000 683750.000000 678997.000000 5538.500000 683747.000000 2 1992.000000
std 0.0 0.500000 113164.702978 102146.130793 884.457838 92866.140417 0 1.154701
min 2.5 3.000000 515000.000000 529000.000000 4576.000000 548000.000000 2 1991.000000
25% 2.5 3.750000 672500.000000 657241.000000 5105.500000 661991.000000 2 1991.000000
50% 2.5 4.000000 732500.000000 719494.000000 5438.500000 719494.000000 2 1992.000000
75% 2.5 4.000000 743750.000000 741250.000000 5871.500000 741250.000000 2 1993.000000
max 2.5 4.000000 755000.000000 748000.000000 6701.000000 748000.000000 2 1993.000000

504 rows × 8 columns

Step 2: Data preprocessing

In this section, we will convert the data to a format, that can be normalized and consumed by our learning algorithm. We have 9 columns of input data based on which, we need to predict the house price. We have around 116 rows of data (training samples).

We will extract the LAST SALE PRICE column and save that to our target values. We will normalize the rest of the columns. By normalization, we are making sure the features are scaled to an uniform scale and no particular feature has more weight over the other.

Step 2.1 Extract target values

We need to extract LAST SALE PRICE and save that to a 'y' array and extract the rest of the columns and save it to 'houses_features' array


In [15]:
y= houses['LAST SALE PRICE'].copy()
houses_features =houses.iloc[:,[0,1,2,3,4,5,6,7,8]].copy()

print ("features: \n",houses_features.head())
print("target: \n", y.head())


features: 
    LIST PRICE  BEDS  BATHS  SQFT  LOT SIZE  YEAR BUILT  PARKING SPOTS  \
0      739000     3    2.5  1988      5595        1991              2   
1      749888     3    2.5  1642      9876        1986              2   
2      713500     4    2.0  1504      5800        1969              2   
3      749000     3    3.0  1781      5800        1971              2   
4      835000     4    2.5  1857      5061        1990              2   

  PARKING TYPE  ORIGINAL LIST PRICE  
0       Garage               739000  
1       Garage               749888  
2       Garage               599000  
3       Garage               749000  
4       Garage               835000  
target: 
 0    755000
1    682500
2    713500
3    750000
4    835000
Name: LAST SALE PRICE, dtype: int64

Step 2.2 Normalize data

The first step for our data would be to give "garage" a numerical value. We can assign 1 for the presence of garage and 0 for the absence of garage.


In [16]:
import math
def convert_parking_type(features):
    features['PARKING TYPE'] \
    = features['PARKING TYPE'].map(lambda x : 1 if x.strip().lower() == 'garage' else 0)
    return features
        
houses_features = convert_parking_type(houses_features);
print(houses_features['PARKING TYPE'].head())


0    1
1    1
2    1
3    1
4    1
Name: PARKING TYPE, dtype: int64

There are many ways to normalize the features. We will use a simple way of normalizing the feature: given a list of values X[1,2,3,4,5] , we will normalize as:

\begin{equation*} X[i] = \frac{X[i] - mean(X[i])}{\sigma(X[i])} \end{equation*}

where m= total no. of features of X

In our case, we will normalize each column of features.


In [17]:
def feature_normalize(matrix, mean=None, std=None):
    mean = matrix.mean() if mean is None else mean
    std = matrix.std()  if std is None else std
    matrix = (matrix - mean)/std
    return {'matrix_norm': matrix.fillna(0), 'mean': mean, 'std': std}

Sanity Check !!!


In [19]:
results = feature_normalize(houses_features)
houses_norm = results['matrix_norm']
X_mean =results['mean']
X_std = results['std']
                      
results = feature_normalize(y);
y_norm = results['matrix_norm']
y_mean = results['mean']
y_std = results['std']

print ('{0}\n'.format(houses_norm.head()))
print ('{0}\n'.format(y_norm.head()))


   LIST PRICE      BEDS     BATHS      SQFT  LOT SIZE  YEAR BUILT  \
0    0.783326 -1.417852  0.444568  2.093853 -0.446035    1.547770   
1    0.899198 -1.417852  0.444568 -0.591009  1.947155    0.851334   
2    0.511950  0.517099 -0.949213 -1.661850 -0.331435   -1.516551   
3    0.889748 -1.417852  1.838348  0.487591 -0.331435   -1.237976   
4    1.804976  0.517099  0.444568  1.077330 -0.744555    1.408483   

   PARKING SPOTS  PARKING TYPE  ORIGINAL LIST PRICE  
0              0             0             0.777064  
1              0             0             0.894669  
2              0             0            -0.735122  
3              0             0             0.885077  
4              0             0             1.813991  

0    0.679040
1   -0.086546
2    0.240808
3    0.626241
4    1.523824
Name: LAST SALE PRICE, dtype: float64

We can use Preprocessing module of sklearn to apply zero mean and unit deviation norm. Let's use that and check the results.


In [22]:
scaler = preprocessing.StandardScaler().fit(houses_features.values)
houses_norm = scaler.transform(houses_features)
print("Mean: {0}, STD: {1}".format(scaler.mean_,scaler.std_))
print(houses_norm)


Mean: [  6.65394267e+05   3.73275862e+00   2.34051724e+00   1.71816379e+03
   6.39287931e+03   1.97988793e+03   2.00000000e+00   1.00000000e+00
   6.67058466e+05], STD: [  9.35597222e+04   5.14576468e-01   3.57186913e-01   1.28313997e+02
   1.78109820e+03   7.14839209e+00   1.00000000e+00   1.00000000e+00
   9.21813058e+04]
[[ 0.78672458 -1.42400336  0.44649665 ...,  0.          0.          0.78043519]
 [ 0.90309944 -1.42400336  0.44649665 ...,  0.          0.          0.89855024]
 [ 0.51417139  0.5193424  -0.95333068 ...,  0.          0.         -0.73831093]
 ..., 
 [-1.24406384  0.5193424  -0.95333068 ...,  0.          0.         -1.28072025]
 [-0.39019213  0.5193424  -0.95333068 ...,  0.          0.         -0.41408033]
 [-2.83769833 -1.42400336 -0.95333068 ...,  0.          0.         -2.89818487]]

Step 3: Training our model

Let's use a simple linear model , that uses the least squared error function.

Optional !!: A quick mathematical background.

$\begin{equation*} y' = h_\theta(X)\end{equation*}$

Where

$\begin{equation*}h_\theta(X) = \theta_0 + \theta_1*x_1 + ... + \theta_n*x_n\end{equation*}$

Here y' is the predicted value and y is the actual value. Our goal is to minimize the delta between these two values. So, if we can choose the best values for $\begin{equation*}\Theta\end{equation*}$, we can accurately predict the house prices.

To achieve that , we come up with least squared error function, for which, we need to find the minimum values of $\begin{equation*}\Theta\end{equation*}$ where , $\begin{equation*}\Theta = [\theta_0, \theta_1 ... \theta_n]\end{equation*}$

$\begin{equation*} J(\Theta) = \sum_{i=0}^m(h_\theta(X)-y)^2\end{equation*}$

We choose some arbitrary number of iterations and in each iteration we simultaneously calculate

$\begin{equation*} \theta_0 := \theta_0 -\alpha*\frac{\partial}{\partial\theta_0}J(\Theta)\end{equation*}$

$\begin{equation*} \theta_1 := \theta_1 - \alpha*\frac{\partial}{\partial\theta_1}J(\Theta)\end{equation*}$

. . .

$\begin{equation*} \theta_n := \theta_n - \alpha*\frac{\partial}{\partial\theta_n}J(\Theta)\end{equation*}$

Where $\begin{equation*}\alpha\end{equation*}$ is a scalar learning rate value.

Phew !! Thankfully, we can use different linear models available in sklearn.

Let's define a train_model function in which,

  1. we create a regressor with intercept set to true. An intercept is the value of Y when, all features values are set to 0. We can either add $\begin{equation*}x_0 = 1\end{equation*}$ and add that to our features matrix or set the fit_intercept parameter to true. So this means, that if we are not given any features of a house, can we come up with a mean value of the house.
  2. We fit the model (train ) with our house features and prices.

In [23]:
def train_model(X,y):
    regressor = LinearRegression(fit_intercept=True)
    regressor.fit(X,y)
    return regressor

In [24]:
regr = train_model(houses_norm, y_norm)
print(regr.coef_)


[ 0.95294642  0.02315098 -0.09140542  0.00173405  0.00790588 -0.03303855
  0.          0.          0.01064799]

Let's run a prediction and see if we can predict the price of a house. We are going to use the 3rd sample from our training set. The input features (house_test) for this sample : {"LIST PRICE" : 749888, "BEDS" : 3, "BATHS": 2.5, "SQFT": 1642, "LOT SIZE" : 9876, "YEAR BUILT" : 1968, "PARKING SPOTS" : 2, "PARKING TYPE" : "Garage", "ORIGINAL LIST PRICE": 74988}

and the actual output is "LAST SALE PRICE" which is 713,550.


In [25]:
house_test = pandas.DataFrame({"LIST PRICE" : 749888,
 "BEDS" : 3,
 "BATHS": 2.5,
 "SQFT": 1642,
 "LOT SIZE" : 9876,
 "YEAR BUILT" : 1968,
 "PARKING SPOTS" : 2, 
 "PARKING TYPE" : "Garage",
 "ORIGINAL LIST PRICE": 74988},index=[0])
house_test_converted = convert_parking_type(house_test)
house_test_norm = feature_normalize(house_test_converted, X_mean, X_std)

y_predicted_norm = regr.predict(house_test_norm['matrix_norm']);
y_predicted = y_predicted_norm*y_std + y_mean
print('Predicted Price: {0}'.format(y_predicted[0].round(0)))


print('Normalized Input: \n {0} \n'.format(house_test_norm['matrix_norm']))


Predicted Price: 713785.0
Normalized Input: 
       BATHS      BEDS  LIST PRICE  LOT SIZE  ORIGINAL LIST PRICE  \
0  0.444568 -1.417852    0.899198  1.947155            -6.395146   

   PARKING SPOTS  PARKING TYPE      SQFT  YEAR BUILT  
0              0             0 -0.591009   -1.655838   

Let's do some score analysis to find out the performance of our prediction algorithm.


In [178]:
print("Mean square error: {0} \n".format(mean_squared_error([y_norm[2]],y_predicted_norm)))
print("R2 Score: {0} \n".format(regr.score(houses_norm, y_norm)))


Mean square error: 9.067023349686855e-06 

R2 Score: 0.8903593961815572 

Step 5. Now what ? Optimizations and explorations - the fun part !!!

In the above "evaluation" we didn't break the data into training and test set. That's a big no no in preparing your machine learning models. Let's fix that.

A proper way is to split the data into a training/test set, where the model only ever sees the training data during its model fitting and parameter tuning. The test data is used for final evaluation of our model to find out it's prediction performance.


In [29]:
houses_norm_train,houses_norm_test, y_norm_train, y_norm_test = \
train_test_split(houses_norm,y_norm,test_size=0.2) #<=== 20% of the samples are used for testing.
print(len(houses_norm_train),len(houses_norm_test),len(houses_norm_train)+len(houses_norm_test))


92 24 116

So, as requested, the test size is 20% of the entire dataset (24 houses out of total 116), and the training is the rest (92 out of 116). Let me introduce the scikit-learn Pipeline concept here. The pipeline helps in breaking our learning into sub-steps and assemble them together. This way, we can run the pipeline on different set of parameters and cross-validation set without changing the code of each individual steps.


In [27]:
linear_regressor_pipeline = Pipeline([
('normalize', preprocessing.StandardScaler()),
('regressor', LinearRegression())
])

A common practice is to split the training data into further smaller subsets of data; for example 5 equally sized subsets. Then, train the model on 4 subsets of data and validate the accuracy on the last subset of data (called "validation set"). If we repeat, this training with different subset as the validation set, we can test the stability of the model. If the model gives different scores for different subsets , then we need to go back and check the model and/or the data.


In [30]:
scores = cross_val_score(linear_regressor_pipeline,
                         houses_norm_train, #<== training samples
                         y_norm_train,#<== training output labels
                         cv=5,#<=== split data randomly into 5 parts, 4 for training and 1 for scoring
                         scoring='r2', #<== use R^2 score
                         n_jobs=-1)
print(scores,scores.mean(),scores.std())


[ 0.89656312  0.91393417  0.887133    0.90887704  0.63537356] 0.848376178916 0.106913414232

A score of 1 implies we have predicted perfectly. As you can see we are little off in our prediction (16% off).The scores are quite consistent. But, we definitely can do better.

How can we fix this ?

There are few design principles that we need to consider :

  1. Get more Data
    • Sometimes more training data doesn't help. More data maynot help because we maybe overfitting our model to our training data.
    • But, most of the time, this is a good place to start.
  2. Try smaller sets of features
    • We can do this by hand or use Dimension Reduction using PCA techniques
  3. Polynomial Features
    • If we think that some of the features have heavier influence over other features, we may want to use higher degree of those features.
  4. Combining Features
    • if we have better knowledge of the domain, we can combine and build better features.
  5. New models
    • We can try new regression models and see if they help in our prediction score

All of the above advice is good. But, where should we start ?

Let's start with Step 1 , Step 3 and Step 4 and see if we can make our predictions better.

We will get more data and use cubic interpolation to fix missing data. You can read more about interpolation here.


In [181]:
labels=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT",
        "PARKING SPOTS","PARKING TYPE","ORIGINAL LIST PRICE","LAST SALE PRICE"]
data=pandas.read_csv('./data/redfin_more_data1.csv', quoting=csv.QUOTE_NONE,names=labels)
print(data.head())
data.plot(x='SQFT', y='LAST SALE PRICE', kind='scatter')
data = convert_parking_type(data)
data.interpolate(method='cubic',inplace=True)
data.fillna(0)


y= data['LAST SALE PRICE'].copy()
X =data.iloc[:,[0,1,2,3,4,5,6,7]].copy()

print ("features {0}: \n{1}".format(X.shape, X.head()))
print("target:{0}: \n{1}".format(y.shape, y.head()))

X_train,X_test, y_train, y_test = \
train_test_split(X,y,test_size=0.2,random_state=42) #<=== 20% of the samples are used for testing.
print(len(X_train),len(X_test),len(X_train)+len(X_test))
print(len(y_train) + len(y_test))
print (X_train.shape)
print (y_train.shape)


   LIST PRICE  BEDS  BATHS  SQFT  LOT SIZE  YEAR BUILT  PARKING SPOTS  \
0      538000     3    2.5  1252      1932        1993              2   
1      538888     3    2.0  1253      2100        1973              2   
2      539500     3    2.5  1298      2437        1984              2   
3      545000     3    2.5  1298      2482        1988              2   
4      548888     4    2.0  1298      2700        1973              2   

  PARKING TYPE  ORIGINAL LIST PRICE  LAST SALE PRICE  
0       Garage               538000           393500  
1       Garage               538888           400000  
2       Garage               539500           401000  
3       Garage               545000           404500  
4       Garage               548888           415000  
features (318, 8): 
   LIST PRICE  BEDS  BATHS  SQFT  LOT SIZE  YEAR BUILT  PARKING SPOTS  \
0      538000     3    2.5  1252      1932        1993              2   
1      538888     3    2.0  1253      2100        1973              2   
2      539500     3    2.5  1298      2437        1984              2   
3      545000     3    2.5  1298      2482        1988              2   
4      548888     4    2.0  1298      2700        1973              2   

   PARKING TYPE  
0             1  
1             1  
2             1  
3             1  
4             1  
target:(318,): 
0    393500
1    400000
2    401000
3    404500
4    415000
Name: LAST SALE PRICE, dtype: int64
254 64 318
318
(254, 8)
(254,)

In [86]:
from sklearn.linear_model import Ridge
from sklearn.metrics import explained_variance_score, r2_score, mean_squared_error, mean_absolute_error
ridge_regressor_pipeline = Pipeline([
('normalize', preprocessing.StandardScaler()),
('regressor',Ridge())
])

scores = cross_val_score(ridge_regressor_pipeline,
                         X_train, #<== training samples
                         y_train,#<== training output labels
                         cv=5,#<=== split data randomly into 5 parts, 4 for training and 1 for scoring
                         scoring='r2', #<== use R^2 score
                         n_jobs=-1)

print(scores,scores.mean(),scores.std())
print("Accuracy: %0.5f (+/- %0.5f)" % (scores.mean(), scores.std() * 2))


[ 0.9822377   0.97144495  0.97506792  0.98181215  0.96860433] 0.975833408526 0.0054564552039
Accuracy: 0.97583 (+/- 0.01091)

Let's test with our test set and compare the predicted result with our actual results.


In [87]:
ridge_regressor_pipeline.fit(X_train,y_train)
y_predicted = ridge_regressor_pipeline.predict(X_test)
print("Variance Score: %0.05f and R2 Score: %0.5f \n" % (explained_variance_score(y_test,y_predicted), r2_score(y_test,y_predicted)))


Variance: 0.97419 and R2 Score: 0.97418 


In [177]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [184]:
%time plot_learning_curve(ridge_regressor_pipeline, "accuracy vs. training set size", X_train, y_train, cv=5)


CPU times: user 265 ms, sys: 48.7 ms, total: 314 ms
Wall time: 520 ms
Out[184]:
<module 'matplotlib.pyplot' from '/Users/nmishra/.virtualenvs/mvregression@python3.4.2/lib/python3.4/site-packages/matplotlib/pyplot.py'>

As you can see, with more data, our prediction confidence is almost at 97% confidence. We have limited data samples. Around 150 training examples, our accuracy has almost reached to perfect score.

Step 6. Find the best parameters for our regressor

Let's use GridSearch to find the best params for our model.


In [267]:
from sklearn.linear_model import Ridge
from sklearn.cross_validation import StratifiedKFold

#load the data
houses=pandas.read_csv('./data/redfin_more_data1.csv', quoting=csv.QUOTE_NONE,names=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT",
        "PARKING SPOTS","PARKING TYPE","ORIGINAL LIST PRICE","LAST SALE PRICE"])
houses.interpolate(inplace=True) #replace missing values by interpolating the data in a particular column
y= houses['LAST SALE PRICE'].copy()
houses_features =houses.iloc[:,[0,1,2,3,4,5]].copy()

#split 80% of data to training set and 20% to test set
houses_train,houses_test, y_train, y_test = \
train_test_split(houses_features,y,test_size=0.2) #<=== 20% of the samples are used for testing.

#build the pipeline and use the StandardScalar to normalize the data
ridge_regressor_pipeline = Pipeline([
('normalize', preprocessing.StandardScaler()),
('regressor', Ridge())
])

#set some params for the regressor
param_ridge = [
{'regressor__alpha':[0.01, 0.1, 1], 'regressor__solver':['lsqr'],'regressor__tol': [.99], 'regressor__max_iter': [500] },
{'regressor__alpha':[0.01, 0.1, 1], 'regressor__solver':['cholesky'],'regressor__tol': [.99], 'regressor__max_iter': [500] },
]

#search the best fitting params for our regressor
grid_ridge = GridSearchCV(
    ridge_regressor_pipeline, #pipeline from above
    param_grid=param_ridge, #parameters to tune via cross validation
    refit=True, #fit using all data for the best regressor
    n_jobs=-1,
    scoring='r2',
    cv=StratifiedKFold(y_train,n_folds=5)
)

In [188]:
%time price_predictor = grid_ridge.fit(houses_train,y_train)
print (price_predictor.grid_scores_)


CPU times: user 196 ms, sys: 40.6 ms, total: 236 ms
Wall time: 522 ms
[mean: 0.96209, std: 0.07023, params: {'regressor__tol': 0.99, 'regressor__alpha': 0.01, 'regressor__solver': 'lsqr', 'regressor__max_iter': 500}, mean: 0.96208, std: 0.07026, params: {'regressor__tol': 0.99, 'regressor__alpha': 0.1, 'regressor__solver': 'lsqr', 'regressor__max_iter': 500}, mean: 0.96190, std: 0.07056, params: {'regressor__tol': 0.99, 'regressor__alpha': 1, 'regressor__solver': 'lsqr', 'regressor__max_iter': 500}, mean: 0.98043, std: 0.01240, params: {'regressor__tol': 0.99, 'regressor__alpha': 0.01, 'regressor__solver': 'cholesky', 'regressor__max_iter': 500}, mean: 0.98045, std: 0.01332, params: {'regressor__tol': 0.99, 'regressor__alpha': 0.1, 'regressor__solver': 'cholesky', 'regressor__max_iter': 500}, mean: 0.97933, std: 0.01996, params: {'regressor__tol': 0.99, 'regressor__alpha': 1, 'regressor__solver': 'cholesky', 'regressor__max_iter': 500}]

So apparently, alpha=0.01, tol=.99 and solver='cholesky' provides the best fit. Let's run our Regressor on our test set for a sanity check.


In [268]:
yp = price_predictor.predict(houses_test)
print("Variance: %0.02f and R2 Score: %0.02f \n" % (explained_variance_score(y_test,yp), r2_score(y_test,yp)))


Variance: 0.97 and R2 Score: 0.95 

We get 95% accuracy in our prediction when we try the prediction on 20% of training samples that we had split for test set. This test set has never been seen before by our predictor. Though, our R2 score as gone down a little bit.But, it represents a real world scenario. Let' predict the price of a house?


In [269]:
import math
data = {"LIST PRICE":879000,
"BEDS": 3,
"BATHS": 2.5,
"SQFT": 1900,
"LOT SIZE" : 6800,
"YEAR BUILT":2015} 
house_test = pandas.DataFrame(data,index=[0],columns=["LIST PRICE","BEDS","BATHS","SQFT","LOT SIZE","YEAR BUILT"])
%time yp=price_predictor.predict(house_test)
print("Predicted Price: {0}".format(round(yp[0])))


CPU times: user 409 µs, sys: 1 µs, total: 410 µs
Wall time: 419 µs
Predicted Price: 864631.0

For production, we may need to save the model, the parameters, cross validation scores, training scores and then run the model on a production database. We would then need to compare the model training , cv and test scores with the production data prediction scores and figure out if our model needs further tuning.

This is a basic example of linear multivariate regresion to predict a continous output. What's next ?

I will try to explore these following topics in future posts.

  1. Noise Detection using Expectation Maximization Models.
  2. Feature exploration using Novelty Detection.
  3. Feature weighing using polynomail degrees of features.
  4. Develop your own multivariate linear regression model,optimize the parameters and predict continuous rate.