We'll be going over how to use the scikit-learn regression model, as well as how to train the regressor using the fit() method, and how to predict new labels using the predict() method. We'll be analyzing a data set consisting of house prices in Boston. We'll start off with a single variable linear regression using numpy and then move on to using scikit learn. We'll do an overview of the mathematics behind the method we're using, but mostly we'll dive deeper into pratical "hands-on" coding lessons.
In this section we will be working through linear regression with the following steps:
Step 1: Getting and setting up the data.
Step 2: Visualizing current data.
Step 3: The mathematics behind the Least Squares Method.
Step 4: Using Numpy for a Univariate Linear Regression.
Step 5: Getting the error.
Step 6: Using scikit learn to implement a multivariate regression.
Step 7: Using Training and Validation.
Step 8: Predicting Prices
Step 9 : Residual Plots
In [2]:
# Standard imports
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
# scikit learn
import sklearn
from sklearn.datasets import load_boston
In [3]:
# Load the housing data sets
boston = load_boston()
print boston.DESCR
In [29]:
# histogram of prices
plt.hist(boston.target, bins=50)
plt.xlabel('Prices in 1000$')
plt.ylabel('Number of houses')
plt.title('Prices Vs Houses')
plt.savefig('house_vs_price.png')
In [7]:
plt.scatter(boston.data[:,5], boston.target)
#label
plt.ylabel('Price in $1000s')
plt.xlabel('Number of rooms')
Out[7]:
Now we can make out a slight trend that price increases along with the number of rooms in that house, which intuitively makes sense! Now let's use scikit learn to see if we can fit the data linearly.
Let's try to do the following:
1.) Use pandas to transform the boston dataset into a DataFrame:
2.) Then use seaborn to perform an lmplot on that DataFrame to reproduce the scatter plot with a linear fit line.
In [8]:
# converting into dataFrame
boston_df = DataFrame(boston.data)
boston_df.columns= boston.feature_names
boston_df.head()
Out[8]:
In [9]:
# Creating a price column in dataFrame
boston_df['PRICE'] = boston.target
boston_df.head()
Out[9]:
Now, you might be reminded of the seaborn lmplot function we used during the visualization lectures. You could use it here to do a linear fit automatically!
In [10]:
# linear regression plot
sns.lmplot('RM', 'PRICE', data=boston_df)
Out[10]:
In this we'll use the least squares method as the way to estimate the coefficients. Here's a quick breakdown of how this method works mathematically:
Take a quick look at the plot we created above using seaborn. Now consider each point, and know that they each have a coordinate in the form (X,Y). Now draw an imaginary line between each point and our current "best-fit" line. We'll call the distanace between each point and our current best-fit line, D. To get a quick image of what we're currently trying to visualize, take a look at the picture below:
In [11]:
# Quick display of image form wikipedia
from IPython.display import Image
url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Linear_least_squares_example2.svg/220px-Linear_least_squares_example2.svg.png'
Image(url)
Out[11]:
Numpy has a built in Least Square Method in its linear algebra library. We'll use this first for our Univariate regression and then move on to scikit learn for out Multi variate regression.
We will start by setting up the X and Y arrays for numpy to take in. An important note for the X array: Numpy expects a two-dimensional array, the first dimension is the different example values, and the second dimension is the attribute number. In this case we have our value as the mean number of rooms per house, and this is a single attribute so the second dimension of the array is just 1. So we'll need to create a (506,1) shape array. There are a few ways to do this, but an easy way to do this is by using numpy's built-in vertical stack tool, vstack.
In [12]:
# Numpy linear algebra needs to have data in the form data and parameters
x = boston_df.RM
#print x.shape
x = np.vstack(boston_df.RM)
#print x.shape
y = boston_df.PRICE
Now that we have our X and Y, let's go ahead and use numpy to create the single variable linear regression.
We know that a line has the equation:
y=mx+b
which we can rewrite using matrices:
y=Ap
where: A=[x 1] and p=[m b]
This is the same as the first equation if you carry out the linear algebra. So we'll start by creating the A matrix using numpy. We'll do this by creating a matrix in the form [X 1], so we'll call every value in our original X using a list comprehension and then set up an array in the form [X 1]
In [13]:
# using list comprehension
x = np.array([[value, 1] for value in x])
In [14]:
# Now get out m and b values for our best fit line
m, b = np.linalg.lstsq(x, y)[0]
In [28]:
# Plotting the same lm plot that we plotted earlier using seaborn
plt.plot(boston_df.RM,boston_df.PRICE ,'o')
# plotting line
X = boston_df.RM
plt.plot(X,m*X + b,'red', label = 'Best Fit')
plt.savefig('bestfit.png')
We've just completed a single variable regression using the least squares method with Python! Let's see if we can find the error in our fitted check the link. Checking out the documentation here, we see that the resulting array has the total squared error. For each element, it checks the the difference between the line and the true value (our original D value), squares it, and returns the sum of all these. This was the summed D^2 value we discussed earlier.
It's probably easier to understand the root mean squared error, which is similar to the standard deviation. In this case, to find the root mean square error we divide by the number of elements and then take the square root. There is also an issue of bias and an unbiased regression, but we'll delve into those topics later.
For now let's see how we can get the root mean squared error of the line we just fitted.
In [16]:
"""
Dependent variable always on y axis and independent variable on x axis while plotting.
"""
result = np.linalg.lstsq(x,y)[1]
# Total error
total_error = np.sqrt(result/len(x))
print "The root mean square error is: {}" .format(float(total_error))
Since the root mean square error (RMSE) corresponds approximately to the standard deviation we can now say that the price of a house won't vary more than 2 times the RMSE 95% of the time. Note: Review the Normal Distribution Appendix lecture if this doesn't make sense to you or check out this link.
Thus we can reasonably expect a house price to be within $13,200 of our line fit.
Now, we'll keep moving along with using scikit learn to do a multi variable regression. This will be a similar apporach to the above example, but sci kit learn will be able to take into account more than just a single data variable effecting the target!
We'll start by importing the linear regression library from the sklearn module.
The sklearn.linear_model.LinearRegression class is an estimator. Estimators predict a value based on the observed data. In scikit-learn, all estimators implement the fit() and predict() methods. The former method is used to learn the parameters of a model, and the latter method is used to predict the value of a response variable for an explanatory variable using the learned parameters. It is easy to experiment with different models using scikit-learn because all estimators implement the fit and predict methods.
In [17]:
# sklearn imports
from sklearn.linear_model import LinearRegression
In [18]:
# Create a LinearRegression Object
lreg = LinearRegression()
The functions we will be using are:
lreg.fit() which fits a linear model
lreg.predict() which is used to predict Y using the linear model with estimated coefficients
lreg.score() which returns the coefficient of determination (R^2). A measure of how well observed outcomes are replicated by the model, learn more about it here
In [19]:
# In order to drop a coloumn we use '1'
x_multi = boston_df.drop('PRICE', 1)
y_target = boston_df.PRICE
In [20]:
# Implement Linear Regression
lreg.fit(x_multi, y_target)
print "The estimated intercept {}" .format(lreg.intercept_)
print "The number of coefficients used {}." .format(len(lreg.coef_))
In [21]:
coeff_df = DataFrame(boston_df.columns)
coeff_df.columns = ['Features']
coeff_df['Coefficient'] = Series(lreg.coef_)
In [22]:
"""
These 13 coefficients are used to bild the line that is used as best fit line by
scikit learn
"""
coeff_df
Out[22]:
In a dataset a training set is implemented to build up a model, while a validation set is used to validate the model built. Data points in the training set are excluded from the validation set. The correct way to pick out samples from your dataset to be part either the training or validation (also called test) set is randomly.
Fortunately, scikit learn has a built in function specifically for this called train_test_split.
The parameters passed are your X and Y, then optionally test_size parameter, representing the proportion of the dataset to include in the test split. As well a train_size parameter. ou can learn more about these parameters here
In [23]:
# Getting the tranning and testing data sets
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(x, boston_df.PRICE)
In [24]:
# The outputs
print X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
In [25]:
legr = LinearRegression()
legr.fit(X_train, Y_train)
Out[25]:
In [26]:
pred_train = legr.predict(X_train)
pred_test = legr.predict(X_test)
print "Fit a model X_train, and calculate MSE with Y_train: {}" .format(np.mean((Y_train - pred_train)**2))
print "Fit a model X_train, and calculate MSE with X_test and Y_test: {}" .format(np.mean((Y_test - pred_test)**2))
It looks like our mean square error between our training and testing was pretty close.
In regression analysis, the difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual, so that:
Residual=Observedvalue−Predictedvalue
You can think of these residuals in the same way as the D value we discussed earlier, in this case however, there were multiple data points considered.
A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
Residual plots are a good way to visualize the errors in your data. If you have done a good job then your data should be randomly scattered around line zero. If there is some strucutre or pattern, that means your model is not capturing some thing. There could be an interaction between 2 variables that you're not considering, or may be you are measuring time dependent data. If this is the case go back to your model and check your data set closely.
So now let's go ahead and create the residual plot. For more info on the residual plots check out this great link.
In [27]:
# Scater plot the training data
train = plt.scatter(pred_train, (pred_train - Y_train), c='b', alpha=0.8)
# Scatter plot the testing data
test = plt.scatter(pred_test, (pred_test - Y_test), c='r', alpha=0.6)
# Horizontal line
plt.hlines(y=0, xmin=-10, xmax=50)
#Labels
plt.legend((train,test),('Training','Test'),loc='lower left')
plt.title('Residual Plots')
plt.savefig('residualplot.png')
From the plot we can observe that the data is scattered around the horizontal and do not follow any pattern so linear regression can be applied on the data set.
In [ ]: