This notebook covers univariate & multi-variate "linear regression". We'll be going over how to use the scikit-learn regression model, as well as how to train the regressor using the fit() method, and how to predict new labels using the predict() method. We'll be analyzing a data set consisting of house prices in Boston.
If you're interested in the deeper mathematics of linear regession methods, check out the wikipedia page and also check out Andrew Ng's wonderful lectures for free on youtube.
In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
In [4]:
sns.set_style('whitegrid')
%matplotlib inline
We'll start by looking a an example of a dataset from scikit-learn. First we'll import our usual data analysis imports, then sklearn's built-in boston dataset.
You should always try to do a quick visualization fo the data you have. Let's go ahead an make a histogram of the prices.
In [6]:
from sklearn.datasets import load_boston
In [7]:
boston = load_boston()
In [11]:
print boston.DESCR
In [13]:
plt.hist(boston.target, bins=50)
plt.xlabel("Prices in $1000s")
plt.ylabel("Number of Houses")
Out[13]:
In [15]:
# the 5th column in "boston" dataset is "RM" (# rooms)
plt.scatter(boston.data[:,5], boston.target)
plt.ylabel("Prices in $1000s")
plt.xlabel("# rooms")
Out[15]:
In [16]:
boston_df = DataFrame(boston.data)
boston_df.columns = boston.feature_names
boston_df.head(5)
Out[16]:
In [17]:
boston_df = DataFrame(boston.data)
boston_df.columns = boston.feature_names
boston_df['Price'] = boston.target
boston_df.head(5)
Out[17]:
In [18]:
sns.lmplot('RM', 'Price', data=boston_df)
Out[18]:
We will start by setting up the X and Y arrays for numpy to take in.
An important note for the X array: Numpy expects a two-dimensional array, the first dimension is the different example values, and the second dimension is the attribute number. In this case we have our value as the mean number of rooms per house, and this is a single attribute so the second dimension of the array is just 1. So we'll need to create a (506,1) shape array. There are a few ways to do this, but an easy way to do this is by using numpy's built-in vertical stack tool, vstack.
In [29]:
# Set up X as median room values
X = boston_df.RM
# Use v to make X two-dimensional
X = np.vstack(boston_df.RM)
# Set up Y as the target price of the houses.
Y = boston_df.Price
In [86]:
type(X)
Out[86]:
In [87]:
type(Y)
Out[87]:
Let's import the linear regression library from the sklearn module.
The sklearn.linear_model.LinearRegression class is an estimator. Estimators predict a value based on the observed data. In scikit-learn, all estimators implement the fit() and predict() methods. The former method is used to learn the parameters of a model, and the latter method is used to predict the value of a response variable for an explanatory variable using the learned parameters. It is easy to experiment with different models using scikit-learn because all estimators implement the fit and predict methods.
In [112]:
import sklearn
from sklearn.linear_model import LinearRegression
Next, we create a LinearRegression object, afterwards, type lm. then press tab to see the list of methods availble on this object.
In [113]:
# Create a LinearRegression Object
lreg = LinearRegression()
The functions we will be using are:
lreg.fit() which fits a linear model
lreg.predict() which is used to predict Y using the linear model with estimated coefficients
lreg.score() which returns the coefficient of determination (R^2). A measure of how well observed outcomes are replicated by the model, learn more about it here
In [114]:
# Implement Linear Regression
lreg.fit(X,Y)
Out[114]:
Let's go ahead check the intercept and number of coefficients.
In [115]:
print(' The estimated intercept coefficient is %.2f ' %lreg.intercept_)
In [116]:
print(' The number of coefficients used was %d ' % len(lreg.coef_))
In [117]:
type(lreg.coef_)
Out[117]:
In [118]:
# Set a DataFrame from the Features
coeff_df = DataFrame(["Intercept", "Rooms"])
coeff_df.columns = ['Feature']
# Set a new column lining up the coefficients from the linear regression
coeff_df["Coefficient Estimate"] = pd.Series(np.append(lreg.intercept_, lreg.coef_))
# Show
coeff_df
Out[118]:
With our X and Y, we now have the solution to the linear regression.
$$y=mx+b$$where b = Intercept, and m is the Coefficient of Estimate for the feature "Rooms"
In [130]:
rooms = pd.Series([4], name="rooms")
X_test = pd.DataFrame(rooms)
X_test
Out[130]:
In [131]:
lreg.predict(X_test)
Out[131]:
Let's add more features to our prediction model.
In [91]:
# Data Columns
X_multi = boston_df.drop('Price',1)
# Targets
Y_target = boston_df.Price
Finally, we're ready to pass the X and Y using the linear regression object.
In [92]:
# Implement Linear Regression
lreg.fit(X_multi,Y_target)
Out[92]:
Let's go ahead check the intercept and number of coefficients.
In [93]:
print(' The estimated intercept coefficient is %.2f ' %lreg.intercept_)
In [94]:
print(' The number of coefficients used was %d ' % len(lreg.coef_))
Great! So we have basically made an equation for a line, but instead of just oneo coefficient m and an intercept b, we now have 13 coefficients. To get an idea of what this looks like check out the documentation for this equation: $$ y(w,x) = w_0 + w_1 x_1 + ... + w_p x_p $$
Where $$w = (w_1, ...w_p)$$ as the coefficients and $$ w_0 $$ as the intercept
What we'll do next is set up a DataFrame showing all the Features and their estimated coefficients obtained form the linear regression.
In [95]:
# Set a DataFrame from the Features
coeff_df = DataFrame(boston_df.columns)
coeff_df.columns = ['Features']
# Set a new column lining up the coefficients from the linear regression
coeff_df["Coefficient Estimate"] = pd.Series(lreg.coef_)
# Show
coeff_df
Out[95]:
Just like we initially plotted out, it seems the highest correlation between a feature and a house price was the number of rooms.
Now let's move on to Predicting prices!
In a dataset a training set is implemented to build up a model, while a validation set is used to validate the model built. Data points in the training set are excluded from the validation set. The correct way to pick out samples from your dataset to be part either the training or validation (also called test) set is randomly.
Fortunately, scikit learn has a built in function specifically for this called train_test_split.
The parameters passed are your X and Y, then optionally test_size parameter, representing the proportion of the dataset to include in the test split. As well a train_size parameter. The default split is: 75% for training set and 25% for testing set. You can learn more about these parameters here
In [102]:
# Grab the output and set as X and Y test and train data sets!
X_train, X_test, Y_train, Y_test = \
sklearn.cross_validation.train_test_split(X_multi,Y_target)
Let's go ahead and see what the output of the train_test_split was:
In [103]:
# Print shapes of the training and testing data sets
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
In [105]:
X_train.head(5)
Out[105]:
Great! Now that we have our training and testing sets we can continue on to predicint gprices based on the multiple variables.
Now that we have our training and testing sets, let's go ahead and try to use them to predict house prices. We'll use our training set for the prediction and then use our testing set for validation.
In [106]:
# Create our regression object
lreg = LinearRegression()
# Once again do a linear regression, except only on the training sets this time
lreg.fit(X_train,Y_train)
Out[106]:
Now run a prediction on both the X training set and the testing set.
In [107]:
# Predictions on training and testing sets
pred_train = lreg.predict(X_train)
pred_test = lreg.predict(X_test)
Let's see if we can find the error in our fitted line. A common error measure is called "root mean squared error" (RMSE). RMSE is similar to the standard deviation. It is calculated by taking the square root of the sum of the square error and divide by the # elements. Square error is the square of the sum of all differences between the prediction and the true value.
The root mean square error (RMSE) corresponds approximately to the standard deviation. i.e., a prediction won't vary more than 2 times the RMSE 95% of the time. Note: Review the Normal Distribution Appendix lecture if this doesn't make sense to you or check out this link.
Now we will get the mean square error
In [108]:
print("Fit a model X_train, and calculate MSE with Y_train: %.2f" \
% np.mean((Y_train - pred_train) ** 2))
print("Fit a model X_train, and calculate MSE with X_test and Y_test: %.2f" \
% np.mean((Y_test - pred_test) ** 2))
It looks like our mean square error between our training and testing was pretty close. But how do we actually visualize this?
In regression analysis, the difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual, so that:
$$Residual = Observed\:value - Predicted\:value $$You can think of these residuals in the same way as the D value we discussed earlier, in this case however, there were multiple data points considered.
A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
Residual plots are a good way to visualize the errors in your data. If you have done a good job then your data should be randomly scattered around line zero. If there is some strucutre or pattern, that means your model is not capturing some thing. There could be an interaction between 2 variables that you're not considering, or may be you are measuring time dependent data. If this is the case go back to your model and check your data set closely.
So now let's go ahead and create the residual plot. For more info on the residual plots check out this great link.
In [109]:
# Scatter plot the training data
train = plt.scatter(pred_train,(Y_train-pred_train),c='b',alpha=0.5)
# Scatter plot the testing data
test = plt.scatter(pred_test,(Y_test-pred_test),c='r',alpha=0.5)
# Plot a horizontal axis line at 0
plt.hlines(y=0,xmin=-10,xmax=50)
#Labels
plt.legend((train,test),('Training','Test'),loc='lower left')
plt.title('Residual Plots')
Out[109]:
Great! Looks like there aren't any major patterns to be concerned about, it may be interesting to check out the line pattern at the top of the graph, but overall the majority of the residuals seem to be randomly allocated above and below the horizontal. We could also use seaborn to create these plots:
In [110]:
# Residual plot of all the dataset using seaborn
sns.residplot('RM', 'Price', data = boston_df)
Out[110]:
Linear regression is a very broad topic, theres a ton of great information in the sci kit learn documentation, and I encourage you to check it out here: http://scikit-learn.org/stable/modules/linear_model.html#linear-model
In [ ]: