Using the Ames Housing Data:
Dean De Cock Truman State University Journal of Statistics Education Volume 19, Number 3(2011), www.amstat.org/publications/jse/v19n3/decock.pdf
We construct two models (one with Gradient Boosting, one with Random Forest) to predict housing prices from features of the property such as overall square footage, lot size, condition of property, etc.
We then use the ml_insights
model x-ray and related functionality to gain some insight on the models, what features are important, how the model predictions are affected by each feature, and why a particular house had a certain price.
In [1]:
# "pip install ml_insights" in terminal if needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ml_insights as mli
# To Plot matplotlib figures inline on the notebook
%matplotlib inline
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
In [2]:
datafile = "data/Ames_Housing_Data.tsv"
In [3]:
df=pd.read_csv(datafile, sep='\t')
In [4]:
df.info()
In [5]:
# This is recommended by the data set author to remove a few outliers
df = df.loc[df['Gr Liv Area']<=4000,:]
df.shape
Out[5]:
In [6]:
# Choose a small subset of variables
X=df.loc[:,['Lot Area','Overall Qual',
'Overall Cond', 'Year Built', 'Year Remod/Add',
'Gr Liv Area',
'Full Bath', 'Bedroom AbvGr',
'Fireplaces', 'Garage Cars']]
y=df['SalePrice']
In [7]:
X.info()
In [8]:
# There appears to be one NA in Garage Cars - fill with 0
X = X.fillna(0)
In [9]:
X.info()
In [10]:
#Split the data 70-30 train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
In [11]:
# Train a Gradient Boosting Model
gbmodel1 = GradientBoostingRegressor(n_estimators = 1000,
learning_rate = .005,
max_depth = 4)
gbmodel1.fit(X_train,y_train)
Out[11]:
In [12]:
#Compute the RMSE on the test set
test_res_gb = gbmodel1.predict(X_test)
np.sqrt(np.sum((test_res_gb - y_test)**2)/len(y_test))
Out[12]:
In [13]:
# Plot the "feature importances" given by the gb model - not super useful
fig, ax = plt.subplots()
ind = np.array(range(len(X.columns)))+.7
plt.barh(ind,gbmodel1.feature_importances_);
ax.set_yticks(ind + .3);
ax.set_yticklabels((X_test.columns));
In [14]:
mxr = mli.ModelXRay(gbmodel1,X_test)
Choose a random sample of 7 points and plot the "Individual Conditional Expectation" (ICE-plots)
Save the randomly selected points to use later for comparison
In [15]:
indices = mxr.feature_dependence_plots(num_pts=5)
In [16]:
indices
Out[16]:
In [17]:
mxr.feature_effect_summary()
In [18]:
diff_path_obj = mxr.explain_prediction_difference(193,300, tol=.05)
In [19]:
diff_path_obj
Out[19]:
In [20]:
rfmodel1 = RandomForestRegressor(n_estimators = 500)
rfmodel1.fit(X_train,y_train)
Out[20]:
In [21]:
mxr_rf = mli.ModelXRay(rfmodel1,X_test)
Note below that the RF model is not very monotonic in the Gr Liv Area
variable, in contrast to the GB Model.
In [22]:
mxr_rf.feature_dependence_plots(pts_selected=indices)
Out[22]:
In [23]:
mxr_rf.feature_effect_summary()
In [24]:
mxr_rf.explain_prediction_difference(193,300, tol=.05)
Out[24]:
In [25]:
#Compute the RMSE
test_res_rf = rfmodel1.predict(X_test)
np.sqrt(np.sum((test_res_rf - y_test)**2)/len(y_test))
Out[25]:
Note that the RMSE is lower with the gbmodel than the rfmodel