In this competitions we have 79 explanatory variables describing aspects of residential homes in Ames, Iowa. The goal is to predict the prices of these houses.
The metric to calculate the accuracy of predictions is Root Mean Squared Logarithmic Error (it penalizes an under-predicted estimate greater than an over-predicted estimate).
The RMSLE is calculated as
$$ \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 } $$Where:
$\epsilon$ is the RMSLE value (score); $n$ is the number of observations; $p_i$ is prediction; $a_i$ is the actual response for $i$; $\log(x)$ is the natural logarithm of $x$
At first I explore the data, fill missing values and visualize some features, then I try several models for prediction.
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from matplotlib import rcParams
import xgboost as xgb
%matplotlib inline
sns.set_style('whitegrid')
import scipy.stats as stats
from scipy import stats
from scipy.stats import pointbiserialr, spearmanr, skew, pearsonr
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.linear_model import Ridge, RidgeCV, LassoCV
from sklearn import linear_model
In [2]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
In [3]:
data.info()
In [4]:
data.describe(include='all')
Out[4]:
In [5]:
data.head()
Out[5]:
Quite a lot of variables. Many categorical variables, which makes analysis more complex. And a lot of missing values. Or are they merely missing values? There are many features for which NaN value simply means an absense of the feature (for example, no Garage).
At first, lets see which columns have missing values.
In [6]:
data[data.columns[data.isnull().sum() > 0].tolist()].info()
Now I create lists of columns with missing values in train and test. Then I use list comprehension to get a list with columns which are present in test but not in train. Then I find the columns which are present in train but not in test.
In [7]:
list_data = data.columns[data.isnull().sum() > 0].tolist()
list_test = test.columns[test.isnull().sum() > 0].tolist()
test[list(i for i in list_test if i not in list_data)].info()
In [8]:
data[list(i for i in list_data if i not in list_test)].info()
No value in the following columns probably means absense of it:
For other variables missing values could be replaced with most common parameters:
In [9]:
#Create a list of column to fill NA with "None" or 0.
to_null = ['Alley', 'MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'GarageYrBlt', 'BsmtFullBath', 'BsmtHalfBath',
'PoolQC', 'Fence', 'MiscFeature']
for col in to_null:
if data[col].dtype == 'object':
data[col].fillna('None',inplace=True)
test[col].fillna('None',inplace=True)
else:
data[col].fillna(0,inplace=True)
test[col].fillna(0,inplace=True)
In [10]:
#Fill NA with common values.
test.loc[test.KitchenQual.isnull(), 'KitchenQual'] = 'TA'
test.loc[test.MSZoning.isnull(), 'MSZoning'] = 'RL'
test.loc[test.Utilities.isnull(), 'Utilities'] = 'AllPub'
test.loc[test.Exterior1st.isnull(), 'Exterior1st'] = 'VinylSd'
test.loc[test.Exterior2nd.isnull(), 'Exterior2nd'] = 'VinylSd'
test.loc[test.Functional.isnull(), 'Functional'] = 'Typ'
test.loc[test.SaleType.isnull(), 'SaleType'] = 'WD'
data.loc[data['Electrical'].isnull(), 'Electrical'] = 'SBrkr'
data.loc[data['LotFrontage'].isnull(), 'LotFrontage'] = data['LotFrontage'].mean()
test.loc[test['LotFrontage'].isnull(), 'LotFrontage'] = test['LotFrontage'].mean()
There are several additional cases: when a categorical variable is None, relevant numerical variable should be 0. For example if there is no veneer (MasVnrType is None), MasVnrArea should be 0.
In [11]:
data.loc[data.MasVnrType == 'None', 'MasVnrArea'] = 0
test.loc[test.MasVnrType == 'None', 'MasVnrArea'] = 0
test.loc[test.BsmtFinType1=='None', 'BsmtFinSF1'] = 0
test.loc[test.BsmtFinType2=='None', 'BsmtFinSF2'] = 0
test.loc[test.BsmtQual=='None', 'BsmtUnfSF'] = 0
test.loc[test.BsmtQual=='None', 'TotalBsmtSF'] = 0
In [12]:
#And there is only one line where GarageCars and GarageArea is null, but it seems that there is no garage.
test.loc[test.GarageCars.isnull() == True]
Out[12]:
In [13]:
test.loc[test.GarageCars.isnull(), 'GarageCars'] = 0
test.loc[test.GarageArea.isnull(), 'GarageArea'] = 0
At first I'll look into data correlation, then I'll visualize some data in order to see the impact of certain features.
In [14]:
corr = data.corr()
plt.figure(figsize=(12, 12))
sns.heatmap(corr, vmax=1)
Out[14]:
It seems that only several pairs of variables have high correlation. But this chart shows data only for pairs of numerical values. I'll calculate correlation for all variables.
In [15]:
threshold = 0.8 # Threshold value.
def correlation():
for i in data.columns:
for j in data.columns[list(data.columns).index(i) + 1:]: #Ugly, but works. This way there won't be repetitions.
if data[i].dtype != 'object' and data[j].dtype != 'object':
#pearson is used by default for numerical.
if abs(pearsonr(data[i], data[j])[0]) >= threshold:
yield (pearsonr(data[i], data[j])[0], i, j)
else:
#spearman works for categorical.
if abs(spearmanr(data[i], data[j])[0]) >= threshold:
yield (spearmanr(data[i], data[j])[0], i, j)
In [16]:
corr_list = list(correlation())
corr_list
Out[16]:
This is a list of highly correlated features. They aren't surprising and none of them should be removed.
In [17]:
#It seems that SalePrice is skewered, so it needs to be transformed.
sns.distplot(data['SalePrice'], kde=False, color='c', hist_kws={'alpha': 0.9})
Out[17]:
In [18]:
#As expected price rises with the quality.
sns.regplot(x='OverallQual', y='SalePrice', data=data, color='Orange')
Out[18]:
In [19]:
#Price also varies depending on neighborhood.
plt.figure(figsize = (12, 6))
sns.boxplot(x='Neighborhood', y='SalePrice', data=data)
xt = plt.xticks(rotation=30)
In [20]:
#There are many little houses.
plt.figure(figsize = (12, 6))
sns.countplot(x='HouseStyle', data=data)
xt = plt.xticks(rotation=30)
In [21]:
#And most of the houses are single-family, so it isn't surprising that most of the them aren't large.
sns.countplot(x='BldgType', data=data)
xt = plt.xticks(rotation=30)
In [22]:
#Most of fireplaces are of good or average quality. And nearly half of houses don't have fireplaces at all.
pd.crosstab(data.Fireplaces, data.FireplaceQu)
Out[22]:
In [23]:
sns.factorplot('HeatingQC', 'SalePrice', hue='CentralAir', data=data)
sns.factorplot('Heating', 'SalePrice', hue='CentralAir', data=data)
Out[23]:
Houses with central air conditioning cost more. And it is interesting that houses with poor and good heating quality cost nearly the same if they have central air conditioning. Also only houses with gas heating have central air conditioning.
In [24]:
#One more interesting point is that while pavement road access is valued more, for alley they quality isn't that important.
fig, ax = plt.subplots(1, 2, figsize = (12, 5))
sns.boxplot(x='Street', y='SalePrice', data=data, ax=ax[0])
sns.boxplot(x='Alley', y='SalePrice', data=data, ax=ax[1])
Out[24]:
In [25]:
#We can say that while quality is normally distributed, overall condition of houses is mainly average.
fig, ax = plt.subplots(1, 2, figsize = (12, 5))
sns.countplot(x='OverallCond', data=data, ax=ax[0])
sns.countplot(x='OverallQual', data=data, ax=ax[1])
Out[25]:
In [26]:
fig, ax = plt.subplots(2, 3, figsize = (16, 12))
ax[0,0].set_title('Gable')
ax[0,1].set_title('Hip')
ax[0,2].set_title('Gambrel')
ax[1,0].set_title('Mansard')
ax[1,1].set_title('Flat')
ax[1,2].set_title('Shed')
sns.stripplot(x="RoofMatl", y="SalePrice", data=data[data.RoofStyle == 'Gable'], jitter=True, ax=ax[0,0])
sns.stripplot(x="RoofMatl", y="SalePrice", data=data[data.RoofStyle == 'Hip'], jitter=True, ax=ax[0,1])
sns.stripplot(x="RoofMatl", y="SalePrice", data=data[data.RoofStyle == 'Gambrel'], jitter=True, ax=ax[0,2])
sns.stripplot(x="RoofMatl", y="SalePrice", data=data[data.RoofStyle == 'Mansard'], jitter=True, ax=ax[1,0])
sns.stripplot(x="RoofMatl", y="SalePrice", data=data[data.RoofStyle == 'Flat'], jitter=True, ax=ax[1,1])
sns.stripplot(x="RoofMatl", y="SalePrice", data=data[data.RoofStyle == 'Shed'], jitter=True, ax=ax[1,2])
Out[26]:
These graphs show information about roof materials and style. Most houses have Gable and Hip style. And material for most roofs is standard.
In [27]:
sns.stripplot(x="GarageQual", y="SalePrice", data=data, hue='GarageFinish', jitter=True)
Out[27]:
Most finished garages gave average quality.
In [28]:
sns.pointplot(x="PoolArea", y="SalePrice", hue="PoolQC", data=data)
Out[28]:
It is worth noting that there are only 7 different pool areas. And while for most of them mean price is ~200000-300000$, pools with area 555 cost very much. Let's see.
In [29]:
#There is only one such pool and sale condition for it is 'Abnorml'.
data.loc[data.PoolArea == 555]
Out[29]:
In [30]:
fig, ax = plt.subplots(1, 2, figsize = (12, 5))
sns.stripplot(x="SaleType", y="SalePrice", data=data, jitter=True, ax=ax[0])
sns.stripplot(x="SaleCondition", y="SalePrice", data=data, jitter=True, ax=ax[1])
Out[30]:
Most of the sold houses are either new or sold under Warranty Deed. And only a little number of houses are sales between family, adjoining land purchases or allocation.
In [31]:
#MSSubClass shows codes for the type of dwelling, it is clearly a categorical variable.
data['MSSubClass'].unique()
Out[31]:
In [32]:
data['MSSubClass'] = data['MSSubClass'].astype(str)
test['MSSubClass'] = test['MSSubClass'].astype(str)
Transforming skewered data and dummifying categorical.
In [33]:
for col in data.columns:
if data[col].dtype != 'object':
if skew(data[col]) > 0.75:
data[col] = np.log1p(data[col])
pass
else:
dummies = pd.get_dummies(data[col], drop_first=False)
dummies = dummies.add_prefix("{}_".format(col))
data.drop(col, axis=1, inplace=True)
data = data.join(dummies)
for col in test.columns:
if test[col].dtype != 'object':
if skew(test[col]) > 0.75:
test[col] = np.log1p(test[col])
pass
else:
dummies = pd.get_dummies(test[col], drop_first=False)
dummies = dummies.add_prefix("{}_".format(col))
test.drop(col, axis=1, inplace=True)
test = test.join(dummies)
Maybe a good idea would be to create some new features, but I decided to do without it. It is time-consuming and model is good enough without it. Besides, the number of features if quite high already.
In [34]:
#This is how the data looks like now.
data.head()
Out[34]:
In [35]:
X_train = data.drop('SalePrice',axis=1)
Y_train = data['SalePrice']
X_test = test
In [36]:
#Function to measure accuracy.
def rmlse(val, target):
return np.sqrt(np.sum(((np.log1p(val) - np.log1p(np.expm1(target)))**2) / len(target)))
In [37]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X_train, Y_train, test_size=0.33)
I'll try several models.
Ridge is linear least squares model with l2 regularization (using squared difference).
RidgeCV is Ridge regression with built-in cross-validation.
Lasso is Linear Model trained with l1 regularization (using module).
LassoCV is Lasso linear model with iterative fitting along a regularization path.
Random Forest is usually good in cases with many features.
And XGBoost is a library which is very popular lately and usually gives good results.
In [38]:
ridge = Ridge(alpha=10, solver='auto').fit(Xtrain, ytrain)
val_ridge = np.expm1(ridge.predict(Xtest))
rmlse(val_ridge, ytest)
Out[38]:
In [39]:
ridge_cv = RidgeCV(alphas=(0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10))
ridge_cv.fit(Xtrain, ytrain)
val_ridge_cv = np.expm1(ridge_cv.predict(Xtest))
rmlse(val_ridge_cv, ytest)
Out[39]:
In [40]:
las = linear_model.Lasso(alpha=0.0005).fit(Xtrain, ytrain)
las_ridge = np.expm1(las.predict(Xtest))
rmlse(las_ridge, ytest)
Out[40]:
In [41]:
las_cv = LassoCV(alphas=(0.0001, 0.0005, 0.001, 0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10))
las_cv.fit(Xtrain, ytrain)
val_las_cv = np.expm1(las_cv.predict(Xtest))
rmlse(val_las_cv, ytest)
Out[41]:
In [42]:
model_xgb = xgb.XGBRegressor(n_estimators=340, max_depth=2, learning_rate=0.2) #the params were tuned using xgb.cv
model_xgb.fit(Xtrain, ytrain)
xgb_preds = np.expm1(model_xgb.predict(Xtest))
rmlse(xgb_preds, ytest)
Out[42]:
In [43]:
forest = RandomForestRegressor(min_samples_split =5,
min_weight_fraction_leaf = 0.0,
max_leaf_nodes = None,
max_depth = None,
n_estimators = 300,
max_features = 'auto')
forest.fit(Xtrain, ytrain)
Y_pred_RF = np.expm1(forest.predict(Xtest))
rmlse(Y_pred_RF, ytest)
Out[43]:
So linear models perform better than the others. And lasso is the best.
Lasso model has one nice feature - it provides feature selection, as it assignes zero weights to the least important variables.
In [44]:
coef = pd.Series(las_cv.coef_, index = X_train.columns)
v = coef.loc[las_cv.coef_ != 0].count()
print('So we have ' + str(v) + ' variables')
In [45]:
#Basically I sort features by weights and take variables with max weights.
indices = np.argsort(abs(las_cv.coef_))[::-1][0:v]
In [46]:
#Features to be used. I do this because I want to see how good will other models perform with these features.
features = X_train.columns[indices]
for i in features:
if i not in X_test.columns:
print(i)
There is only one selected feature which isn't in test data. I'll simply add this column with zero values.
In [47]:
X_test['RoofMatl_ClyTile'] = 0
In [48]:
X = X_train[features]
Xt = X_test[features]
Let's see whether something changed.
In [49]:
Xtrain1, Xtest1, ytrain1, ytest1 = train_test_split(X, Y_train, test_size=0.33)
In [50]:
ridge = Ridge(alpha=5, solver='svd').fit(Xtrain1, ytrain1)
val_ridge = np.expm1(ridge.predict(Xtest1))
rmlse(val_ridge, ytest1)
Out[50]:
In [51]:
las_cv = LassoCV(alphas=(0.0001, 0.0005, 0.001, 0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10)).fit(Xtrain1, ytrain1)
val_las = np.expm1(las_cv.predict(Xtest1))
rmlse(val_las, ytest1)
Out[51]:
In [52]:
model_xgb = xgb.XGBRegressor(n_estimators=340, max_depth=2, learning_rate=0.2) #the params were tuned using xgb.cv
model_xgb.fit(Xtrain1, ytrain1)
xgb_preds = np.expm1(model_xgb.predict(Xtest1))
rmlse(xgb_preds, ytest1)
Out[52]:
In [53]:
forest = RandomForestRegressor(min_samples_split =5,
min_weight_fraction_leaf = 0.0,
max_leaf_nodes = None,
max_depth = 100,
n_estimators = 300,
max_features = None)
forest.fit(Xtrain1, ytrain1)
Y_pred_RF = np.expm1(forest.predict(Xtest1))
rmlse(Y_pred_RF, ytest1)
Out[53]:
The accuracy got worse, but it is due to random seed while splitting the data. It's time for prediction!
In [54]:
las_cv1 = LassoCV(alphas=(0.0001, 0.0005, 0.001, 0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10))
las_cv1.fit(X, Y_train)
lasso_preds = np.expm1(las_cv1.predict(Xt))
In [55]:
#I added XGBoost as it usually improves the predictions.
model_xgb = xgb.XGBRegressor(n_estimators=340, max_depth=2, learning_rate=0.1)
model_xgb.fit(X, Y_train)
xgb_preds = np.expm1(model_xgb.predict(Xt))
In [56]:
preds = 0.7 * lasso_preds + 0.3 * xgb_preds
In [57]:
submission = pd.DataFrame({
'Id': test['Id'].astype(int),
'SalePrice': preds
})
submission.to_csv('home.csv', index=False)
But the result wasn't very good. I thought for some time and then decided that the problem could lie in feature selection - maybe I selected bad features or Maybe random seed gave bad results. I decided to try selecting features based on full training dataset (not just on part of the data).
In [58]:
model_lasso = LassoCV(alphas=(0.0001, 0.0005, 0.001, 0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10, 100))
model_lasso.fit(X_train, Y_train)
coef = pd.Series(model_lasso.coef_, index = X_train.columns)
v1 = coef.loc[model_lasso.coef_ != 0].count()
print('So we have ' + str(v1) + ' variables')
In [59]:
indices = np.argsort(abs(model_lasso.coef_))[::-1][0:v1]
features_f=X_train.columns[indices]
In [60]:
print('Features in full, but not in val:')
for i in features_f:
if i not in features:
print(i)
print('\n' + 'Features in val, but not in full:')
for i in features:
if i not in features_f:
print(i)
A lot of difference between the selected features. I suppose that the reason for this is that there was too little data relatively to the number of features in the first case. So I'll use the features obtain with the analysis of the whole train dataset.
In [61]:
for i in features_f:
if i not in X_test.columns:
X_test[i] = 0
print(i)
X = X_train[features_f]
Xt = X_test[features_f]
Now all necessary features are present in both train and test.
In [62]:
model_lasso = LassoCV(alphas=(0.0001, 0.0005, 0.001, 0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10))
model_lasso.fit(X, Y_train)
lasso_preds = np.expm1(model_lasso.predict(Xt))
In [63]:
model_xgb = xgb.XGBRegressor(n_estimators=340, max_depth=2, learning_rate=0.1) #the params were tuned using xgb.cv
model_xgb.fit(X, Y_train)
xgb_preds = np.expm1(model_xgb.predict(Xt))
In [64]:
solution = pd.DataFrame({"id":test.Id, "SalePrice":0.7*lasso_preds + 0.3*xgb_preds})
solution.to_csv("House_price.csv", index = False)
The best result I got with this model was 0.12922. Currently top results are 0.10-0.11.