House Prices Estimator

Note: It's a competition from Kaggle.com and the input data was retrieved from there.

Details

Goal

It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.

Metric

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

Submission File Format

The file should contain a header and have the following format:

Id,SalePrice 1461,169000.1 1462,187724.1233 1463,175221 etc.

TODO

  • Use another algorithm to predict the house price
  • More feature engineering
  • Add more comments, thoughts, conclusions, ...
  • Come up with new ideas..

Data Analysis


In [1]:
import numpy as np
import pandas as pd

#load the files
train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')
data = pd.concat([train, test])

#size of training dataset
train_samples = train.shape[0]

#print some of them
data.head()


Out[1]:
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
0 856 854 0 NaN 3 1Fam TA No 706.0 0.0 ... WD 0 Pave 8 856.0 AllPub 0 2003 2003 2008
1 1262 0 0 NaN 3 1Fam TA Gd 978.0 0.0 ... WD 0 Pave 6 1262.0 AllPub 298 1976 1976 2007
2 920 866 0 NaN 3 1Fam TA Mn 486.0 0.0 ... WD 0 Pave 6 920.0 AllPub 0 2001 2002 2008
3 961 756 0 NaN 3 1Fam Gd No 216.0 0.0 ... WD 0 Pave 7 756.0 AllPub 0 1915 1970 2006
4 1145 1053 0 NaN 4 1Fam TA Av 655.0 0.0 ... WD 0 Pave 9 1145.0 AllPub 192 2000 2000 2008

5 rows × 81 columns


In [2]:
# remove the Id feature
data.drop(['Id'],1, inplace=True);

In [3]:
data.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 0 to 1458
Data columns (total 80 columns):
1stFlrSF         2919 non-null int64
2ndFlrSF         2919 non-null int64
3SsnPorch        2919 non-null int64
Alley            198 non-null object
BedroomAbvGr     2919 non-null int64
BldgType         2919 non-null object
BsmtCond         2837 non-null object
BsmtExposure     2837 non-null object
BsmtFinSF1       2918 non-null float64
BsmtFinSF2       2918 non-null float64
BsmtFinType1     2840 non-null object
BsmtFinType2     2839 non-null object
BsmtFullBath     2917 non-null float64
BsmtHalfBath     2917 non-null float64
BsmtQual         2838 non-null object
BsmtUnfSF        2918 non-null float64
CentralAir       2919 non-null object
Condition1       2919 non-null object
Condition2       2919 non-null object
Electrical       2918 non-null object
EnclosedPorch    2919 non-null int64
ExterCond        2919 non-null object
ExterQual        2919 non-null object
Exterior1st      2918 non-null object
Exterior2nd      2918 non-null object
Fence            571 non-null object
FireplaceQu      1499 non-null object
Fireplaces       2919 non-null int64
Foundation       2919 non-null object
FullBath         2919 non-null int64
Functional       2917 non-null object
GarageArea       2918 non-null float64
GarageCars       2918 non-null float64
GarageCond       2760 non-null object
GarageFinish     2760 non-null object
GarageQual       2760 non-null object
GarageType       2762 non-null object
GarageYrBlt      2760 non-null float64
GrLivArea        2919 non-null int64
HalfBath         2919 non-null int64
Heating          2919 non-null object
HeatingQC        2919 non-null object
HouseStyle       2919 non-null object
KitchenAbvGr     2919 non-null int64
KitchenQual      2918 non-null object
LandContour      2919 non-null object
LandSlope        2919 non-null object
LotArea          2919 non-null int64
LotConfig        2919 non-null object
LotFrontage      2433 non-null float64
LotShape         2919 non-null object
LowQualFinSF     2919 non-null int64
MSSubClass       2919 non-null int64
MSZoning         2915 non-null object
MasVnrArea       2896 non-null float64
MasVnrType       2895 non-null object
MiscFeature      105 non-null object
MiscVal          2919 non-null int64
MoSold           2919 non-null int64
Neighborhood     2919 non-null object
OpenPorchSF      2919 non-null int64
OverallCond      2919 non-null int64
OverallQual      2919 non-null int64
PavedDrive       2919 non-null object
PoolArea         2919 non-null int64
PoolQC           10 non-null object
RoofMatl         2919 non-null object
RoofStyle        2919 non-null object
SaleCondition    2919 non-null object
SalePrice        1460 non-null float64
SaleType         2918 non-null object
ScreenPorch      2919 non-null int64
Street           2919 non-null object
TotRmsAbvGrd     2919 non-null int64
TotalBsmtSF      2918 non-null float64
Utilities        2917 non-null object
WoodDeckSF       2919 non-null int64
YearBuilt        2919 non-null int64
YearRemodAdd     2919 non-null int64
YrSold           2919 non-null int64
dtypes: float64(12), int64(25), object(43)
memory usage: 1.8+ MB

First problem

  • The training and test datasets have almost the same size.

In [4]:
print("Size training: {}".format(train.shape[0]))
print("Size testing: {}".format(test.shape[0]))


Size training: 1460
Size testing: 1459

Selecting only numeric columns (by now)


In [5]:
datanum = data.select_dtypes([np.number])

datanum.describe()


Out[5]:
1stFlrSF 2ndFlrSF 3SsnPorch BedroomAbvGr BsmtFinSF1 BsmtFinSF2 BsmtFullBath BsmtHalfBath BsmtUnfSF EnclosedPorch ... OverallQual PoolArea SalePrice ScreenPorch TotRmsAbvGrd TotalBsmtSF WoodDeckSF YearBuilt YearRemodAdd YrSold
count 2919.000000 2919.000000 2919.000000 2919.000000 2918.000000 2918.000000 2917.000000 2917.000000 2918.000000 2919.000000 ... 2919.000000 2919.000000 1460.000000 2919.000000 2919.000000 2918.000000 2919.000000 2919.000000 2919.000000 2919.000000
mean 1159.581706 336.483727 2.602261 2.860226 441.423235 49.582248 0.429894 0.061364 560.772104 23.098321 ... 6.089072 2.251799 180921.195890 16.062350 6.451524 1051.777587 93.709832 1971.312778 1984.264474 2007.792737
std 392.362079 428.701456 25.188169 0.822693 455.610826 169.205611 0.524736 0.245687 439.543659 64.244246 ... 1.409947 35.663946 79442.502883 56.184365 1.569379 440.766258 126.526589 30.291442 20.894344 1.314964
min 334.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 1.000000 0.000000 34900.000000 0.000000 2.000000 0.000000 0.000000 1872.000000 1950.000000 2006.000000
25% 876.000000 0.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 220.000000 0.000000 ... 5.000000 0.000000 129975.000000 0.000000 5.000000 793.000000 0.000000 1953.500000 1965.000000 2007.000000
50% 1082.000000 0.000000 0.000000 3.000000 368.500000 0.000000 0.000000 0.000000 467.000000 0.000000 ... 6.000000 0.000000 163000.000000 0.000000 6.000000 989.500000 0.000000 1973.000000 1993.000000 2008.000000
75% 1387.500000 704.000000 0.000000 3.000000 733.000000 0.000000 1.000000 0.000000 805.500000 0.000000 ... 7.000000 0.000000 214000.000000 0.000000 7.000000 1302.000000 168.000000 2001.000000 2004.000000 2009.000000
max 5095.000000 2065.000000 508.000000 8.000000 5644.000000 1526.000000 3.000000 2.000000 2336.000000 1012.000000 ... 10.000000 800.000000 755000.000000 576.000000 15.000000 6110.000000 1424.000000 2010.000000 2010.000000 2010.000000

8 rows × 37 columns


In [6]:
data.select_dtypes(exclude=[np.number]).head()


Out[6]:
Alley BldgType BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 BsmtQual CentralAir Condition1 Condition2 ... MiscFeature Neighborhood PavedDrive PoolQC RoofMatl RoofStyle SaleCondition SaleType Street Utilities
0 NaN 1Fam TA No GLQ Unf Gd Y Norm Norm ... NaN CollgCr Y NaN CompShg Gable Normal WD Pave AllPub
1 NaN 1Fam TA Gd ALQ Unf Gd Y Feedr Norm ... NaN Veenker Y NaN CompShg Gable Normal WD Pave AllPub
2 NaN 1Fam TA Mn GLQ Unf Gd Y Norm Norm ... NaN CollgCr Y NaN CompShg Gable Normal WD Pave AllPub
3 NaN 1Fam Gd No ALQ Unf TA Y Norm Norm ... NaN Crawfor Y NaN CompShg Gable Abnorml WD Pave AllPub
4 NaN 1Fam TA Av GLQ Unf Gd Y Norm Norm ... NaN NoRidge Y NaN CompShg Gable Normal WD Pave AllPub

5 rows × 43 columns

Find if there's null values


In [7]:
datanum.columns[datanum.isnull().any()].tolist()


Out[7]:
['BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtFullBath',
 'BsmtHalfBath',
 'BsmtUnfSF',
 'GarageArea',
 'GarageCars',
 'GarageYrBlt',
 'LotFrontage',
 'MasVnrArea',
 'SalePrice',
 'TotalBsmtSF']

In [8]:
#number of row without NaN
print(datanum.shape[0] - datanum.dropna().shape[0])


1798

In [9]:
#list of columns with NaN
datanum.columns[datanum.isnull().any()].tolist()


Out[9]:
['BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtFullBath',
 'BsmtHalfBath',
 'BsmtUnfSF',
 'GarageArea',
 'GarageCars',
 'GarageYrBlt',
 'LotFrontage',
 'MasVnrArea',
 'SalePrice',
 'TotalBsmtSF']

In [10]:
#Filling with the mean
datanum_no_nan = datanum.fillna(datanum.mean())

#check
datanum_no_nan.columns[datanum_no_nan.isnull().any()].tolist()


Out[10]:
[]

Normalizing


In [11]:
import matplotlib.pyplot as plt

datanum_no_nan.drop(['SalePrice'], axis=1).head(15).plot()
plt.show()



In [12]:
#Squeeze the data to [0,1]
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()
columns = datanum_no_nan.columns
columns = columns.drop('SalePrice')
print("Features: {}".format(columns))

data_norm = datanum_no_nan


Features: Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'BedroomAbvGr', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtUnfSF',
       'EnclosedPorch', 'Fireplaces', 'FullBath', 'GarageArea', 'GarageCars',
       'GarageYrBlt', 'GrLivArea', 'HalfBath', 'KitchenAbvGr', 'LotArea',
       'LotFrontage', 'LowQualFinSF', 'MSSubClass', 'MasVnrArea', 'MiscVal',
       'MoSold', 'OpenPorchSF', 'OverallCond', 'OverallQual', 'PoolArea',
       'ScreenPorch', 'TotRmsAbvGrd', 'TotalBsmtSF', 'WoodDeckSF', 'YearBuilt',
       'YearRemodAdd', 'YrSold'],
      dtype='object')

In [13]:
data_norm[columns] = scaler.fit_transform(datanum_no_nan[columns])
print("Train shape: {}".format(data_norm.shape))

data_norm.drop(['SalePrice'], axis=1).head(15).plot()
plt.show()


Train shape: (2919, 37)

In [14]:
data_norm.describe().T


Out[14]:
count mean std min 25% 50% 75% max
1stFlrSF 2919.0 0.173405 0.082412 0.0 0.113842 0.157110 0.221277 1.0
2ndFlrSF 2919.0 0.162946 0.207604 0.0 0.000000 0.000000 0.340920 1.0
3SsnPorch 2919.0 0.005123 0.049583 0.0 0.000000 0.000000 0.000000 1.0
BedroomAbvGr 2919.0 0.357528 0.102837 0.0 0.250000 0.375000 0.375000 1.0
BsmtFinSF1 2919.0 0.078211 0.080711 0.0 0.000000 0.065379 0.129872 1.0
BsmtFinSF2 2919.0 0.032492 0.110863 0.0 0.000000 0.000000 0.000000 1.0
BsmtFullBath 2919.0 0.143298 0.174852 0.0 0.000000 0.000000 0.333333 1.0
BsmtHalfBath 2919.0 0.030682 0.122801 0.0 0.000000 0.000000 0.000000 1.0
BsmtUnfSF 2919.0 0.240057 0.188129 0.0 0.094178 0.199914 0.344606 1.0
EnclosedPorch 2919.0 0.022824 0.063482 0.0 0.000000 0.000000 0.000000 1.0
Fireplaces 2919.0 0.149281 0.161532 0.0 0.000000 0.250000 0.250000 1.0
FullBath 2919.0 0.392001 0.138242 0.0 0.250000 0.500000 0.500000 1.0
GarageArea 2919.0 0.317792 0.144730 0.0 0.215054 0.322581 0.387097 1.0
GarageCars 2919.0 0.353324 0.152299 0.0 0.200000 0.400000 0.400000 1.0
GarageYrBlt 2919.0 0.266389 0.079704 0.0 0.213141 0.266389 0.339744 1.0
GrLivArea 2919.0 0.219812 0.095337 0.0 0.149209 0.209118 0.265543 1.0
HalfBath 2919.0 0.190134 0.251436 0.0 0.000000 0.000000 0.500000 1.0
KitchenAbvGr 2919.0 0.348179 0.071487 0.0 0.333333 0.333333 0.333333 1.0
LotArea 2919.0 0.041450 0.036865 0.0 0.028877 0.038108 0.048003 1.0
LotFrontage 2919.0 0.165431 0.072987 0.0 0.133562 0.165431 0.195205 1.0
LowQualFinSF 2919.0 0.004412 0.043606 0.0 0.000000 0.000000 0.000000 1.0
MSSubClass 2919.0 0.218457 0.250104 0.0 0.000000 0.176471 0.294118 1.0
MasVnrArea 2919.0 0.063876 0.111641 0.0 0.000000 0.000000 0.102188 1.0
MiscVal 2919.0 0.002990 0.033377 0.0 0.000000 0.000000 0.000000 1.0
MoSold 2919.0 0.473917 0.246797 0.0 0.272727 0.454545 0.636364 1.0
OpenPorchSF 2919.0 0.063998 0.091072 0.0 0.000000 0.035040 0.094340 1.0
OverallCond 2919.0 0.570572 0.139141 0.0 0.500000 0.500000 0.625000 1.0
OverallQual 2919.0 0.565452 0.156661 0.0 0.444444 0.555556 0.666667 1.0
PoolArea 2919.0 0.002815 0.044580 0.0 0.000000 0.000000 0.000000 1.0
SalePrice 2919.0 180921.195890 56174.332503 34900.0 163000.000000 180921.195890 180921.195890 755000.0
ScreenPorch 2919.0 0.027886 0.097542 0.0 0.000000 0.000000 0.000000 1.0
TotRmsAbvGrd 2919.0 0.342425 0.120721 0.0 0.230769 0.307692 0.384615 1.0
TotalBsmtSF 2919.0 0.172140 0.072126 0.0 0.129787 0.162029 0.213093 1.0
WoodDeckSF 2919.0 0.065807 0.088853 0.0 0.000000 0.000000 0.117978 1.0
YearBuilt 2919.0 0.719658 0.219503 0.0 0.590580 0.731884 0.934783 1.0
YearRemodAdd 2919.0 0.571075 0.348239 0.0 0.250000 0.716667 0.900000 1.0
YrSold 2919.0 0.448184 0.328741 0.0 0.250000 0.500000 0.750000 1.0

In [15]:
#plotting distributions of numeric features
data_norm.hist(bins=50, figsize=(22,16))
plt.show()


Using Box-Cox


In [16]:
data_norm['1stFlrSF'].hist()
plt.show()



In [17]:
#transform the data so it's closest to normal
from scipy import stats

data_gauss = data_norm.copy()

for f in datanum.columns.tolist():
    data_gauss[f], _ = stats.boxcox(data_gauss[f]+0.01)

#rescale again
std_scaler = preprocessing.StandardScaler()
data_gauss[columns] = std_scaler.fit_transform(data_gauss[columns])
    
data_gauss['1stFlrSF'].hist()
plt.show()



In [18]:
#plotting distributions of numeric features
data_gauss.hist(bins=50, figsize=(22,16))
plt.show()


Splitting dataset in train and test (getting batches)


In [19]:
#include no numbers columns
data.select_dtypes(exclude=[np.number]).head()
data_categorical = pd.get_dummies(data.select_dtypes(exclude=[np.number]))
data_all = pd.concat([data_norm, data_categorical], axis=1)

Selecting good features...


In [20]:
#data_norm.columns.tolist()

feat_list = ['1stFlrSF',
 #'2ndFlrSF',
 #'3SsnPorch',
 'BedroomAbvGr',
 'BsmtFinSF1',
 #'BsmtFinSF2',
 #'BsmtFullBath',
 #'BsmtHalfBath',
 'BsmtUnfSF',
 #'EnclosedPorch',
 #'Fireplaces',
 #'FullBath',
 'GarageArea',
 'GarageCars',
 'GarageYrBlt',
 #'GrLivArea',
 #'HalfBath',
 #'KitchenAbvGr',
 'LotArea',
 'LotFrontage',
 #'LowQualFinSF',
 'MSSubClass',
 'MasVnrArea',
 #'MiscVal',
 'MoSold',
 'OpenPorchSF',
 'OverallCond',
 'OverallQual',
 'PoolArea',
 #'SalePrice',
 #'ScreenPorch',
 'TotRmsAbvGrd',
 'TotalBsmtSF',
 'WoodDeckSF',
 'YearBuilt',
 'YearRemodAdd']
 #'YrSold']

In [21]:
%matplotlib inline
import seaborn as sns
fig = plt.figure(figsize=(14, 10))
sns.heatmap(data_norm[feat_list+['SalePrice']].corr())


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x114ae64e0>

In [22]:
#heatmap
fig = plt.figure(figsize=(14, 10))
sns.heatmap(data_norm.corr())


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x10eafc6a0>

In [23]:
# Correlation features
data_norm.corr()['SalePrice'].sort_values().tail(13)


Out[23]:
Fireplaces      0.329421
MasVnrArea      0.339679
YearRemodAdd    0.354302
YearBuilt       0.368664
TotRmsAbvGrd    0.390869
FullBath        0.394977
1stFlrSF        0.422097
TotalBsmtSF     0.431912
GarageArea      0.437654
GarageCars      0.444406
GrLivArea       0.520311
OverallQual     0.548617
SalePrice       1.000000
Name: SalePrice, dtype: float64

In [24]:
feat_low_corr = ['KitchenAbvGr',
                 'EnclosedPorch',
                 'MSSubClass',
                 'OverallCond',
                 'YrSold',
                 'LowQualFinSF',
                 'MiscVal',
                 'BsmtHalfBath',
                 'BsmtFinSF2',
                 'MoSold',
                 '3SsnPorch',
                 'PoolArea',
                 'ScreenPorch']

feat_high_corr = ['Fireplaces',
                  'MasVnrArea',
                  'YearRemodAdd',
                  'YearBuilt',
                  'TotRmsAbvGrd',
                  'FullBath',
                  '1stFlrSF',
                  'TotalBsmtSF',
                  'GarageArea',
                  'GarageCars',
                  'GrLivArea',
                  'OverallQual']

data_norm_low_corr = data_norm[feat_low_corr]
data_norm_high_corr = data_norm[feat_high_corr]

KFold


In [152]:
from sklearn.model_selection import KFold

y = np.array(data_all['SalePrice'])
X = np.array(data_norm_high_corr)

#split by idx
idx = train_samples
X_train, X_test = X[:idx], X[idx:]
y_train, y_test = y[:idx], y[idx:]

print("Shape X train: {}".format(X_train.shape))
print("Shape y train: {}".format(y_train.shape))
print("Shape X test: {}".format(X_test.shape))
print("Shape y test: {}".format(y_test.shape))

kf = KFold(n_splits=3, random_state=9, shuffle=True)
print(kf)


Shape X train: (1460, 12)
Shape y train: (1460,)
Shape X test: (1459, 12)
Shape y test: (1459,)
KFold(n_splits=3, random_state=9, shuffle=True)

Anomaly Detection


In [153]:
#plotting PCA
from sklearn.decomposition import PCA

def plotPCA(X, y):
    pca = PCA(n_components=1)
    X_r = pca.fit(X).transform(X)
    plt.plot(X_r, y, 'x')

In [154]:
from sklearn.covariance import EllipticEnvelope

# fit the model
ee = EllipticEnvelope(contamination=0.05,
                      assume_centered=True,
                      random_state=9)
ee.fit(X_train)
pred = ee.predict(X_train)

X_train = X_train[pred == 1]
y_train = y_train[pred == 1]
print(X_train.shape)
print(y_train.shape)

#after removing anomalies
plotPCA(X_train, y_train)


(1387, 12)
(1387,)

Models

Multilayer Perceptron


In [155]:
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error

rf = MLPRegressor(activation='relu',
                  solver='lbfgs',
                  #learning_rate_init=1e-2,
                  #learning_rate='adaptive',
                  #alpha=0.0001,
                  max_iter=400,
                  #shuffle=True,
                  hidden_layer_sizes=(64,64),
                  warm_start=True,
                  random_state=9,
                  verbose=False)

for e in range(1):
    batch = 1;
    for train_idx, val_idx in kf.split(X_train, y_train):
        X_t, X_v = X_train[train_idx], X_train[val_idx]
        y_t, y_v = y_train[train_idx], y_train[val_idx]

        #training
        rf.fit(X_t, y_t)

        #calculate costs
        t_error = mean_squared_error(y_t, rf.predict(X_t))**0.5
        v_error = mean_squared_error(y_v, rf.predict(X_v))**0.5
        print("{}-{}) Training error: {:.2f}  Validation error: {:.2f}".format(e, batch, t_error, v_error))
        batch += 1

#Scores
print("Training score: {:.4f}".format(rf.score(X_train, y_train)))


0-1) Training error: 28183.71  Validation error: 30444.11
0-2) Training error: 20297.65  Validation error: 62717.58
0-3) Training error: 22571.34  Validation error: 26189.24
Training score: 0.9093

In [181]:
# Gradient boosting
from sklearn import ensemble

params = {'n_estimators': 100, 'max_depth': 50, 'min_samples_split': 5,
          'learning_rate': 0.1, 'loss': 'ls', 'random_state':9, 'warm_start':True}

gbr = ensemble.GradientBoostingRegressor(**params)

batch = 0
for train_idx, val_idx in kf.split(X_train, y_train):
    X_t, X_v = X_train[train_idx], X_train[val_idx]
    y_t, y_v = y_train[train_idx], y_train[val_idx]

    #training
    gbr.fit(X_t, y_t)

    #calculate costs
    t_error = mean_squared_error(y_t, gbr.predict(X_t))**0.5
    v_error = mean_squared_error(y_v, gbr.predict(X_v))**0.5
    print("{}) Training error: {:.2f}  Validation error: {:.2f}".format(batch, t_error, v_error))
    batch += 1

#Scores
print("Training score: {:.4f}".format(gbr.score(X_train, y_train)))


0) Training error: 625.88  Validation error: 32930.52
1) Training error: 11004.55  Validation error: 805.72
2) Training error: 11007.83  Validation error: 12.10
3) Training error: 11007.60  Validation error: 212.82
4) Training error: 11003.23  Validation error: 954.25
5) Training error: 11007.83  Validation error: 7.78
6) Training error: 11007.06  Validation error: 391.90
7) Training error: 10995.31  Validation error: 1271.39
8) Training error: 11002.65  Validation error: 393.04
9) Training error: 11003.43  Validation error: 7.58
Training score: 0.9826

In [157]:
# AdaBoost
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

abr = AdaBoostRegressor(DecisionTreeRegressor(max_depth=50),
                        n_estimators=100, random_state=9)

batch = 0
for train_idx, val_idx in kf.split(X_train, y_train):
    X_t, X_v = X_train[train_idx], X_train[val_idx]
    y_t, y_v = y_train[train_idx], y_train[val_idx]

    #training
    abr.fit(X_t, y_t)

    #calculate costs
    t_error = mean_squared_error(y_t, abr.predict(X_t))**0.5
    v_error = mean_squared_error(y_v, abr.predict(X_v))**0.5
    print("{}) Training error: {:.2f}  Validation error: {:.2f}".format(batch, t_error, v_error))
    batch += 1

#Scores
print("Training score: {:.4f}".format(abr.score(X_train, y_train)))


0) Training error: 2360.56  Validation error: 31521.29
1) Training error: 1798.02  Validation error: 40457.58
2) Training error: 2063.63  Validation error: 26834.71
Training score: 0.9612

In [158]:
# Lasso
from sklearn.linear_model import Lasso

lr = Lasso()

batch = 0
for train_idx, val_idx in kf.split(X_train, y_train):
    X_t, X_v = X_train[train_idx], X_train[val_idx]
    y_t, y_v = y_train[train_idx], y_train[val_idx]

    #training
    lr.fit(X_t, y_t)

    #calculate costs
    t_error = mean_squared_error(y_t, lr.predict(X_t))**0.5
    v_error = mean_squared_error(y_v, lr.predict(X_v))**0.5
    print("{}) Training error: {:.2f}  Validation error: {:.2f}".format(batch, t_error, v_error))
    batch += 1

#Scores
print("Training score: {:.4f}".format(lr.score(X_train, y_train)))


0) Training error: 34984.31  Validation error: 32420.25
1) Training error: 30681.71  Validation error: 40792.49
2) Training error: 35557.70  Validation error: 31194.92
Training score: 0.8136

Stacked model


In [178]:
### Testing
### Ada + mlp + gradient boosting -> level 1 predictions
### level 1 -> mlp -> level 2 predictions (final)

# Training
#mlp1 = MLPRegressor(activation='logistic',
#                  solver='sgd',
#                  hidden_layer_sizes=(5,5),
#                  learning_rate='adaptive',
#                  random_state=9,
#                  warm_start=True,
#                  verbose=False)

from sklearn.linear_model import LogisticRegression
mlp = LogisticRegression(random_state=9)

sclr = preprocessing.StandardScaler()

def stack_training(X, y):
    X0 = rf.predict(X)
    X1 = gbr.predict(X)
    X2 = abr.predict(X)
    X3 = lr.predict(X)
    Xt = np.array([X0, X1, X2, X3]).T
    #Xt = np.array([X0, X1, X2, X3, X1+X3, X2*X3, X0*X2*X3, X0/X2, X1/X3, X0/X3, (X0+X1+X2+X3)/4]).T
    Xt = sclr.fit_transform(Xt)
    mlp.fit(Xt, y)

def stack_predict(X, verbose=False):
    X0 = rf.predict(X)
    X1 = gbr.predict(X)
    X2 = abr.predict(X)
    X3 = lr.predict(X)
    Xt = np.array([X0, X1, X2, X3]).T
    #Xt = np.array([X0, X1, X2, X3, X1+X3, X2*X3, X0*X2*X3, X0/X2, X1/X3, X0/X3, (X0+X1+X2+X3)/4]).T
    Xt = sclr.transform(Xt)
    if verbose:
        print("Training score: {:.4f}".format(mlp.score(Xt, y_train)))
        plotPCA(Xt, y_train)
    return mlp.predict(Xt)

#
batch = 0
kf = KFold(n_splits=10, random_state=9, shuffle=True)
for train_idx, val_idx in kf.split(X_train, y_train):
    X_t, X_v = X_train[train_idx], X_train[val_idx]
    y_t, y_v = y_train[train_idx], y_train[val_idx]

    #training
    stack_training(X_t, y_t)

    #calculate costs
    t_error = mean_squared_error(y_t, abr.predict(X_t))**0.5
    v_error = mean_squared_error(y_v, abr.predict(X_v))**0.5
    print("{}) Training error: {:.2f}  Validation error: {:.2f}".format(batch, t_error, v_error))
    batch += 1

rmse = mean_squared_error(y_train, stack_predict(X_train, True))**0.5
print("RMSE: {:.4f}".format(rmse))


0) Training error: 16408.80  Validation error: 2083.72
1) Training error: 16412.72  Validation error: 1785.17
2) Training error: 16405.73  Validation error: 2291.03
3) Training error: 16411.19  Validation error: 1907.92
4) Training error: 16406.64  Validation error: 2231.50
5) Training error: 16418.28  Validation error: 1244.06
6) Training error: 15627.28  Validation error: 15137.17
7) Training error: 13758.32  Validation error: 26946.16
8) Training error: 13930.28  Validation error: 26134.30
9) Training error: 13555.17  Validation error: 27862.46
Training score: 0.0310
RMSE: 41833.6092

Evaluation

It has to be used the root mean squared error, RMSE.


In [177]:
from sklearn.metrics import mean_squared_error
import random

RMSE_rf = mean_squared_error(y_train, rf.predict(X_train))**0.5
RMSE_gbr = mean_squared_error(y_train, gbr.predict(X_train))**0.5
RMSE_abr = mean_squared_error(y_train, abr.predict(X_train))**0.5
RMSE_lr = mean_squared_error(y_train, lr.predict(X_train))**0.5
RMSE_stack = mean_squared_error(y_train, stack_predict(X_train))**0.5

def avg_predict(X):
    return (rf.predict(X) + gbr.predict(X) + abr.predict(X) + lr.predict(X))/4

predictions = avg_predict(X_train)
RMSE_total = mean_squared_error(y_train, predictions)**0.5

print("RMSE mlp: {:.3f}".format(RMSE_rf))
print("RMSE gbr: {:.3f}".format(RMSE_gbr))
print("RMSE abr: {:.3f}".format(RMSE_abr))
print("RMSE lr:  {:.3f}".format(RMSE_lr))
print("====")
print("RMSE average: {:.3f}".format(RMSE_total))
print("RMSE stacked: {:.3f}".format(RMSE_stack))


RMSE mlp: 23837.500
RMSE gbr: 22294.129
RMSE abr: 15578.859
RMSE lr:  34166.418
====
RMSE average: 17570.962
RMSE stacked: 41833.609

Get Predictions

  • Good results without data_gauss

In [33]:
import os

#predict = avg_predict(X_test)
predict = stack_predict(X_test)
file = "Id,SalePrice" + os.linesep

startId = 1461
for i in range(len(X_test)):
    file += "{},{}".format(startId, (int)(predict[i])) + os.linesep
    startId += 1

#print(file)

In [34]:
# Save to file
with open('attempt.txt', 'w') as f:
    f.write(file)