Mei-Cheng Shih, 2016

This kernel is inspired by the post of JMT5802. The aim of this kernel is to use XGBoost to replace RF which was used as the core of the Boruta package. Since XGBoost generates better quality predictions than RF in this case, the output of this kernel is expected to be mor representative. Moreover, the code also includes the data cleaning process I used to build my model

First, import packages for data cleaning and read the data



In [1]:

    
from scipy.stats.mstats import mode
import pandas as pd
import numpy as np
import time
from sklearn.preprocessing import LabelEncoder

"""
Read Data
"""
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
target = train['SalePrice']
train = train.drop(['SalePrice'],axis=1)
trainlen = train.shape[0]

Combined the train and test set for cleaning



In [2]:

    
df1 = train.head()
df2 = test.head()
pd.concat([df1, df2], axis=0, ignore_index=True)









    Out[2]:






  
    
      
      Id
      MSSubClass
      MSZoning
      LotFrontage
      LotArea
      Street
      Alley
      LotShape
      LandContour
      Utilities
      ...
      ScreenPorch
      PoolArea
      PoolQC
      Fence
      MiscFeature
      MiscVal
      MoSold
      YrSold
      SaleType
      SaleCondition
    
  
  
    
      0
      1
      60
      RL
      65.0
      8450
      Pave
      NaN
      Reg
      Lvl
      AllPub
      ...
      0
      0
      NaN
      NaN
      NaN
      0
      2
      2008
      WD
      Normal
    
    
      1
      2
      20
      RL
      80.0
      9600
      Pave
      NaN
      Reg
      Lvl
      AllPub
      ...
      0
      0
      NaN
      NaN
      NaN
      0
      5
      2007
      WD
      Normal
    
    
      2
      3
      60
      RL
      68.0
      11250
      Pave
      NaN
      IR1
      Lvl
      AllPub
      ...
      0
      0
      NaN
      NaN
      NaN
      0
      9
      2008
      WD
      Normal
    
    
      3
      4
      70
      RL
      60.0
      9550
      Pave
      NaN
      IR1
      Lvl
      AllPub
      ...
      0
      0
      NaN
      NaN
      NaN
      0
      2
      2006
      WD
      Abnorml
    
    
      4
      5
      60
      RL
      84.0
      14260
      Pave
      NaN
      IR1
      Lvl
      AllPub
      ...
      0
      0
      NaN
      NaN
      NaN
      0
      12
      2008
      WD
      Normal
    
    
      5
      1461
      20
      RH
      80.0
      11622
      Pave
      NaN
      Reg
      Lvl
      AllPub
      ...
      120
      0
      NaN
      MnPrv
      NaN
      0
      6
      2010
      WD
      Normal
    
    
      6
      1462
      20
      RL
      81.0
      14267
      Pave
      NaN
      IR1
      Lvl
      AllPub
      ...
      0
      0
      NaN
      NaN
      Gar2
      12500
      6
      2010
      WD
      Normal
    
    
      7
      1463
      60
      RL
      74.0
      13830
      Pave
      NaN
      IR1
      Lvl
      AllPub
      ...
      0
      0
      NaN
      MnPrv
      NaN
      0
      3
      2010
      WD
      Normal
    
    
      8
      1464
      60
      RL
      78.0
      9978
      Pave
      NaN
      IR1
      Lvl
      AllPub
      ...
      0
      0
      NaN
      NaN
      NaN
      0
      6
      2010
      WD
      Normal
    
    
      9
      1465
      120
      RL
      43.0
      5005
      Pave
      NaN
      IR1
      HLS
      AllPub
      ...
      144
      0
      NaN
      NaN
      NaN
      0
      1
      2010
      WD
      Normal
    
  

10 rows × 80 columns



In [3]:

    
alldata = pd.concat([train, test], axis=0, join='outer', ignore_index=True)
alldata = alldata.drop(['Id','Utilities'], axis=1)
alldata.dtypes









    Out[3]:





MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
Alley             object
LotShape          object
LandContour       object
LotConfig         object
LandSlope         object
Neighborhood      object
Condition1        object
Condition2        object
BldgType          object
HouseStyle        object
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
RoofStyle         object
RoofMatl          object
Exterior1st       object
Exterior2nd       object
MasVnrType        object
MasVnrArea       float64
ExterQual         object
ExterCond         object
Foundation        object
BsmtQual          object
BsmtCond          object
                  ...   
HalfBath           int64
BedroomAbvGr       int64
KitchenAbvGr       int64
KitchenQual       object
TotRmsAbvGrd       int64
Functional        object
Fireplaces         int64
FireplaceQu       object
GarageType        object
GarageYrBlt      float64
GarageFinish      object
GarageCars       float64
GarageArea       float64
GarageQual        object
GarageCond        object
PavedDrive        object
WoodDeckSF         int64
OpenPorchSF        int64
EnclosedPorch      int64
3SsnPorch          int64
ScreenPorch        int64
PoolArea           int64
PoolQC            object
Fence             object
MiscFeature       object
MiscVal            int64
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
dtype: object

Dealing with the NA values in the variables, some of them equal to 0 and some equal to median, based on the txt descriptions



In [6]:

    
fMedlist=['LotFrontage']
fArealist=['MasVnrArea','TotalBsmtSF','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','BsmtFullBath', 'BsmtHalfBath','MasVnrArea','Fireplaces','GarageArea','GarageYrBlt','GarageCars']

for i in fArealist:
    alldata.ix[pd.isnull(alldata.ix[:,i]),i]=0
        
for i in fMedlist:
    alldata.ix[pd.isnull(alldata.ix[:,i]),i] = np.nanmedian(alldata.ix[:,i])

Transforming Data Use integers to encode categorical data.

Convert all ints to floats for XGBoost



In [4]:

    
alldata.ix[:,(alldata.dtypes=='int64') & (alldata.columns != 'MSSubClass')]=alldata.ix[:,(alldata.dtypes=='int64') & (alldata.columns!='MSSubClass')].astype('float64')



In [5]:

    
alldata['MSSubClass']









    Out[5]:





0        60
1        20
2        60
3        70
4        60
5        50
6        20
7        60
8        50
9       190
10       20
11       60
12       20
13       20
14       20
15       45
16       20
17       90
18       20
19       20
20       60
21       45
22       20
23      120
24       20
25       20
26       20
27       20
28       20
29       30
       ... 
2889     30
2890     50
2891     30
2892    190
2893     50
2894    120
2895    120
2896     20
2897     90
2898     20
2899     80
2900     20
2901     20
2902     20
2903     20
2904     20
2905     90
2906    160
2907     20
2908     90
2909    180
2910    160
2911     20
2912    160
2913    160
2914    160
2915    160
2916     20
2917     85
2918     60
Name: MSSubClass, dtype: int64



In [10]:

    
alldata.head(20)









    Out[10]:






  
    
      
      MSSubClass
      MSZoning
      LotFrontage
      LotArea
      Street
      Alley
      LotShape
      LandContour
      LotConfig
      LandSlope
      ...
      ScreenPorch
      PoolArea
      PoolQC
      Fence
      MiscFeature
      MiscVal
      MoSold
      YrSold
      SaleType
      SaleCondition
    
  
  
    
      0
      5
      3
      65.0
      8450.0
      1
      0
      3
      3
      4
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      2.0
      2008.0
      8
      4
    
    
      1
      0
      3
      80.0
      9600.0
      1
      0
      3
      3
      2
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      5.0
      2007.0
      8
      4
    
    
      2
      5
      3
      68.0
      11250.0
      1
      0
      0
      3
      4
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      9.0
      2008.0
      8
      4
    
    
      3
      6
      3
      60.0
      9550.0
      1
      0
      0
      3
      0
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      2.0
      2006.0
      8
      0
    
    
      4
      5
      3
      84.0
      14260.0
      1
      0
      0
      3
      2
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      12.0
      2008.0
      8
      4
    
    
      5
      4
      3
      85.0
      14115.0
      1
      0
      0
      3
      4
      0
      ...
      0.0
      0.0
      0
      3
      3
      700.0
      10.0
      2009.0
      8
      4
    
    
      6
      0
      3
      75.0
      10084.0
      1
      0
      3
      3
      4
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      8.0
      2007.0
      8
      4
    
    
      7
      5
      3
      68.0
      10382.0
      1
      0
      0
      3
      0
      0
      ...
      0.0
      0.0
      0
      0
      3
      350.0
      11.0
      2009.0
      8
      4
    
    
      8
      4
      4
      51.0
      6120.0
      1
      0
      3
      3
      4
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      4.0
      2008.0
      8
      0
    
    
      9
      15
      3
      50.0
      7420.0
      1
      0
      3
      3
      0
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      1.0
      2008.0
      8
      4
    
    
      10
      0
      3
      70.0
      11200.0
      1
      0
      3
      3
      4
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      2.0
      2008.0
      8
      4
    
    
      11
      5
      3
      85.0
      11924.0
      1
      0
      0
      3
      4
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      7.0
      2006.0
      6
      5
    
    
      12
      0
      3
      68.0
      12968.0
      1
      0
      1
      3
      4
      0
      ...
      176.0
      0.0
      0
      0
      0
      0.0
      9.0
      2008.0
      8
      4
    
    
      13
      0
      3
      91.0
      10652.0
      1
      0
      0
      3
      4
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      8.0
      2007.0
      6
      5
    
    
      14
      0
      3
      68.0
      10920.0
      1
      0
      0
      3
      0
      0
      ...
      0.0
      0.0
      0
      2
      0
      0.0
      5.0
      2008.0
      8
      4
    
    
      15
      3
      4
      51.0
      6120.0
      1
      0
      3
      3
      0
      0
      ...
      0.0
      0.0
      0
      1
      0
      0.0
      7.0
      2007.0
      8
      4
    
    
      16
      0
      3
      68.0
      11241.0
      1
      0
      0
      3
      1
      0
      ...
      0.0
      0.0
      0
      0
      3
      700.0
      3.0
      2010.0
      8
      4
    
    
      17
      10
      3
      72.0
      10791.0
      1
      0
      3
      3
      4
      0
      ...
      0.0
      0.0
      0
      0
      3
      500.0
      10.0
      2006.0
      8
      4
    
    
      18
      0
      3
      66.0
      13695.0
      1
      0
      3
      3
      4
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      6.0
      2008.0
      8
      4
    
    
      19
      0
      3
      70.0
      7560.0
      1
      0
      3
      3
      4
      0
      ...
      0.0
      0.0
      0
      3
      0
      0.0
      5.0
      2009.0
      0
      0
    
  

20 rows × 78 columns



In [8]:

    
le = LabelEncoder()
nacount_category = np.array(alldata.columns[((alldata.dtypes=='int64') | (alldata.dtypes=='object')) & (pd.isnull(alldata).sum()>0)])
category = np.array(alldata.columns[((alldata.dtypes=='int64') | (alldata.dtypes=='object'))])
Bsmtset = set(['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2'])
MasVnrset = set(['MasVnrType'])
Garageset = set(['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond'])
Fireplaceset = set(['FireplaceQu'])
Poolset = set(['PoolQC'])
NAset = set(['Fence','MiscFeature','Alley'])

# Put 0 and null values in the same category
for i in nacount_category:
    if i in Bsmtset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['TotalBsmtSF']==0), i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]), i] = alldata.ix[:,i].value_counts().index[0]
    elif i in MasVnrset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['MasVnrArea']==0),i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]
    elif i in Garageset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['GarageArea']==0),i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]
    elif i in Fireplaceset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['Fireplaces']==0),i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]
    elif i in Poolset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['PoolArea']==0),i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]
    elif i in NAset:
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]='Empty'
    else:
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]

for i in category:
    alldata.ix[:,i]=le.fit_transform(alldata.ix[:,i])

train = alldata.ix[0:trainlen-1, :]
test = alldata.ix[trainlen:alldata.shape[0],:]



In [9]:

    
alldata.head()









    Out[9]:






  
    
      
      MSSubClass
      MSZoning
      LotFrontage
      LotArea
      Street
      Alley
      LotShape
      LandContour
      LotConfig
      LandSlope
      ...
      ScreenPorch
      PoolArea
      PoolQC
      Fence
      MiscFeature
      MiscVal
      MoSold
      YrSold
      SaleType
      SaleCondition
    
  
  
    
      0
      5
      3
      65.0
      8450.0
      1
      0
      3
      3
      4
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      2.0
      2008.0
      8
      4
    
    
      1
      0
      3
      80.0
      9600.0
      1
      0
      3
      3
      2
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      5.0
      2007.0
      8
      4
    
    
      2
      5
      3
      68.0
      11250.0
      1
      0
      0
      3
      4
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      9.0
      2008.0
      8
      4
    
    
      3
      6
      3
      60.0
      9550.0
      1
      0
      0
      3
      0
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      2.0
      2006.0
      8
      0
    
    
      4
      5
      3
      84.0
      14260.0
      1
      0
      0
      3
      2
      0
      ...
      0.0
      0.0
      0
      0
      0
      0.0
      12.0
      2008.0
      8
      4
    
  

5 rows × 78 columns

Import required packages for Feature Selection Process



In [9]:

    
import xgboost as xgb
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import mean_squared_error
from sklearn.utils import shuffle

Start the code, drop some outliers. The outliers were detected by package statsmodel in python, skip details here

Learn how to do this!



In [6]:

    
o=[30, 462, 523, 632, 968, 970, 1298, 1324]

train=train.drop(o,axis=0)
target=target.drop(o,axis=0)

train.index=range(train.shape[0])
target.index=range(train.shape[0])

Set XGB model, the parameters were obtained from CV based on a Bayesian Optimization Process



In [7]:

    
est=xgb.XGBRegressor(colsample_bytree=0.4,
                 gamma=0.045,                 
                 learning_rate=0.07,
                 max_depth=20,
                 min_child_weight=1.5,
                 n_estimators=300,                                                                    
                 reg_alpha=0.65,
                 reg_lambda=0.45,
                 subsample=0.95)

Start the test process, the basic idea is to permutate the order of elements in each of the columns randomly and see the impact of the permutation

For the evaluation metric of feature importance, I used ((MSE of pertutaed data)-(MSE of original data))/(MSE of original data)



In [8]:

    
n=200

scores=pd.DataFrame(np.zeros([n, train.shape[1]]))
scores.columns=train.columns
ct=0

for train_idx, test_idx in ShuffleSplit(train.shape[0], n, .25):
    ct+=1
    X_train, X_test = train.ix[train_idx,:], train.ix[test_idx,:]
    Y_train, Y_test = target.ix[train_idx], target.ix[test_idx]
    r = est.fit(X_train, Y_train)
    acc = mean_squared_error(Y_test, est.predict(X_test))
    for i in range(train.shape[1]):
        X_t = X_test.copy()
        X_t.ix[:,i]=shuffle(np.array(X_t.ix[:, i]))
        shuff_acc =  mean_squared_error(Y_test, est.predict(X_t))
        scores.ix[ct-1,i]=((acc-shuff_acc)/acc)

Generate output, the mean, median, max and min of the scores fluctuation



In [10]:

    
fin_score=pd.DataFrame(np.zeros([train.shape[1], 4]))
fin_score.columns=['Mean','Median','Max','Min']
fin_score.index=train.columns
fin_score.ix[:,0]=scores.mean()
fin_score.ix[:,1]=scores.median()
fin_score.ix[:,2]=scores.min()
fin_score.ix[:,3]=scores.max()

See the importances of features. The higher the value, the less important the factor.



In [11]:

    
pd.set_option('display.max_rows', None)
fin_score.sort_values('Mean',axis=0)









    Out[11]:






  
    
      
      Mean
      Median
      Max
      Min
    
  
  
    
      OverallQual
      -1.745853
      -1.744054e+00
      -3.160713
      -0.908557
    
    
      GrLivArea
      -0.718646
      -7.081993e-01
      -1.407366
      -0.269067
    
    
      TotalBsmtSF
      -0.691805
      -6.770063e-01
      -1.258033
      -0.232703
    
    
      GarageCars
      -0.229122
      -2.223006e-01
      -0.499151
      0.068078
    
    
      2ndFlrSF
      -0.220490
      -2.165775e-01
      -0.417533
      -0.079036
    
    
      ExterQual
      -0.126791
      -1.214447e-01
      -0.376415
      0.028661
    
    
      TotRmsAbvGrd
      -0.126063
      -1.173449e-01
      -0.345731
      0.032275
    
    
      1stFlrSF
      -0.117124
      -1.102345e-01
      -0.354299
      0.048251
    
    
      BsmtFinSF1
      -0.111459
      -1.075463e-01
      -0.228526
      -0.002136
    
    
      LotArea
      -0.098098
      -9.367906e-02
      -0.199207
      0.022418
    
    
      YearRemodAdd
      -0.096721
      -9.266874e-02
      -0.231705
      0.015454
    
    
      YearBuilt
      -0.072773
      -7.156515e-02
      -0.200384
      -0.009698
    
    
      KitchenQual
      -0.067574
      -5.813617e-02
      -0.328904
      0.086712
    
    
      GarageArea
      -0.058919
      -5.642314e-02
      -0.209697
      0.077256
    
    
      OverallCond
      -0.055925
      -5.513424e-02
      -0.114862
      -0.005446
    
    
      BsmtQual
      -0.040257
      -3.491340e-02
      -0.202992
      0.071910
    
    
      Neighborhood
      -0.036025
      -3.616916e-02
      -0.100370
      0.029343
    
    
      Fireplaces
      -0.032378
      -3.076624e-02
      -0.102836
      0.019869
    
    
      FullBath
      -0.032094
      -3.027362e-02
      -0.142153
      0.027313
    
    
      BsmtExposure
      -0.029104
      -2.673424e-02
      -0.098127
      0.065690
    
    
      FireplaceQu
      -0.023655
      -2.142477e-02
      -0.132021
      0.020082
    
    
      GarageType
      -0.018152
      -1.606927e-02
      -0.082256
      0.034612
    
    
      BsmtFullBath
      -0.018065
      -1.674809e-02
      -0.069231
      0.029977
    
    
      HalfBath
      -0.016640
      -1.309231e-02
      -0.068500
      0.018128
    
    
      GarageYrBlt
      -0.015293
      -1.342915e-02
      -0.072729
      0.026033
    
    
      SaleCondition
      -0.014690
      -1.459492e-02
      -0.069980
      0.036967
    
    
      BsmtUnfSF
      -0.014099
      -1.388500e-02
      -0.065160
      0.046270
    
    
      MSZoning
      -0.013157
      -1.172020e-02
      -0.046477
      0.004154
    
    
      LotFrontage
      -0.012638
      -1.319702e-02
      -0.085937
      0.102318
    
    
      CentralAir
      -0.009905
      -9.648136e-03
      -0.022149
      0.002633
    
    
      HouseStyle
      -0.009775
      -8.684391e-03
      -0.063406
      0.015083
    
    
      BedroomAbvGr
      -0.009728
      -9.430763e-03
      -0.046725
      0.022914
    
    
      OpenPorchSF
      -0.009356
      -1.003044e-02
      -0.061482
      0.063952
    
    
      MSSubClass
      -0.007799
      -7.085857e-03
      -0.041251
      0.035381
    
    
      KitchenAbvGr
      -0.007257
      -5.858517e-03
      -0.040476
      0.010593
    
    
      MasVnrArea
      -0.006865
      -6.051450e-03
      -0.069828
      0.036731
    
    
      WoodDeckSF
      -0.006050
      -5.415741e-03
      -0.037125
      0.044904
    
    
      SaleType
      -0.005838
      -5.164100e-03
      -0.044775
      0.017638
    
    
      Functional
      -0.005400
      -4.920237e-03
      -0.025779
      0.008431
    
    
      BsmtFinType1
      -0.004997
      -4.775785e-03
      -0.028020
      0.013469
    
    
      Exterior1st
      -0.004246
      -3.490362e-03
      -0.024252
      0.016022
    
    
      Condition1
      -0.003756
      -3.380778e-03
      -0.013468
      0.005329
    
    
      LandSlope
      -0.003734
      -3.399198e-03
      -0.019064
      0.006658
    
    
      LotShape
      -0.003417
      -2.838803e-03
      -0.045976
      0.017040
    
    
      BldgType
      -0.003192
      -2.631465e-03
      -0.024310
      0.006900
    
    
      HeatingQC
      -0.003113
      -2.850851e-03
      -0.023039
      0.023475
    
    
      PavedDrive
      -0.002682
      -2.239351e-03
      -0.011865
      0.003973
    
    
      GarageFinish
      -0.002556
      -2.307206e-03
      -0.024870
      0.016588
    
    
      LandContour
      -0.002485
      -1.815430e-03
      -0.026012
      0.011874
    
    
      GarageQual
      -0.001723
      -2.114806e-03
      -0.011777
      0.012070
    
    
      ScreenPorch
      -0.001680
      -1.331372e-03
      -0.011340
      0.004841
    
    
      BsmtCond
      -0.001627
      -1.799775e-03
      -0.013417
      0.009557
    
    
      Foundation
      -0.001278
      -7.971762e-04
      -0.085218
      0.012931
    
    
      Alley
      -0.000999
      -4.454887e-04
      -0.013659
      0.003832
    
    
      Electrical
      -0.000762
      -9.008381e-04
      -0.008299
      0.008689
    
    
      MoSold
      -0.000729
      -7.599729e-04
      -0.025653
      0.030117
    
    
      PoolArea
      -0.000669
      -4.194024e-05
      -0.023620
      0.005802
    
    
      GarageCond
      -0.000469
      -3.865457e-04
      -0.006497
      0.003488
    
    
      MasVnrType
      -0.000368
      -8.056413e-04
      -0.027697
      0.048125
    
    
      MiscFeature
      -0.000314
      -1.813119e-04
      -0.003070
      0.002185
    
    
      BsmtFinType2
      -0.000244
      -9.938363e-05
      -0.007453
      0.004107
    
    
      BsmtFinSF2
      -0.000142
      -2.426682e-05
      -0.013033
      0.012890
    
    
      RoofMatl
      -0.000100
      -4.735000e-05
      -0.018069
      0.021387
    
    
      LotConfig
      -0.000096
      -3.057170e-04
      -0.011418
      0.023260
    
    
      MiscVal
      -0.000086
      -2.442930e-05
      -0.003269
      0.001007
    
    
      PoolQC
      -0.000049
      -8.743486e-06
      -0.007560
      0.006302
    
    
      RoofStyle
      -0.000041
      1.702540e-04
      -0.015179
      0.021780
    
    
      Condition2
      -0.000031
      0.000000e+00
      -0.002181
      0.000480
    
    
      Street
      -0.000017
      0.000000e+00
      -0.000582
      0.000207
    
    
      ExterCond
      -0.000009
      -8.864087e-05
      -0.018773
      0.015733
    
    
      Heating
      0.000015
      -2.381500e-07
      -0.001178
      0.002461
    
    
      LowQualFinSF
      0.000033
      -7.800121e-05
      -0.006745
      0.015485
    
    
      3SsnPorch
      0.000044
      5.104960e-06
      -0.003069
      0.002630
    
    
      Fence
      0.000060
      2.732116e-05
      -0.008569
      0.006448
    
    
      BsmtHalfBath
      0.000152
      1.935323e-05
      -0.007376
      0.008231
    
    
      Exterior2nd
      0.000411
      9.841010e-04
      -0.032692
      0.025933
    
    
      YrSold
      0.000741
      1.080704e-03
      -0.018658
      0.018778
    
    
      EnclosedPorch
      0.000804
      5.695463e-04
      -0.012578
      0.021105

The result is a little bit difference from what JMT5802 got, but in general they are similar. For example, OverallQual, GrLivArea are important in both cases, and PoolArea and PoolQC are not important in both cases. Also, based on the test conducted in link below, it is reasonable to say the differences are not obvious in both cases

Also, the main code was modified from the example in the link below, special thanks to the author of the blog

http://blog.datadive.net/selecting-good-features-part-iii-random-forests/

Updates:

After several tests, I removed the variables in the list below, and this action did improve my score a little bit. ['Exterior2nd', 'EnclosedPorch', 'RoofMatl', 'PoolQC', 'BsmtHalfBath', 'RoofStyle', 'PoolArea', 'MoSold', 'Alley', 'Fence', 'LandContour', 'MasVnrType', '3SsnPorch', 'LandSlope']



In [12]:

    
est









    Out[12]:





XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.4,
       gamma=0.045, learning_rate=0.07, max_delta_step=0, max_depth=20,
       min_child_weight=1.5, missing=None, n_estimators=300, nthread=-1,
       objective='reg:linear', reg_alpha=0.65, reg_lambda=0.45,
       scale_pos_weight=1, seed=0, silent=True, subsample=0.95)



In [21]:

    
test.shape[0]









    Out[21]:





1459



In [22]:

    
result = pd.Series(est.predict(test))



In [23]:

    
result.index









    Out[23]:





RangeIndex(start=0, stop=1459, step=1)



In [28]:

    
submission = pd.DataFrame({
        "Id": result.index + 1461,
        "SalePrice": result.values
    })



In [29]:

    
submission.to_csv('submission-xgboost.csv', index=False)



In [ ]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	2	2008	WD	Normal
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2007	WD	Normal
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	9	2008	WD	Normal
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	2	2006	WD	Abnorml
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	12	2008	WD	Normal
5	1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	NaN	MnPrv	NaN	0	6	2010	WD	Normal
6	1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal
7	1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
8	1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2010	WD	Normal
9	1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	...	144	NaN	NaN	NaN	0	1	2010	WD	Normal

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	LotShape	LandContour	LotConfig	...	ScreenPorch	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	5	3	65.0	8450.0	1	3	3	4	...	0.0	0	0	0.0	2.0	2008.0	8	4
1	0	3	80.0	9600.0	1	3	3	2	...	0.0	0	0	0.0	5.0	2007.0	8	4
2	5	3	68.0	11250.0	1	0	3	4	...	0.0	0	0	0.0	9.0	2008.0	8	4
3	6	3	60.0	9550.0	1	0	3	0	...	0.0	0	0	0.0	2.0	2006.0	8	0
4	5	3	84.0	14260.0	1	0	3	2	...	0.0	0	0	0.0	12.0	2008.0	8	4
5	4	3	85.0	14115.0	1	0	3	4	...	0.0	3	3	700.0	10.0	2009.0	8	4
6	0	3	75.0	10084.0	1	3	3	4	...	0.0	0	0	0.0	8.0	2007.0	8	4
7	5	3	68.0	10382.0	1	0	3	0	...	0.0	0	3	350.0	11.0	2009.0	8	4
8	4	4	51.0	6120.0	1	3	3	4	...	0.0	0	0	0.0	4.0	2008.0	8	0
9	15	3	50.0	7420.0	1	3	3	0	...	0.0	0	0	0.0	1.0	2008.0	8	4
10	0	3	70.0	11200.0	1	3	3	4	...	0.0	0	0	0.0	2.0	2008.0	8	4
11	5	3	85.0	11924.0	1	0	3	4	...	0.0	0	0	0.0	7.0	2006.0	6	5
12	0	3	68.0	12968.0	1	1	3	4	...	176.0	0	0	0.0	9.0	2008.0	8	4
13	0	3	91.0	10652.0	1	0	3	4	...	0.0	0	0	0.0	8.0	2007.0	6	5
14	0	3	68.0	10920.0	1	0	3	0	...	0.0	2	0	0.0	5.0	2008.0	8	4
15	3	4	51.0	6120.0	1	3	3	0	...	0.0	1	0	0.0	7.0	2007.0	8	4
16	0	3	68.0	11241.0	1	0	3	1	...	0.0	0	3	700.0	3.0	2010.0	8	4
17	10	3	72.0	10791.0	1	3	3	4	...	0.0	0	3	500.0	10.0	2006.0	8	4
18	0	3	66.0	13695.0	1	3	3	4	...	0.0	0	0	0.0	6.0	2008.0	8	4
19	0	3	70.0	7560.0	1	3	3	4	...	0.0	3	0	0.0	5.0	2009.0	0	0

	Mean	Median	Max	Min
OverallQual	-1.745853	-1.744054e+00	-3.160713	-0.908557
GrLivArea	-0.718646	-7.081993e-01	-1.407366	-0.269067
TotalBsmtSF	-0.691805	-6.770063e-01	-1.258033	-0.232703
GarageCars	-0.229122	-2.223006e-01	-0.499151	0.068078
2ndFlrSF	-0.220490	-2.165775e-01	-0.417533	-0.079036
ExterQual	-0.126791	-1.214447e-01	-0.376415	0.028661
TotRmsAbvGrd	-0.126063	-1.173449e-01	-0.345731	0.032275
1stFlrSF	-0.117124	-1.102345e-01	-0.354299	0.048251
BsmtFinSF1	-0.111459	-1.075463e-01	-0.228526	-0.002136
LotArea	-0.098098	-9.367906e-02	-0.199207	0.022418
YearRemodAdd	-0.096721	-9.266874e-02	-0.231705	0.015454
YearBuilt	-0.072773	-7.156515e-02	-0.200384	-0.009698
KitchenQual	-0.067574	-5.813617e-02	-0.328904	0.086712
GarageArea	-0.058919	-5.642314e-02	-0.209697	0.077256
OverallCond	-0.055925	-5.513424e-02	-0.114862	-0.005446
BsmtQual	-0.040257	-3.491340e-02	-0.202992	0.071910
Neighborhood	-0.036025	-3.616916e-02	-0.100370	0.029343
Fireplaces	-0.032378	-3.076624e-02	-0.102836	0.019869
FullBath	-0.032094	-3.027362e-02	-0.142153	0.027313
BsmtExposure	-0.029104	-2.673424e-02	-0.098127	0.065690
FireplaceQu	-0.023655	-2.142477e-02	-0.132021	0.020082
GarageType	-0.018152	-1.606927e-02	-0.082256	0.034612
BsmtFullBath	-0.018065	-1.674809e-02	-0.069231	0.029977
HalfBath	-0.016640	-1.309231e-02	-0.068500	0.018128
GarageYrBlt	-0.015293	-1.342915e-02	-0.072729	0.026033
SaleCondition	-0.014690	-1.459492e-02	-0.069980	0.036967
BsmtUnfSF	-0.014099	-1.388500e-02	-0.065160	0.046270
MSZoning	-0.013157	-1.172020e-02	-0.046477	0.004154
LotFrontage	-0.012638	-1.319702e-02	-0.085937	0.102318
CentralAir	-0.009905	-9.648136e-03	-0.022149	0.002633
HouseStyle	-0.009775	-8.684391e-03	-0.063406	0.015083
BedroomAbvGr	-0.009728	-9.430763e-03	-0.046725	0.022914
OpenPorchSF	-0.009356	-1.003044e-02	-0.061482	0.063952
MSSubClass	-0.007799	-7.085857e-03	-0.041251	0.035381
KitchenAbvGr	-0.007257	-5.858517e-03	-0.040476	0.010593
MasVnrArea	-0.006865	-6.051450e-03	-0.069828	0.036731
WoodDeckSF	-0.006050	-5.415741e-03	-0.037125	0.044904
SaleType	-0.005838	-5.164100e-03	-0.044775	0.017638
Functional	-0.005400	-4.920237e-03	-0.025779	0.008431
BsmtFinType1	-0.004997	-4.775785e-03	-0.028020	0.013469
Exterior1st	-0.004246	-3.490362e-03	-0.024252	0.016022
Condition1	-0.003756	-3.380778e-03	-0.013468	0.005329
LandSlope	-0.003734	-3.399198e-03	-0.019064	0.006658
LotShape	-0.003417	-2.838803e-03	-0.045976	0.017040
BldgType	-0.003192	-2.631465e-03	-0.024310	0.006900
HeatingQC	-0.003113	-2.850851e-03	-0.023039	0.023475
PavedDrive	-0.002682	-2.239351e-03	-0.011865	0.003973
GarageFinish	-0.002556	-2.307206e-03	-0.024870	0.016588
LandContour	-0.002485	-1.815430e-03	-0.026012	0.011874
GarageQual	-0.001723	-2.114806e-03	-0.011777	0.012070
ScreenPorch	-0.001680	-1.331372e-03	-0.011340	0.004841
BsmtCond	-0.001627	-1.799775e-03	-0.013417	0.009557
Foundation	-0.001278	-7.971762e-04	-0.085218	0.012931
Alley	-0.000999	-4.454887e-04	-0.013659	0.003832
Electrical	-0.000762	-9.008381e-04	-0.008299	0.008689
MoSold	-0.000729	-7.599729e-04	-0.025653	0.030117
PoolArea	-0.000669	-4.194024e-05	-0.023620	0.005802
GarageCond	-0.000469	-3.865457e-04	-0.006497	0.003488
MasVnrType	-0.000368	-8.056413e-04	-0.027697	0.048125
MiscFeature	-0.000314	-1.813119e-04	-0.003070	0.002185
BsmtFinType2	-0.000244	-9.938363e-05	-0.007453	0.004107
BsmtFinSF2	-0.000142	-2.426682e-05	-0.013033	0.012890
RoofMatl	-0.000100	-4.735000e-05	-0.018069	0.021387
LotConfig	-0.000096	-3.057170e-04	-0.011418	0.023260
MiscVal	-0.000086	-2.442930e-05	-0.003269	0.001007
PoolQC	-0.000049	-8.743486e-06	-0.007560	0.006302
RoofStyle	-0.000041	1.702540e-04	-0.015179	0.021780
Condition2	-0.000031	0.000000e+00	-0.002181	0.000480
Street	-0.000017	0.000000e+00	-0.000582	0.000207
ExterCond	-0.000009	-8.864087e-05	-0.018773	0.015733
Heating	0.000015	-2.381500e-07	-0.001178	0.002461
LowQualFinSF	0.000033	-7.800121e-05	-0.006745	0.015485
3SsnPorch	0.000044	5.104960e-06	-0.003069	0.002630
Fence	0.000060	2.732116e-05	-0.008569	0.006448
BsmtHalfBath	0.000152	1.935323e-05	-0.007376	0.008231
Exterior2nd	0.000411	9.841010e-04	-0.032692	0.025933
YrSold	0.000741	1.080704e-03	-0.018658	0.018778
EnclosedPorch	0.000804	5.695463e-04	-0.012578	0.021105