Random Forest Interpretation

WNixalo 2018/5/13-14

lecture 00:01:33 Random forests allow us to understand our data deeper and more quickly than traditional approaches.

Q: "when should we use Random Forests?"

A: "the question is more: *when should I use other things as well?"

for Unstructured Data (sound waveform, words in text, pixels in images) I'll almost certainly want to try Deep Learning. For Collaborative Filtering, neither DL nor RF are exactly what you want (tweaks are needed)



In [1]:

    
%reload_ext autoreload
%autoreload 2
%matplotlib inline



In [2]:

    
from fastai.imports import *
from fastai.structured import *
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics



In [3]:

    
set_plot_sizes(12,14,16)

1.1 Load in our data from last lesson

Lecture 3 - 51:07

We already read in our CSV and processed it into categories in the Lesson 1 notebook. proc_df turns the categories into integers, deals w/ missing values, and pulls out the dependant variable (y_trn).

Lecture 3 – 52:59 Note on proc_df, nas, and na_dict:

proc_df returns a dictionary telling you which columns were missing (had missing things?), and for each of those columns what the median was. When you call proc_df on the larger non-subset dataset you don't pass in na_dict as an argument – you just want to get back the result.

Later on, when you pass it into a subset, you want to have the same missing columns and the same medians so you pass in na_dict. If the different subset (for example a different dataset) had different missing columns, proc_df would update the na_dict argument with additional p-values.

If you don't pass it in, proc_df just gives you the information about what was missing & the medians. If you do pass it in it uses that information for any missing columns that are there – and if there are new missing columns it'll update the dictionary with that additional information.

It keeps track of any missing columns you came across in anything you passed to proc_df.



In [4]:

    
PATH = 'data/bulldozers/'

df_raw = pd.read_feather('tmp/bulldozers-raw.feather')
df_trn, y_trn , nas = proc_df(df_raw, 'SalePrice')



In [5]:

    
def split_vals(a,n): return a[:n], a[n:]
n_valid = 12000
n_trn = len(df_trn) - n_valid
X_train, X_valid = split_vals(df_trn, n_trn)
y_train, y_valid = split_vals(y_trn, n_trn)
raw_train, raw_valid = split_vals(df_raw, n_trn)



In [6]:

    
def rmse(x,y): return math.sqrt(((x-y)**2).mean())
def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid), 
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)



In [7]:

    
df_raw # after proc_df. shows log(SalePrice)









    Out[7]:







  
    
      
      SalesID
      SalePrice
      MachineID
      ModelID
      datasource
      auctioneerID
      YearMade
      MachineHoursCurrentMeter
      UsageBand
      fiModelDesc
      ...
      saleDay
      saleDayofweek
      saleDayofyear
      saleIs_month_end
      saleIs_month_start
      saleIs_quarter_end
      saleIs_quarter_start
      saleIs_year_end
      saleIs_year_start
      saleElapsed
    
  
  
    
      0
      1139246
      11.097410
      999089
      3157
      121
      3.0
      2004
      68.0
      2
      521D
      ...
      16
      3
      320
      False
      False
      False
      False
      False
      False
      1163635200
    
    
      1
      1139248
      10.950807
      117657
      77
      121
      3.0
      1996
      4640.0
      2
      950FII
      ...
      26
      4
      86
      False
      False
      False
      False
      False
      False
      1080259200
    
    
      2
      1139249
      9.210340
      434808
      7009
      121
      3.0
      2001
      2838.0
      0
      226
      ...
      26
      3
      57
      False
      False
      False
      False
      False
      False
      1077753600
    
    
      3
      1139251
      10.558414
      1026470
      332
      121
      3.0
      2001
      3486.0
      0
      PC120-6E
      ...
      19
      3
      139
      False
      False
      False
      False
      False
      False
      1305763200
    
    
      4
      1139253
      9.305651
      1057373
      17311
      121
      3.0
      2007
      722.0
      1
      S175
      ...
      23
      3
      204
      False
      False
      False
      False
      False
      False
      1248307200
    
    
      5
      1139255
      10.184900
      1001274
      4605
      121
      3.0
      2004
      508.0
      2
      310G
      ...
      18
      3
      353
      False
      False
      False
      False
      False
      False
      1229558400
    
    
      6
      1139256
      9.952278
      772701
      1937
      121
      3.0
      1993
      11540.0
      0
      790ELC
      ...
      26
      3
      239
      False
      False
      False
      False
      False
      False
      1093478400
    
    
      7
      1139261
      10.203592
      902002
      3539
      121
      3.0
      2001
      4883.0
      0
      416D
      ...
      17
      3
      321
      False
      False
      False
      False
      False
      False
      1132185600
    
    
      8
      1139272
      9.975808
      1036251
      36003
      121
      3.0
      2008
      302.0
      2
      430HAG
      ...
      27
      3
      239
      False
      False
      False
      False
      False
      False
      1251331200
    
    
      9
      1139275
      11.082143
      1016474
      3883
      121
      3.0
      1000
      20700.0
      1
      988B
      ...
      9
      3
      221
      False
      False
      False
      False
      False
      False
      1186617600
    
    
      10
      1139278
      10.085809
      1024998
      4605
      121
      3.0
      2004
      1414.0
      1
      310G
      ...
      21
      3
      234
      False
      False
      False
      False
      False
      False
      1219276800
    
    
      11
      1139282
      10.021271
      319906
      5255
      121
      3.0
      1998
      2764.0
      2
      D31E
      ...
      24
      3
      236
      False
      False
      False
      False
      False
      False
      1156377600
    
    
      12
      1139283
      10.491274
      1052214
      2232
      121
      3.0
      1998
      0.0
      -1
      PC200LC6
      ...
      20
      3
      293
      False
      False
      False
      False
      False
      False
      1129766400
    
    
      13
      1139284
      10.325482
      1068082
      3542
      121
      3.0
      2001
      1921.0
      1
      420D
      ...
      26
      3
      26
      False
      False
      False
      False
      False
      False
      1138233600
    
    
      14
      1139290
      10.239960
      1058450
      5162
      121
      3.0
      2004
      320.0
      2
      214E
      ...
      3
      1
      3
      False
      False
      False
      False
      False
      False
      1136246400
    
    
      15
      1139291
      9.852194
      1004810
      4604
      121
      3.0
      1999
      2450.0
      1
      310E
      ...
      16
      3
      320
      False
      False
      False
      False
      False
      False
      1163635200
    
    
      16
      1139292
      9.510445
      1026973
      9510
      121
      3.0
      1999
      1972.0
      2
      334
      ...
      14
      3
      165
      False
      False
      False
      False
      False
      False
      1181779200
    
    
      17
      1139299
      9.159047
      1002713
      21442
      121
      3.0
      2003
      0.0
      -1
      45NX
      ...
      28
      3
      28
      False
      False
      False
      False
      False
      False
      1264636800
    
    
      18
      1139301
      9.433484
      125790
      7040
      121
      3.0
      2001
      994.0
      2
      302.5
      ...
      9
      3
      68
      False
      False
      False
      False
      False
      False
      1141862400
    
    
      19
      1139304
      9.350102
      1011914
      3177
      121
      3.0
      1991
      8005.0
      1
      580SUPER K
      ...
      17
      3
      321
      False
      False
      False
      False
      False
      False
      1132185600
    
    
      20
      1139311
      10.621327
      1014135
      8867
      121
      3.0
      2000
      3259.0
      1
      JS260
      ...
      18
      3
      138
      False
      False
      False
      False
      False
      False
      1147910400
    
    
      21
      1139333
      10.448715
      999192
      3350
      121
      3.0
      1000
      16328.0
      1
      120G
      ...
      19
      3
      292
      False
      False
      False
      False
      False
      False
      1161216000
    
    
      22
      1139344
      10.165852
      1044500
      7040
      121
      3.0
      2005
      109.0
      2
      302.5
      ...
      25
      3
      298
      False
      False
      False
      False
      False
      False
      1193270400
    
    
      23
      1139346
      11.198215
      821452
      85
      121
      3.0
      1996
      17033.0
      0
      966FII
      ...
      19
      3
      292
      False
      False
      False
      False
      False
      False
      1161216000
    
    
      24
      1139348
      10.404263
      294562
      3542
      121
      3.0
      2001
      1877.0
      1
      420D
      ...
      20
      3
      141
      False
      False
      False
      False
      False
      False
      1085011200
    
    
      25
      1139351
      9.433484
      833838
      7009
      121
      3.0
      2003
      1028.0
      1
      226
      ...
      9
      3
      68
      False
      False
      False
      False
      False
      False
      1141862400
    
    
      26
      1139354
      9.648595
      565440
      7040
      121
      3.0
      2003
      356.0
      2
      302.5
      ...
      9
      3
      68
      False
      False
      False
      False
      False
      False
      1141862400
    
    
      27
      1139356
      10.878047
      1004127
      25458
      121
      3.0
      2000
      0.0
      -1
      EX550STD
      ...
      22
      3
      53
      False
      False
      False
      False
      False
      False
      1172102400
    
    
      28
      1139357
      10.736397
      44800
      19167
      121
      3.0
      2004
      904.0
      2
      685B
      ...
      9
      3
      221
      False
      False
      False
      False
      False
      False
      1186617600
    
    
      29
      1139358
      11.396392
      1018076
      1333
      121
      3.0
      1998
      10466.0
      1
      345BL
      ...
      1
      3
      152
      False
      True
      False
      False
      False
      False
      1149120000
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      401095
      6333259
      9.259131
      1872639
      21437
      149
      1.0
      2003
      NaN
      -1
      35N
      ...
      14
      2
      348
      False
      False
      False
      False
      False
      False
      1323820800
    
    
      401096
      6333260
      9.210340
      1816341
      21437
      149
      2.0
      2004
      NaN
      -1
      35N
      ...
      15
      3
      258
      False
      False
      False
      False
      False
      False
      1316044800
    
    
      401097
      6333261
      9.047821
      1843949
      21437
      149
      1.0
      2005
      NaN
      -1
      35N
      ...
      28
      4
      301
      False
      False
      False
      False
      False
      False
      1319760000
    
    
      401098
      6333262
      9.259131
      1791341
      21437
      149
      2.0
      2004
      NaN
      -1
      35N
      ...
      16
      1
      228
      False
      False
      False
      False
      False
      False
      1313452800
    
    
      401099
      6333263
      9.305651
      1833174
      21437
      149
      1.0
      2004
      NaN
      -1
      35N
      ...
      14
      2
      348
      False
      False
      False
      False
      False
      False
      1323820800
    
    
      401100
      6333264
      9.259131
      1791370
      21437
      149
      2.0
      2004
      NaN
      -1
      35N
      ...
      16
      1
      228
      False
      False
      False
      False
      False
      False
      1313452800
    
    
      401101
      6333270
      9.210340
      1799208
      21437
      149
      1.0
      2004
      NaN
      -1
      35N
      ...
      14
      2
      348
      False
      False
      False
      False
      False
      False
      1323820800
    
    
      401102
      6333272
      9.259131
      1927142
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      16
      1
      228
      False
      False
      False
      False
      False
      False
      1313452800
    
    
      401103
      6333273
      9.433484
      1789856
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      15
      3
      258
      False
      False
      False
      False
      False
      False
      1316044800
    
    
      401104
      6333275
      9.259131
      1924623
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      16
      1
      228
      False
      False
      False
      False
      False
      False
      1313452800
    
    
      401105
      6333276
      9.210340
      1835350
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      25
      1
      298
      False
      False
      False
      False
      False
      False
      1319500800
    
    
      401106
      6333278
      9.259131
      1944702
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      16
      1
      228
      False
      False
      False
      False
      False
      False
      1313452800
    
    
      401107
      6333279
      9.433484
      1866563
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      15
      3
      258
      False
      False
      False
      False
      False
      False
      1316044800
    
    
      401108
      6333280
      9.259131
      1851633
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      16
      1
      228
      False
      False
      False
      False
      False
      False
      1313452800
    
    
      401109
      6333281
      9.259131
      1798958
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      16
      1
      228
      False
      False
      False
      False
      False
      False
      1313452800
    
    
      401110
      6333282
      9.259131
      1878866
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      15
      3
      258
      False
      False
      False
      False
      False
      False
      1316044800
    
    
      401111
      6333283
      9.210340
      1874235
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      25
      1
      298
      False
      False
      False
      False
      False
      False
      1319500800
    
    
      401112
      6333284
      9.259131
      1887654
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      25
      1
      298
      False
      False
      False
      False
      False
      False
      1319500800
    
    
      401113
      6333285
      9.259131
      1817165
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      25
      1
      298
      False
      False
      False
      False
      False
      False
      1319500800
    
    
      401114
      6333287
      9.433484
      1918242
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      15
      1
      319
      False
      False
      False
      False
      False
      False
      1321315200
    
    
      401115
      6333290
      9.210340
      1843374
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      25
      1
      298
      False
      False
      False
      False
      False
      False
      1319500800
    
    
      401116
      6333302
      9.047821
      1825337
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      25
      1
      298
      False
      False
      False
      False
      False
      False
      1319500800
    
    
      401117
      6333307
      9.210340
      1821747
      21437
      149
      2.0
      2005
      NaN
      -1
      35N
      ...
      25
      1
      298
      False
      False
      False
      False
      False
      False
      1319500800
    
    
      401118
      6333311
      9.159047
      1828862
      21437
      149
      2.0
      2006
      NaN
      -1
      35N
      ...
      25
      1
      298
      False
      False
      False
      False
      False
      False
      1319500800
    
    
      401119
      6333335
      9.047821
      1798293
      21435
      149
      2.0
      2005
      NaN
      -1
      30NX
      ...
      25
      1
      298
      False
      False
      False
      False
      False
      False
      1319500800
    
    
      401120
      6333336
      9.259131
      1840702
      21439
      149
      1.0
      2005
      NaN
      -1
      35NX2
      ...
      2
      2
      306
      False
      False
      False
      False
      False
      False
      1320192000
    
    
      401121
      6333337
      9.305651
      1830472
      21439
      149
      1.0
      2005
      NaN
      -1
      35NX2
      ...
      2
      2
      306
      False
      False
      False
      False
      False
      False
      1320192000
    
    
      401122
      6333338
      9.350102
      1887659
      21439
      149
      1.0
      2005
      NaN
      -1
      35NX2
      ...
      2
      2
      306
      False
      False
      False
      False
      False
      False
      1320192000
    
    
      401123
      6333341
      9.104980
      1903570
      21435
      149
      2.0
      2005
      NaN
      -1
      30NX
      ...
      25
      1
      298
      False
      False
      False
      False
      False
      False
      1319500800
    
    
      401124
      6333342
      8.955448
      1926965
      21435
      149
      2.0
      2005
      NaN
      -1
      30NX
      ...
      25
      1
      298
      False
      False
      False
      False
      False
      False
      1319500800
    
  

401125 rows × 65 columns

2. Confidence-based on tree variance

Lecture 3 - 54:54

We'd be less confident of a prediction if we hadn't seen many of rows like a particular row. And if we hadn't: we wouldn't expect any of the trees to have a path through which is designed to help us predict that row.

So conceptually you'd expect that as you pass this unusual row through different trees it's going to end up in very different places.

IoW: instead of taking the mean of the predictions of the trees and saying 'that's our prediction' – what if we took the Standard Deviation of the predictions of the trees.

If the Standard Deviation of the preds of the trees is high: each tree is giving us a very different estimate of this row's prediction. So if this was a very common kind of row: the trees would've learnt to make good predictions for it bc they'd've seen lots of opportunities to split on those kinds of rows.

The Standard Deviation of the predictions across the trees gives us some kind of relative understanding of how confident we are of this prediction.

This doesn't exist in scikit-learn or other libraries so we'll have to create it.

For model interpretation, there's no need to use the full dataset on each tree – using a subset will be both faster, and also provide better interpretability (since an overfit model won't provide much variance across trees).

We don't need a massively accurate RF, we just need one that indicates the nature of the relationships involved.

So you want n_samples to be high enough st if you call the same interpretation commands multiple times you don't get different results back each time.



In [8]:

    
set_rf_samples(50000)



In [9]:

    
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)









    



[0.20550240018943106, 0.24727510802206912, 0.9117390447817355, 0.8908034859072879, 0.8948049439153177]

preds: rows: results of each tree. cols: result of each row in original dataset. (ie: trees x dataset_rows)

We can check out the mean and standard-deviation of the first column in preds – which is the μ&σ of our dataset's 1st row.



In [10]:

    
%time preds = np.stack([t.predict(X_valid) for t in m.estimators_])
np.mean(preds[:,0]), np.std(preds[:,0])









    



CPU times: user 1.48 s, sys: 52 ms, total: 1.53 s
Wall time: 1.54 s






    Out[10]:





(9.3354467948131, 0.30467757743942037)

When we use python to loop through trees like this, we're calculating each in series, which is slow! We can use parallel processing to speed things up. Fast AI provides the function parallel_trees which takes the random-forest model you trained, and some function to call – and calls that function on every tree in parallel.



In [11]:

    
def get_preds(t): return t.predict(X_valid)
%time preds = np.stack(parallel_trees(m, get_preds))
np.mean(preds[:,0]), np.std(preds[:,0])









    



CPU times: user 117 ms, sys: 97 ms, total: 214 ms
Wall time: 1.03 s






    Out[11]:





(9.3354467948131, 0.30467757743942037)

(When CPU time roughly equals Wall time: that's 1 CPU working on it's own)

We can see that different trees are giving different estimates in this auction. In order to see how prediction confidence varies, we can add this to our dataset. "Enclosure" is one of the categories of our dataset, and has several subcategories:



In [12]:

    
x = raw_valid.copy()                  # create copy of data
x['pred_std'] = np.std(preds, axis=0) # add col for stdv of preds
x['pred'] = np.mean(preds, axis=0)    # add col for mean of preds
x.Enclosure.value_counts().plot.barh();

We don't care about the 1st 3 subcategories because they never show up in the data.

We take our dataset and group it by 'Enclosure' and take the average of the 3 fields 'SalePrice', 'pred', & 'pred_std':



In [13]:

    
flds = ['Enclosure', 'SalePrice', 'pred', 'pred_std']
enc_summ = x[flds].groupby('Enclosure', as_index=False).mean()
enc_summ









    Out[13]:







  
    
      
      Enclosure
      SalePrice
      pred
      pred_std
    
  
  
    
      0
      EROPS
      9.849178
      9.839894
      0.270817
    
    
      1
      EROPS AC
      NaN
      NaN
      NaN
    
    
      2
      EROPS w AC
      10.623971
      10.577495
      0.261123
    
    
      3
      NO ROPS
      NaN
      NaN
      NaN
    
    
      4
      None or Unspecified
      NaN
      NaN
      NaN
    
    
      5
      OROPS
      9.682064
      9.688541
      0.224662

We can already get some insight. As you'd expect the sale price & prediction are close to each other on average – that's a good sign.

The Standard Deviation varies a little bit – to see it better we can plot it:



In [14]:

    
enc_summ = enc_summ[~pd.isnull(enc_summ.SalePrice)]
enc_summ.plot('Enclosure', 'SalePrice', 'barh', xlim=(0,11));



In [15]:

    
enc_summ.plot('Enclosure', 'pred', 'barh', xerr='pred_std', alpha=0.6, xlim=(0,11));

The error bars are the standard deviation of the prediction.

Question: Why are the predictions nearly exactly right, but the error bars are quite wide?

Doing the same thing for Product Size:

why is matplotlib all of a sudden tasting the rainbow?..



In [16]:

    
raw_valid.ProductSize.value_counts().plot.barh();



In [17]:

    
flds = ['ProductSize', 'SalePrice', 'pred', 'pred_std']
summ = x[flds].groupby(flds[0]).mean()
summ









    Out[17]:







  
    
      
      SalePrice
      pred
      pred_std
    
    
      ProductSize
      
      
      
    
  
  
    
      Compact
      9.735093
      9.872288
      0.330976
    
    
      Large
      10.470589
      10.394293
      0.354021
    
    
      Large / Medium
      10.691871
      10.652369
      0.290579
    
    
      Medium
      10.681511
      10.625052
      0.280709
    
    
      Mini
      9.535147
      9.557366
      0.247063
    
    
      Small
      10.324448
      10.310762
      0.309695

we could ask "what's the ratio to the stdev of the predictions to the predictions themselves?" – you'd kinda expect on average that when you're predicting something that's a bigger number that you're standard deviation would be higher. So you can sort by that number

What this tell us is our "Large" and "Compact" Product Sizes are less accurate (relatively speaking as a ratio of the total price):



In [18]:

    
(summ.pred_std/summ.pred).sort_values(ascending=False)









    Out[18]:





ProductSize
Large             0.034059
Compact           0.033526
Small             0.030036
Large / Medium    0.027278
Medium            0.026420
Mini              0.025850
dtype: float64

If you go back to the histogram above you'll see 'Large' and 'Compact' are the smallest groups – so of course. You always want to see the confidence intervals for deployed models.

3. Feature Importance

Lecture 3 – 1:07:20

It's not normally enough to just know that a model can make accurate predictions – we also want to know how it's making predictions. The most important way to see this is with feature importance.

We showed Confidence Intervals first because the intuition of how to calculate them is easier and similar to what we've already done. When it comes to which one do I look at first in practice: I always look at Feature Importance in practice.

When I'm work on a Kaggle competition or a real-world project, I build a Random Forest as fast as I can – trying to get to the point it's significantly better than random (but doesn't need to be much better than that) – and the next thing I do is plot the feature importance.

Feature Importance tells us: "in this Random Forest, which columns matter"



In [19]:

    
fi = rf_feat_importance(m, df_trn); fi[:10] # view top-10









    Out[19]:







  
    
      
      cols
      imp
    
  
  
    
      5
      YearMade
      0.174785
    
    
      13
      ProductSize
      0.088344
    
    
      37
      Coupler_System
      0.087752
    
    
      14
      fiProductClassDesc
      0.081937
    
    
      39
      Hydraulics_Flow
      0.077173
    
    
      2
      ModelID
      0.062335
    
    
      63
      saleElapsed
      0.051469
    
    
      10
      fiSecondaryDesc
      0.042881
    
    
      8
      fiModelDesc
      0.032704
    
    
      38
      Grouser_Tracks
      0.030734



In [20]:

    
fi.plot('cols', 'imp', figsize=(10,6), legend=False);

(this should be displaying the column names on x-axis..)

In nearly every dataset you use in real life, this is what your feature importance is going to look like. There's a handful of columns you care about.

We can plot this as a barplot:



In [21]:

    
def plot_fi(fi): return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)



In [22]:

    
plot_fi(fi[:30]);

Before we worry about how to calculate it – much more important is to know what to do with it. The most important thing to do with it is to now sit down with your client or data-dictionary (w/e your information source is) and ask them about the data. Ie: "what is YearMade?"

Find out everything you can about YearMade, Coupler_System, and ProductSize, and their relationships to price, because they are what matter.

What'll often happen in real-world projects is you'll sit with the client and you'll say "turns out the Coupler System is the 2nd most important thing", and they might say "that makes no sense" That doesn't mean there's a problem w/ your model. It means there's a problem w/ their understanding of the data they gave you.

Data Leakage: there's information in the dataset you're modeling with which the client wouldn't've had in real-life at the point in time they're making a decision.

A lot of the key insights you'll find with feature importance are Data Leakage problems. You'll often find signs of co-linearity. (ie: Coupler_System is a proxy for a type of ind.eqpmt. that uses that feature).

Now let's keep the features with an importance greater than 0.005. Our RF looking at all features got a validation R2 score of 0.89.



In [23]:

    
to_keep = fi[fi.imp > 0.005].cols; len(to_keep)









    Out[23]:





23



In [24]:

    
df_keep = df_trn[to_keep].copy()
X_train, X_valid = split_vals(df_keep, n_trn)



In [25]:

    
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5,
                          n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)









    



[0.20718513027092908, 0.2455197479433483, 0.910287699881376, 0.8923483147060356, 0.8940241362481987]

The R2 score basically didn't change – in fact removing the redundant columns actually made it a little better.



In [26]:

    
fi = rf_feat_importance(m, df_keep)
plot_fi(fi);

When you remove redundant columns, you're also removing sources of colinearity – ie: 2 cols that might be related to each other. Colinearity won't make your RF less predictive – but if 2 cols are related, and one of them is a strong driver of the dependant variable, then what happens is the importance becomes split between them. By removing those cols w/ v.little impact, it makes feature importance a lot clearer.

Lecture 3 – 1:17:19 Feature Importance is a technique that works for any kind of Machine Learning model.

Feature Importances are calculated by shuffling each column and measuring its affect on the model's score (running predictions again).

Lecture 4 RF Interpretation start.

More on RF Feature Importance -vs- Logistic Regression Coefficients.

Lecture 4 – 38:40

You're much more often going to see people talk about Logistic Regression coefficients than you are to see them talk about Random Forest variable importance – and every time you see that happen you should be v.v.very skeptical.

3.1 One-Hot Encoding

Lecture 4 – 40:37

with max_n_cats, any category with < max_n_cats levels will be turned into a 1-hot encoded bunch of columns.



In [27]:

    
df_trn2, y_trn, nas = proc_df(df_raw, 'SalePrice', max_n_cat=7)
X_train, X_valid = split_vals(df_trn2, n_trn)

m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.6, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)









    



[0.20730757604107286, 0.25476890562005594, 0.9101816291640084, 0.8840846857167944, 0.8924669137249612]

Lecture 4 – 47:14

In this case the validation set RMSE & R2 got a little worse. This isn't always the case, and depends on the dataset: if the dataset has single columns which tend to be important or not.

In this case 1-hot encoding didn't make our model more predictive. But it did create different features. proc_df places the "variablename+_+levelname"

Whereas before "Enclosure" was only somewhat important, when we 1-hot encode it, "Enclosure_EROPS w AC" becomes the most important feature.

For at least the purpose of interpreting your model, you should always try 1-hot encoding quite a few of your variables – 6 or 7 are often good.



In [28]:

    
fi = rf_feat_importance(m, df_trn2)
plot_fi(fi[:25]);

The number of levels in a category is its cardinality. proc_df checks the cardinality against max_n_cats and 1-hot encodes it if it's less.

The original variable is not kept when it's 1-hot encoded.

If you have an ordinal variable (0,1,2 for Hi, Med, Low), proc_df will destroy it's ordering if it's affected by max_n_cats. Convert it to numerical ahead of time via df.CatName = df.CatName.cat.codes. This will cause the numericalize within proc_df to skip over it.

4. Removing redundant features

Lecture 4 – 54:50

One thing that makes this harder to interpret is that there seem to be some variables with very similar meanings. Let's try to remove redundant features.

One way to do this is do use a dendrogram – a type of hierarchical clustering.

In hierarchical (aka aglomorative) clustering, we look at every pair of objects and ask which 2 are the closest – and replace them with the midpoint (average) of the two.



In [29]:

    
from scipy.cluster import hierarchy as hc



In [30]:

    
# create dendogram
corr = np.round(scipy.stats.spearmanr(df_keep).correlation, 4)
corr_condensed = hc.distance.squareform(1 - corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=(16,10))
dendrogram = hc.dendrogram(z, labels=df_keep.columns, orientation='left', leaf_font_size=16)
plt.show()

Lecture 4 – 57:40 The horizontal axis is similarity. So, saleYear and saleElapsed are very similar. They've been combined by the hier.clustering algo and are v.similar.

Correlation is almost the same as the R2, except it's between 2 variables, and not between a variable and its prediction. Here the Spearman's R was used as the correlation metric (most common type).

Rank-correlation: replace each point with its rank on the x & y axis (turns data into a straight line).

I want to find the columns that are similar in a way that the Random Forest would find them similar – RF's don't care about linearity, only ordering, so a rank-correlation is the 'right way' to think about it.

It also has to be monotonic (parabolas won't work).

Let's try removing some of these related features to see if the model can be simplified without impacting accuracy.



In [31]:

    
def get_oob(df):
    m = RandomForestRegressor(n_estimators=30, min_samples_leaf=5, max_features=0.6, n_jobs=-1, oob_score=True)
    x, _ = split_vals(df, n_trn)
    m.fit(x, y_train)
    return m.oob_score_

Here's our baseline:



In [32]:

    
get_oob(df_keep)









    Out[32]:





0.8900672314366508

We're going to try removing each of the most correlated variables one at a time and see which we can remove without worsening the OOB score.



In [33]:

    
for c in ('fiModelDesc', 'fiBaseModel', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow', 'saleElapsed', 'saleYear', ):
    print(c, get_oob(df_keep.drop(c, axis=1)))









    



fiModelDesc 0.8880537102299333
fiBaseModel 0.8901670877768342
Coupler_System 0.8901832888201284
Grouser_Tracks 0.890431087663299
Hydraulics_Flow 0.8907072210425535
saleElapsed 0.8878246402691617
saleYear 0.8890321447143599

What if we try one from each group for removal:



In [36]:

    
to_drop = ['fiBaseModel', 'Grouser_Tracks', 'saleYear']
get_oob(df_keep.drop(to_drop, axis=1))









    Out[36]:





0.8887697149686028

Looking good – let's use this dataframe from here. We'll save the list of columns so we can resuse it later.



In [37]:

    
df_keep.drop(to_drop, axis=1, inplace=True)
X_train, X_valid = split_vals(df_keep, n_trn)



In [38]:

    
np.save('tmp/keep_cols.npy', np.array(df_keep.columns))



In [39]:

    
keep_cols = np.load('tmp/keep_cols.npy')
df_keep = df_trn[keep_cols]

And let's see how this model looks on the full dataset.



In [40]:

    
reset_rf_samples() # use full dataset & bootstrapped samples



In [41]:

    
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)









    



[0.12534213461044671, 0.22764123755915913, 0.9671656064536831, 0.9074556626842947, 0.9083576950520804]

We're getting a good score (better than before) with a smaller and simpler model now.

5. Partial Dependence

Lecture 4 – 1:07:16

Now we're at the point where we want to really try to understand the data better by taking advantage of the model. We'll use something called Partial Dependence. V.few people know about this powerful technique.

What we're going to do is find out for the features that are important, how they relate to the dependent variable.



In [42]:

    
from pdpbox import pdp
from plotnine import *



In [43]:

    
set_rf_samples(50000)

This next analysis will be a little easier if we use the 1-hot encoded categorical variables, so let's load them up again.



In [47]:

    
df_trn2, y_trn, nas = proc_df(df_raw, 'SalePrice', max_n_cat=7)
X_train, X_valid = split_vals(df_trn2, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.6, n_jobs=-1)
m.fit(X_train, y_train);



In [48]:

    
plot_fi(rf_feat_importance(m, df_trn2)[:10]);

YearMade is the 2nd most important variable. We'd expect it to come together with saleElapsed to tell some info about how old the product was when it was sold.

We can plot them to see how they relate to each other:



In [50]:

    
df_raw.plot('YearMade', 'saleElapsed', 'scatter', alpha=0.01, figsize=(10,8));

It's highly unlikely that tractors were made in the year 1000 ... so we can remove those.



In [56]:

    
x_all = get_sample(df_raw[df_raw.YearMade > 1930], 500) # grab 500 pts

Lecture 4 – 1:09:58 the ggplot library uses it's R syntax.



In [57]:

    
ggplot(x_all, aes('YearMade', 'SalePrice')) + stat_smooth(se=True, method='loess')









    












    Out[57]:





<ggplot: (-9223372036539678583)>

loess -- locallly-weighted regression (like doing lots of little linear regressions) –– (JHoward uses this for univariate plots)

What we want to do is say what's the relationship between salePrice and YearMade – all other things being equal.

We do this with a Partial Dependence Plot:



In [58]:

    
x = get_sample(X_train[X_train.YearMade>1930], 500)



In [59]:

    
def plot_pdp(feat, clusters=None, feat_name=None):
    feat_name = feat_name or feat
    p = pdp.pdp_isolate(m, x, feat)
    return pdp.pdp_plot(p, feat_name, plot_lines=True, 
                        cluster=clusters is not None, n_cluster_centers=clusters)



In [60]:

    
plot_pdp('YearMade')

Lecture 4 – 1:15:07

We've got our sample of 500 datapoints. We're going to take each one of those 500 randomly chosen auctions and make a dataset out of it.

Instead of shuffling a chosen column for feature importance, we're going to set the entire column to a single value, run that dataset pre-fit RF model seeing what the predictions become, and do so for each value for the chosen column in our dataset.

Random shuffling the column tells us how accurate it is when you don't use that column anymore, replacing the whole column with a constant (say, 1961 for YearMade) estimates for us how much we would've sold that product for – in that auction, on that day, in that place, etc – if that product had been made in 1961.

So we basically then take the average of all of the sale prices that we calculate from that Random Forest.

Each light-blue line is a row in our dataset. The y-axis is dependant-variable (zero-indexed to the 1st value in our chosen column: so 1900s prediction becomes 0), and x-axis is the values of our chosen independant variable (YearMade).

This Partial Dependence Plot concept is something which uses a Random Forest to get us a clearer interpretation of what's going on in our data.

Steps:

look at feature importance to tell us what to care about
use the partial dependence plot to tell us what's going on, on average.

We can also use clusters with pdp.

clusters uses Cluster Analysis to look at each one of the 500 rows and say "do some of those 500 rows kind of move in the same way"



In [62]:

    
plot_pdp('YearMade', clusters=5)

Here's the result of the Cluster Analysis. We still see the same average, but we also see the 5 most common patterns/shapes in the data.

Lecture 4 – 1:27:35

You can do the same thing in a PDP-Interaction Plot. In this example: how do saleElapsed & YearMade together impact price?

y-axis: zero-indexed Log(price); (last plot: saleElapsed -vs- YearMade, w/ Log(price) color bar.



In [63]:

    
feats = ['saleElapsed', 'YearMade']
p = pdp.pdp_interact(m, x, feats)
pdp.pdp_interact_plot(p, feats)

If you have 1-hot encoded variables, you can pass an array of them to plot_pdp and it'll treat them as categories.



In [73]:

    
plot_pdp(['Enclosure_EROPS w AC', 'Enclosure_EROPS', 'Enclosure_OROPS'], 5, 'Enclosure')

Enclosure_EROPS w AC are on average more expensive than Enclosure_EROPS which are on.avg slightly more expensive than Enclosure_OROPS.

Lecture 4 – 1:31:24

Based on all these interactions: let's check out setting everything made before 1950 to 1950 (there seems to be missing dates set to 1000 or 1900), and create an 'age' category. Run an RF on that and let's see how important age becomes in the sale of construction equipment.



In [74]:

    
df_raw.YearMade[df_raw.YearMade<1950] = 1950
df_keep['age'] = df_raw['age'] = df_raw.saleYear-df_raw.YearMade



In [75]:

    
X_train, X_valid = split_vals(df_keep, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.6, n_jobs=-1)
m.fit(X_train, y_train)
plot_fi(rf_feat_importance(m, df_keep));

6. Tree Interpreter

Lecture 4 – 1:32:30

Tree Interpretation may be unimportant for Kaggle competitions but it is vitally important for real-world applications.

Tree Interpreter allows us to take a particular row, (in this case row 0), let's you call ti.predict passing in your RF and the row (a particular auction or customer's information) and it'll display 3 things:

The RF's prediction

The Bias (avg. sale price across whole original dataset)

The Contributions

To find out why/how we arrive at a specific decision in a Random Forest, we can simply trace the leaf node prediction back through to the root of the tree. contributions shows each split point and value from root to leaf that led the RF to make a specific prediction.



In [77]:

    
from treeinterpreter import treeinterpreter as ti



In [78]:

    
df_train, df_valid = split_vals(df_raw[df_keep.columns], n_trn)



In [79]:

    
row = X_valid.values[None, 0]; row









    Out[79]:





array([[      1999,          5,          0,         17,          0,        665, 1284595200,          0,
              3232,          1,          0,    4364751,          4,    2300944,          0,          0,
                 4,         16,         35,        259,         11]])



In [80]:

    
prediction, bias, contributions = ti.predict(m, row)



In [81]:

    
prediction[0], bias[0]









    Out[81]:





(9.306800797695304, 10.105414209021683)



In [82]:

    
idxs = np.argsort(contributions[0])



In [83]:

    
[o for o in zip(df_keep.columns[idxs], df_valid.iloc[0][idxs], contributions[0][idxs])]









    Out[83]:





[('ProductSize', 'Mini', -0.531897800356864),
 ('fiProductClassDesc',
  'Hydraulic Excavator, Track - 3.0 to 4.0 Metric Tons',
  -0.12777875705636715),
 ('age', 11, -0.1230352578754624),
 ('fiModelDesc', 'KX1212', -0.06429695750820139),
 ('fiSecondaryDesc', nan, -0.05472354042099488),
 ('Enclosure', 'EROPS', -0.043426017361440784),
 ('fiModelDescriptor', nan, -0.03343716157360368),
 ('saleElapsed', 1284595200, -0.0234380393919567),
 ('Tire_Size', nan, -0.016596968861341342),
 ('SalesID', 4364751, -0.015071057901257578),
 ('saleDayofyear', 259, -0.0134373918878953),
 ('MachineID', 2300944, -0.00808281735968901),
 ('saleDay', 16, 0.002627244531604944),
 ('ProductGroup', 'TEX', 0.006643776061455897),
 ('state', 'Ohio', 0.006759068526634993),
 ('Drive_System', nan, 0.008547643551969885),
 ('ProductGroupDesc', 'Track Excavators', 0.013558388800812926),
 ('Hydraulics_Flow', nan, 0.03156173744405044),
 ('YearMade', 1999, 0.05783227736847527),
 ('Coupler_System', nan, 0.06341225705818969),
 ('ModelID', 665, 0.0656659628855019)]



In [84]:

    
contributions[0].sum()









    Out[84]:





-0.7986134113263783

7. Extrapolation

Lecture 5

self note: comeback to this later and fill in notes



In [85]:

    
df_ext = df_keep.copy()
df_ext['is_valid'] = 1
df_ext.is_valid[:n_trn] = 0
x, y, nas = proc_df(df_ext, 'is_valid')



In [86]:

    
m = RandomForestClassifier(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(x, y);
m.oob_score_









    Out[86]:





0.9999925210345902



In [87]:

    
fi = rf_feat_importance(m, x); fi[:10]









    Out[87]:







  
    
      
      cols
      imp
    
  
  
    
      11
      SalesID
      0.786710
    
    
      6
      saleElapsed
      0.153094
    
    
      13
      MachineID
      0.049669
    
    
      5
      ModelID
      0.002298
    
    
      0
      YearMade
      0.001966
    
    
      19
      saleDayofyear
      0.001812
    
    
      8
      fiModelDesc
      0.001183
    
    
      9
      Enclosure
      0.001179
    
    
      7
      fiSecondaryDesc
      0.000608
    
    
      4
      Hydraulics_Flow
      0.000426



In [88]:

    
feats=['SalesID', 'saleElapsed', 'MachineID']



In [89]:

    
(X_train[feats]/1000).describe()









    Out[89]:







  
    
      
      SalesID
      saleElapsed
      MachineID
    
  
  
    
      count
      389125.000000
      3.891250e+05
      389125.000000
    
    
      mean
      1800.452485
      1.084797e+06
      1206.796148
    
    
      std
      595.627288
      1.803913e+05
      430.850552
    
    
      min
      1139.246000
      6.009984e+05
      0.000000
    
    
      25%
      1413.348000
      9.666432e+05
      1087.016000
    
    
      50%
      1632.093000
      1.134605e+06
      1273.859000
    
    
      75%
      2210.453000
      1.234138e+06
      1458.661000
    
    
      max
      4364.741000
      1.325117e+06
      2313.821000



In [90]:

    
(X_valid[feats]/1000).describe()









    Out[90]:







  
    
      
      SalesID
      saleElapsed
      MachineID
    
  
  
    
      count
      12000.000000
      1.200000e+04
      12000.000000
    
    
      mean
      5786.967651
      1.306609e+06
      1578.049709
    
    
      std
      836.899608
      2.497808e+04
      589.497173
    
    
      min
      4364.751000
      1.174522e+06
      0.830000
    
    
      25%
      4408.580750
      1.309219e+06
      1271.225250
    
    
      50%
      6272.538500
      1.316045e+06
      1825.317000
    
    
      75%
      6291.792250
      1.321402e+06
      1907.858000
    
    
      max
      6333.342000
      1.325203e+06
      2486.330000



In [91]:

    
x.drop(feats, axis=1, inplace=True)



In [92]:

    
m = RandomForestClassifier(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(x, y);
m.oob_score_









    Out[92]:





0.9782885634153942



In [93]:

    
fi = rf_feat_importance(m, x); fi[:10]









    Out[93]:







  
    
      
      cols
      imp
    
  
  
    
      17
      age
      0.233065
    
    
      0
      YearMade
      0.188074
    
    
      16
      saleDayofyear
      0.162140
    
    
      5
      ModelID
      0.080256
    
    
      7
      fiModelDesc
      0.065816
    
    
      14
      saleDay
      0.064389
    
    
      15
      state
      0.058693
    
    
      3
      fiProductClassDesc
      0.034648
    
    
      6
      fiSecondaryDesc
      0.024488
    
    
      8
      Enclosure
      0.020663



In [94]:

    
set_rf_samples(50000)



In [98]:

    
feats=['SalesID', 'saleElapsed', 'MachineID', 'age', 'YearMade', 'saleDayofyear']



In [99]:

    
X_train, X_valid = split_vals(df_keep, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)









    



[0.20732218508844377, 0.24541989109033346, 0.9101689696453006, 0.8924358642620684, 0.8938017012211197]



In [100]:

    
for f in feats:
    df_subs = df_keep.drop(f, axis=1)
    X_train, X_valid = split_vals(df_subs, n_trn)
    m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
    m.fit(X_train, y_train)
    print(f)
    print_score(m)









    



SalesID
[0.20896872710860465, 0.2457949267740947, 0.9087364368447693, 0.8921068671960654, 0.8924951613415502]
saleElapsed
[0.2140015452301819, 0.25334212893002017, 0.9042875034196795, 0.8853793662139942, 0.8871234744679918]
MachineID
[0.20675807763153825, 0.24531975639354217, 0.9106571510098291, 0.8925236216539543, 0.894903006839451]
age
[0.20846663377542413, 0.24760162915675493, 0.9091744715454211, 0.8905149124980403, 0.8926126826684936]
YearMade
[0.2086129584339771, 0.2511086069391198, 0.9090469242407012, 0.8873915005248926, 0.8924032010346019]
saleDayofyear
[0.20832658795791745, 0.24376901262286063, 0.9092964619345454, 0.8938781113620977, 0.8932243397624751]



In [101]:

    
reset_rf_samples()



In [102]:

    
df_subs = df_keep.drop(['SalesID', 'MachineID', 'saleDayofyear'], axis=1)
X_train, X_valid = split_vals(df_subs, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)









    



[0.13800246004306055, 0.21698920627218976, 0.9601976712410503, 0.9159138948727017, 0.9091438213957634]



In [103]:

    
plot_fi(rf_feat_importance(m, X_train));



In [104]:

    
np.save('tmp/subs_cols.npy', np.array(df_subs.columns))

Our Final Model:



In [105]:

    
m = RandomForestRegressor(n_estimators=160, max_features=0.5, n_jobs=-1, oob_score=True)
%time m.fit(X_train, y_train)
print_score(m)









    



CPU times: user 5min 6s, sys: 11.4 s, total: 5min 17s
Wall time: 1min 40s
[0.08060201550993717, 0.2107441175127788, 0.9864222868391533, 0.9206843484697286, 0.9147146645659682]



In [ ]:



In [ ]:



In [ ]:

	SalesID	SalePrice	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	UsageBand	fiModelDesc	...	saleDay	saleDayofweek	saleDayofyear	saleIs_month_end	saleIs_month_start	saleIs_quarter_end	saleIs_quarter_start	saleIs_year_end	saleIs_year_start	saleElapsed
0	1139246	11.097410	999089	3157	121	3.0	2004	68.0	2	521D	...	16	3	320	False	False	False	False	False	False	1163635200
1	1139248	10.950807	117657	77	121	3.0	1996	4640.0	2	950FII	...	26	4	86	False	False	False	False	False	False	1080259200
2	1139249	9.210340	434808	7009	121	3.0	2001	2838.0	0	226	...	26	3	57	False	False	False	False	False	False	1077753600
3	1139251	10.558414	1026470	332	121	3.0	2001	3486.0	0	PC120-6E	...	19	3	139	False	False	False	False	False	False	1305763200
4	1139253	9.305651	1057373	17311	121	3.0	2007	722.0	1	S175	...	23	3	204	False	False	False	False	False	False	1248307200
5	1139255	10.184900	1001274	4605	121	3.0	2004	508.0	2	310G	...	18	3	353	False	False	False	False	False	False	1229558400
6	1139256	9.952278	772701	1937	121	3.0	1993	11540.0	0	790ELC	...	26	3	239	False	False	False	False	False	False	1093478400
7	1139261	10.203592	902002	3539	121	3.0	2001	4883.0	0	416D	...	17	3	321	False	False	False	False	False	False	1132185600
8	1139272	9.975808	1036251	36003	121	3.0	2008	302.0	2	430HAG	...	27	3	239	False	False	False	False	False	False	1251331200
9	1139275	11.082143	1016474	3883	121	3.0	1000	20700.0	1	988B	...	9	3	221	False	False	False	False	False	False	1186617600
10	1139278	10.085809	1024998	4605	121	3.0	2004	1414.0	1	310G	...	21	3	234	False	False	False	False	False	False	1219276800
11	1139282	10.021271	319906	5255	121	3.0	1998	2764.0	2	D31E	...	24	3	236	False	False	False	False	False	False	1156377600
12	1139283	10.491274	1052214	2232	121	3.0	1998	0.0	-1	PC200LC6	...	20	3	293	False	False	False	False	False	False	1129766400
13	1139284	10.325482	1068082	3542	121	3.0	2001	1921.0	1	420D	...	26	3	26	False	False	False	False	False	False	1138233600
14	1139290	10.239960	1058450	5162	121	3.0	2004	320.0	2	214E	...	3	1	3	False	False	False	False	False	False	1136246400
15	1139291	9.852194	1004810	4604	121	3.0	1999	2450.0	1	310E	...	16	3	320	False	False	False	False	False	False	1163635200
16	1139292	9.510445	1026973	9510	121	3.0	1999	1972.0	2	334	...	14	3	165	False	False	False	False	False	False	1181779200
17	1139299	9.159047	1002713	21442	121	3.0	2003	0.0	-1	45NX	...	28	3	28	False	False	False	False	False	False	1264636800
18	1139301	9.433484	125790	7040	121	3.0	2001	994.0	2	302.5	...	9	3	68	False	False	False	False	False	False	1141862400
19	1139304	9.350102	1011914	3177	121	3.0	1991	8005.0	1	580SUPER K	...	17	3	321	False	False	False	False	False	False	1132185600
20	1139311	10.621327	1014135	8867	121	3.0	2000	3259.0	1	JS260	...	18	3	138	False	False	False	False	False	False	1147910400
21	1139333	10.448715	999192	3350	121	3.0	1000	16328.0	1	120G	...	19	3	292	False	False	False	False	False	False	1161216000
22	1139344	10.165852	1044500	7040	121	3.0	2005	109.0	2	302.5	...	25	3	298	False	False	False	False	False	False	1193270400
23	1139346	11.198215	821452	85	121	3.0	1996	17033.0	0	966FII	...	19	3	292	False	False	False	False	False	False	1161216000
24	1139348	10.404263	294562	3542	121	3.0	2001	1877.0	1	420D	...	20	3	141	False	False	False	False	False	False	1085011200
25	1139351	9.433484	833838	7009	121	3.0	2003	1028.0	1	226	...	9	3	68	False	False	False	False	False	False	1141862400
26	1139354	9.648595	565440	7040	121	3.0	2003	356.0	2	302.5	...	9	3	68	False	False	False	False	False	False	1141862400
27	1139356	10.878047	1004127	25458	121	3.0	2000	0.0	-1	EX550STD	...	22	3	53	False	False	False	False	False	False	1172102400
28	1139357	10.736397	44800	19167	121	3.0	2004	904.0	2	685B	...	9	3	221	False	False	False	False	False	False	1186617600
29	1139358	11.396392	1018076	1333	121	3.0	1998	10466.0	1	345BL	...	1	3	152	False	True	False	False	False	False	1149120000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
401095	6333259	9.259131	1872639	21437	149	1.0	2003	NaN	-1	35N	...	14	2	348	False	False	False	False	False	False	1323820800
401096	6333260	9.210340	1816341	21437	149	2.0	2004	NaN	-1	35N	...	15	3	258	False	False	False	False	False	False	1316044800
401097	6333261	9.047821	1843949	21437	149	1.0	2005	NaN	-1	35N	...	28	4	301	False	False	False	False	False	False	1319760000
401098	6333262	9.259131	1791341	21437	149	2.0	2004	NaN	-1	35N	...	16	1	228	False	False	False	False	False	False	1313452800
401099	6333263	9.305651	1833174	21437	149	1.0	2004	NaN	-1	35N	...	14	2	348	False	False	False	False	False	False	1323820800
401100	6333264	9.259131	1791370	21437	149	2.0	2004	NaN	-1	35N	...	16	1	228	False	False	False	False	False	False	1313452800
401101	6333270	9.210340	1799208	21437	149	1.0	2004	NaN	-1	35N	...	14	2	348	False	False	False	False	False	False	1323820800
401102	6333272	9.259131	1927142	21437	149	2.0	2005	NaN	-1	35N	...	16	1	228	False	False	False	False	False	False	1313452800
401103	6333273	9.433484	1789856	21437	149	2.0	2005	NaN	-1	35N	...	15	3	258	False	False	False	False	False	False	1316044800
401104	6333275	9.259131	1924623	21437	149	2.0	2005	NaN	-1	35N	...	16	1	228	False	False	False	False	False	False	1313452800
401105	6333276	9.210340	1835350	21437	149	2.0	2005	NaN	-1	35N	...	25	1	298	False	False	False	False	False	False	1319500800
401106	6333278	9.259131	1944702	21437	149	2.0	2005	NaN	-1	35N	...	16	1	228	False	False	False	False	False	False	1313452800
401107	6333279	9.433484	1866563	21437	149	2.0	2005	NaN	-1	35N	...	15	3	258	False	False	False	False	False	False	1316044800
401108	6333280	9.259131	1851633	21437	149	2.0	2005	NaN	-1	35N	...	16	1	228	False	False	False	False	False	False	1313452800
401109	6333281	9.259131	1798958	21437	149	2.0	2005	NaN	-1	35N	...	16	1	228	False	False	False	False	False	False	1313452800
401110	6333282	9.259131	1878866	21437	149	2.0	2005	NaN	-1	35N	...	15	3	258	False	False	False	False	False	False	1316044800
401111	6333283	9.210340	1874235	21437	149	2.0	2005	NaN	-1	35N	...	25	1	298	False	False	False	False	False	False	1319500800
401112	6333284	9.259131	1887654	21437	149	2.0	2005	NaN	-1	35N	...	25	1	298	False	False	False	False	False	False	1319500800
401113	6333285	9.259131	1817165	21437	149	2.0	2005	NaN	-1	35N	...	25	1	298	False	False	False	False	False	False	1319500800
401114	6333287	9.433484	1918242	21437	149	2.0	2005	NaN	-1	35N	...	15	1	319	False	False	False	False	False	False	1321315200
401115	6333290	9.210340	1843374	21437	149	2.0	2005	NaN	-1	35N	...	25	1	298	False	False	False	False	False	False	1319500800
401116	6333302	9.047821	1825337	21437	149	2.0	2005	NaN	-1	35N	...	25	1	298	False	False	False	False	False	False	1319500800
401117	6333307	9.210340	1821747	21437	149	2.0	2005	NaN	-1	35N	...	25	1	298	False	False	False	False	False	False	1319500800
401118	6333311	9.159047	1828862	21437	149	2.0	2006	NaN	-1	35N	...	25	1	298	False	False	False	False	False	False	1319500800
401119	6333335	9.047821	1798293	21435	149	2.0	2005	NaN	-1	30NX	...	25	1	298	False	False	False	False	False	False	1319500800
401120	6333336	9.259131	1840702	21439	149	1.0	2005	NaN	-1	35NX2	...	2	2	306	False	False	False	False	False	False	1320192000
401121	6333337	9.305651	1830472	21439	149	1.0	2005	NaN	-1	35NX2	...	2	2	306	False	False	False	False	False	False	1320192000
401122	6333338	9.350102	1887659	21439	149	1.0	2005	NaN	-1	35NX2	...	2	2	306	False	False	False	False	False	False	1320192000
401123	6333341	9.104980	1903570	21435	149	2.0	2005	NaN	-1	30NX	...	25	1	298	False	False	False	False	False	False	1319500800
401124	6333342	8.955448	1926965	21435	149	2.0	2005	NaN	-1	30NX	...	25	1	298	False	False	False	False	False	False	1319500800

	Enclosure	SalePrice	pred	pred_std
0	EROPS	9.849178	9.839894	0.270817
1	EROPS AC	NaN	NaN	NaN
2	EROPS w AC	10.623971	10.577495	0.261123
3	NO ROPS	NaN	NaN	NaN
4	None or Unspecified	NaN	NaN	NaN
5	OROPS	9.682064	9.688541	0.224662

	SalePrice	pred	pred_std
ProductSize
Compact	9.735093	9.872288	0.330976
Large	10.470589	10.394293	0.354021
Large / Medium	10.691871	10.652369	0.290579
Medium	10.681511	10.625052	0.280709
Mini	9.535147	9.557366	0.247063
Small	10.324448	10.310762	0.309695

	cols	imp
5	YearMade	0.174785
13	ProductSize	0.088344
37	Coupler_System	0.087752
14	fiProductClassDesc	0.081937
39	Hydraulics_Flow	0.077173
2	ModelID	0.062335
63	saleElapsed	0.051469
10	fiSecondaryDesc	0.042881
8	fiModelDesc	0.032704
38	Grouser_Tracks	0.030734

	cols	imp
11	SalesID	0.786710
6	saleElapsed	0.153094
13	MachineID	0.049669
5	ModelID	0.002298
0	YearMade	0.001966
19	saleDayofyear	0.001812
8	fiModelDesc	0.001183
9	Enclosure	0.001179
7	fiSecondaryDesc	0.000608
4	Hydraulics_Flow	0.000426

	SalesID	saleElapsed	MachineID
count	389125.000000	3.891250e+05	389125.000000
mean	1800.452485	1.084797e+06	1206.796148
std	595.627288	1.803913e+05	430.850552
min	1139.246000	6.009984e+05	0.000000
25%	1413.348000	9.666432e+05	1087.016000
50%	1632.093000	1.134605e+06	1273.859000
75%	2210.453000	1.234138e+06	1458.661000
max	4364.741000	1.325117e+06	2313.821000

	SalesID	saleElapsed	MachineID
count	12000.000000	1.200000e+04	12000.000000
mean	5786.967651	1.306609e+06	1578.049709
std	836.899608	2.497808e+04	589.497173
min	4364.751000	1.174522e+06	0.830000
25%	4408.580750	1.309219e+06	1271.225250
50%	6272.538500	1.316045e+06	1825.317000
75%	6291.792250	1.321402e+06	1907.858000
max	6333.342000	1.325203e+06	2486.330000

	cols	imp
17	age	0.233065
0	YearMade	0.188074
16	saleDayofyear	0.162140
5	ModelID	0.080256
7	fiModelDesc	0.065816
14	saleDay	0.064389
15	state	0.058693
3	fiProductClassDesc	0.034648
6	fiSecondaryDesc	0.024488
8	Enclosure	0.020663