lecture 00:01:33 Random forests allow us to understand our data deeper and more quickly than traditional approaches.
Q: "when should we use Random Forests?"
A: "the question is more: *when should I use other things as well?"
for Unstructured Data (sound waveform, words in text, pixels in images) I'll almost certainly want to try Deep Learning. For Collaborative Filtering, neither DL nor RF are exactly what you want (tweaks are needed)
In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
In [2]:
from fastai.imports import *
from fastai.structured import *
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics
In [3]:
set_plot_sizes(12,14,16)
We already read in our CSV and processed it into categories in the Lesson 1 notebook. proc_df
turns the categories into integers, deals w/ missing values, and pulls out the dependant variable (y_trn
).
Lecture 3 – 52:59 Note on proc_df
, nas
, and na_dict
:
proc_df
returns a dictionary telling you which columns were missing (had missing things?), and for each of those columns what the median was. When you callproc_df
on the larger non-subset dataset you don't pass inna_dict
as an argument – you just want to get back the result.Later on, when you pass it into a subset, you want to have the same missing columns and the same medians so you pass in
na_dict
. If the different subset (for example a different dataset) had different missing columns,proc_df
would update thena_dict
argument with additional p-values.If you don't pass it in,
proc_df
just gives you the information about what was missing & the medians. If you do pass it in it uses that information for any missing columns that are there – and if there are new missing columns it'll update the dictionary with that additional information.It keeps track of any missing columns you came across in anything you passed to
proc_df
.
In [4]:
PATH = 'data/bulldozers/'
df_raw = pd.read_feather('tmp/bulldozers-raw.feather')
df_trn, y_trn , nas = proc_df(df_raw, 'SalePrice')
In [5]:
def split_vals(a,n): return a[:n], a[n:]
n_valid = 12000
n_trn = len(df_trn) - n_valid
X_train, X_valid = split_vals(df_trn, n_trn)
y_train, y_valid = split_vals(y_trn, n_trn)
raw_train, raw_valid = split_vals(df_raw, n_trn)
In [6]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean())
def print_score(m):
res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
m.score(X_train, y_train), m.score(X_valid, y_valid)]
if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
print(res)
In [7]:
df_raw # after proc_df. shows log(SalePrice)
Out[7]:
We'd be less confident of a prediction if we hadn't seen many of rows like a particular row. And if we hadn't: we wouldn't expect any of the trees to have a path through which is designed to help us predict that row.
So conceptually you'd expect that as you pass this unusual row through different trees it's going to end up in very different places.
IoW: instead of taking the mean of the predictions of the trees and saying 'that's our prediction' – what if we took the Standard Deviation of the predictions of the trees.
If the Standard Deviation of the preds of the trees is high: each tree is giving us a very different estimate of this row's prediction. So if this was a very common kind of row: the trees would've learnt to make good predictions for it bc they'd've seen lots of opportunities to split on those kinds of rows.
The Standard Deviation of the predictions across the trees gives us some kind of relative understanding of how confident we are of this prediction.
This doesn't exist in scikit-learn or other libraries so we'll have to create it.
For model interpretation, there's no need to use the full dataset on each tree – using a subset will be both faster, and also provide better interpretability (since an overfit model won't provide much variance across trees).
We don't need a massively accurate RF, we just need one that indicates the nature of the relationships involved.
So you want n_samples
to be high enough st if you call the same interpretation commands multiple times you don't get different results back each time.
In [8]:
set_rf_samples(50000)
In [9]:
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)
preds
: rows: results of each tree. cols: result of each row in original dataset. (ie: trees x dataset_rows)
We can check out the mean and standard-deviation of the first column in preds
– which is the μ&σ of our dataset's 1st row.
In [10]:
%time preds = np.stack([t.predict(X_valid) for t in m.estimators_])
np.mean(preds[:,0]), np.std(preds[:,0])
Out[10]:
When we use python to loop through trees like this, we're calculating each in series, which is slow! We can use parallel processing to speed things up. Fast AI provides the function parallel_trees
which takes the random-forest model you trained, and some function to call – and calls that function on every tree in parallel.
In [11]:
def get_preds(t): return t.predict(X_valid)
%time preds = np.stack(parallel_trees(m, get_preds))
np.mean(preds[:,0]), np.std(preds[:,0])
Out[11]:
(When CPU time roughly equals Wall time: that's 1 CPU working on it's own)
We can see that different trees are giving different estimates in this auction. In order to see how prediction confidence varies, we can add this to our dataset. "Enclosure
" is one of the categories of our dataset, and has several subcategories:
In [12]:
x = raw_valid.copy() # create copy of data
x['pred_std'] = np.std(preds, axis=0) # add col for stdv of preds
x['pred'] = np.mean(preds, axis=0) # add col for mean of preds
x.Enclosure.value_counts().plot.barh();
We don't care about the 1st 3 subcategories because they never show up in the data.
We take our dataset and group it by 'Enclosure' and take the average of the 3 fields 'SalePrice', 'pred', & 'pred_std':
In [13]:
flds = ['Enclosure', 'SalePrice', 'pred', 'pred_std']
enc_summ = x[flds].groupby('Enclosure', as_index=False).mean()
enc_summ
Out[13]:
We can already get some insight. As you'd expect the sale price & prediction are close to each other on average – that's a good sign.
The Standard Deviation varies a little bit – to see it better we can plot it:
In [14]:
enc_summ = enc_summ[~pd.isnull(enc_summ.SalePrice)]
enc_summ.plot('Enclosure', 'SalePrice', 'barh', xlim=(0,11));
In [15]:
enc_summ.plot('Enclosure', 'pred', 'barh', xerr='pred_std', alpha=0.6, xlim=(0,11));
The error bars are the standard deviation of the prediction.
Question: Why are the predictions nearly exactly right, but the error bars are quite wide?
Doing the same thing for Product Size:
why is matplotlib all of a sudden tasting the rainbow?..
In [16]:
raw_valid.ProductSize.value_counts().plot.barh();
In [17]:
flds = ['ProductSize', 'SalePrice', 'pred', 'pred_std']
summ = x[flds].groupby(flds[0]).mean()
summ
Out[17]:
we could ask "what's the ratio to the stdev of the predictions to the predictions themselves?" – you'd kinda expect on average that when you're predicting something that's a bigger number that you're standard deviation would be higher. So you can sort by that number
What this tell us is our "Large" and "Compact" Product Sizes are less accurate (relatively speaking as a ratio of the total price):
In [18]:
(summ.pred_std/summ.pred).sort_values(ascending=False)
Out[18]:
If you go back to the histogram above you'll see 'Large' and 'Compact' are the smallest groups – so of course. You always want to see the confidence intervals for deployed models.
It's not normally enough to just know that a model can make accurate predictions – we also want to know how it's making predictions. The most important way to see this is with feature importance.
We showed Confidence Intervals first because the intuition of how to calculate them is easier and similar to what we've already done. When it comes to which one do I look at first in practice: I always look at Feature Importance in practice.
When I'm work on a Kaggle competition or a real-world project, I build a Random Forest as fast as I can – trying to get to the point it's significantly better than random (but doesn't need to be much better than that) – and the next thing I do is plot the feature importance.
Feature Importance tells us: "in this Random Forest, which columns matter"
In [19]:
fi = rf_feat_importance(m, df_trn); fi[:10] # view top-10
Out[19]:
In [20]:
fi.plot('cols', 'imp', figsize=(10,6), legend=False);
(this should be displaying the column names on x-axis..)
In nearly every dataset you use in real life, this is what your feature importance is going to look like. There's a handful of columns you care about.
We can plot this as a barplot:
In [21]:
def plot_fi(fi): return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)
In [22]:
plot_fi(fi[:30]);
Before we worry about how to calculate it – much more important is to know what to do with it. The most important thing to do with it is to now sit down with your client or data-dictionary (w/e your information source is) and ask them about the data. Ie: "what is
YearMade
?"
Find out everything you can about YearMade
, Coupler_System
, and ProductSize
, and their relationships to price, because they are what matter.
What'll often happen in real-world projects is you'll sit with the client and you'll say "turns out the Coupler System is the 2nd most important thing", and they might say "that makes no sense" That doesn't mean there's a problem w/ your model. It means there's a problem w/ their understanding of the data they gave you.
Data Leakage: there's information in the dataset you're modeling with which the client wouldn't've had in real-life at the point in time they're making a decision.
A lot of the key insights you'll find with feature importance are Data Leakage problems. You'll often find signs of co-linearity. (ie:
Coupler_System
is a proxy for a type of ind.eqpmt. that uses that feature).
Now let's keep the features with an importance greater than 0.005
. Our RF looking at all features got a validation R2 score of 0.89.
In [23]:
to_keep = fi[fi.imp > 0.005].cols; len(to_keep)
Out[23]:
In [24]:
df_keep = df_trn[to_keep].copy()
X_train, X_valid = split_vals(df_keep, n_trn)
In [25]:
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5,
n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)
The R2 score basically didn't change – in fact removing the redundant columns actually made it a little better.
In [26]:
fi = rf_feat_importance(m, df_keep)
plot_fi(fi);
When you remove redundant columns, you're also removing sources of colinearity – ie: 2 cols that might be related to each other. Colinearity won't make your RF less predictive – but if 2 cols are related, and one of them is a strong driver of the dependant variable, then what happens is the importance becomes split between them. By removing those cols w/ v.little impact, it makes feature importance a lot clearer.
Lecture 3 – 1:17:19 Feature Importance is a technique that works for any kind of Machine Learning model.
Feature Importances are calculated by shuffling each column and measuring its affect on the model's score (running predictions again).
Lecture 4 RF Interpretation start.
More on RF Feature Importance -vs- Logistic Regression Coefficients.
You're much more often going to see people talk about Logistic Regression coefficients than you are to see them talk about Random Forest variable importance – and every time you see that happen you should be v.v.very skeptical.
with max_n_cats
, any category with < max_n_cats
levels will be turned into a 1-hot encoded bunch of columns.
In [27]:
df_trn2, y_trn, nas = proc_df(df_raw, 'SalePrice', max_n_cat=7)
X_train, X_valid = split_vals(df_trn2, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.6, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)
In this case the validation set RMSE & R2 got a little worse. This isn't always the case, and depends on the dataset: if the dataset has single columns which tend to be important or not.
In this case 1-hot encoding didn't make our model more predictive. But it did create different features. proc_df
places the "variablename
+_
+levelname
"
Whereas before "Enclosure" was only somewhat important, when we 1-hot encode it, "Enclosure_EROPS w AC" becomes the most important feature.
For at least the purpose of interpreting your model, you should always try 1-hot encoding quite a few of your variables – 6 or 7 are often good.
In [28]:
fi = rf_feat_importance(m, df_trn2)
plot_fi(fi[:25]);
The number of levels in a category is its cardinality. proc_df
checks the cardinality against max_n_cats
and 1-hot encodes it if it's less.
The original variable is not kept when it's 1-hot encoded.
If you have an ordinal variable (0,1,2 for Hi, Med, Low), proc_df
will destroy it's ordering if it's affected by max_n_cats
. Convert it to numerical ahead of time via df.CatName = df.CatName.cat.codes
. This will cause the numericalize
within proc_df
to skip over it.
One thing that makes this harder to interpret is that there seem to be some variables with very similar meanings. Let's try to remove redundant features.
One way to do this is do use a dendrogram – a type of hierarchical clustering.
In hierarchical (aka aglomorative) clustering, we look at every pair of objects and ask which 2 are the closest – and replace them with the midpoint (average) of the two.
In [29]:
from scipy.cluster import hierarchy as hc
In [30]:
# create dendogram
corr = np.round(scipy.stats.spearmanr(df_keep).correlation, 4)
corr_condensed = hc.distance.squareform(1 - corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=(16,10))
dendrogram = hc.dendrogram(z, labels=df_keep.columns, orientation='left', leaf_font_size=16)
plt.show()
Lecture 4 – 57:40 The horizontal axis is similarity. So, saleYear and saleElapsed are very similar. They've been combined by the hier.clustering algo and are v.similar.
Correlation is almost the same as the R2, except it's between 2 variables, and not between a variable and its prediction. Here the Spearman's R was used as the correlation metric (most common type).
Rank-correlation: replace each point with its rank on the x & y axis (turns data into a straight line).
I want to find the columns that are similar in a way that the Random Forest would find them similar – RF's don't care about linearity, only ordering, so a rank-correlation is the 'right way' to think about it.
It also has to be monotonic (parabolas won't work).
Let's try removing some of these related features to see if the model can be simplified without impacting accuracy.
In [31]:
def get_oob(df):
m = RandomForestRegressor(n_estimators=30, min_samples_leaf=5, max_features=0.6, n_jobs=-1, oob_score=True)
x, _ = split_vals(df, n_trn)
m.fit(x, y_train)
return m.oob_score_
Here's our baseline:
In [32]:
get_oob(df_keep)
Out[32]:
We're going to try removing each of the most correlated variables one at a time and see which we can remove without worsening the OOB score.
In [33]:
for c in ('fiModelDesc', 'fiBaseModel', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow', 'saleElapsed', 'saleYear', ):
print(c, get_oob(df_keep.drop(c, axis=1)))
What if we try one from each group for removal:
In [36]:
to_drop = ['fiBaseModel', 'Grouser_Tracks', 'saleYear']
get_oob(df_keep.drop(to_drop, axis=1))
Out[36]:
Looking good – let's use this dataframe from here. We'll save the list of columns so we can resuse it later.
In [37]:
df_keep.drop(to_drop, axis=1, inplace=True)
X_train, X_valid = split_vals(df_keep, n_trn)
In [38]:
np.save('tmp/keep_cols.npy', np.array(df_keep.columns))
In [39]:
keep_cols = np.load('tmp/keep_cols.npy')
df_keep = df_trn[keep_cols]
And let's see how this model looks on the full dataset.
In [40]:
reset_rf_samples() # use full dataset & bootstrapped samples
In [41]:
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)
We're getting a good score (better than before) with a smaller and simpler model now.
Now we're at the point where we want to really try to understand the data better by taking advantage of the model. We'll use something called Partial Dependence. V.few people know about this powerful technique.
What we're going to do is find out for the features that are important, how they relate to the dependent variable.
In [42]:
from pdpbox import pdp
from plotnine import *
In [43]:
set_rf_samples(50000)
This next analysis will be a little easier if we use the 1-hot encoded categorical variables, so let's load them up again.
In [47]:
df_trn2, y_trn, nas = proc_df(df_raw, 'SalePrice', max_n_cat=7)
X_train, X_valid = split_vals(df_trn2, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.6, n_jobs=-1)
m.fit(X_train, y_train);
In [48]:
plot_fi(rf_feat_importance(m, df_trn2)[:10]);
YearMade is the 2nd most important variable. We'd expect it to come together with saleElapsed to tell some info about how old the product was when it was sold.
We can plot them to see how they relate to each other:
In [50]:
df_raw.plot('YearMade', 'saleElapsed', 'scatter', alpha=0.01, figsize=(10,8));
It's highly unlikely that tractors were made in the year 1000 ... so we can remove those.
In [56]:
x_all = get_sample(df_raw[df_raw.YearMade > 1930], 500) # grab 500 pts
Lecture 4 – 1:09:58 the ggplot
library uses it's R syntax.
In [57]:
ggplot(x_all, aes('YearMade', 'SalePrice')) + stat_smooth(se=True, method='loess')
Out[57]:
loess
-- locallly-weighted regression (like doing lots of little linear regressions) –– (JHoward uses this for univariate plots)
What we want to do is say what's the relationship between salePrice and YearMade – all other things being equal.
We do this with a Partial Dependence Plot:
In [58]:
x = get_sample(X_train[X_train.YearMade>1930], 500)
In [59]:
def plot_pdp(feat, clusters=None, feat_name=None):
feat_name = feat_name or feat
p = pdp.pdp_isolate(m, x, feat)
return pdp.pdp_plot(p, feat_name, plot_lines=True,
cluster=clusters is not None, n_cluster_centers=clusters)
In [60]:
plot_pdp('YearMade')
We've got our sample of 500 datapoints. We're going to take each one of those 500 randomly chosen auctions and make a dataset out of it.
Instead of shuffling a chosen column for feature importance, we're going to set the entire column to a single value, run that dataset pre-fit RF model seeing what the predictions become, and do so for each value for the chosen column in our dataset.
Random shuffling the column tells us how accurate it is when you don't use that column anymore, replacing the whole column with a constant (say, 1961 for YearMade) estimates for us how much we would've sold that product for – in that auction, on that day, in that place, etc – if that product had been made in 1961.
So we basically then take the average of all of the sale prices that we calculate from that Random Forest.
Each light-blue line is a row in our dataset. The y-axis is dependant-variable (zero-indexed to the 1st value in our chosen column: so 1900s prediction becomes 0), and x-axis is the values of our chosen independant variable (YearMade).
This Partial Dependence Plot concept is something which uses a Random Forest to get us a clearer interpretation of what's going on in our data.
Steps:
look at feature importance to tell us what to care about
use the partial dependence plot to tell us what's going on, on average.
We can also use clusters with pdp.
clusters
uses Cluster Analysis to look at each one of the 500 rows and say "do some of those 500 rows kind of move in the same way"
In [62]:
plot_pdp('YearMade', clusters=5)
Here's the result of the Cluster Analysis. We still see the same average, but we also see the 5 most common patterns/shapes in the data.
You can do the same thing in a PDP-Interaction Plot. In this example: how do saleElapsed & YearMade together impact price?
y-axis: zero-indexed Log(price); (last plot: saleElapsed -vs- YearMade, w/ Log(price) color bar.
In [63]:
feats = ['saleElapsed', 'YearMade']
p = pdp.pdp_interact(m, x, feats)
pdp.pdp_interact_plot(p, feats)
If you have 1-hot encoded variables, you can pass an array of them to plot_pdp
and it'll treat them as categories.
In [73]:
plot_pdp(['Enclosure_EROPS w AC', 'Enclosure_EROPS', 'Enclosure_OROPS'], 5, 'Enclosure')
Enclosure_EROPS w AC are on average more expensive than Enclosure_EROPS which are on.avg slightly more expensive than Enclosure_OROPS.
Based on all these interactions: let's check out setting everything made before 1950 to 1950 (there seems to be missing dates set to 1000 or 1900), and create an 'age' category. Run an RF on that and let's see how important age
becomes in the sale of construction equipment.
In [74]:
df_raw.YearMade[df_raw.YearMade<1950] = 1950
df_keep['age'] = df_raw['age'] = df_raw.saleYear-df_raw.YearMade
In [75]:
X_train, X_valid = split_vals(df_keep, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.6, n_jobs=-1)
m.fit(X_train, y_train)
plot_fi(rf_feat_importance(m, df_keep));
Tree Interpretation may be unimportant for Kaggle competitions but it is vitally important for real-world applications.
Tree Interpreter allows us to take a particular row, (in this case row 0), let's you call
ti.predict
passing in your RF and the row (a particular auction or customer's information) and it'll display 3 things:
- The RF's prediction
- The Bias (avg. sale price across whole original dataset)
- The Contributions
To find out why/how we arrive at a specific decision in a Random Forest, we can simply trace the leaf node prediction back through to the root of the tree. contributions
shows each split point and value from root to leaf that led the RF to make a specific prediction.
In [77]:
from treeinterpreter import treeinterpreter as ti
In [78]:
df_train, df_valid = split_vals(df_raw[df_keep.columns], n_trn)
In [79]:
row = X_valid.values[None, 0]; row
Out[79]:
In [80]:
prediction, bias, contributions = ti.predict(m, row)
In [81]:
prediction[0], bias[0]
Out[81]:
In [82]:
idxs = np.argsort(contributions[0])
In [83]:
[o for o in zip(df_keep.columns[idxs], df_valid.iloc[0][idxs], contributions[0][idxs])]
Out[83]:
In [84]:
contributions[0].sum()
Out[84]:
In [85]:
df_ext = df_keep.copy()
df_ext['is_valid'] = 1
df_ext.is_valid[:n_trn] = 0
x, y, nas = proc_df(df_ext, 'is_valid')
In [86]:
m = RandomForestClassifier(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(x, y);
m.oob_score_
Out[86]:
In [87]:
fi = rf_feat_importance(m, x); fi[:10]
Out[87]:
In [88]:
feats=['SalesID', 'saleElapsed', 'MachineID']
In [89]:
(X_train[feats]/1000).describe()
Out[89]:
In [90]:
(X_valid[feats]/1000).describe()
Out[90]:
In [91]:
x.drop(feats, axis=1, inplace=True)
In [92]:
m = RandomForestClassifier(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(x, y);
m.oob_score_
Out[92]:
In [93]:
fi = rf_feat_importance(m, x); fi[:10]
Out[93]:
In [94]:
set_rf_samples(50000)
In [98]:
feats=['SalesID', 'saleElapsed', 'MachineID', 'age', 'YearMade', 'saleDayofyear']
In [99]:
X_train, X_valid = split_vals(df_keep, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)
In [100]:
for f in feats:
df_subs = df_keep.drop(f, axis=1)
X_train, X_valid = split_vals(df_subs, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print(f)
print_score(m)
In [101]:
reset_rf_samples()
In [102]:
df_subs = df_keep.drop(['SalesID', 'MachineID', 'saleDayofyear'], axis=1)
X_train, X_valid = split_vals(df_subs, n_trn)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)
In [103]:
plot_fi(rf_feat_importance(m, X_train));
In [104]:
np.save('tmp/subs_cols.npy', np.array(df_subs.columns))
In [105]:
m = RandomForestRegressor(n_estimators=160, max_features=0.5, n_jobs=-1, oob_score=True)
%time m.fit(X_train, y_train)
print_score(m)
In [ ]:
In [ ]:
In [ ]: