As an experiment, we'll look at a dataset uniquely well suited to modeling with random forest regression.
In [195]:
import numpy as np
import pandas as pd
In [196]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
In [197]:
import matplotlib.pyplot as plt
In [198]:
n = 2000
df = pd.DataFrame({
'a': np.random.normal(size=n),
'b': np.random.normal(size=n),
'c': np.random.normal(size=n),
'd': np.random.uniform(size=n),
'e': np.random.uniform(size=n),
'f': np.random.choice(list('abc'), size=n, replace=True),
'g': np.random.choice(list('efghij'), size=n, replace=True),
})
df = pd.get_dummies(df)
In [199]:
df.head()
Out[199]:
The target variable is a little bit complicated. One of the categorical variables is used to select which continuous variable comes into play.
In [200]:
y = df.a*df.f_a + df.b*df.f_b + df.c*df.f_c + 3*df.d + np.random.normal(scale=1/3, size=n)
To calibrate our expectations, even if we have perfect insight, there's still some noise involved. How well could we do, in the best case?
In [205]:
y_pred = df.a*df.f_a + df.b*df.f_b + df.c*df.f_c + 3*df.d
r2_score(y, y_pred)
Out[205]:
In [206]:
plt.scatter(y, y_pred)
Out[206]:
In [207]:
i = np.random.choice((1,2,3), size=n, replace=True, p=(3/5,1/5,1/5))
In [208]:
i = np.array([1]*int(3/5*n) + [2]*int(1/5*n) + [3]*int(1/5*n))
np.random.shuffle(i)
len(i) == n
Out[208]:
In [209]:
df_train = df[i==1]
df_val = df[i==2]
df_test = df[i==3]
y_train = y[i==1]
y_val = y[i==2]
y_test = y[i==3]
len(df_train), len(df_val), len(df_test)
Out[209]:
In [210]:
regr = RandomForestRegressor(n_estimators=200, oob_score=True)
regr.fit(df_train, y_train)
print(regr.oob_score_)
Cross-validate to select params max_depth, min_samples_leaf?
In [211]:
y_pred = regr.predict(df_test)
In [212]:
r2_score(y_test, y_pred)
Out[212]:
In [213]:
plt.scatter(y_test, y_pred)
Out[213]:
The idea was that random forest could capture the interaction with the categorical variable. If that's true, it should outperform standard linear regression. Let's fit a linear regression to the same data and see how that does.
In [223]:
from sklearn.linear_model import LinearRegression
In [224]:
lm = LinearRegression()
lm.fit(df_train, y_train)
Out[224]:
In [225]:
y_pred = lm.predict(df_test)
In [226]:
r2_score(y_test, y_pred)
Out[226]:
In [227]:
plt.scatter(y_test, y_pred)
Out[227]:
In [228]:
n = 2000
df = pd.DataFrame({
'a': np.random.normal(size=n),
'b': np.random.normal(size=n),
'c': np.random.choice(list('xyz'), size=n, replace=True),
})
df = pd.get_dummies(df)
df.head()
Out[228]:
In [181]:
y = 0.234*df.a + -0.678*df.b + -0.456*df.c_x + 0.123*df.c_y + 0.811*df.c_z + np.random.normal(scale=1/3, size=n)
In [182]:
i = np.random.choice((1,2,3), size=n, replace=True, p=(3/5,1/5,1/5))
In [183]:
i = np.array([1]*int(3/5*n) + [2]*int(1/5*n) + [3]*int(1/5*n))
np.random.shuffle(i)
len(i) == n
Out[183]:
In [184]:
df_train = df[i==1]
df_val = df[i==2]
df_test = df[i==3]
y_train = y[i==1]
y_val = y[i==2]
y_test = y[i==3]
len(df_train), len(df_val), len(df_test)
Out[184]:
In [185]:
regr = RandomForestRegressor(n_estimators=10, oob_score=True)
regr.fit(df_train, y_train)
print(regr.oob_score_)
In [186]:
y_pred = regr.predict(df_test)
In [187]:
r2_score(y_test, y_pred)
Out[187]:
In [188]:
plt.scatter(y_test, y_pred)
Out[188]:
In [189]:
lm = LinearRegression()
lm.fit(df_train, y_train)
Out[189]:
In [190]:
y_pred = lm.predict(df_test)
In [191]:
r2_score(y_test, y_pred)
Out[191]:
In [192]:
plt.scatter(y_test, y_pred)
Out[192]:
In [ ]: