Classic machine learning algorithms map multiple inputs to a single output. For example, you might have five photometric observations of a galaxy, and predict a single attribute or label (like the redshift, metallicity, etc.) When multiple ouputs are desired, standard practice is to essentially run two independent classifications: first predict one variable, then the next. The problem with this approach is that it completely ignores correlations in the outputs.
This is my Thursday hack, which was to explore ideas to improve on this within Random Forests.
First we need some toy data: some inputs X which are associated with multiple outputs Y. I'll do this by generating some random Y values and creating some noisy X values which map to them:
In :%matplotlib inline import matplotlib.pyplot as plt import numpy as np from sklearn.ensemble import RandomForestRegressor import seaborn as sns sns.set()
In :def create_data(N=1000): Y = np.vstack([np.random.normal([3, 8], [1, 3], (N // 2, 2)), np.random.normal([8, 3], [3, 1], (N - N // 2, 2))]) X = np.vstack([0.1 * Y[:, 0] * Y[:, 1], np.sin(Y[:, 0]) * np.cos(Y[:, 1]), np.cos(Y[:, 0]) * np.sin(Y[:, 1])]).T X = np.random.normal(X, 1.0) return X, Y
Let's create a thousand points and look at the correlations in the outputs:
In :np.random.seed(0) X, Y = create_data(1000)
In :fig, ax = plt.subplots() ax.plot(Y[:, 0], Y[:, 1], 'o', alpha=0.5) ax.add_patch(plt.Rectangle((6, 6), 14, 14, color='yellow', alpha=0.2)) ax.set_xlim(0, 20); ax.set_ylim(0, 20) ax.set_xlabel('$y_1$'); ax.set_ylabel('$y_2$');
This shows that the true output values obey an "exclusion zone", which we can imagine would be due to some astrophysical constraint in the system.
Let's see how the classic random forest multi-output solution works on this. We'll start by doing a simple split of our data into a training and testing set:
In :from sklearn.cross_validation import train_test_split Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, train_size=0.5)
In :from sklearn.ensemble import RandomForestRegressor clf1 = RandomForestRegressor(100).fit(Xtrain, Ytrain) Ypred1 = clf1.predict(Xtest) fig, ax = plt.subplots() ax.plot(Ypred1[:, 0], Ypred1[:, 1], 'o', alpha=0.5) ax.add_patch(plt.Rectangle((6, 6), 14, 14, color='yellow', alpha=0.2)) ax.set_xlim(2, 12); ax.set_ylim(2, 12) ax.set_xlabel('$y_1$'); ax.set_ylabel('$y_2$');
No good! The independent estimates don't take into account the correlation, and many of our estimates bleed into the exclusion zone!
How might we hack around this? My quick-and-dirty idea was to first fit for $y_1$, and then add this information into the second fit. That is, instead of doing
we instead do
By daisy-chaining our estimates, we should get an output which "knows" about this exclusion zone.
Here is the multi output random forest regressor:
In :class MultiOutputRF(object): def __init__(self, *args, **kwargs): self.args = args self.kwargs = kwargs def fit(self, X, Y): X, Y = map(np.atleast_2d, (X, Y)) assert X.shape == Y.shape Ny = Y.shape self.clfs =  for i in range(Ny): clf = RandomForestRegressor(*self.args, **self.kwargs) Xi = np.hstack([X, Y[:, :i]]) yi = Y[:, i] self.clfs.append(clf.fit(Xi, yi)) return self def predict(self, X): Y = np.empty([X.shape, len(self.clfs)]) for i, clf in enumerate(self.clfs): Y[:, i] = clf.predict(np.hstack([X, Y[:, :i]])) return Y
Let's try this on our result, comparing to the previous plot
In :clf2 = MultiOutputRF(100).fit(Xtrain, Ytrain) Ypred2 = clf2.predict(Xtest)
In :fig, ax = plt.subplots(1, 2, figsize=(10, 4), sharex=True, sharey=True) ax.plot(Ypred1[:, 0], Ypred1[:, 1], 'o', alpha=0.7) ax.plot(Ypred2[:, 0], Ypred2[:, 1], 'o', alpha=0.7) ax.set_title("Standard method") ax.set_title("Daisy-chain method") for axi in ax: axi.add_patch(plt.Rectangle((6, 6), 14, 14, color='yellow', alpha=0.2)) axi.set_xlim(2, 12) axi.set_ylim(2, 12)
The results are what we'd hope! Now, there are probably more principled ways to do this sort of thing, but this demonstrates one way to start thinking about multi-output classifiers and regressors. My intuition is that you could make some interesting headway by instead using a generative model: for example a Gaussian Process Regression which takes into account correlations in the output space as well as the input space. Perhaps I'll think about this more at some point...