This notebook will help us with testing different regression techniques, and demonstrate the diagnostic class which can be used to find the optimal parameters for COBRA.
So for now we will generate a random data-set and try some of the popular regression techniques on it, after it has been loaded to COBRA.
In [1]:
from pycobra.cobra import Cobra
from pycobra.diagnostics import Diagnostics
import numpy as np
%matplotlib inline
In [2]:
# setting up our random data-set
rng = np.random.RandomState(1)
# D1 = train machines; D2 = create COBRA; D3 = calibrate epsilon, alpha; D4 = testing
n_features = 20
D1, D2, D3, D4 = 200, 200, 200, 200
D = D1 + D2 + D3 + D4
X = rng.uniform(-1, 1, D * n_features).reshape(D, n_features)
Y = np.power(X[:,1], 2) + np.power(X[:,3], 3) + np.exp(X[:,10])
# Y = np.power(X[:,0], 2) + np.power(X[:,1], 3)
# training data-set
X_train = X[:D1 + D2]
X_test = X[D1 + D2 + D3:D1 + D2 + D3 + D4]
X_eps = X[D1 + D2:D1 + D2 + D3]
# for testing
Y_train = Y[:D1 + D2]
Y_test = Y[D1 + D2 + D3:D1 + D2 + D3 + D4]
Y_eps = Y[D1 + D2:D1 + D2 + D3]
In [3]:
cobra = Cobra(random_state=0, epsilon=0.5)
cobra.fit(X_train, Y_train, default=False)
Out[3]:
When we are fitting, we initialise COBRA with an epsilon value of $0.5$ - this is because we are aware of the distribution and 0.5 is a fair guess of what would be a "good" epsilon value, because the data varies from $-1$ to $1$.
If we do not pass the $\epsilon$ parameter, we perform a CV on the training data for an optimised epsilon.
It can be noticed that the default parameter is set as false: this is so we can walk you through what happens when COBRA is set-up, instead of the deafult settings being used.
We're now going to split our dataset into two parts, and shuffle data points.
In [4]:
cobra.split_data(D1, D1 + D2, shuffle_data=True)
Out[4]:
Let's load the default machines to COBRA.
In [5]:
cobra.load_default()
Out[5]:
We note here that further machines can be loaded using either the loadMachine() and loadSKMachine() methods. The only prerequisite is that the machine has a valid predict() method.
In [6]:
query = X_test[9].reshape(1, -1)
In [7]:
cobra.machines_
Out[7]:
In [8]:
cobra.machines_['lasso'].predict(query)
Out[8]:
In [9]:
cobra.machines_['tree'].predict(query)
Out[9]:
In [10]:
cobra.machines_['ridge'].predict(query)
Out[10]:
In [11]:
cobra.machines_['random_forest'].predict(query)
Out[11]:
In [12]:
cobra.load_machine_predictions()
Out[12]:
In [13]:
cobra.predict(query)
Out[13]:
In [14]:
Y_test[9]
Out[14]:
Let's check the MSE for each of COBRAs machines:
In [15]:
cobra_diagnostics = Diagnostics(cobra, X_test, Y_test, load_MSE=True)
In [16]:
cobra_diagnostics.machine_MSE
Out[16]:
This error is bound by the value $C\mathscr{l}^{\frac{-2}{M + 2}}$ upto a constant $C$, which is problem dependant. For more details, we refer the user to the original paper.
In [17]:
cobra_diagnostics.error_bound
Out[17]:
When we initially started to set up COBRA, we split our training data into two further parts - $D_k$, and $D_l$. This split was done 50-50, but it is upto us how we wish to do this. The following section will compare 20-80, 60-40, 50-50, 40-60, 80-20 and check for which case we get the best MSE values, for a fixed Epsilon (or use a grid).
In [18]:
cobra_diagnostics.optimal_split(X_eps, Y_eps)
Out[18]:
What we saw was the default result, with the optimal split ratio and the corresponding MSE. We can do a further analysis here by enabling the info and graph options, and using more values to split on.
In [19]:
split = [(0.05, 0.95), (0.10, 0.90), (0.20, 0.80), (0.40, 0.60), (0.50, 0.50), (0.60, 0.40), (0.80, 0.20), (0.90, 0.10), (0.95, 0.05)]
In [20]:
cobra_diagnostics.optimal_split(X_eps, Y_eps, split=split, info=True, graph=True)
Out[20]:
The following are methods to idetify the optimal epislon values, alpha values, and combination of machines. The grid methods allow for us to predict for a single point the optimal alpha/machines and epsilon combination.
The epsilon paramter controls how "strict" cobra should behave while choosing the points to aggregate.
In [21]:
cobra_diagnostics.optimal_epsilon(X_eps, Y_eps, line_points=100)
Out[21]:
In [22]:
cobra_diagnostics.optimal_alpha(X_eps, Y_eps, info=True)
Out[22]:
In this particular case, the best performance is obtained by seeking consensus over all 4 machines.
In [23]:
cobra_diagnostics.optimal_machines(X_eps, Y_eps, info=True)
Out[23]:
In [24]:
cobra_diagnostics.optimal_alpha_grid(X_eps[0], Y_eps[0], line_points=100)
Out[24]:
In [25]:
cobra_diagnostics.optimal_machines_grid(X_eps[0], Y_eps[0], line_points=100)
Out[25]:
Increasing the number of line points helps in finding a better optimal value. These are the results for the same point. The MSEs are to the second value of the tuple.
With 10: ((('ridge', 'random_forest', 'lasso'), 1.1063905961135443), 0.96254542159345469)
With 20: ((('tree', 'random_forest'), 0.87346626008964035), 0.53850941611803993)
With 50: ((('ridge', 'tree'), 0.94833479666875231), 0.48256303899450931)
With 100: ((('ridge', 'tree', 'random_forest'), 0.10058096328304948), 0.30285776885759158)
With 200: ((('ridge', 'tree', 'lasso'), 0.10007553130675276), 0.30285776885759158)
pycobra is Licensed under the MIT License - https://opensource.org/licenses/MIT