gcForest Algorithm

The gcForest algorithm was suggested in Zhou and Feng 2017 ( https://arxiv.org/abs/1702.08835 , refer for this paper for technical details) and I provide here a python3 implementation of this algorithm.
I chose to adopt the scikit-learn syntax for ease of use and hereafter I present how it can be used.



In [1]:

    
from GCForest import gcForest
from sklearn.datasets import load_iris, load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Before starting, a word about sizes.

*Note* : I recommend the reader to look at this section with the original paper next to the computer to see what I am talking about.

The main technical problem in the present gcForest implementation so far is the memory usage when slicing the input data. A naive calculation can actually give you an idea of the number and sizes of objects the algorithm will be dealing with.

Starting with a dataset of $N$ samples of size $[l,L]$ and with $C$ classes, the initial "size" is:

$S_{D} = N.l.L$

**Slicing Step**
If my window is of size $[w_l,w_L]$ and the chosen stride are $[s_l,s_L]$ then the number of slices per sample is :

$n_{slices} = \left(\frac{l-w_l}{s_l}+1\right)\left(\frac{L-w_L}{s_L}+1\right)$

Obviously the size of slice is $w_l.w_L$ hence the total size of the sliced data set is :

$S_{sliced} = N.w_l.w_L.\left(\frac{l-w_l}{s_l}+1\right)\left(\frac{L-w_L}{s_L}+1\right)$
This is when the memory consumption is its peak maximum.

**Class Vector after Multi-Grain Scanning**
Now all slices are fed to the random forest to generate *class vectors*. The number of class vector per random forest per window per sample is simply equal to the number of slices given to the random forest $n_{cv}(w) = n_{slices}(w)$. Hence, if we have $N_{RF}$ random forest per window the size of a class vector is (recall we have $N$ samples and $C$ classes):

$S_{cv}(w) = N.n_{cv}(w).N_{RF}.C$

And finally the total size of the Multi-Grain Scanning output will be:

$S_{mgs} = N.\sum_{w} N_{RF}.C.n_{cv}(w)$

This short calculation is just meant to give you an idea of the data processing during the Multi-Grain Scanning phase. The actual memory consumption depends on the format given (aka float, int, double, etc.) and it might be worth looking at it carefully when dealing with large datasets.

Iris example

The iris data set is actually not a very good example as the gcForest algorithm is better suited for time series and images where informations can be found at different scales in one sample.
Nonetheless it is still an easy way to test the method.



In [2]:

    
# loading the data
iris = load_iris()
X = iris.data
y = iris.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.33)

First calling and training the algorithm. A specificity here is the presence of the 'shape_1X' keyword to specify the shape of a single sample. I have added it as pictures fed to the machinery might not be square.
Obviously it is not very relevant for the iris data set but still, it has to be defined.

**New in version 0.1.3** : possibility to directly use an int as shape_1X for sequence data.



In [3]:

    
gcf = gcForest(shape_1X=4, window=2, tolerance=0.0)
gcf.fit(X_tr, y_tr)









    



Slicing Sequence...
Training MGS Random Forests...
Adding/Training Layer, n_layer=1
Layer validation accuracy = 1.0
Adding/Training Layer, n_layer=2
Layer validation accuracy = 1.0

Now checking the prediction for the test set:



In [4]:

    
pred_X = gcf.predict(X_te)
print(pred_X)









    



Slicing Sequence...
[0 1 0 2 1 2 1 0 0 1 0 2 2 2 2 2 1 0 1 0 2 2 2 0 0 0 0 2 0 2 0 0 2 0 1 0 0
 1 1 2 2 1 2 1 0 0 2 1 0 2]



In [5]:

    
# evaluating accuracy
accuracy = accuracy_score(y_true=y_te, y_pred=pred_X)
print('gcForest accuracy : {}'.format(accuracy))









    



gcForest accuracy : 0.96

Digits Example

A much better example is the digits data set containing images of hand written digits. The scikit data set can be viewed as a mini-MNIST for training purpose.



In [6]:

    
# loading the data
digits = load_digits()
X = digits.data
y = digits.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.4)

... taining gcForest ... (can take some time...)



In [7]:

    
gcf = gcForest(shape_1X=[8,8], window=[4,6], tolerance=0.0, min_samples_mgs=10, min_samples_cascade=7)
gcf.fit(X_tr, y_tr)









    



Slicing Images...
Training MGS Random Forests...
Slicing Images...
Training MGS Random Forests...
Adding/Training Layer, n_layer=1
Layer validation accuracy = 0.9861111111111112
Adding/Training Layer, n_layer=2
Layer validation accuracy = 0.9861111111111112

... and predicting classes ...



In [8]:

    
pred_X = gcf.predict(X_te)
print(pred_X)









    



Slicing Images...
Slicing Images...
[0 1 0 2 4 3 7 1 4 6 9 5 5 2 6 4 7 7 5 2 1 6 8 6 4 3 3 3 0 9 3 7 3 4 5 2 5
 4 2 1 3 6 9 8 5 8 6 2 2 7 8 6 0 3 0 9 5 0 4 6 7 0 3 8 0 6 9 1 8 3 9 1 1 3
 7 3 5 6 6 6 9 4 0 8 7 3 0 0 4 2 7 0 4 7 9 7 9 6 4 4 7 0 2 2 5 9 4 5 7 7 7
 7 0 6 0 1 6 3 9 4 7 3 1 1 3 6 3 0 3 8 8 1 9 7 6 4 7 5 5 3 5 4 8 3 4 4 6 4
 0 5 9 6 8 5 1 3 1 9 9 5 4 0 4 1 1 7 9 2 3 5 5 6 5 4 6 3 7 5 1 9 6 8 1 7 4
 9 7 3 0 7 3 6 5 9 7 7 9 7 2 3 8 8 7 7 9 4 5 9 0 7 6 1 4 5 7 6 6 0 4 5 9 3
 6 4 0 6 4 6 4 9 1 3 7 3 4 8 8 0 7 4 1 5 5 8 2 0 7 7 6 4 5 8 1 2 0 4 5 7 4
 8 5 4 5 8 3 5 9 6 2 7 6 7 1 3 4 1 0 6 9 4 0 7 6 0 0 9 2 7 4 3 9 3 9 0 4 1
 0 2 9 0 9 0 6 2 4 0 1 6 9 0 2 0 9 1 1 5 3 4 2 9 6 1 6 1 3 9 1 1 8 6 4 9 3
 5 5 2 1 4 8 8 2 2 9 0 9 4 4 2 0 6 2 6 1 7 4 6 7 7 1 5 6 9 7 2 6 0 5 8 7 5
 8 4 2 6 9 2 5 6 5 7 4 3 1 2 7 9 5 3 0 2 6 9 9 1 6 0 9 4 5 7 8 4 8 9 6 5 6
 6 2 1 4 3 3 5 8 2 0 6 5 9 8 7 4 8 6 1 0 1 8 4 3 9 4 5 5 7 9 8 6 1 7 0 8 2
 5 2 5 9 8 4 0 8 7 3 9 7 4 5 5 3 7 4 7 6 0 8 0 9 5 6 0 3 1 6 7 6 4 6 4 7 6
 0 0 6 7 7 2 2 9 8 9 8 8 5 5 4 1 4 6 3 8 8 0 0 1 4 4 7 3 3 4 7 1 3 3 9 2 4
 0 9 4 9 1 5 1 4 4 5 8 5 7 6 8 8 0 8 1 4 6 0 2 0 1 6 0 3 0 8 1 2 3 2 5 0 4
 0 2 1 9 2 3 2 6 3 9 9 2 6 8 3 0 8 0 5 8 4 2 2 4 6 5 7 1 2 6 9 1 2 8 8 7 5
 9 1 5 0 2 2 5 6 7 0 7 3 5 5 2 2 7 1 1 7 0 3 7 4 2 1 3 2 9 1 7 4 6 3 1 3 2
 7 6 9 2 0 4 6 3 0 7 1 8 4 6 0 7 3 6 3 0 6 3 1 1 5 0 8 9 2 3 0 5 5 3 9 0 9
 9 8 6 8 7 8 8 3 1 6 1 4 2 6 0 4 1 4 3 2 1 4 8 8 2 3 3 2 3 9 4 5 3 7 4 2 2
 9 2 8 0 9 3 9 3 2 7 1 6 0 7 0 8]



In [9]:

    
# evaluating accuracy
accuracy = accuracy_score(y_true=y_te, y_pred=pred_X)
print('gcForest accuracy : {}'.format(accuracy))









    



gcForest accuracy : 0.9847009735744089

Saving Models to Disk

You probably don't want to re-train your classifier every day especially if you're using it on large data sets. Fortunately there is a very easy way to save and load models to disk using ```sklearn.externals.joblib```

__Saving model:__



In [10]:

    
from sklearn.externals import joblib
joblib.dump(gcf, 'gcf_model.sav')









    Out[10]:





['gcf_model.sav']

__Loading model__:



In [11]:

    
gcf = joblib.load('gcf_model.sav')

Using mg-scanning and cascade_forest Sperately

As the Multi-Grain scanning and the cascade forest modules are quite independent it is possible to use them seperately.
If a target `y` is given the code automaticcaly use it for training otherwise it recalls the last trained Random Forests to slice the data.



In [12]:

    
gcf = gcForest(shape_1X=[8,8], window=5, min_samples_mgs=10, min_samples_cascade=7)
X_tr_mgs = gcf.mg_scanning(X_tr, y_tr)









    



Slicing Images...
Training MGS Random Forests...



In [13]:

    
X_te_mgs = gcf.mg_scanning(X_te)









    



Slicing Images...

It is now possible to use the mg_scanning output as input for cascade forests using different parameters. Note that the cascade forest module does not directly return predictions but probability predictions from each Random Forest in the last layer of the cascade. Hence the need to first take the mean of the output and then find the max.



In [14]:

    
gcf = gcForest(tolerance=0.0, min_samples_mgs=10, min_samples_cascade=7)
_ = gcf.cascade_forest(X_tr_mgs, y_tr)









    



Adding/Training Layer, n_layer=1
Layer validation accuracy = 0.9814814814814815
Adding/Training Layer, n_layer=2
Layer validation accuracy = 0.9814814814814815



In [15]:

    
pred_proba = gcf.cascade_forest(X_te_mgs)
tmp = np.mean(pred_proba, axis=0)
preds = np.argmax(tmp, axis=1)
accuracy_score(y_true=y_te, y_pred=preds)









    Out[15]:





0.97635605006954107



In [16]:

    
gcf = gcForest(tolerance=0.0, min_samples_mgs=20, min_samples_cascade=10)
_ = gcf.cascade_forest(X_tr_mgs, y_tr)









    



Adding/Training Layer, n_layer=1
Layer validation accuracy = 0.9675925925925926
Adding/Training Layer, n_layer=2
Layer validation accuracy = 0.9722222222222222
Adding/Training Layer, n_layer=3
Layer validation accuracy = 0.9722222222222222



In [17]:

    
pred_proba = gcf.cascade_forest(X_te_mgs)
tmp = np.mean(pred_proba, axis=0)
preds = np.argmax(tmp, axis=1)
accuracy_score(y_true=y_te, y_pred=preds)









    Out[17]:





0.97635605006954107

Skipping mg_scanning

It is also possible to directly use the cascade forest and skip the multi grain scanning step.



In [18]:

    
gcf = gcForest(tolerance=0.0, min_samples_cascade=20)
_ = gcf.cascade_forest(X_tr, y_tr)









    



Adding/Training Layer, n_layer=1
Layer validation accuracy = 0.9583333333333334
Adding/Training Layer, n_layer=2
Layer validation accuracy = 0.9583333333333334



In [19]:

    
pred_proba = gcf.cascade_forest(X_te)
tmp = np.mean(pred_proba, axis=0)
preds = np.argmax(tmp, axis=1)
accuracy_score(y_true=y_te, y_pred=preds)









    Out[19]:





0.95827538247566069



In [ ]: