Assessing the Performance of Lolopy

Transferring data to and from JVM is remarkably costly. In this notebook, we quantify how costly the transfer is compared to the training as a function of training set size. We will use a standard ML problem: predicting glass-forming ability of ternary metallic alloys.



In [1]:

    
%matplotlib inline
from matplotlib import pyplot as plt
from tqdm import tqdm_notebook as tqdm
from matminer.datasets.dataset_retrieval import load_dataset
from matminer.featurizers.composition import ElementProperty
from lolopy.learners import RandomForestRegressor
from lolopy.loloserver import find_lolo_jar
from sklearn.ensemble import RandomForestRegressor as SKRFRegressor
from scipy.interpolate import interp1d
from subprocess import PIPE, Popen
from pymatgen import Composition
from time import perf_counter
import pandas as pd
import numpy as np

Make a timing function.



In [2]:

    
def time_function(fun, n, *args, **kwargs):
    """Run a certain function and return timing
    
    Args:
        fun (function): Function to be evaluated
        n (int): Number of times to run function
        args: Input to function
    Returns:
        ([float]) Function run times
    """
    
    times = []
    for i in range(n):
        st = perf_counter()
        fun(*args, **kwargs)
        times.append(perf_counter() - st)
    return times

Create the Dataset

We'll use the Materials Project dataset

Pull down the dataset



In [3]:

    
data = load_dataset('mp_nostruct')

Eliminate entries without formulas



In [4]:

    
data = data[~ data['formula'].isnull()]

Downselect to $10^3$ entries



In [5]:

    
data = data.sample(1000)

Generate some features



In [6]:

    
X = np.array(ElementProperty.from_preset('magpie').featurize_many(data['formula'].apply(Composition), pbar=False))



In [7]:

    
y = data['e_form'].values

Make a function to run the scala benchmark



In [8]:

    
lolojar = find_lolo_jar()



In [9]:

    
def get_scala_timings(X, y, X_run):
    """Train a RF with standard settings using Lolo, generate uncertainties for whole dataset, report timings
    
    Args:
        X, y (ndarray): Training dataset
        X_run (ndarray): Dataset to evaluate
    Returns:
        train, expected, uncertainty (float): Training time, expected and uncertainty evaluation times
    """
    np.savetxt('train.csv', np.hstack((X, y[:, None])), delimiter=',')
    np.savetxt('run.csv', np.hstack((X_run, np.zeros((len(X_run), 1)))), delimiter=',')
    p = Popen('scala -J-Xmx8g -cp {} scala-benchmark.scala train.csv run.csv'.format(lolojar), stdout=PIPE, 
         stderr=PIPE, shell=True)
    
    result = p.stdout.read().decode()
    return map(float, result.split(','))



In [10]:

    
scala_train, scala_expect, scala_uncert = get_scala_timings(X, y, X)



In [11]:

    
print('Lolo train time:', scala_train)









    



Lolo train time: 25.0268002725



In [12]:

    
print('Lolo apply time', scala_expect + scala_uncert)









    



Lolo apply time 5.184949755083333

Profile Fitting the Model

We are looking to comprae the total time for fitting a model to the time required to send data over



In [13]:

    
model = RandomForestRegressor(num_trees=len(X))

Fit the model 16 times, measure the times



In [14]:

    
rf_fit = time_function(model.fit, 16, X, y)
print('Average fit time:', np.mean(rf_fit))









    



Average fit time: 25.836265772373736

Run only transfering the data to Java, record the time



In [15]:

    
x_java, _ = model._convert_train_data(X, y, None)



In [16]:

    
rf_transfer = time_function(model._convert_train_data, 16, X, y)
print('Average transfer time:', np.mean(rf_transfer))









    



Average transfer time: 0.024832841125316918

Compute uncertainities



In [17]:

    
rf_apply = time_function(model.predict, 16, X, return_std=True)
print('Average predict time:', np.mean(rf_apply))









    



Average predict time: 4.532787880501928



In [18]:

    
rf_apply_transfer = time_function(model._convert_run_data, 16, X)
print('Average transfer time for prediction:', np.mean(rf_apply_transfer))









    



Average transfer time for prediction: 0.03166484831672278

Time Scikit-Learn

Compare against a scikit-learn model with the same amount of trees as Lolo.



In [19]:

    
sk_model = SKRFRegressor(n_estimators=len(X), n_jobs=-1)



In [20]:

    
sk_train = time_function(sk_model.fit, 16, X, y)



In [21]:

    
print('Sklearn fitting time:', np.mean(sk_train))









    



Sklearn fitting time: 6.492418694690059

Compare as a Function of Scale

Measure the performance of each model as a function of training/test set size



In [22]:

    
results = []
for n in tqdm(np.logspace(1, np.log10(len(X)), 8, dtype=int)):
    # Initialize output
    r = {'n': n}
    
    # Get the training and test set sizes
    X_n = X[:n, :]
    y_n = y[:n]
    
    # Time using lolo via Scala
    scala_train, scala_expect, scala_uncert = get_scala_timings(X_n, y_n, X)
    r['scala_train'] = scala_train
    r['scala_apply'] = scala_expect
    r['scala_apply_wuncert'] = scala_expect + scala_uncert
    
    # Time using lolo via lolopy
    model.set_params(num_trees=len(X_n))
    
    r['lolopy_train'] = np.mean(time_function(model.fit, 16, X_n, y_n))
    r['lolopy_train_transfer'] = np.mean(time_function(model._convert_train_data, 16, X_n, y_n))
    
    r['lolopy_apply'] = np.mean(time_function(model.predict, 16, X, return_std=False))
    r['lolopy_apply_wuncert'] = np.mean(time_function(model.predict, 16, X, return_std=True))
    r['lolopy_apply_transfer'] = np.mean(time_function(model._convert_run_data, 16, X))
    
    model.clear_model()  # To save memory
    
    # Time using RF
    sk_model = SKRFRegressor(n_estimators=n)
    
    r['sklearn_fit'] = np.mean(time_function(sk_model.fit, 16, X_n, y_n))
    r['sklearn_apply'] = np.mean(time_function(sk_model.predict, 16, X))
    
    # Append results and continue
    results.append(r)



In [23]:

    
results = pd.DataFrame(results)



In [24]:

    
results









    Out[24]:







  
    
      
      lolopy_apply
      lolopy_apply_transfer
      lolopy_apply_wuncert
      lolopy_train
      lolopy_train_transfer
      n
      scala_apply
      scala_apply_wuncert
      scala_train
      sklearn_apply
      sklearn_fit
    
  
  
    
      0
      0.116662
      0.024391
      0.165237
      0.017087
      0.003188
      10
      0.015996
      0.035499
      0.005603
      0.001232
      0.006404
    
    
      1
      0.133029
      0.024051
      0.189982
      0.016471
      0.002876
      19
      0.029241
      0.062266
      0.010618
      0.002103
      0.015835
    
    
      2
      0.168782
      0.026943
      0.255486
      0.036207
      0.003416
      37
      0.053957
      0.124907
      0.027915
      0.004003
      0.038460
    
    
      3
      0.222436
      0.029697
      0.319998
      0.099397
      0.004629
      71
      0.118092
      0.281287
      0.082423
      0.008115
      0.128532
    
    
      4
      0.324749
      0.047900
      0.477770
      0.361892
      0.005894
      138
      0.223314
      0.524047
      0.334785
      0.016721
      0.472197
    
    
      5
      0.542687
      0.023511
      0.803255
      1.476095
      0.013816
      268
      0.410609
      1.031157
      1.355419
      0.037041
      1.944339
    
    
      6
      0.911938
      0.028466
      1.665230
      6.058513
      0.017012
      517
      0.821408
      2.295356
      5.728549
      0.081465
      7.859019
    
    
      7
      1.739875
      0.030233
      4.498204
      21.606862
      0.036226
      1000
      1.426820
      5.280560
      24.840871
      0.187823
      31.290804

Plot the training results. The blue shading is the data transfer time



In [25]:

    
fig, ax = plt.subplots()

ax.fill_between(results['n'], results['lolopy_train_transfer'], 0.001, alpha=0.1)

ax.loglog(results['n'], results['lolopy_train'], 'r', label='lolopy')
ax.loglog(results['n'], results['scala_train'], 'b--', label='lolo')
ax.loglog(results['n'], results['sklearn_fit'], 'g:', label='sklearn')

ax.set_ylim(0.005, max(ax.get_ylim()))

ax.set_xlabel('Training Set Size')
ax.set_ylabel('Train Time (s)')

ax.legend()
fig.set_size_inches(3.5, 2.5)
fig.tight_layout()
fig.savefig('training-performance.png')

Plot the evaluation speed. Note that the number of trees scales with the training set size (hence the decrease in speed with training set size)



In [26]:

    
fig, axs = plt.subplots(1, 2)

# Plot results without uncertainties
axs[0].loglog(results['n'], len(X)  / results['lolopy_apply'], 'r', label='lolopy')
axs[0].loglog(results['n'], len(X) / results['scala_apply'], 'b--', label='lolo')
axs[0].loglog(results['n'], len(X) / results['sklearn_apply'], 'g:', label='sklearn')
axs[0].set_title('Without Uncertainties')

# Plot results with uncertainities
axs[1].loglog(results['n'], len(X)  / results['lolopy_apply_wuncert'], 'r', label='lolopy')
axs[1].loglog(results['n'], len(X) / results['scala_apply_wuncert'], 'b--', label='lolo')
axs[1].set_title('With Uncertainties')

for ax in axs:

    ax.set_xlabel('Training Set Size')
    ax.set_ylabel('Evaluation Speed (entry/s)')

    ax.legend()
fig.set_size_inches(6.5, 2.5)
fig.tight_layout()
fig.savefig('evaluation-performance.png')

Verify that performance is within acceptable bounds: Less than a 2x slowdown for model training or evaluation at a training set size of 100 entries.



In [27]:

    
lolopy_timing = interp1d(results['n'], results['lolopy_train'])
lolo_timing = interp1d(results['n'], results['scala_train'])
slowdown = lolopy_timing(100) / lolo_timing(100)
print('Training slowdown: {:.2f}'.format(slowdown))
assert slowdown < 2









    



Training slowdown: 1.11



In [28]:

    
lolopy_timing = interp1d(results['n'], results['lolopy_apply'])
lolo_timing = interp1d(results['n'], results['scala_apply'])
slowdown = lolopy_timing(100) / lolo_timing(100)
print('Evaluation without uncertainties slowdown: {:.2f}'.format(slowdown))
assert slowdown < 2









    



Evaluation without uncertainties slowdown: 1.63



In [29]:

    
lolopy_timing = interp1d(results['n'], results['lolopy_apply_wuncert'])
lolo_timing = interp1d(results['n'], results['scala_apply_wuncert'])
slowdown = lolopy_timing(100) / lolo_timing(100)
print('Evaluation with uncertainties slowdown: {:.2f}'.format(slowdown))









    



Evaluation with uncertainties slowdown: 1.00



In [30]:

	lolopy_apply	lolopy_apply_transfer	lolopy_apply_wuncert	lolopy_train	lolopy_train_transfer	n	scala_apply	scala_apply_wuncert	scala_train	sklearn_apply	sklearn_fit
0	0.116662	0.024391	0.165237	0.017087	0.003188	10	0.015996	0.035499	0.005603	0.001232	0.006404
1	0.133029	0.024051	0.189982	0.016471	0.002876	19	0.029241	0.062266	0.010618	0.002103	0.015835
2	0.168782	0.026943	0.255486	0.036207	0.003416	37	0.053957	0.124907	0.027915	0.004003	0.038460
3	0.222436	0.029697	0.319998	0.099397	0.004629	71	0.118092	0.281287	0.082423	0.008115	0.128532
4	0.324749	0.047900	0.477770	0.361892	0.005894	138	0.223314	0.524047	0.334785	0.016721	0.472197
5	0.542687	0.023511	0.803255	1.476095	0.013816	268	0.410609	1.031157	1.355419	0.037041	1.944339
6	0.911938	0.028466	1.665230	6.058513	0.017012	517	0.821408	2.295356	5.728549	0.081465	7.859019
7	1.739875	0.030233	4.498204	21.606862	0.036226	1000	1.426820	5.280560	24.840871	0.187823	31.290804