
This notebook illustrates the use of a utility, InferenceWrapper.df_to_emb that can be used to perform inference in bulk on large amounts of data. A benchmark is provided that compares doing inference one at a time in a serial fashion and demonstrates a 10x speedup in inference time over the previous method.

Location of Model Artifacts

Google Cloud Storage

  • model for inference (965 MB):
  • encoder (for fine-tuning w/a classifier) (965 MB):
  • fastai.databunch (27.1 GB):
  • checkpointed model (2.29 GB):

Load Minimal Model For Inference

from inference import InferenceWrapper, pass_through
from IPython.display import display, Markdown
import pandas as pd
from torch.nn.utils.rnn import pad_sequence
from torch import Tensor, cat, device
from torch.cuda import empty_cache
from typing import List
from tqdm import tqdm
from numpy import concatenate as cat
import torch
import numpy as np

Create an InferenceWrapper object

wrapper = InferenceWrapper(model_path='/ds/Issue-Embeddings/notebooks',

Download a test dataset

The test dataset has 2,000 GitHub Issues in the below format:

testdf = pd.read_csv(f'').head(8000)


url repo title title_length body body_length
0 egingric/2016-Racing-Game Got stuck near shortcut 25 After being blown up by the barrel, I got stuc... 314
1 Microsoft/nodejstools Guidance for unit test execution - How to prop... 95 What is the appropriate way to set NODE_ENV fo... 507
2 raphapari/dummy Génération du catalogue 25 ## User story xxxlnbrk - En tant que : **gest... 480

Perform Batch Inference

Why Batch-Inference? When there are a large number of issues for which you want to retrieve document embedddings, batch inference on a gpu (should be) significantly faster than on a cpu.

Generate Embeddings From Pre-Trained Language Model

See help for wrapper.df_to_emb:

Help on method df_to_emb in module inference:

df_to_emb(dataframe:pandas.core.frame.DataFrame, bs=100) -> numpy.ndarray method of inference.InferenceWrapper instance
    Retrieve document embeddings for a dataframe with the columns `title` and `body`.
    Uses batching for effiecient computation, which is useful when you have many documents
    to retrieve embeddings for. 
    dataframe: pandas.DataFrame
        Dataframe with columns `title` and `body`, which reprsent the Title and Body of a
        GitHub Issue. 
    bs: int
        batch size for doing inference.  Set this variable according to your available GPU memory.
        The default is set to 200, which was stable on a Nvida-Tesla V-100.
        An array with of shape (number of dataframe rows, 2400)
        This numpy array represents the latent features of the GitHub issues.
    >>> import pandas as pd
    >>> wrapper = InferenceWrapper(model_path='/path/to/model',
    # load 200 sample GitHub issues
    >>> testdf = pd.read_csv(f'').head(200)
    >>> embeddings = wrapper.df_to_emb(testdf)
    >>> embeddings.shape
    (200, 2400)

Benchmarking inference time on 8,000 Issues

Below is when inference is done using batching (New Method)

embeddings = wrapper.df_to_emb(testdf)

CPU times: user 1min 6s, sys: 21.5 s, total: 1min 27s
Wall time: 1min 28s

Below is when inference is done one at a time (Old Method)

# prepare data
test_data = [wrapper.process_dict(x)['text'] for x in testdf.to_dict(orient='rows')]

emb_single = []
for d in tqdm(test_data):
emb_single_combined = cat(emb_single)

100%|██████████| 8000/8000 [14:48<00:00,  9.92it/s]
CPU times: user 12min 42s, sys: 2min 54s, total: 15min 37s
Wall time: 15min 34s


There is a 10x speedup for inference by chunking the data into batches of similar length (to minimize padding) and passing that through the GPU.

In order to get a further speed improvement we must utilize pad_packed_sequence. We leave this a future exercise to optimize batching more.


This section tests that the embeddings retrieved from the one-at-a time approach are sufficently close to the embeddings from the batching approach

assert np.allclose(emb_single_combined, embeddings, atol=1e-5)