by Thomas Viehmann tv@lernapparat.de
Today, I would like to discuss in detail some aspects of optimizing code in models, and in particular how you can let the PyTorch JIT optimize things for you.
We will use the Intersection over Union loss commonly used in training detection models as an example and explore various ways to implement it in PyTorch.
The intersection over union (or IoU) loss arises in training detection networks. Given two axis-parallel rectangles (blue and red), we wish to compute the quotient between the are in the intersection (which is a rectangle again) and the union. In colors:
As the intersection is always contained in the union, we know that $0 \leq IoU \leq 1$ (with the optimum being $1$, so strictly speaking $-IoU$ would be a loss).
Note that if we have the area of the intersection and of the two rectangles, we can also express the area of the union as the sum areas of the two rectangles minus the area of the intersection (which is contained twice in the sum).
Let $(x_1, y_1, w_1, h_1$) be the coordinates top left and the width and the height of the first rectangle and $(x_2, y_2, w_2, h_2$) those of the second.
The intersection is easily calculated: If we have the top left and bottom right coordinates (and our coordinate system has increasing $y$ from top to bottom), we can take the maximum of the top left coordinates and the minimum of the bottom right coordinates.
So we have1
$$
x_I = \max(x_1, x_2), \qquad y_I = \max(y_1, y_2)
$$
and - we need to calculate the bottom right corners, take the minimum and transform back to width and hight -
$$
w_I = \min(x_1 + w_1, x_2 + w_2)-x_I.
$$
But there is a slight complication when the rectangles don't intersect: then our formulae do not work but instead give us the rectangle "between" the two but with the corner points exchanged. This means that then $w_i$ calculated as above is actually negative, so we can fix this by enforcing a minimum of $0$
$$
w_I = \max \left( \min(x_1 + w_1, x_2 + w_2)-x_I,0\right), \qquad h_I = \max \left( \min(y_1 + h_1, y_2 + h_2)-y_I,0\right).
$$
Note that these last maxmimizations with a constant would be performed in PyTorch using the torch.clamp
function, while the (elementwise) maximum and minimum between two tensors is computed using torch.min
and torch.max
.
Speaking of PyTorch, enough of the theory, let's move to practical things!
I use $I$ here to mean Intersection, it's not an index.↩
In [1]:
import torch
import torch.utils.cpp_extension
The formulas above readily translate into a PyTorch function. Just to be safe, we clamp the the denominator to be at least $10^{-5}$.
In [2]:
def ratio_iou(x1, y1, w1, h1, x2, y2, w2, h2, eps=1e-5):
xi = torch.max(x1, x2) # Intersection
yi = torch.max(y1, y2)
wi = torch.clamp(torch.min(x1+w1, x2+w2) - xi, min=0)
hi = torch.clamp(torch.min(y1+h1, y2+h2) - yi, min=0)
area_i = wi * hi # Area Intersection
area_u = w1 * h1 + w2 * h2 - wi * hi # Area Union
return area_i / torch.clamp(area_u, min=eps)
The function will is vector-ready just by passing in a multi-dimensional tensor. Let's try it out with some dummy data:
In [3]:
x1, y1, w1, h1, x2, y2, w2, h2 = torch.randn(8, 100, 1000, device='cuda').exp()
ratio_iou(x1, y1, w1, h1, x2, y2, w2, h2)
Out[3]:
Without looking too much at the results, it seems to work.
Let us take a short digression here. As you may know, PyTorch provides functional interfaces
in torch.nn.functional
(often also known as F
) as well as modules (in torch.nn
, commonly imported as nn
)1. It does so for typical neural network components as well as the loss functions. We might wonder which is preferable for our own modelling. It is, in the end, a question of style, but I would suggest the following as a good rule of thumb: If it has (significant) parameters or even state, use the module interface (so subclass nn.Module
). If it has not, define a function as the above. I also do this when using PyTorch's functions - e.g. I usually spell out my forward and prefer to use the function F.relu
over the module nn.Relu
.
But enough of the digression. Can our ratio_iou calculation be made more efficient?
One common thought when trying to make Python things more efficient is moving to C++. Fortunately PyTorch makes it very straightforward to do so, by the way of C++ extensions or custom operators. Both work the same except for the actual bindings. The difference between them is that functions in PyTorch extensions can take any parameters (by using the library PyBind11) while custom operators are restricted to the types that PyTorch supports (e.g. Tensors, int64_t
, double
, std::string
, IntList
and TensorList
). The advantage of custom operators is that they can be used with the JIT and in C++, too.
Happily, we can just type our C-Code into a cell and have PyTorch compile it for us. Let's do a custom operator that follows exactly the Python function above:
I might say that I usually just type out the modules instead for importing them under short names.↩
In [4]:
csrc = """
#include <torch/script.h>
using namespace torch;
Tensor iou_native(const Tensor& x1, const Tensor& y1, const Tensor& w1, const Tensor& h1,
const Tensor& x2, const Tensor& y2, const Tensor& w2, const Tensor& h2) {
auto xi = torch::max(x1, x2);
auto yi = torch::max(y1, y2);
auto wi = torch::clamp(torch::min(x1+w1, x2+w2) - xi, 0);
auto hi = torch::clamp(torch::min(y1+h1, y2+h2) - yi, 0);
auto area_i = wi * hi;
auto area_u = w1 * h1 + w2 * h2 - wi * hi;
return area_i / torch::clamp(area_u, 1e-5);
}
static auto registry =
torch::jit::RegisterOperators("super_iou::iou_native", &iou_native);
"""
torch.utils.cpp_extension.load_inline("libsuperiou", csrc, is_python_module=False, verbose=True)
That was easy enough! Now we have a custom op unter torch.ops
, the name it is available under is determined by the string argument to RegisterOperators - here torch.ops.super_iou.iou_native
. (Note: If you get an error about "multiple overloads", you'll have to reload your kernel and start again... While PyTorch extensions support re-building and re-loading, custom operators run into trouble with that.) Let's see if it gives the same result as the Python version:
In [5]:
(ratio_iou(x1, y1, w1, h1, x2, y2, w2, h2)==torch.ops.super_iou.iou_native(x1, y1, w1, h1, x2, y2, w2, h2)).all().item()
Out[5]:
It works. Note that in general it is safer to use torch.almost_equal
or print (a-b).abs().max()
to deal with numerical precision. But here, ==
works well, too.
So how about timings? Note that we need to call torch.cuda.synchronize()
to get valid timings on the GPU.
In [6]:
def taketime(fn):
_ = fn(x1, y1, w1, h1, x2, y2, w2, h2)
torch.cuda.synchronize()
torch.cuda.synchronize()
%timeit taketime(ratio_iou)
%timeit taketime(torch.ops.super_iou.iou_native)
We see that there is difference of about 5% n cuda, if we did this with CPU tensors, there would be no significant difference. Depending on the nature of the calculation, this is a typical result. For the lltm
model in the PyTorch C++-Extension tutorial, you get a speedup of about 10% by moving to C++. But this involves a loop over the input sequence, so calls quite a few tensor operation. Here we only have a handful of operations, so moving to C++ offers little performance gain by itself.
What is relatively slow about our code is that each operation stores intermediate results in tensors and the next operation reads those to continue. If we write our own kernel, that can be helped. I consider this the "classic way" of optimizing models.
The TensorAccessor
(for CPU) / PackedTensorAccessor
(for transferring sizes and strides to GPU) classes provide a convenient interface for element access. As you would in production, we multiplex the floating types through templates in scalar_t
.
For simplicity, we only deal with 1-d tensors (this is the second argument to anything accessor
).
In [7]:
csrc = """
#include <torch/script.h>
#include <ATen/Parallel.h>
using namespace torch;
// The cuda kernel is easy enough
template<typename scalar_t>
__global__ void iou_kernel_gpu(PackedTensorAccessor<scalar_t, 1> result,
PackedTensorAccessor<scalar_t, 1> x1,
PackedTensorAccessor<scalar_t, 1> y1,
PackedTensorAccessor<scalar_t, 1> w1,
PackedTensorAccessor<scalar_t, 1> h1,
PackedTensorAccessor<scalar_t, 1> x2,
PackedTensorAccessor<scalar_t, 1> y2,
PackedTensorAccessor<scalar_t, 1> w2,
PackedTensorAccessor<scalar_t, 1> h2
) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i >= x1.size(0)) // we might have more threads than work to do in the last block
return;
// This should look very familiar. We could try reading each element only once, but let's keep it simple.
scalar_t xi = max(x1[i], x2[i]);
scalar_t yi = max(y1[i], y2[i]);
scalar_t wi = max(min(x1[i]+w1[i], x2[i]+w2[i]) - xi, static_cast<scalar_t>(0));
scalar_t hi = max(min(y1[i]+h1[i], y2[i]+h2[i]) - yi, static_cast<scalar_t>(0));
scalar_t area_i = wi * hi;
scalar_t area_u = w1[i] * h1[i] + w2[i] * h2[i] - area_i;
result[i] = area_i / max(area_u, static_cast<scalar_t>(0.00001f));
}
// The CPU kernel is looks similar, we could also just put it in the main function...
template<typename scalar_t>
void iou_kernel_cpu(TensorAccessor<scalar_t, 1> result,
TensorAccessor<scalar_t, 1> x1,
TensorAccessor<scalar_t, 1> y1,
TensorAccessor<scalar_t, 1> w1,
TensorAccessor<scalar_t, 1> h1,
TensorAccessor<scalar_t, 1> x2,
TensorAccessor<scalar_t, 1> y2,
TensorAccessor<scalar_t, 1> w2,
TensorAccessor<scalar_t, 1> h2) {
// we use CPU parallelization
constexpr int64_t GRAIN_SIZE = 8192; // minimum grain size for parallel execution
at::parallel_for(0, x1.size(0), GRAIN_SIZE, [&](int64_t i_begin, int64_t i_end) {
for (int64_t i = i_begin; i < i_end; ++i) {
scalar_t xi = max(x1[i], x2[i]);
scalar_t yi = max(y1[i], y2[i]);
scalar_t wi = max(min(x1[i]+w1[i], x2[i]+w2[i]) - xi, static_cast<scalar_t>(0));
scalar_t hi = max(min(y1[i]+h1[i], y2[i]+h2[i]) - yi, static_cast<scalar_t>(0));
scalar_t area_i = wi * hi;
scalar_t area_u = w1[i] * h1[i] + w2[i] * h2[i] - area_i;
result[i] = area_i / max(area_u, static_cast<scalar_t>(0.00001f));
}
});
}
torch::Tensor iou_forward(const Tensor& x1, const Tensor& y1, const Tensor& w1, const Tensor& h1,
const Tensor& x2, const Tensor& y2, const Tensor& w2, const Tensor& h2) {
auto res = torch::empty_like(x1);
for (auto& t : {x1, y1, w1, h1, x2, y2, w2, h2}) {
AT_ASSERTM(t.dim()==1 && t.size(0)==x1.size(0) && t.device()==x1.device() && t.dtype()==x1.dtype(),
"tensors are not of same shape and kind");
}
if (x1.is_cuda()) {
dim3 block(512);
dim3 grid((x1.size(0)+511)/512);
AT_DISPATCH_FLOATING_TYPES(x1.type(), "iou", [&] {
iou_kernel_gpu<scalar_t><<<grid,block>>>(res.packed_accessor<scalar_t, 1>(),
x1.packed_accessor<scalar_t, 1>(),
y1.packed_accessor<scalar_t, 1>(),
w1.packed_accessor<scalar_t, 1>(),
h1.packed_accessor<scalar_t, 1>(),
x2.packed_accessor<scalar_t, 1>(),
y2.packed_accessor<scalar_t, 1>(),
w2.packed_accessor<scalar_t, 1>(),
h2.packed_accessor<scalar_t, 1>());
});
} else {
AT_DISPATCH_FLOATING_TYPES(x1.type(), "iou", [&] {
iou_kernel_cpu<scalar_t>(res.accessor<scalar_t, 1>(),
x1.accessor<scalar_t, 1>(),
y1.accessor<scalar_t, 1>(),
w1.accessor<scalar_t, 1>(),
h1.accessor<scalar_t, 1>(),
x2.accessor<scalar_t, 1>(),
y2.accessor<scalar_t, 1>(),
w2.accessor<scalar_t, 1>(),
h2.accessor<scalar_t, 1>());
});
}
return res;
}
torch::Tensor iou_native(const Tensor& x1, const Tensor& y1, const Tensor& w1, const Tensor& h1,
const Tensor& x2, const Tensor& y2, const Tensor& w2, const Tensor& h2) {
auto xi = torch::max(x1, x2);
auto yi = torch::max(y1, y2);
auto wi = torch::clamp(torch::min(x1+w1, x2+w2) - xi, 0);
auto hi = torch::clamp(torch::min(y1+h1, y2+h2) - yi, 0);
auto area_i = wi * hi;
auto area_u = w1 * h1 + w2 * h2 - wi * hi;
return area_i / torch::clamp(area_u, 1e-5);
}
static auto registry =
torch::jit::RegisterOperators("super_iou2::iou_forward", &iou_forward)
.op("super_iou2::iou_native", &iou_native);
;
"""
torch.utils.cpp_extension.load_inline("iou_op", "", csrc, is_python_module=False, verbose=True)
Phew. That was a bit tedious, but let's see if it works!
In [8]:
x1, y1, w1, h1, x2, y2, w2, h2 = [t.view(-1) for t in [x1, y1, w1, h1, x2, y2, w2, h2]]
print ("check gpu", (ratio_iou(x1, y1, w1, h1, x2, y2, w2, h2)==torch.ops.super_iou.iou_native(x1, y1, w1, h1, x2, y2, w2, h2)).all().item())
print ("check cpu", (torch.ops.super_iou2.iou_forward(x1.cpu(), y1.cpu(), w1.cpu(), h1.cpu(), x2.cpu(), y2.cpu(), w2.cpu(), h2.cpu())
== torch.ops.super_iou2.iou_forward(x1.cpu(), y1.cpu(), w1.cpu(), h1.cpu(), x2.cpu(), y2.cpu(), w2.cpu(), h2.cpu())).all().item())
So it seems to work, let's time things again.
In [9]:
torch.cuda.synchronize()
%timeit taketime(torch.ops.super_iou2.iou_forward)
%timeit taketime(ratio_iou)
Now that is a lot faster!
However, it is not usable as is: We do not have a backward. So we need two more kernels? Can we get something that is fast and doesn't need us to write all the infrastructure?
It turns out we can. The PyTorch JIT has two awesome components, the fuser and the autodiff that will automatically create kernels for us. (There is a limitation, here, we need to specify the max
argument to clamp in order for this here to work.)
In [10]:
import math
@torch.jit.script
def ratio_iou_scripted(x1, y1, w1, h1, x2, y2, w2, h2):
xi = torch.max(x1, x2) # Intersection (yi similarly)
yi = torch.max(y1, y2) # Intersection (yi similarly)
wi = torch.clamp(torch.min(x1+w1, x2+w2) - xi, min=0, max=math.inf)
hi = torch.clamp(torch.min(y1+h1, y2+h2) - yi, min=0, max=math.inf)
area_i = wi * hi # Area Intersection
area_u = w1 * h1 + w2 * h2 - wi * hi # Area Union
return area_i / torch.clamp(area_u, min=1e-5, max=math.inf)
In [11]:
print("check", (ratio_iou_scripted(x1, y1, w1, h1, x2, y2, w2, h2)-ratio_iou(x1, y1, w1, h1, x2, y2, w2, h2)).abs().max().item())
Let's time it again:
In [12]:
torch.cuda.synchronize()
%timeit taketime(torch.ops.super_iou2.iou_forward)
%timeit taketime(ratio_iou_scripted)
%timeit taketime(ratio_iou)
Not bad! We got a more than 6x speedup just by putting @torch.jit.script above our function. While apparent factor of two off the hand-crafted kernel still isn't ideal, part of that is that the size of the tensors isn't that large. Going to 10 Million elements, we are only 25% slower than the handwritten kernel:
In [13]:
x1, y1, w1, h1, x2, y2, w2, h2 = torch.randn(8, 10_000_000, device='cuda').exp()
torch.cuda.synchronize()
%timeit taketime(torch.ops.super_iou2.iou_forward)
%timeit taketime(ratio_iou_scripted)
How did that work? We can look at the graph the JIT has built for our calculation: You see that the main graph defers to a FusionGroup
. The fusion group represents the graph that will be compiled into our custom kernel. (Note: I assume here that you run this with parameters not requiring gradients, we'll repeat the same with gradients below.)
In [14]:
ratio_iou_scripted.graph_for(x1, y1, w1, h1, x2, y2, w2, h2)
Out[14]:
Note that even if things are shown in a fusion group, it can sometimes happen that the fuser decides it cannot create a kernel.
You can observe kernel creation by setting the environment variable PYTORCH_FUSION_DEBUG=1
(works best on the console, the source code is written to the terminal).
But we really wanted to get forward and backward, so let's do that.
Here is a bit of digression again, but I'll keep it very short: Note that I use requires_grad_()
below instead of a requires_grad=True
argument in the randn
. This is because now x1
and friends are leaf variables to the autograd graph, otherwise the random tensor (not assigned a Python variable) would be the leaf variables and accumulate the grads! This is something that you can easily fool yourself with (I can't say it never happened to me before and it's a not-so-infrequent cause for people asking on the forums, too). I prefer .requires_grad_()
over setting the attribute .requires_grad = True
because the first is not only shorter, but also will fail if I misspell it for any reason.
But so here is timing this with backward (I always evaluate the scripted function to not have the one-off compilation time in the timing):
In [15]:
x1, y1, w1, h1, x2, y2, w2, h2 = [t.requires_grad_() for t in torch.randn(8, 100_000, device='cuda').exp()]
l1 = ratio_iou(x1, y1, w1, h1, x2, y2, w2, h2)
l2 = ratio_iou_scripted(x1, y1, w1, h1, x2, y2, w2, h2)
grad_out = torch.randn_like(l1)
grads1 = torch.autograd.grad(l1, [x1, y1, w1, h1, x2, y2, w2, h2], grad_out)
grads2 = torch.autograd.grad(l2, [x1, y1, w1, h1, x2, y2, w2, h2], grad_out)
print ("check:", (l1-l2).abs().max().item(), max([(g1-g2).abs().max().item() for g1, g2 in zip(grads1, grads2)]))
def time_loss_and_backward(fn):
l = fn(x1, y1, w1, h1, x2, y2, w2, h2)
grads = torch.autograd.grad(l, [x1, y1, w1, h1, x2, y2, w2, h2], grad_out)
torch.cuda.synchronize()
torch.cuda.synchronize()
%timeit time_loss_and_backward(ratio_iou)
%timeit time_loss_and_backward(ratio_iou_scripted)
I get a 4.5x speedup. Not bad for just adding @torch.jit.script
!
My measurements have been done on my PR #14957 branch. The backward optimization has had a bit of a bumpy ride in PyTorch in November 2018, as a late fix for correct gradients of broadcasted tensors has inserted summations into the backward that cannot be fused. I hope that it will be fixed soon.
Let's look at the graph again. You see that it now is wrapped in a DifferentiableGraph
. This means that the JIT autodiff has identified a block that it knows how to differentiate. Ìnside, you have the FusionGroup
we already saw and a bit of broadcasting.
In [16]:
ratio_iou_scripted.graph_for(x1, y1, w1, h1, x2, y2, w2, h2)
Out[16]:
Let's look at the backward graph, too. I extracted the code to get the backward graph from PyTorch's testsuite. I re-define the function in order for only a single backward being defined. It tries to extract the backward graph from the latest(?) run forward, so it might be a bit fragile (rerun the definition of the ratio_iou_script and the timing with backward before backward_graph) if you run into trouble.
Note that in the output here, the bulk of the calculation (except a few GradSumToSize
) is done in a large fusion group again. On 1.0 this would have been split into piecemeal fusiongroups with SumToSize
in between.
In [17]:
def backward_graph(script_module):
# magic debugging stuff I learned about in the PyTorch JIT test suite
graph_executor_state = script_module.get_debug_state()
fwd_plan = list(graph_executor_state.execution_plans.values())[-1]
grad_executor = list(fwd_plan.code.grad_executors())[-1]
bwd_plan = list(grad_executor.get_debug_state().execution_plans.values())[-1]
return bwd_plan.graph.copy() # in order to own the graph, we need to make a copy
In [18]:
backward_graph(ratio_iou_scripted)
Out[18]:
That's all for now. I hope you enjoyed this little demo. I hope you enjoyed it and appreciate your feedback and comments at tv@lernapparat.de.
On my blog https://lernapparat.de/ you'll find the slides from the talk that this demonstration accompanies.
In [ ]: