An Introduction To Tripy

Tripy is a debuggable, Pythonic frontend for TensorRT, a deep learning inference compiler.

API Semantics

Unlike TensorRT’s graph-based semantics, Tripy uses a functional style:

1a = tp.ones((2, 3))
2b = tp.ones((2, 3))
3c = a + b
4print(c)

Output:

tensor(
    [[2.0000, 2.0000, 2.0000],
     [2.0000, 2.0000, 2.0000]],
    dtype=float32, loc=gpu:0, shape=(2, 3))

Organizing Code With Modules

nvtripy.Modules are composable, reusable blocks of code:

 1class MLP(tp.Module):
 2    def __init__(self, embd_size, dtype=tp.float32):
 3        super().__init__()
 4        self.c_fc = tp.Linear(embd_size, 4 * embd_size, bias=True, dtype=dtype)
 5        self.c_proj = tp.Linear(
 6            4 * embd_size, embd_size, bias=True, dtype=dtype
 7        )
 8
 9    def __call__(self, x):
10        x = self.c_fc(x)
11        x = tp.gelu(x)
12        x = self.c_proj(x)
13        return x

Usage:

1mlp = MLP(embd_size=2)
2
3inp = tp.iota(shape=(1, 2), dim=1, dtype=tp.float32)
4out = mlp(inp)
Local Variables
>>> inp
tensor(
    [[0.0000, 1.0000]],
    dtype=float32, loc=gpu:0, shape=(1, 2))

>>> out
tensor(
    [[447.9999, 1183.7290]],
    dtype=float32, loc=gpu:0, shape=(1, 2))

Compiling For Better Performance

Modules and functions can be compiled:

1fast_mlp = tp.compile(
2    mlp,
3    # We must indicate which parameters are runtime inputs.
4    # MLP takes 1 input tensor for which we specify shape and datatype:
5    args=[tp.InputInfo(shape=(1, 2), dtype=tp.float32)],
6)

Usage:

1out = fast_mlp(inp)
Local Variables
>>> out
tensor(
    [[447.9999, 1183.7290]],
    dtype=float32, loc=gpu:0, shape=(1, 2))

Important

There are restrictions on what can be compiled - see nvtripy.compile().

See also

The compiler guide contains more information, including how to enable dynamic shapes.

Pitfalls And Best Practices

  • Best Practice: Use eager mode only for debugging; compile for deployment.

    Why: Eager mode internally compiles the graph (slow!) as TensorRT lacks eager execution.

  • Pitfall: Be careful timing code in eager mode.

    Why: Tensors are evaluated only when used; naive timing will be inaccurate:

     1import time
     2
     3start = time.time()
     4a = tp.gelu(tp.ones((2, 8)))
     5end = time.time()
     6
     7# `a` has not been evaluated yet - this time is not what we want!
     8print(f"Defined `a` in: {(end - start) * 1000:.3f} ms.")
     9
    10start = time.time()
    11# `a` is used (and thus evaluated) for the first time:
    12print(a)
    13end = time.time()
    14
    15# This includes compilation time, not just execution time!
    16print(f"Compiled and evaluated `a` in: {(end - start) * 1000:.3f} ms.")
    

    Output:

    Defined `a` in: 6.750 ms.
    tensor(
        [[0.8412, 0.8412, 0.8412, 0.8412, 0.8412, 0.8412, 0.8412, 0.8412],
         [0.8412, 0.8412, 0.8412, 0.8412, 0.8412, 0.8412, 0.8412, 0.8412]],
        dtype=float32, loc=gpu:0, shape=(2, 8))
    Compiled and evaluated `a` in: 105.025 ms.