An Introduction To Tripy¶
Tripy is a debuggable, Pythonic frontend for TensorRT, a deep learning inference compiler.
API Semantics¶
Unlike TensorRT’s graph-based semantics, Tripy uses a functional style:
1a = tp.ones((2, 3))
2b = tp.ones((2, 3))
3c = a + b
4print(c)
Output:
tensor(
[[2.0000, 2.0000, 2.0000],
[2.0000, 2.0000, 2.0000]],
dtype=float32, loc=gpu:0, shape=(2, 3))
Organizing Code With Modules¶
nvtripy.Module
s are composable, reusable blocks of code:
1class MLP(tp.Module):
2 def __init__(self, embd_size, dtype=tp.float32):
3 super().__init__()
4 self.c_fc = tp.Linear(embd_size, 4 * embd_size, bias=True, dtype=dtype)
5 self.c_proj = tp.Linear(
6 4 * embd_size, embd_size, bias=True, dtype=dtype
7 )
8
9 def __call__(self, x):
10 x = self.c_fc(x)
11 x = tp.gelu(x)
12 x = self.c_proj(x)
13 return x
Usage:
1mlp = MLP(embd_size=2)
2
3inp = tp.iota(shape=(1, 2), dim=1, dtype=tp.float32)
4out = mlp(inp)
Local Variables
>>> inp
tensor(
[[0.0000, 1.0000]],
dtype=float32, loc=gpu:0, shape=(1, 2))
>>> out
tensor(
[[447.9999, 1183.7290]],
dtype=float32, loc=gpu:0, shape=(1, 2))
Compiling For Better Performance¶
Modules and functions can be compiled:
1fast_mlp = tp.compile(
2 mlp,
3 # We must indicate which parameters are runtime inputs.
4 # MLP takes 1 input tensor for which we specify shape and datatype:
5 args=[tp.InputInfo(shape=(1, 2), dtype=tp.float32)],
6)
Usage:
1out = fast_mlp(inp)
Local Variables
>>> out
tensor(
[[447.9999, 1183.7290]],
dtype=float32, loc=gpu:0, shape=(1, 2))
Important
There are restrictions on what can be compiled - see nvtripy.compile()
.
See also
The compiler guide contains more information, including how to enable dynamic shapes.
Pitfalls And Best Practices¶
Best Practice: Use eager mode only for debugging; compile for deployment.
Why: Eager mode internally compiles the graph (slow!) as TensorRT lacks eager execution.
Pitfall: Be careful timing code in eager mode.
Why: Tensors are evaluated only when used; naive timing will be inaccurate:
1import time 2 3start = time.time() 4a = tp.gelu(tp.ones((2, 8))) 5end = time.time() 6 7# `a` has not been evaluated yet - this time is not what we want! 8print(f"Defined `a` in: {(end - start) * 1000:.3f} ms.") 9 10start = time.time() 11# `a` is used (and thus evaluated) for the first time: 12print(a) 13end = time.time() 14 15# This includes compilation time, not just execution time! 16print(f"Compiled and evaluated `a` in: {(end - start) * 1000:.3f} ms.")
Output:
Defined `a` in: 6.750 ms. tensor( [[0.8412, 0.8412, 0.8412, 0.8412, 0.8412, 0.8412, 0.8412, 0.8412], [0.8412, 0.8412, 0.8412, 0.8412, 0.8412, 0.8412, 0.8412, 0.8412]], dtype=float32, loc=gpu:0, shape=(2, 8)) Compiled and evaluated `a` in: 105.025 ms.