An Introduction To Tripy¶
Tripy is a debuggable, Pythonic frontend for TensorRT, a deep learning inference compiler.
API Semantics¶
Unlike TensorRT’s graph-based semantics, Tripy uses a functional style:
1a = tp.ones((2, 3))
2b = tp.ones((2, 3))
3c = a + b
4print(c)
Output:
tensor(
[[2, 2, 2],
[2, 2, 2]],
dtype=float32, loc=gpu:0, shape=(2, 3))
Organizing Code With Modules¶
nvtripy.Module
s are composable, reusable blocks of code:
1class MLP(tp.Module):
2 def __init__(self, embd_size, dtype=tp.float32):
3 super().__init__()
4 self.c_fc = tp.Linear(embd_size, 4 * embd_size, bias=True, dtype=dtype)
5 self.c_proj = tp.Linear(
6 4 * embd_size, embd_size, bias=True, dtype=dtype
7 )
8
9 def forward(self, x):
10 x = self.c_fc(x)
11 x = tp.gelu(x)
12 x = self.c_proj(x)
13 return x
Usage:
1mlp = MLP(embd_size=2)
2
3# Set parameters:
4mlp.load_state_dict(
5 {
6 "c_fc.weight": tp.ones((8, 2)),
7 "c_fc.bias": tp.ones((8,)),
8 "c_proj.weight": tp.ones((2, 8)),
9 "c_proj.bias": tp.ones((2,)),
10 }
11)
12
13# Execute:
14inp = tp.iota(shape=(1, 2), dim=1, dtype=tp.float32)
15out = mlp(inp)
Local Variables
>>> inp
tensor(
[[0, 1]],
dtype=float32, loc=gpu:0, shape=(1, 2))
>>> out
tensor(
[[16.636, 16.636]],
dtype=float32, loc=gpu:0, shape=(1, 2))
Compiling For Better Performance¶
Modules and functions can be compiled:
1fast_mlp = tp.compile(
2 mlp,
3 # We must indicate which parameters are runtime inputs.
4 # MLP takes 1 input tensor for which we specify shape and datatype:
5 args=[tp.InputInfo(shape=(1, 2), dtype=tp.float32)],
6)
Usage:
1out = fast_mlp(inp)
Local Variables
>>> out
tensor(
[[16.636, 16.636]],
dtype=float32, loc=gpu:0, shape=(1, 2))
Important
There are restrictions on what can be compiled - see nvtripy.compile()
.
See also
The compiler guide contains more information, including how to enable dynamic shapes.
Pitfalls And Best Practices¶
Best Practice: Use eager mode only for debugging; compile for deployment.
Why: Eager mode internally compiles the graph (slow!) as TensorRT lacks eager execution.
Pitfall: Be careful timing code in eager mode.
Why: Tensors are evaluated only when used; naive timing will be inaccurate:
1import time 2 3start = time.time() 4a = tp.gelu(tp.ones((2, 8))) 5end = time.time() 6 7# `a` has not been evaluated yet - this time is not what we want! 8print(f"Defined `a` in: {(end - start) * 1000:.3f} ms.") 9 10start = time.time() 11# `a` is used (and thus evaluated) for the first time: 12print(a) 13end = time.time() 14 15# This includes compilation time, not just execution time! 16print(f"Compiled and evaluated `a` in: {(end - start) * 1000:.3f} ms.")
Output:
Defined `a` in: 3.295 ms. tensor( [[0.841345, 0.841345, 0.841345, 0.841345, 0.841345, 0.841345, 0.841345, 0.841345], [0.841345, 0.841345, 0.841345, 0.841345, 0.841345, 0.841345, 0.841345, 0.841345]], dtype=float32, loc=gpu:0, shape=(2, 8)) Compiled and evaluated `a` in: 98.144 ms.