An Introduction To Tripy¶
What Is Tripy?¶
Tripy is a compiler that compiles deep learning models for inference using TensorRT as a backend. It aims to be fast, easy to debug, and provide an easy-to-use Pythonic interface.
Your First Tripy Program¶
1a = tp.arange(5)
2c = a + 1.5
3print(c)
Output:
tensor([1.5000, 2.5000, 3.5000, 4.5000, 5.5000], dtype=float32, loc=gpu:0, shape=(5,))
This should look familiar if you’ve used linear algebra or deep learning libraries like NumPy and PyTorch. Hopefully, the code above is self-explanatory, so we won’t go into details.
Organizing Code Using Modules¶
The tripy.Module
API allows you to create reusable blocks that can be composed together
to create models. Modules may be comprised of other modules, including modules predefined
by Tripy, like tripy.Linear
and tripy.LayerNorm
.
For example, we can define a Transfomer MLP block like so:
1class MLP(tp.Module):
2 def __init__(self, embd_size, dtype=tp.float32):
3 super().__init__()
4 self.c_fc = tp.Linear(embd_size, 4 * embd_size, bias=True, dtype=dtype)
5 self.c_proj = tp.Linear(4 * embd_size, embd_size, bias=True, dtype=dtype)
6
7 def __call__(self, x):
8 x = self.c_fc(x)
9 x = tp.gelu(x)
10 x = self.c_proj(x)
11 return x
To use it, we just need to construct and call it:
1mlp = MLP(embd_size=2)
2
3inp = tp.iota(shape=(1, 2), dim=1, dtype=tp.float32)
4out = mlp(inp)
>>> inp
tensor(
[[0.0000, 1.0000]],
dtype=float32, loc=gpu:0, shape=(1, 2))
>>> out
tensor(
[[447.9999, 1183.7290]],
dtype=float32, loc=gpu:0, shape=(1, 2))
Compiling Code¶
All the code we’ve seen so far has been using Tripy’s eager mode. It is also possible to compile functions or modules ahead of time, which can result in significantly better performance.
Note that the compiler imposes some requirements on the functions/modules it can compile.
See tripy.compile()
for details.
Let’s compile the MLP module we defined above as an example:
1# When we compile, we need to indicate which parameters to the function
2# should be runtime inputs. In this case, MLP takes a single input tensor
3# for which we can specify our desired shape and datatype.
4fast_mlp = tp.compile(mlp, args=[tp.InputInfo(shape=(1, 2), dtype=tp.float32)])
Now let’s benchmark the compiled version against eager mode:
1import time
2
3start = time.time()
4out = mlp(inp)
5# We need to evaluate in order to actually materialize `out`.
6# See the section on lazy evaluation below for details.
7out.eval()
8end = time.time()
9
10eager_time = (end - start) * 1000
11print(f"Eager mode time: {eager_time:.4f} ms")
12
13start = time.time()
14out = fast_mlp(inp)
15out.eval()
16end = time.time()
17
18compiled_time = (end - start) * 1000
19print(f"Compiled mode time: {compiled_time:.4f} ms")
Output:
Eager mode time: 102.1855 ms
Compiled mode time: 0.2320 ms
For more information on the compiler, compiled functions/modules, and dynamic shapes, see the compiler guide.
Things To Note¶
Eager Mode: How Does It Work?¶
If you’ve used TensorRT before, you may know that it does not support an eager mode. In order to provide eager mode support in Tripy, we actually need to compile the graph under the hood.
Although we employ several tricks to make compile times faster when using eager mode, we do still need to compile, and so eager mode will likely be slower than other comparable frameworks.
Consequently, we suggest that you use eager mode primarily for debugging and compiled mode for deployments.
Lazy Evaluation: Putting Off Work¶
One important point is that Tripy uses a lazy evaluation model; that is, no computation is performed until a value is actually needed.
In most cases, this is simply an implementation detail that you will not notice. One exception to this is when attempting to time code. Consider the following code:
1import time
2
3start = time.time()
4a = tp.arange(5)
5b = tp.arange(5)
6c = a + b + tp.tanh(a)
7end = time.time()
8
9print(f"Time to create 'c': {(end - start) * 1000:.3f} ms.")
Output:
Time to create 'c': 10.499 ms.
Given what we said above about eager mode, it seems like Tripy is very fast!
Of course, this is because we haven’t actually done anything yet.
The actual compilation and execution only happens when we evaluate c
:
1start = time.time()
2print(c)
3end = time.time()
4
5print(f"Time to print 'c': {(end - start) * 1000:.3f} ms.")
Output:
tensor([0.0000, 2.7616, 4.9640, 6.9951, 8.9993], dtype=float32, loc=gpu:0, shape=(5,))
Time to print 'c': 89.350 ms.
That is why the time to print c
is so much higher than the time to create it.