Architecture¶
Overview¶
Tripy builds an MLIR program by tracing functional-style Python APIs.
The program is compiled and executed by MLIR-TRT.
Backend: Interfaces with MLIR-TRT:
Compiler compiles tensorrt-dialect MLIR to an MLIR-TRT executable.
Executable wraps an MLIR-TRT executable in a Pythonic API.
Trace: Computation graph of
TraceTensor
s andTraceOp
s that lowers to tensorrt-dialect MLIR.Frontend: Exposes functional-style operations for
nvtripy.Tensor
s.
Note
Frontend/Backend refer to the flow of execution, not what the user does/doesn’t see.
Public APIs are exposed by both the frontend (e.g. nvtripy.resize()
) and backend (e.g. nvtripy.compile()
).
Source Code Links¶
The Stack By Example¶
Consider a simple example:
1def scale_up(inp):
2 out = tp.resize(inp, scales=(2, 2), mode="linear")
3 out.name = "out" # Setting name for IR readability
4 return out
5
6
7compiled_func = tp.compile(
8 scale_up, args=[tp.InputInfo((2, 2), dtype=tp.float32)]
9)
10
11inp = tp.iota((2, 2), dtype=tp.float32)
12out = compiled_func(inp)
Local Variables
>>> compiled_func
Executable(inp: nvtripy.Tensor) -> nvtripy.Tensor
>>> inp
tensor(
[[0, 0],
[1, 1]],
dtype=float32, loc=gpu:0, shape=(2, 2))
>>> out
tensor(
[[0, 0, 0, 0],
[0.25, 0.25, 0.25, 0.25],
[0.75, 0.75, 0.75, 0.75],
[1, 1, 1, 1]],
dtype=float32, loc=gpu:0, shape=(4, 4))
Frontend¶
The frontend exposes nvtripy.Tensor
(wraps TraceTensor
)
and various operations, e.g. nvtripy.resize
.
Info
Most operations are decorated with:
@export.public_api
: Enables documentation, type checking, and overloading.@wrappers.interface
: Enforces (and generates tests for) data type constraints.
Operations are lazily evaluated.
Calling them just builds up an implicit graph of TraceOp
s:
Note
To evaluate ouptuts, the graph must first be compiled:
In eager mode, this happens when a frontend tensor is used (printed,
.eval()
’d, or exported w/ DLPack).In compiled mode, the user explicitly compiles a function or
nvtripy.Module
.
Trace¶
To build the Trace
, we walk backwards from the output(s) and accumulate operations:
==== Trace IR ====
def scale_up(
inp : tensor<2x2xf32:gpu:0> : ShapeBounds(min=[2, 2], opt=[2, 2], max=[2, 2])
) -> (
out : tensor<?x?xf32:gpu:0>
):
out = resize_linear(inp : tensor<2x2xf32:gpu:0>, scales=(2, 2), align_corners=False) : tensor<?x?xf32:gpu:0>
return out
Each trace operation corresponds one-to-one to an MLIR operation of the tensorrt
dialect and has 2 responsibilities:
Implement MLIR conversion logic.
Compute operation metadata, e.g. number of outputs, rank inference, etc.
Info
The extra indirection of a “Trace” is required so we can infer ranks, data types, and devices for the frontend.
Backend¶
The backend uses Trace.to_mlir()
to generate an MLIR program using the tensorrt
dialect:
==== MLIR ====
module @ins_inp_outs_out_2 {
func.func @main(%arg0: tensor<2x2xf32> {tensorrt.shape_profile = #tensorrt.shape_profile<min = [2, 2], opt = [2, 2], max = [2, 2]>}) -> tensor<?x?xf32> {
%0 = tensorrt.resize_linear {coordinateTransformation = #tensorrt.resize_coordinate_transformation<kHALF_PIXEL>, scales = array<f32: 2.000000e+00, 2.000000e+00>, selectorForSinglePixel = #tensorrt.resize_selector<kFORMULA>} %arg0 : (tensor<2x2xf32>) -> tensor<?x?xf32>
return %0 : tensor<?x?xf32>
}
}
The program is compiled by MLIR-TRT to an MLIR-TRT executable, which is wrapped
in an nvtripy.Executable
.
The MLIR-TRT executable interfaces with memref
s; data in frontend tensors is stored as
memref
s in the Constant
operation.
Building Better Errors¶
Frontend tensors store stack information in their corresponding trace tensors upon creation.
When generating MLIR operations, we encode the trace tensor names in their location attributes.
If there is an error from the compiler, we map the location from the error to the user’s code via the stack information in the trace tensor.