Architecture¶

Overview¶

Tripy builds an MLIR program by tracing functional-style Python APIs.

The program is compiled and executed by MLIR-TRT.

%%{init: {'theme':'neutral'}}%% graph LR subgraph "Tripy (Python)" subgraph "Frontend" A("Operations"):::frontend end subgraph "Trace" A --> B("Trace"):::trace end subgraph "Backend" B --> C("MLIR"):::backend end end subgraph "MLIR-TRT (C++)" C --> D("Compiler/Runtime"):::mlirtrt end classDef frontend fill:#87CEFA,stroke:#000,stroke-width:1px; classDef trace fill:#D8BFD8,stroke:#000,stroke-width:1px; classDef backend fill:#9ACD32,stroke:#000,stroke-width:1px; classDef mlirtrt fill:#CC4040,stroke:#000,stroke-width:1px;

Backend: Interfaces with MLIR-TRT:
- Compiler compiles tensorrt-dialect MLIR to an MLIR-TRT executable.
- Executable wraps an MLIR-TRT executable in a Pythonic API.
Trace: Computation graph of TraceTensors and TraceOps that lowers to tensorrt-dialect MLIR.
Frontend: Exposes functional-style operations for nvtripy.Tensors.

Note

Frontend/Backend refer to the flow of execution, not what the user does/doesn’t see.

Public APIs are exposed by both the frontend (e.g. nvtripy.resize()) and backend (e.g. nvtripy.compile()).

The Stack By Example¶

Consider a simple example:

def scale_up(inp):
    out = tp.resize(inp, scales=(2, 2), mode="linear")
    out.name = "out"  # Setting name for IR readability
    return out


compiled_func = tp.compile(
    scale_up, args=[tp.InputInfo((2, 2), dtype=tp.float32)]
)

inp = tp.iota((2, 2), dtype=tp.float32).eval()
out = compiled_func(inp)

Local Variables

>>> compiled_func
Executable(inp: nvtripy.Tensor) -> nvtripy.Tensor

>>> inp
tensor(
    [[0, 0],
     [1, 1]],
    dtype=float32, loc=gpu:0, shape=(2, 2))

>>> out
tensor(
    [[0, 0, 0, 0],
     [0.25, 0.25, 0.25, 0.25],
     [0.75, 0.75, 0.75, 0.75],
     [1, 1, 1, 1]],
    dtype=float32, loc=gpu:0, shape=(4, 4))

Frontend¶

The frontend exposes nvtripy.Tensor (wraps TraceTensor) and various operations, e.g. nvtripy.resize.

Info

Most operations are decorated with:

@export.public_api: Enables documentation, type checking, and overloading.
@wrappers.interface: Enforces (and generates tests for) data type constraints.

Operations are lazily evaluated. Calling them just builds up an implicit graph of TraceOps:

%%{init: {'theme':'neutral'}}%% graph LR subgraph "'inp' Tensor" A(trace_tensor0) end subgraph "Operation" A --> B[Resize] end subgraph "'out' Tensor" B --> C(trace_tensor1) end

Note

To evaluate ouptuts, the graph must first be compiled:

In eager mode, this happens when a frontend tensor is used (printed, .eval()’d, or exported w/ DLPack).
In compiled mode, the user explicitly compiles a function or nvtripy.Module.

Trace¶

To build the Trace, we walk backwards from the output(s) and accumulate operations:

==== Trace IR ====
def scale_up(
    inp : tensor<2x2xf32:gpu:0> : InputInfo<Bounds(min=(2, 2), opt=(2, 2), max=(2, 2)), dimension names: {}, dtype: float32>
) -> (
    out : tensor<?x?xf32:gpu:0>
):
    out = resize_linear(inp : tensor<2x2xf32:gpu:0>, scales=(2, 2), align_corners=False) : tensor<?x?xf32:gpu:0>
    return out

Each trace operation corresponds one-to-one to an MLIR operation of the tensorrt dialect and has 2 responsibilities:

Implement MLIR conversion logic.
Compute operation metadata, e.g. number of outputs, rank inference, etc.

Info

The extra indirection of a “Trace” is required so we can infer ranks, data types, and devices for the frontend.

Backend¶

The backend uses Trace.to_mlir() to generate an MLIR program using the tensorrt dialect:

==== MLIR ====
module @ins_inp_outs_out_2 {
  func.func @main(%arg0: tensor<2x2xf32> {tensorrt.dimension_names = {}, tensorrt.shape_profile = #tensorrt.shape_profile<min = [2, 2], opt = [2, 2], max = [2, 2]>}) -> tensor<?x?xf32> {
    %0 = tensorrt.resize_linear {coordinateTransformation = #tensorrt.resize_coordinate_transformation<kHALF_PIXEL>, scales = array<f32: 2.000000e+00, 2.000000e+00>, selectorForSinglePixel = #tensorrt.resize_selector<kFORMULA>} %arg0 : (tensor<2x2xf32>) -> tensor<?x?xf32>
    return %0 : tensor<?x?xf32>
  }
}

The program is compiled by MLIR-TRT to an MLIR-TRT executable, which is wrapped in an nvtripy.Executable.

The MLIR-TRT executable interfaces with memrefs; data in frontend tensors is stored as memrefs in the Constant operation.

Building Better Errors¶

Frontend tensors store stack information in their corresponding trace tensors upon creation.
When generating MLIR operations, we encode the trace tensor names in their location attributes.
If there is an error from the compiler, we map the location from the error to the user’s code via the stack information in the trace tensor.

Architecture¶

Overview¶

The Stack By Example¶

Frontend¶

Trace¶

Backend¶

Building Better Errors¶

Source Code Links¶