Architecture

Overview

Tripy builds an MLIR program by tracing functional-style Python APIs.

  • The program is compiled and executed by MLIR-TRT.

%%{init: {'theme':'neutral'}}%% graph LR subgraph "Tripy (Python)" subgraph "Frontend" A("Operations"):::frontend end subgraph "Trace" A --> B("Trace"):::trace end subgraph "Backend" B --> C("MLIR"):::backend end end subgraph "MLIR-TRT (C++)" C --> D("Compiler/Runtime"):::mlirtrt end classDef frontend fill:#87CEFA,stroke:#000,stroke-width:1px; classDef trace fill:#D8BFD8,stroke:#000,stroke-width:1px; classDef backend fill:#9ACD32,stroke:#000,stroke-width:1px; classDef mlirtrt fill:#CC4040,stroke:#000,stroke-width:1px;
  1. Backend: Interfaces with MLIR-TRT:

    • Compiler compiles tensorrt-dialect MLIR to an MLIR-TRT executable.

    • Executable wraps an MLIR-TRT executable in a Pythonic API.

  2. Trace: Computation graph of TraceTensors and TraceOps that lowers to tensorrt-dialect MLIR.

  3. Frontend: Exposes functional-style operations for nvtripy.Tensors.

Note

Frontend/Backend refer to the flow of execution, not what the user does/doesn’t see.

Public APIs are exposed by both the frontend (e.g. nvtripy.resize()) and backend (e.g. nvtripy.compile()).

The Stack By Example

Consider a simple example:

 1def scale_up(inp):
 2    out = tp.resize(inp, scales=(2, 2), mode="linear")
 3    out.name = "out"  # Setting name for IR readability
 4    return out
 5
 6
 7compiled_func = tp.compile(
 8    scale_up, args=[tp.InputInfo((2, 2), dtype=tp.float32)]
 9)
10
11inp = tp.iota((2, 2), dtype=tp.float32)
12out = compiled_func(inp)
Local Variables
>>> compiled_func
Executable(inp: nvtripy.Tensor) -> nvtripy.Tensor

>>> inp
tensor(
    [[0, 0],
     [1, 1]],
    dtype=float32, loc=gpu:0, shape=(2, 2))

>>> out
tensor(
    [[0, 0, 0, 0],
     [0.25, 0.25, 0.25, 0.25],
     [0.75, 0.75, 0.75, 0.75],
     [1, 1, 1, 1]],
    dtype=float32, loc=gpu:0, shape=(4, 4))

Frontend

The frontend exposes nvtripy.Tensor (wraps TraceTensor) and various operations, e.g. nvtripy.resize.

Info

Most operations are decorated with:

  1. @export.public_api: Enables documentation, type checking, and overloading.

  2. @wrappers.interface: Enforces (and generates tests for) data type constraints.

Operations are lazily evaluated. Calling them just builds up an implicit graph of TraceOps:

%%{init: {'theme':'neutral'}}%% graph LR subgraph "'inp' Tensor" A(trace_tensor0) end subgraph "Operation" A --> B[Resize] end subgraph "'out' Tensor" B --> C(trace_tensor1) end

Note

To evaluate ouptuts, the graph must first be compiled:

  • In eager mode, this happens when a frontend tensor is used (printed, .eval()’d, or exported w/ DLPack).

  • In compiled mode, the user explicitly compiles a function or nvtripy.Module.

Trace

To build the Trace, we walk backwards from the output(s) and accumulate operations:

==== Trace IR ====
def scale_up(
    inp : tensor<2x2xf32:gpu:0> : ShapeBounds(min=[2, 2], opt=[2, 2], max=[2, 2])
) -> (
    out : tensor<?x?xf32:gpu:0>
):
    out = resize_linear(inp : tensor<2x2xf32:gpu:0>, scales=(2, 2), align_corners=False) : tensor<?x?xf32:gpu:0>
    return out

Each trace operation corresponds one-to-one to an MLIR operation of the tensorrt dialect and has 2 responsibilities:

  1. Implement MLIR conversion logic.

  2. Compute operation metadata, e.g. number of outputs, rank inference, etc.

Info

The extra indirection of a “Trace” is required so we can infer ranks, data types, and devices for the frontend.

Backend

The backend uses Trace.to_mlir() to generate an MLIR program using the tensorrt dialect:

==== MLIR ====
module @ins_inp_outs_out_2 {
  func.func @main(%arg0: tensor<2x2xf32> {tensorrt.shape_profile = #tensorrt.shape_profile<min = [2, 2], opt = [2, 2], max = [2, 2]>}) -> tensor<?x?xf32> {
    %0 = tensorrt.resize_linear {coordinateTransformation = #tensorrt.resize_coordinate_transformation<kHALF_PIXEL>, scales = array<f32: 2.000000e+00, 2.000000e+00>, selectorForSinglePixel = #tensorrt.resize_selector<kFORMULA>} %arg0 : (tensor<2x2xf32>) -> tensor<?x?xf32>
    return %0 : tensor<?x?xf32>
  }
}

The program is compiled by MLIR-TRT to an MLIR-TRT executable, which is wrapped in an nvtripy.Executable.

The MLIR-TRT executable interfaces with memrefs; data in frontend tensors is stored as memrefs in the Constant operation.

Building Better Errors

  • Frontend tensors store stack information in their corresponding trace tensors upon creation.

  • When generating MLIR operations, we encode the trace tensor names in their location attributes.

  • If there is an error from the compiler, we map the location from the error to the user’s code via the stack information in the trace tensor.