Counters#

Counter annotations describe quantities that change over time, such as memory usage, queue depth, bytes processed, or model training metrics.

Create counters with nvtx.Domain.get_counter() and keep the counter object around; call nvtx.Counter.sample() from performance-sensitive code.

Scalar Counters#

Use int for signed 64-bit integer samples and float for double-precision floating point samples:

import nvtx
import torch


domain = nvtx.get_domain("Example")
gpu_utilization_counter = domain.get_counter(
    "gpu utilization",
    int,
    description="Percent of time kernels were executing on the GPU",
)

gpu_utilization_counter.sample(torch.cuda.utilization())

Counter Groups#

Use a NumPy dtype to expose multiple fields as one counter sample. Structured dtypes represent counter groups:

import nvtx
import torch


domain = nvtx.get_domain("CUDA")
device = torch.cuda.current_device()

gpu_metrics_dtype = nvtx.numpy_dtype([
    ("gpu_utilization", int),
    ("memory_allocated", int),
    ("memory_reserved", int),
])

gpu_metrics_counter = domain.get_counter(
    "gpu metrics",
    gpu_metrics_dtype,
    description="GPU utilization and PyTorch CUDA memory usage",
)

gpu_metrics_counter.sample((
    torch.cuda.utilization(device),
    torch.cuda.memory_allocated(device),
    torch.cuda.memory_reserved(device),
))

Counter groups are flat: fields must be scalar dtypes. Fixed-size array fields and structured or nested fields are not supported.

Avoiding Copies#

Counter samples and groups follow the general guidance in Pass data in its native form: pass native Python values as-is, and prefer a C-contiguous NumPy array matching the counter’s dtype when you assemble a batch yourself with nvtx.Counter.batch_submit().

Counter Semantics#

Use nvtx.CounterSemantics to describe how counter values should be interpreted, including units, bounds, value type, or interpolation. For top-level scalar counters, pass nvtx.CounterSemantics to nvtx.Domain.get_counter():

gpu_utilization_counter = domain.get_counter(
    "gpu utilization",
    int,
    semantics=nvtx.CounterSemantics(unit="percent", min=0, max=100),
)

For per-field semantics in a counter group, build the field dtype with nvtx.numpy_dtype():

percent_dtype = nvtx.numpy_dtype(
    int,
    counter_semantics=nvtx.CounterSemantics(unit="percent", min=0, max=100),
)
bytes_dtype = nvtx.numpy_dtype(
    int,
    counter_semantics=nvtx.CounterSemantics(unit="bytes", min=0),
)
gpu_metrics_dtype = nvtx.numpy_dtype([
    ("gpu_utilization", percent_dtype),
    ("memory_allocated", bytes_dtype),
    ("memory_reserved", bytes_dtype),
])

Batched Samples#

Batched samples can reduce overhead when metrics are produced in a hot path but do not need to be submitted immediately. Store the metric values and timestamps with minimal work in the hot path, then use nvtx.Counter.batch_submit() to submit them together. Use nvtx.Domain.get_timestamp() to get NVTX timestamps:

loss_counter = domain.get_counter("training loss", float)
losses = []
timestamps = []

for inputs, targets in dataloader:
    optimizer.zero_grad(set_to_none=True)
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()

    losses.append(loss.detach())
    timestamps.append(domain.get_timestamp())

loss_samples = [loss.item() for loss in losses]
loss_counter.batch_submit(loss_samples, timestamps)

The default time_domain matches timestamps returned by nvtx.Domain.get_timestamp(). If timestamps come from another clock, create the counter with the matching nvtx.TimestampType so the batch declares the clock domain used by its timestamps.

No-Value Samples#

Use nvtx.Counter.sample_no_value() when a sample is known to be zero, unchanged, or unavailable, but no explicit value should be submitted:

gpu_utilization_counter.sample_no_value(nvtx.CounterNoValueReason.UNAVAILABLE)