Interactive-drive latency tuning#
interactive-drive latency has two different components:
Model / chunk latency is the time spent preparing HDMap conditioning, running the OmniDreams DiT, decoding the generated chunk, and updating model state. This is usually the dominant input-to-visual delay.
Video-transport latency is the time spent delivering already-generated frames to a local window or browser. Switching from MJPEG to WebRTC can reduce this delivery cost, but it does not make the model generate a chunk faster.
Tune the model path first when the profiler shows per-chunk work dominating. Tune transport when generated frames are ready quickly but arrive late or unevenly in the viewer.
Model and backend choice#
Use the OmniDreams world-model backend for latency work. The raster backend is useful for scene, control, and presenter debugging, but it does not exercise the model path and should not be used as a model-latency reference.
The packaged interactive-drive manifests are the supported starting points:
example_world_model.yamlis the default single-view configuration at1280 x 704(width x height), 30 FPS, 8 generated frames per steady-state block, LightVAE enabled, and native DiT acceleration disabled.example_world_model_perf.yamlis the perf-tuned manifest. It lowers the default resolution to1168 x 640, keeps 30 FPS and 8-frame steady-state blocks, enables the performance recipe, and requires the native DiT path.
Run the perf manifest only on hosts that can build and load the native extension:
uv run --package flashdreams-omnidreams omnidreams-prepare --perf
uv run --package flashdreams-omnidreams interactive-drive \
--manifest example_world_model_perf.yaml
native_dit_acceleration: required is intentional. If the native extension is
not available, startup fails instead of silently falling back to the slower
PyTorch path.
Resolution#
Resolution is one of the highest-impact latency knobs because it changes the amount of HDMap, DiT, and VAE work per chunk. Set it in the world-model manifest:
resolution_wh: [1168, 640]
Both values are [width, height] and must be positive multiples of 16. The
bundled manifests list tested aspect-compatible examples:
[1280, 704][1168, 640][1024, 560][896, 496][640, 352]
Lowering resolution reduces per-chunk compute and transport payload size, with the expected image-quality tradeoff. The raster presenter automatically adopts the manifest resolution for the world-model backend.
Chunk size constraints#
Do not treat chunk size as an arbitrary latency knob. The interactive-drive adapter validates the FlashDreams pipeline at startup:
The initial conditioning chunk is fixed at 5 frames.
The public LightVAE single-view recipe used by the bundled manifests supports 8-frame steady-state chunks.
Full-VAE single-view recipes support 8- or 12-frame chunks only when the matching checkpoint is available.
The pixel-shuffle single-view branch requires 16-frame chunks and
local_attn_size: 8. It is not the published interactive-drive tuning path.
At 30 FPS, an 8-frame steady-state chunk covers about 267 ms of generated video. Reducing video-transport latency cannot remove this model-side chunk granularity.
FP8 and native acceleration#
The perf manifest uses the OmniDreams single-view native CUDA extension for the DiT path:
native_dit_acceleration: required
native_dit_backend: fp8_kvcache_cudnn
native_dit_attention_backend: cudnn
Supported manifest values are native_dit_acceleration: disabled | auto |
required and native_dit_backend: fp8_kvcache_cudnn | bf16. The attention
backend accepts auto, cudnn, sparge, sage3, and sage3_fp8;
the bundled perf manifest pins cudnn.
The native extension requires a source checkout, git, a CUDA toolchain
(nvcc) matching the PyTorch build, synced third-party sources from
omnidreams-prepare --perf, and a Blackwell-class GPU (SM 12.0) or newer.
The extension builds for 12.0a by default. Use this path on Blackwell and
GB300 systems.
H100 / Hopper systems should use the standard PyTorch CUDA path with native DiT disabled unless you are deliberately maintaining a compatible native build. That path is supported, but it is not the same perf path as the published GB300 numbers.
The manifest also exposes an optional native LightVAE FP8 encoder:
native_vae_encoder: fp8
It is disabled in the bundled perf manifest. Enabling it requires
OMNIDREAMS_LIGHTVAE_FP8_STATE_PATH or native_vae_fp8_state_path pointing
to a calibrated LightVAE FP8 state.
Transport choice#
Pick transport based on where the viewer runs:
Local Vulkan window: lowest-overhead local presentation when the host has a graphics-capable GPU and display stack.
--stream-mjpeg [HOST:]PORT: simple browser delivery from the same process. Use it on compute-only hosts such as GB300 systems without a graphics queue, or when a laptop browser views a remote model host.omnidreams.webrtc.server: richer browser frontend with WebRTC’s lower video-delivery latency and streaming gRPC service support. Prefer this for product-style remote viewing or multi-client integration.
MJPEG and WebRTC affect video delivery after a frame exists. If the model is still spending most of the time inside each chunk, use the perf manifest, resolution, and native-acceleration knobs first.
Profiling and validated reference#
Use --profile-world-model to enable FlashDreams CUDA-event profiling for the
world-model runtime. Use --sync-gpu-timing only when you need raster compute
timings; it synchronizes GPU work and is not a throughput setting.
The validated published reference for interactive-drive latency is the
single-view GB300 table from NVIDIA OmniDreams, measured at
1280 x 704 resolution:
Stage |
1x GPU |
2x GPU |
4x GPU |
8x GPU |
|---|---|---|---|---|
HDMap Encoder |
28 ms |
26 ms |
26 ms |
26 ms |
Diffusion DiT |
84 ms |
71 ms |
49 ms |
47 ms |
VAE Decoder |
6 ms |
5 ms |
5 ms |
5 ms |
KV-cache Update |
42 ms |
34 ms |
23 ms |
22 ms |
Total |
118 ms |
102 ms |
80 ms |
78 ms |
Effective FPS |
68 |
78 |
100 |
103 |
KV-cache update is off the hot path and excluded from the total. This guide consolidates the supported latency controls and existing published measurement; it does not add new end-to-end hardware benchmarking for Hopper, H100, Blackwell, or GB300 systems.