Benchmarks#
Benchmarks
Per-step latency at steady state for the recipes we profile, measured on H100, GB200, and GB300, and compared against an existing implementation of the same model on the same hardware and checkpoint.
Results — autoregressive#
Per-AR-step latency at steady state, across the three devices we profile against. Each cell is the median of the AR-2-onward window (warm-up steps dropped, steady-state CUDA graph captured) of a single run on that device. Lower is better.
Recipe (runner slug) |
H100 (ms) |
GB200 (ms) |
GB300 (ms) |
Source |
|---|---|---|---|---|
|
344 |
203 |
171 |
|
|
629 |
449 |
394 |
|
Results — bidirectional#
Bidirectional recipes are treated as a single large-windowed causal rollout in FlashDreams (see each integration’s README for the full statement). Reporting per-step ms lines up the bidirectional reference with the streaming variants above. Measured on a single GPU.
Recipe (runner slug) |
H100 (ms) |
GB200 (ms) |
GB300 (ms) |
|---|---|---|---|
|
1040 |
441 |
382 |
Per-device profile charts
The interactive per-device profiling charts on each model page
(e.g. Self-Forcing, LingBot-World,
Wan2.1) read from the same per-recipe markdown files
under _static/performance/ as the tables above.
Results — super-resolution#
FlashVSR is a streaming video super-resolution recipe, so its latency
scales with output resolution rather than autoregressive step count and
is reported per scene. Measured on GB200, FlashDreams against the
official FlashVSR runner; source
_static/performance/flashvsr/perf-0527.md. Lower is better.
Scene (resolution) |
FlashDreams (ms) |
Official (ms) |
|---|---|---|
384×384 |
74.7 |
130.5 |
672×384 |
114.5 |
235.7 |
384×672 |
111.2 |
181.3 |
640×480 |
132.5 |
255.7 |
768×416 |
135.5 |
215.9 |
1280×704 |
372.5 |
528.2 |
Versus upstream#
The FlashDreams runner calls the same model code paths as the upstream
library it integrates with, but in a different inference environment:
KV caches managed by flashdreams.infra, ring attention provided by
flashdreams.core, and a CUDA graph captured per recipe. For an
apples-to-apples comparison, both sides are forced to the cuDNN
attention backend under matched runtime settings. Lower is better; a
ratio greater than 1 means FlashDreams is faster.
Recipe (runner slug) |
GPU |
Upstream (ms) |
FlashDreams (ms) |
Ratio |
Baseline |
|---|---|---|---|---|---|
|
H100 |
432 |
344 |
1.26× |
Official Self-Forcing runner |
|
GB200 |
350 |
203 |
1.72× |
Official Self-Forcing runner |
|
GB300 |
251 |
171 |
1.47× |
Official Self-Forcing runner |
|
GB300 |
362 |
171 |
2.12× |
FastVideo runner (landing-page hero) |
|
H100 |
1950 |
629 |
3.10× |
Official LingBot-World runner |
|
GB200 |
1113 |
449 |
2.48× |
Official LingBot-World runner |
|
GB300 |
534 |
382 |
1.40× |
FastVideo runner (landing-page hero) |
|
H100 |
1290 |
1040 |
1.24× |
FastVideo runner |
Supported models#
The benchmarks above cover the recipes currently in the profiled corpus. Streaming and autoregressive recipes emit per-AR-step output and target sub-second steady-state step latency once the CUDA graph is captured; bidirectional recipes emit one end-to-end output per invocation and serve as the parity reference for the streaming variants. Each tile links to the recipe’s page, which carries the canonical invocation, the checkpoint source, and the per-recipe knobs.
Streaming and autoregressive
Streaming Wan 2.1 T2V via the Self-Forcing plugin. AR steps after warmup are sub-second on H100 / GB200.
Causal-forcing framewise T2V and I2V variants of Wan 2.1 via the Causal-Forcing plugin.
FastVideo Wan 2.2 14B causal T2V recipe.
Camera-controlled I2V with bundled prompt, first-frame, and camera arrays.
Single-view and multi-view streaming recipes against the OmniDreams checkpoints, including a diffusion-forcing AR variant.
Streaming video super-resolution for the FlashVSR checkpoint family.
Bidirectional reference
Bidirectional Wan 2.1 — T2V 1.3B / 480p and I2V 14B / 480p.
The parity baseline for self-forcing and
causal-forcing recipes.
Bidirectional Cosmos-Predict2 recipes (T2V / I2V, 2B).
What we measure#
The reported metric is steady-state per-step latency: the
wall-clock of one autoregressive step once past AR step 2 and the
steady-state CUDA graph is captured. This is the total(w/o
finalize) value the inference pipeline logs each step. Time to first
frame is not reported, and quality metrics (FVD, CLIP-T) are out of
scope here — they are tracked by each recipe’s training pipeline. The
benchmark only verifies that the inference path holds tolerance-bounded
parity with the upstream reference, which is enforced per recipe (see
A note on parity below).
Methodology#
Each row is produced by driving a recipe end-to-end and parsing the
per-step log lines: drop the first two AR steps as warm-up, then take
the median of the remaining total(w/o finalize) values.
uv run flashdreams-run \
self-forcing-wan2.1-t2v-1.3b-taehv \
--total-blocks 7 \
2>&1 | tee /tmp/bench-self-forcing.log
--total-blocks is defined on the streaming-runner subclasses
(self_forcing, causal_forcing, fastvideo_causal_wan22,
lingbot, omnidreams); bidirectional runners (wan21-*,
cosmos2-*) drop the flag and emit a single end-to-end output.
For multi-GPU recipes, launch the same command under torchrun; the
recipe transformer auto-detects its context-parallel size from the
launcher’s world group (LingBot-World uses Ulysses sequence
parallelism across 4 GPUs):
uv run torchrun --nproc_per_node=4 --no-python \
flashdreams-run lingbot-world-fast --total-blocks 21
The upstream baseline in Versus upstream runs the same checkpoint
under the upstream library’s own runner; per-integration instructions
live under integrations/<name>/tests/parity_check/.
A note on parity
Six integrations ship a parity check against their upstream
reference under integrations/<name>/tests/parity_check/run.sh:
self_forcing, lingbot, wan21, cosmos_predict2,
flashvsr, and hy_worldplay (intended for manual execution
in the upstream environment, not in CI). The numbers on this page
assume parity holds where it is enforced; see
Community for how to escalate a regression rather
than averaging away a discrepancy.