Benchmarks#

Benchmarks

Per-step latency at steady state for the recipes we profile, measured on H100, GB200, and GB300, and compared against an existing implementation of the same model on the same hardware and checkpoint.

Reproduce locally

GitHub

Results — autoregressive#

Per-AR-step latency at steady state, across the three devices we profile against. Each cell is the median of the AR-2-onward window (warm-up steps dropped, steady-state CUDA graph captured) of a single run on that device. Lower is better.

Recipe (runner slug)	H100 (ms)	GB200 (ms)	GB300 (ms)	Source
`self-forcing-wan2.1-t2v-1.3b-taehv`	344	203	171	`_static/performance/self_forcing/perf-0521.md`
`lingbot-world-fast` (4×GPU, Ulysses)	629	449	394	`_static/performance/lingbot_world/perf-0521.md`

Results — bidirectional#

Bidirectional recipes are treated as a single large-windowed causal rollout in FlashDreams (see each integration’s README for the full statement). Reporting per-step ms lines up the bidirectional reference with the streaming variants above. Measured on a single GPU.

Recipe (runner slug)	H100 (ms)	GB200 (ms)	GB300 (ms)
`wan21-t2v-1.3b-480p`	1040	441	382

Per-device profile charts

The interactive per-device profiling charts on each model page (e.g. Self-Forcing, LingBot-World, Wan2.1) read from the same per-recipe markdown files under _static/performance/ as the tables above.

Results — super-resolution#

FlashVSR is a streaming video super-resolution recipe, so its latency scales with output resolution rather than autoregressive step count and is reported per scene. Measured on GB200, FlashDreams against the official FlashVSR runner; source _static/performance/flashvsr/perf-0527.md. Lower is better.

Scene (resolution)	FlashDreams (ms)	Official (ms)
384×384	74.7	130.5
672×384	114.5	235.7
384×672	111.2	181.3
640×480	132.5	255.7
768×416	135.5	215.9
1280×704	372.5	528.2

Versus upstream#

The FlashDreams runner calls the same model code paths as the upstream library it integrates with, but in a different inference environment: KV caches managed by flashdreams.infra, ring attention provided by flashdreams.core, and a CUDA graph captured per recipe. For an apples-to-apples comparison, both sides are forced to the cuDNN attention backend under matched runtime settings. Lower is better; a ratio greater than 1 means FlashDreams is faster.

Recipe (runner slug)	GPU	Upstream (ms)	FlashDreams (ms)	Ratio	Baseline
`self-forcing-wan2.1-t2v-1.3b-taehv`	H100	432	344	1.26×	Official Self-Forcing runner
`self-forcing-wan2.1-t2v-1.3b-taehv`	GB200	350	203	1.72×	Official Self-Forcing runner
`self-forcing-wan2.1-t2v-1.3b-taehv`	GB300	251	171	1.47×	Official Self-Forcing runner
`self-forcing-wan2.1-t2v-1.3b-taehv`	GB300	362	171	2.12×	FastVideo runner (landing-page hero)
`lingbot-world-fast` (4×GPU)	H100	1950	629	3.10×	Official LingBot-World runner
`lingbot-world-fast` (4×GPU)	GB200	1113	449	2.48×	Official LingBot-World runner
`wan21-t2v-1.3b-480p`	GB300	534	382	1.40×	FastVideo runner (landing-page hero)
`wan21-t2v-1.3b-480p`	H100	1290	1040	1.24×	FastVideo runner

Supported models#

The benchmarks above cover the recipes currently in the profiled corpus. Streaming and autoregressive recipes emit per-AR-step output and target sub-second steady-state step latency once the CUDA graph is captured; bidirectional recipes emit one end-to-end output per invocation and serve as the parity reference for the streaming variants. Each tile links to the recipe’s page, which carries the canonical invocation, the checkpoint source, and the per-recipe knobs.

Streaming and autoregressive

Self-Forcing

Streaming Wan 2.1 T2V via the Self-Forcing plugin. AR steps after warmup are sub-second on H100 / GB200.

Self-Forcing

Causal-Forcing

Causal-forcing framewise T2V and I2V variants of Wan 2.1 via the Causal-Forcing plugin.

Causal-Forcing

Causal Wan 2.2

FastVideo Wan 2.2 14B causal T2V recipe.

Causal Wan2.2

LingBot-World

Camera-controlled I2V with bundled prompt, first-frame, and camera arrays.

LingBot-World

OmniDreams

Single-view and multi-view streaming recipes against the OmniDreams checkpoints, including a diffusion-forcing AR variant.

NVIDIA OmniDreams

FlashVSR

Streaming video super-resolution for the FlashVSR checkpoint family.

FlashVSR

Bidirectional reference

Wan 2.1

Bidirectional Wan 2.1 — T2V 1.3B / 480p and I2V 14B / 480p. The parity baseline for self-forcing and causal-forcing recipes.

Wan2.1

Cosmos-Predict2.5

Bidirectional Cosmos-Predict2 recipes (T2V / I2V, 2B).

Cosmos-Predict2.5

What we measure#

The reported metric is steady-state per-step latency: the wall-clock of one autoregressive step once past AR step 2 and the steady-state CUDA graph is captured. This is the total(w/o finalize) value the inference pipeline logs each step. Time to first frame is not reported, and quality metrics (FVD, CLIP-T) are out of scope here — they are tracked by each recipe’s training pipeline. The benchmark only verifies that the inference path holds tolerance-bounded parity with the upstream reference, which is enforced per recipe (see A note on parity below).

Methodology#

Each row is produced by driving a recipe end-to-end and parsing the per-step log lines: drop the first two AR steps as warm-up, then take the median of the remaining total(w/o finalize) values.

uv run flashdreams-run \
    self-forcing-wan2.1-t2v-1.3b-taehv \
    --total-blocks 7 \
    2>&1 | tee /tmp/bench-self-forcing.log

--total-blocks is defined on the streaming-runner subclasses (self_forcing, causal_forcing, fastvideo_causal_wan22, lingbot, omnidreams); bidirectional runners (wan21-*, cosmos2-*) drop the flag and emit a single end-to-end output.

For multi-GPU recipes, launch the same command under torchrun; the recipe transformer auto-detects its context-parallel size from the launcher’s world group (LingBot-World uses Ulysses sequence parallelism across 4 GPUs):

uv run torchrun --nproc_per_node=4 --no-python \
    flashdreams-run lingbot-world-fast --total-blocks 21

The upstream baseline in Versus upstream runs the same checkpoint under the upstream library’s own runner; per-integration instructions live under integrations/<name>/tests/parity_check/.

A note on parity

Six integrations ship a parity check against their upstream reference under integrations/<name>/tests/parity_check/run.sh: self_forcing, lingbot, wan21, cosmos_predict2, flashvsr, and hy_worldplay (intended for manual execution in the upstream environment, not in CI). The numbers on this page assume parity holds where it is enforced; see Community for how to escalate a regression rather than averaging away a discrepancy.