Models#

FlashDreams runs a growing family of world and video models (text-to-video, image-to-video, camera-controlled, and super-resolution), all through one consistent command line and Python interface. Browse the models below, pick the one that fits what you want to make, and follow its card through to the full recipe.

Available models#

The models come in three flavors. Streaming and autoregressive generation recipes build a video step by step and stay fast once warmed up, aiming for sub-second latency per step; bidirectional recipes produce a clip in a single pass and serve as the quality reference for their streaming counterparts; and super-resolution recipes upscale existing frames in chunks, so their latency scales with output resolution rather than step count. Each card links to that recipe’s page, where you’ll find the exact command to run it, the checkpoint it uses, and the settings you can tune.

Streaming and autoregressive generation

OmniDreams

Single-view and multi-view streaming recipes against the OmniDreams checkpoints, including a diffusion-forcing AR variant.

NVIDIA OmniDreams

Self-Forcing

Streaming Wan 2.1 T2V via the Self-Forcing plugin. AR steps after warmup are sub-second on H100 / GB200.

Self-Forcing

Causal-Forcing

Causal-forcing framewise T2V and I2V variants of Wan 2.1 via the Causal-Forcing plugin.

Causal-Forcing

Causal Wan 2.2

FastVideo Wan 2.2 14B causal T2V recipe.

Causal Wan2.2

LingBot-World

Camera-controlled I2V with bundled prompt, first-frame, and camera arrays.

LingBot-World

Bidirectional Video Generation

Wan 2.1

Bidirectional Wan 2.1: T2V 1.3B / 480p and I2V 14B / 480p. The parity baseline for self-forcing and causal-forcing recipes.

Wan2.1

Cosmos-Predict2.5

Bidirectional Cosmos-Predict2.5 recipes (T2V / I2V, 2B).

Cosmos-Predict2.5

Super-resolution

FlashVSR

Streaming video super-resolution for the FlashVSR checkpoint family. Latency scales with output resolution and is reported per scene.

FlashVSR

Benchmarks#

Every figure on this page is latency in milliseconds, lower is better: a FlashDreams runner timed against the upstream runner for the same checkpoint under matched runtime settings. The numbers verify that the inference path holds tolerance-bounded parity with the upstream reference, enforced per recipe (see How the benchmarks were gathered). What differs between the two tables below is the cadence latency is measured against: per autoregressive step for the generation recipes (Per-step latency) and per scene for super-resolution (Per-scene latency).

Several model pages also carry per-device profiling charts (e.g. Self-Forcing, LingBot-World, Wan2.1), rendered from the same sources as the tables below.

Per-step latency#

Steady-state per-step latency is the real elapsed time of one autoregressive step once past AR step 2 and the steady-state CUDA graph is captured. It is the total(w/o finalize) value the inference pipeline logs each step. Across the three devices we profile against, each number is the median of the AR-2-onward window of a single run on that device. Bidirectional recipes (wan21-*) are treated as a single large-windowed causal rollout, so their per-step number lines up with the streaming variants in the same table. The FlashDreams runner calls the same model code paths as the upstream library but in a different inference environment (KV caches managed by flashdreams.infra, ring attention provided by flashdreams.core, and a CUDA graph captured per recipe); for an apples-to-apples comparison both sides are forced to the cuDNN attention backend under matched runtime settings. Speedup is (upstream − FlashDreams) ÷ FlashDreams expressed as a percentage, so a positive value means FlashDreams is faster (e.g. a 432 ms → 344 ms step is +26%).

Recipe (runner slug)	GPU	Upstream (ms)	FlashDreams (ms)	Speedup	Baseline
`self-forcing-wan2.1-t2v-1.3b-taehv`	H100	432	344	+26%	Official Self-Forcing runner
`self-forcing-wan2.1-t2v-1.3b-taehv`	GB200	350	203	+72%	Official Self-Forcing runner
`self-forcing-wan2.1-t2v-1.3b-taehv`	GB300	251	171	+47%	Official Self-Forcing runner
`self-forcing-wan2.1-t2v-1.3b-taehv`	H100	511	344	+49%	FastVideo runner
`self-forcing-wan2.1-t2v-1.3b-taehv`	GB200	374	203	+84%	FastVideo runner
`self-forcing-wan2.1-t2v-1.3b-taehv`	GB300	362	171	+112%	FastVideo runner
`lingbot-world-fast`	4×H100	1950	629	+210%	Official LingBot-World runner
`lingbot-world-fast`	4×GB200	1113	449	+148%	Official LingBot-World runner
`lingbot-world-fast`	4×GB300	1032	394	+162%	Official LingBot-World runner
`lingbot-world-fast`	4×H100	740	629	+18%	LightX2V runner
`lingbot-world-fast`	4×GB200	717	449	+60%	LightX2V runner
`lingbot-world-fast`	4×GB300	602	394	+53%	LightX2V runner
`wan21-t2v-1.3b-480p`	H100	1140	1040	+10%	Official Wan2.1 runner
`wan21-t2v-1.3b-480p`	GB200	481	441	+9%	Official Wan2.1 runner
`wan21-t2v-1.3b-480p`	GB300	429	382	+12%	Official Wan2.1 runner
`wan21-t2v-1.3b-480p`	H100	1290	1040	+24%	FastVideo runner
`wan21-t2v-1.3b-480p`	GB200	578	441	+31%	FastVideo runner
`wan21-t2v-1.3b-480p`	GB300	534	382	+40%	FastVideo runner

Per-scene latency#

FlashVSR super-resolution has no autoregressive steps, so its latency scales with output resolution: each row is the per-chunk 2× upsampling time (8-frame chunks) at one output scene, measured on GB200 against the official FlashVSR runner.

Scene (resolution)	FlashDreams (ms)	Upstream (ms)
384×384	74.7	130.5
672×384	114.5	235.7
384×672	111.2	181.3
640×480	132.5	255.7
768×416	135.5	215.9
1280×704	372.5	528.2

How the benchmarks were gathered#

Each Per-step latency row is produced by driving a generation recipe end-to-end and parsing the per-step log lines: drop the first two AR steps as warm-up, then take the median of the remaining total(w/o finalize) values. The Per-scene latency rows are gathered from the FlashVSR runner instead, timed per output scene rather than per AR step.

uv run flashdreams-run \
    self-forcing-wan2.1-t2v-1.3b-taehv \
    --total-blocks 7 \
    2>&1 | tee /tmp/bench-self-forcing.log

--total-blocks is defined on the streaming-runner subclasses (omnidreams, self_forcing, causal_forcing, fastvideo_causal_wan22, lingbot); bidirectional runners (wan21-*, cosmos2-*) drop the flag and emit a single end-to-end output.

For multi-GPU recipes, launch the same command under torchrun; the recipe transformer auto-detects its context-parallel size from the launcher’s world group (LingBot-World uses Ulysses sequence parallelism across 4 GPUs):

uv run torchrun --nproc_per_node=4 --no-python \
    flashdreams-run lingbot-world-fast --total-blocks 21

The upstream baseline in Per-step latency runs the same checkpoint under the upstream library’s own runner; per-integration instructions live in the repo under each integration’s integrations/<name>/tests/parity_check/ directory (see the integrations/ tree).

Six integrations ship a parity check against their upstream reference under integrations/<name>/tests/parity_check/run.sh in the repo: self_forcing, lingbot, wan21, cosmos_predict2, flashvsr, and hy_worldplay. These are intended for manual execution in the upstream environment, not in CI. The numbers on this page assume parity holds where it is enforced; see Contributing to FlashDreams for how to escalate a regression rather than averaging away a discrepancy.

Running a model yourself#

uv run flashdreams-run <MODEL_SLUG> --help

Examples:

uv run flashdreams-run self-forcing-wan2.1-t2v-1.3b-taehv --total-blocks 7
uv run flashdreams-run lingbot-world-fast --example-data True --total-blocks 21

Adding your own model#

See Add a new method for model integration and registration guidance.