Models#

FlashDreams runs a growing family of world and video models (text-to-video, image-to-video, camera-controlled, and super-resolution), all through one consistent command line and Python interface. Browse the models below, pick the one that fits what you want to make, and follow its card through to the full recipe.

Available models#

The models come in three flavors. Streaming and autoregressive generation recipes build a video step by step and stay fast once warmed up, aiming for sub-second latency per step; bidirectional recipes produce a clip in a single pass and serve as the quality reference for their streaming counterparts; and super-resolution recipes upscale existing frames in chunks, so their latency scales with output resolution rather than step count. Each card links to that recipe’s page, where you’ll find the exact command to run it, the checkpoint it uses, and the settings you can tune.

Streaming and autoregressive generation

OmniDreams

Single-view and multi-view streaming recipes against the OmniDreams checkpoints, including a diffusion-forcing AR variant.

NVIDIA OmniDreams
Self-Forcing

Streaming Wan 2.1 T2V via the Self-Forcing plugin. AR steps after warmup are sub-second on H100 / GB200.

Self-Forcing
Causal-Forcing

Causal-forcing framewise T2V and I2V variants of Wan 2.1 via the Causal-Forcing plugin.

Causal-Forcing
Causal Wan 2.2

FastVideo Wan 2.2 14B causal T2V recipe.

Causal Wan2.2
LingBot-World

Camera-controlled I2V with bundled prompt, first-frame, and camera arrays.

LingBot-World

Bidirectional Video Generation

Wan 2.1

Bidirectional Wan 2.1: T2V 1.3B / 480p and I2V 14B / 480p. The parity baseline for self-forcing and causal-forcing recipes.

Wan2.1
Cosmos-Predict2.5

Bidirectional Cosmos-Predict2.5 recipes (T2V / I2V, 2B).

Cosmos-Predict2.5

Super-resolution

FlashVSR

Streaming video super-resolution for the FlashVSR checkpoint family. Latency scales with output resolution and is reported per scene.

FlashVSR

Benchmarks#

Every figure on this page is latency in milliseconds, lower is better: a FlashDreams runner timed against the upstream runner for the same checkpoint under matched runtime settings. The numbers verify that the inference path holds tolerance-bounded parity with the upstream reference, enforced per recipe (see How the benchmarks were gathered). What differs between the two tables below is the cadence latency is measured against: per autoregressive step for the generation recipes (Per-step latency) and per scene for super-resolution (Per-scene latency).

Several model pages also carry per-device profiling charts (e.g. Self-Forcing, LingBot-World, Wan2.1), rendered from the same sources as the tables below.

Per-step latency#

Steady-state per-step latency is the real elapsed time of one autoregressive step once past AR step 2 and the steady-state CUDA graph is captured. It is the total(w/o finalize) value the inference pipeline logs each step. Across the three devices we profile against, each number is the median of the AR-2-onward window of a single run on that device. Bidirectional recipes (wan21-*) are treated as a single large-windowed causal rollout, so their per-step number lines up with the streaming variants in the same table. The FlashDreams runner calls the same model code paths as the upstream library but in a different inference environment (KV caches managed by flashdreams.infra, ring attention provided by flashdreams.core, and a CUDA graph captured per recipe); for an apples-to-apples comparison both sides are forced to the cuDNN attention backend under matched runtime settings. Speedup is (upstream FlashDreams) ÷ FlashDreams expressed as a percentage, so a positive value means FlashDreams is faster (e.g. a 432 ms → 344 ms step is +26%).

Recipe (runner slug)

GPU

Upstream (ms)

FlashDreams (ms)

Speedup

Baseline

self-forcing-wan2.1-t2v-1.3b-taehv

H100

432

344

+26%

Official Self-Forcing runner

self-forcing-wan2.1-t2v-1.3b-taehv

GB200

350

203

+72%

Official Self-Forcing runner

self-forcing-wan2.1-t2v-1.3b-taehv

GB300

251

171

+47%

Official Self-Forcing runner

self-forcing-wan2.1-t2v-1.3b-taehv

H100

511

344

+49%

FastVideo runner

self-forcing-wan2.1-t2v-1.3b-taehv

GB200

374

203

+84%

FastVideo runner

self-forcing-wan2.1-t2v-1.3b-taehv

GB300

362

171

+112%

FastVideo runner

lingbot-world-fast

4×H100

1950

629

+210%

Official LingBot-World runner

lingbot-world-fast

4×GB200

1113

449

+148%

Official LingBot-World runner

lingbot-world-fast

4×GB300

1032

394

+162%

Official LingBot-World runner

lingbot-world-fast

4×H100

740

629

+18%

LightX2V runner

lingbot-world-fast

4×GB200

717

449

+60%

LightX2V runner

lingbot-world-fast

4×GB300

602

394

+53%

LightX2V runner

wan21-t2v-1.3b-480p

H100

1140

1040

+10%

Official Wan2.1 runner

wan21-t2v-1.3b-480p

GB200

481

441

+9%

Official Wan2.1 runner

wan21-t2v-1.3b-480p

GB300

429

382

+12%

Official Wan2.1 runner

wan21-t2v-1.3b-480p

H100

1290

1040

+24%

FastVideo runner

wan21-t2v-1.3b-480p

GB200

578

441

+31%

FastVideo runner

wan21-t2v-1.3b-480p

GB300

534

382

+40%

FastVideo runner

Per-scene latency#

FlashVSR super-resolution has no autoregressive steps, so its latency scales with output resolution: each row is the per-chunk 2× upsampling time (8-frame chunks) at one output scene, measured on GB200 against the official FlashVSR runner.

Scene (resolution)

FlashDreams (ms)

Upstream (ms)

384×384

74.7

130.5

672×384

114.5

235.7

384×672

111.2

181.3

640×480

132.5

255.7

768×416

135.5

215.9

1280×704

372.5

528.2

How the benchmarks were gathered#

Each Per-step latency row is produced by driving a generation recipe end-to-end and parsing the per-step log lines: drop the first two AR steps as warm-up, then take the median of the remaining total(w/o finalize) values. The Per-scene latency rows are gathered from the FlashVSR runner instead, timed per output scene rather than per AR step.

uv run flashdreams-run \
    self-forcing-wan2.1-t2v-1.3b-taehv \
    --total-blocks 7 \
    2>&1 | tee /tmp/bench-self-forcing.log

--total-blocks is defined on the streaming-runner subclasses (omnidreams, self_forcing, causal_forcing, fastvideo_causal_wan22, lingbot); bidirectional runners (wan21-*, cosmos2-*) drop the flag and emit a single end-to-end output.

For multi-GPU recipes, launch the same command under torchrun; the recipe transformer auto-detects its context-parallel size from the launcher’s world group (LingBot-World uses Ulysses sequence parallelism across 4 GPUs):

uv run torchrun --nproc_per_node=4 --no-python \
    flashdreams-run lingbot-world-fast --total-blocks 21

The upstream baseline in Per-step latency runs the same checkpoint under the upstream library’s own runner; per-integration instructions live in the repo under each integration’s integrations/<name>/tests/parity_check/ directory (see the integrations/ tree).

Six integrations ship a parity check against their upstream reference under integrations/<name>/tests/parity_check/run.sh in the repo: self_forcing, lingbot, wan21, cosmos_predict2, flashvsr, and hy_worldplay. These are intended for manual execution in the upstream environment, not in CI. The numbers on this page assume parity holds where it is enforced; see Contributing to FlashDreams for how to escalate a regression rather than averaging away a discrepancy.

Running a model yourself#

uv run flashdreams-run <MODEL_SLUG> --help

Examples:

uv run flashdreams-run self-forcing-wan2.1-t2v-1.3b-taehv --total-blocks 7
uv run flashdreams-run lingbot-world-fast --example-data True --total-blocks 21

Adding your own model#

See Add a new method for model integration and registration guidance.