Models#
FlashDreams runs a growing family of world and video models (text-to-video, image-to-video, camera-controlled, and super-resolution), all through one consistent command line and Python interface. Browse the models below, pick the one that fits what you want to make, and follow its card through to the full recipe.
Available models#
The models come in three flavors. Streaming and autoregressive generation recipes build a video step by step and stay fast once warmed up, aiming for sub-second latency per step; bidirectional recipes produce a clip in a single pass and serve as the quality reference for their streaming counterparts; and super-resolution recipes upscale existing frames in chunks, so their latency scales with output resolution rather than step count. Each card links to that recipe’s page, where you’ll find the exact command to run it, the checkpoint it uses, and the settings you can tune.
Streaming and autoregressive generation
Single-view and multi-view streaming recipes against the OmniDreams checkpoints, including a diffusion-forcing AR variant.
Streaming Wan 2.1 T2V via the Self-Forcing plugin. AR steps after warmup are sub-second on H100 / GB200.
Causal-forcing framewise T2V and I2V variants of Wan 2.1 via the Causal-Forcing plugin.
FastVideo Wan 2.2 14B causal T2V recipe.
Camera-controlled I2V with bundled prompt, first-frame, and camera arrays.
Bidirectional Video Generation
Bidirectional Wan 2.1: T2V 1.3B / 480p and I2V 14B / 480p.
The parity baseline for self-forcing and
causal-forcing recipes.
Bidirectional Cosmos-Predict2.5 recipes (T2V / I2V, 2B).
Super-resolution
Streaming video super-resolution for the FlashVSR checkpoint family. Latency scales with output resolution and is reported per scene.
Benchmarks#
Every figure on this page is latency in milliseconds, lower is better: a FlashDreams runner timed against the upstream runner for the same checkpoint under matched runtime settings. The numbers verify that the inference path holds tolerance-bounded parity with the upstream reference, enforced per recipe (see How the benchmarks were gathered). What differs between the two tables below is the cadence latency is measured against: per autoregressive step for the generation recipes (Per-step latency) and per scene for super-resolution (Per-scene latency).
Several model pages also carry per-device profiling charts (e.g. Self-Forcing, LingBot-World, Wan2.1), rendered from the same sources as the tables below.
Per-step latency#
Steady-state per-step latency is the real elapsed time of one
autoregressive step once past AR step 2 and the steady-state CUDA graph
is captured. It is the total(w/o finalize) value the inference
pipeline logs each step. Across the three devices we profile against, each number
is the median of the AR-2-onward window of a single run on that device.
Bidirectional recipes (wan21-*) are treated as a single
large-windowed causal rollout, so their per-step number lines up with
the streaming variants in the same table. The FlashDreams runner calls
the same model code paths as the upstream library but in a different
inference environment (KV caches managed by flashdreams.infra, ring
attention provided by flashdreams.core, and a CUDA graph captured
per recipe); for an apples-to-apples comparison both sides are forced to
the cuDNN attention backend under matched runtime settings. Speedup is
(upstream − FlashDreams) ÷ FlashDreams expressed as a percentage, so
a positive value means FlashDreams is faster (e.g. a 432 ms → 344 ms
step is +26%).
Recipe (runner slug) |
GPU |
Upstream (ms) |
FlashDreams (ms) |
Speedup |
Baseline |
|---|---|---|---|---|---|
|
H100 |
432 |
344 |
+26% |
Official Self-Forcing runner |
|
GB200 |
350 |
203 |
+72% |
Official Self-Forcing runner |
|
GB300 |
251 |
171 |
+47% |
Official Self-Forcing runner |
|
H100 |
511 |
344 |
+49% |
FastVideo runner |
|
GB200 |
374 |
203 |
+84% |
FastVideo runner |
|
GB300 |
362 |
171 |
+112% |
FastVideo runner |
|
4×H100 |
1950 |
629 |
+210% |
Official LingBot-World runner |
|
4×GB200 |
1113 |
449 |
+148% |
Official LingBot-World runner |
|
4×GB300 |
1032 |
394 |
+162% |
Official LingBot-World runner |
|
4×H100 |
740 |
629 |
+18% |
LightX2V runner |
|
4×GB200 |
717 |
449 |
+60% |
LightX2V runner |
|
4×GB300 |
602 |
394 |
+53% |
LightX2V runner |
|
H100 |
1140 |
1040 |
+10% |
Official Wan2.1 runner |
|
GB200 |
481 |
441 |
+9% |
Official Wan2.1 runner |
|
GB300 |
429 |
382 |
+12% |
Official Wan2.1 runner |
|
H100 |
1290 |
1040 |
+24% |
FastVideo runner |
|
GB200 |
578 |
441 |
+31% |
FastVideo runner |
|
GB300 |
534 |
382 |
+40% |
FastVideo runner |
Per-scene latency#
FlashVSR super-resolution has no autoregressive steps, so its latency scales with output resolution: each row is the per-chunk 2× upsampling time (8-frame chunks) at one output scene, measured on GB200 against the official FlashVSR runner.
Scene (resolution) |
FlashDreams (ms) |
Upstream (ms) |
|---|---|---|
384×384 |
74.7 |
130.5 |
672×384 |
114.5 |
235.7 |
384×672 |
111.2 |
181.3 |
640×480 |
132.5 |
255.7 |
768×416 |
135.5 |
215.9 |
1280×704 |
372.5 |
528.2 |
How the benchmarks were gathered#
Each Per-step latency row is produced by driving a generation recipe
end-to-end and parsing the per-step log lines: drop the first two AR
steps as warm-up, then take the median of the remaining total(w/o
finalize) values. The Per-scene latency rows are gathered from the
FlashVSR runner instead, timed per output scene rather than per AR step.
uv run flashdreams-run \
self-forcing-wan2.1-t2v-1.3b-taehv \
--total-blocks 7 \
2>&1 | tee /tmp/bench-self-forcing.log
--total-blocks is defined on the streaming-runner subclasses
(omnidreams, self_forcing, causal_forcing,
fastvideo_causal_wan22, lingbot); bidirectional runners (wan21-*,
cosmos2-*) drop the flag and emit a single end-to-end output.
For multi-GPU recipes, launch the same command under torchrun; the
recipe transformer auto-detects its context-parallel size from the
launcher’s world group (LingBot-World uses Ulysses sequence
parallelism across 4 GPUs):
uv run torchrun --nproc_per_node=4 --no-python \
flashdreams-run lingbot-world-fast --total-blocks 21
The upstream baseline in Per-step latency runs the same checkpoint
under the upstream library’s own runner; per-integration instructions
live in the repo under each integration’s
integrations/<name>/tests/parity_check/ directory (see the
integrations/
tree).
Six integrations ship a parity check against their upstream reference
under integrations/<name>/tests/parity_check/run.sh in the repo:
self_forcing, lingbot, wan21, cosmos_predict2,
flashvsr, and hy_worldplay. These are intended for manual
execution in the upstream environment, not in CI. The numbers on this
page assume parity holds where it is enforced; see
Contributing to FlashDreams for how to escalate a regression rather than
averaging away a discrepancy.
Running a model yourself#
uv run flashdreams-run <MODEL_SLUG> --help
Examples:
uv run flashdreams-run self-forcing-wan2.1-t2v-1.3b-taehv --total-blocks 7
uv run flashdreams-run lingbot-world-fast --example-data True --total-blocks 21
Adding your own model#
See Add a new method for model integration and registration guidance.