Self-Forcing#

Self-Forcing is a text-to-video (T2V) model based on Wan2.1. It uses a training paradigm for autoregressive video diffusion that simulates inference-time rollout during training with KV caching, reducing the train-test gap and enabling efficient streaming generation quality.

Teaser image source: Self-Forcing project page.

Requirements#

Minimum VRAM: ~24 GB.
PyTorch: >= 2.9.

Installation#

# from the repo root
uv sync --project integrations/self_forcing

Running the method#

To run Self-Forcing, launch one of the registered runner slugs. For example:

uv run --project integrations/self_forcing \
    flashdreams-run \
    self-forcing-wan2.1-t2v-1.3b \
    --prompt "A stylish woman strolls down a bustling Tokyo street, the warm glow of neon lights and animated city signs casting vibrant reflections. She wears a sleek black leather jacket paired with a flowing red dress and black boots, her black purse slung over her shoulder. Sunglasses perched on her nose and a bold red lipstick add to her confident, casual demeanor. The street is damp and reflective, creating a mirror-like effect that enhances the colorful lights and shadows. Pedestrians move about, adding to the lively atmosphere. The scene is captured in a dynamic medium shot with the woman walking slightly to one side, highlighting her graceful strides." \
    --pixel-height 480 --pixel-width 832 \
    --total-blocks 7

We provide the following variants:

Method	Description
`self-forcing-wan2.1-t2v-1.3b`	Official checkpoint.
`self-forcing-wan2.1-t2v-1.3b-taehv`	Official checkpoint. Swap Wan VAE decoder with the faster TAEHV decoder.
`self-forcing-wan2.1-t2v-1.3b-sink5-window7-rerope`	Steady long-rollout preset: static sink=5 + rolling window=7, with KVCache-relative RoPE.

For multi-GPU inference, use:

uv run --project integrations/self_forcing \
    torchrun --nproc_per_node=4 --no-python flashdreams-run \
    self-forcing-wan2.1-t2v-1.3b \
    --prompt "A stylish woman strolls down a bustling Tokyo street, the warm glow of neon lights and animated city signs casting vibrant reflections. She wears a sleek black leather jacket paired with a flowing red dress and black boots, her black purse slung over her shoulder. Sunglasses perched on her nose and a bold red lipstick add to her confident, casual demeanor. The street is damp and reflective, creating a mirror-like effect that enhances the colorful lights and shadows. Pedestrians move about, adding to the lively atmosphere. The scene is captured in a dynamic medium shot with the woman walking slightly to one side, highlighting her graceful strides." \
    --pixel-height 480 --pixel-width 832 \
    --total-blocks 7

To inspect all supported CLI arguments and their default values, run:

uv run --project integrations/self_forcing \
    flashdreams-run \
    self-forcing-wan2.1-t2v-1.3b \
    --help

What to expect#

Default prompt: omitting --prompt uses a Tokyo street-scene default. Override with an inline string or a path to a .txt file.
Total blocks: --total-blocks N runs N autoregressive chunks. Commands here use 7 for a fast demo; the config default is 60 for full rollouts. See Inference pipeline overview for what one chunk does end-to-end.
Outputs: outputs/<runner-slug>.mp4 (16 FPS, 480×832 by default) and outputs/stats_<runner-slug>.json. Override with --output-dir / --pixel-height / --pixel-width / --fps.

Measured runtimes on H100 80GB with --total-blocks 7:

Setup	First run (cold)	Subsequent runs
1× H100 PCIe	~6.9 min	~42 s
4× H100 HBM3 (`torchrun --nproc_per_node=4`)	~8.6 min	~73 s

Cold runs are dominated by the first two AR blocks (Triton autotuning + CUDA-graph warmup); steady-state blocks are sub-second.

Per-block steady-state on 4 GPUs is ~2× faster (~251 ms vs ~500 ms), but per-rank autotune + NCCL overhead makes 4 GPUs end-to-end slower than 1 GPU at --total-blocks 7. Multi-GPU pays off once steady-state dominates warmup — use it for --total-blocks 60+.

Some generated samples from the above commands:

A close-up shot of a ceramic teacup slowly pouring water into a glass mug. The water flows smoothly from the spout of the teacup into the mug, creating gentle ripples as it fills up. Both cups have detailed textures, with the teacup having a matte finish and the glass mug showcasing clear transparency. The background is a blurred kitchen countertop, adding context without distracting from the central action. The pouring motion is fluid and natural, emphasizing the interaction between the two cups.

A dramatic and dynamic scene in the style of a disaster movie, depicting a powerful tsunami rushing through a narrow alley in Bulgaria. The water is turbulent and chaotic, with waves crashing violently against the walls and buildings on either side. The alley is lined with old, weathered houses, their facades partially submerged and splintered. The camera angle is low, capturing the full force of the tsunami as it surges forward, creating a sense of urgency and danger. People can be seen running frantically, adding to the chaos. The background features a distant horizon, hinting at the larger scale of the tsunami. A dynamic, sweeping shot from a low-angle perspective, emphasizing the movement and intensity of the event.

Profiling benchmark#

Here is the profiling benchmark on total DiT runtime for FlashDreams Self-Forcing compared to the official Self-Forcing implementation and the FastVideo implementation under matched settings.

This chart shows the DiT total runtime (4 denoising steps in milliseconds) at the 6th autoregressive rollout on a single GPU. For an apples-to-apples comparison, all implementations are forced to use cuDNN attention backend and torch.compile for DiT network. For profiling the official implementation, see this instruction. For profiling the FastVideo implementation, see this instruction.

Citation#

If you use Self-Forcing, please cite the original work:

@article{huang2026self,
  title={Self forcing: Bridging the train-test gap in autoregressive video diffusion},
  author={Huang, Xun and Li, Zhengqi and He, Guande and Zhou, Mingyuan and Shechtman, Eli},
  journal={Advances in Neural Information Processing Systems},
  volume={38},
  pages={167283--167308},
  year={2026}
}