FlashDreams¶
Overview¶
FlashDreams is a high-performance inference and serving library for interactive autoregressive video and world models. It is a general platform for real-time world-model applications across gaming, autonomous vehicles, robotics, simulated or virtual environments, and more, and is the runtime backbone of the NVIDIA OmniDreams closed-loop demo at GTC 2026.
Interactive world models
A world model learns to generate and evolve an environment over time. In practice, this often means video, but the same concept can include actions, state, audio, sensor input, and control signals.
World-model serving is the runtime pattern for putting that model inside a live application. Instead of producing one static video, the system keeps a session alive while input, model state, GPU inference, and output evolve together. This is useful for interactive simulation, robotics, autonomy, healthcare workflows, creative tools, virtual worlds, and game-like experiences.
In an online world-model application, the key requirement is not only generating high-quality videos. The runtime must keep an interactive session responsive while the model continues to advance the world.
Comparison with offline video generation
Compared with offline video generation, the target is different. One-shot systems prepare a conditioning input, run the model, then return a finished video. Libraries such as FastVideo and LightX2V are strong references for high-throughput offline inference, but their core pattern is not a persistent interactive loop with low-latency control and streaming output.
Connection to LLM serving
There is also a useful connection to LLM serving engines such as
vLLM and
SGLang: both LLMs and many world
models are autoregressive. The difference is the interaction pattern.
LLM chat usually runs prefill -> decode -> prefill -> decode
across user turns. Interactive video/world-model serving is closer to
initialize -> decode -> decode -> decode -> ...:
initialize the session once, then advance the world continuously at a fixed pace.
Best-in-class inference speed
FlashDreams is engineered with efficiency in mind. With a bottom-up system design tailored to autoregressive world-model inference patterns, it delivers best-in-class speed across many popular open-source models and GPU architectures:
Although FlashDreams is designed for autoregressive inference, the same optimization stack applies naturally to bidirectional inference (e.g., Wan2.1) by treating it as a single-rollout autoregressive pass.
Production-oriented interactive serving backend
FlashDreams also includes a production-oriented serving backend for persistent and low-latency world-model sessions, with efficient inference execution, multi-GPU support, and streaming input/output. Explore the interactive demos powered by FlashDreams: