LingBot-World#

Introduced by Robbyant, LingBot-World is a camera-controllable image-to-video (I2V) world model with streaming inference and context-parallel runtime support.

Teaser video source: LingBot-World project page.

Requirements#

Minimum VRAM: ~120 GB.
PyTorch: >= 2.9.

Installation#

# from the repo root
uv sync --project integrations/lingbot

Running the method#

To run LingBot-World, launch one of the registered runner slugs. For example:

uv run --project integrations/lingbot \
    flashdreams-run \
    lingbot-world-fast \
    --example-data True \
    --example-idx 0 \
    --pixel-height 464 --pixel-width 832 \
    --total-blocks 21

Sample data is downloaded from the LingBot-World repository. Valid --example-idx values are 0, 1, 2, 5. Note the single GPU command might run out of memory for large --total-blocks values.

For multi-GPU inference, run the same command under torchrun (taking 4 GPUs as an example):

uv run --project integrations/lingbot \
    torchrun --nproc_per_node=4 --no-python flashdreams-run \
    lingbot-world-fast \
    --example-data True \
    --example-idx 0 \
    --pixel-height 464 --pixel-width 832 \
    --total-blocks 21

We provide the following variants:

Method	Description
`lingbot-world-fast`	Official camera-control I2V (Wan VAE decoder, full KV-cache).
`lingbot-world-fast-taehv-window15-sink3`	Efficient streaming configuration: TAEHV decoder, `window_size_t=15` + `sink_size_t=3` streaming KV-cache.

To inspect all supported CLI arguments and their default values, run:

uv run --project integrations/lingbot \
    flashdreams-run \
    lingbot-world-fast \
    --help

What to expect#

Example data: --example-data True downloads image.jpg, intrinsics.npy, poses.npy, prompt.txt from the upstream examples folder into assets/example_data/lingbot_world/<NN>/ (<NN> matches --example-idx). Cached after first run; no credentials needed.
Model checkpoint: ~70 GB pulled from huggingface.co/robbyant/lingbot-world-fast on first run, cached under $HF_HOME. Export HF_TOKEN first.
Disk: keep ~200 GB free for the model + HF cache. Hosts under ~100 GB have been seen to run out mid-load.
First launch: a few minutes (download + Triton autotuning + CUDA-graph warmup). Subsequent launches reuse the caches.
Outputs: outputs/<runner-slug>.mp4 (16 FPS, 464×832 by default) and outputs/stats_<runner-slug>.json. Override with --output-dir / --pixel-height / --pixel-width / --fps.

See Inference pipeline overview for what one autoregressive chunk does end-to-end.

Some generated samples from the above commands:

example_idx: 01

example_idx: 02

Launch the interactive server#

Spin up the interactive LingBot-World server via WebRTC:

# from the repo root
uv run --package flashdreams-lingbot torchrun --nproc_per_node 4 \
    -m lingbot.webrtc.server \
    --host 0.0.0.0 --port 8089 \
    --config_name lingbot-world-fast-taehv-window15-sink3 \
    --example-idx 0

--example-idx selects which example to download (0, 1, 2, 5); assets auto-download on first launch. The HTTP port opens only after model load + warmup — a few minutes on first launch, much faster afterwards. When ready the server prints Connect via http://<server-ip>:8089/request_session (use localhost when running locally).

Note

On a remote or cloud GPU instance (e.g. Brev), the server port is usually not reachable at the host IP directly. Forward it to your local machine first, then open http://localhost:8089/request_session:

# Brev
brev port-forward <instance> -p 8089:8089
# or plain SSH
ssh -L 8089:localhost:8089 <user>@<host>

When successfully connected, the browser-based UI looks like this:

Profiling benchmark#

Here is the profiling benchmark on total DiT runtime for FlashDreams LingBot-World compared to the official LingBot-World implementation and LightX2V under matched settings.

This chart shows total DiT runtime (4 diffusion steps) in milliseconds at the 6th autoregressive rollout on 4x GPUs. For an apples-to-apples comparison, all implementations are forced to use cuDNN attention backend under matched runtime settings, and all runs use Ulysses sequence parallelism for multi-GPU inference. For the official LingBot-World implementation, see this instruction. For the LightX2V baseline, see this instruction.

Citation#

If you use LingBot-World, please cite the original work:

@article{lingbot-world,
      title={Advancing Open-source World Models},
      author={Robbyant Team and Zelin Gao and Qiuyu Wang and Yanhong Zeng and Jiapeng Zhu and Ka Leong Cheng and Yixuan Li and Hanlin Wang and Yinghao Xu and Shuailei Ma and Yihang Chen and Jie Liu and Yansong Cheng and Yao Yao and Jiayi Zhu and Yihao Meng and Kecheng Zheng and Qingyan Bai and Jingye Chen and Zehong Shen and Yue Yu and Xing Zhu and Yujun Shen and Hao Ouyang},
      journal={arXiv preprint arXiv:2601.20540},
      year={2026}
}