Infra¶
The flashdreams.infra package defines the swappable abstractions that
every integration plugs into: a config system, an encoder / diffusion-model /
decoder triple, and the streaming inference pipeline that drives them.
Config¶
Every component is built from a frozen InstantiateConfig
dataclass via config.setup(). This makes the full configuration
tree printable, hashable, and trivially serialisable.
- class PrintableConfig[source]¶
Bases:
objectConfig base class providing a multi-line
__str__for human-readable dumps.
- class InstantiateConfig(_target: type)[source]¶
Bases:
PrintableConfigConfig carrying a
_targetclass plus its kwargs, instantiable viasetup.
- derive_config(base_config: T, **changes: Any) T[source]¶
Deep-copy a base config and apply nested keyword overrides.
Nested
dictvalues walk into both dataclass attributes and nested dicts; leaf values overwrite directly. RaisesKeyErroron unknown paths.Example:
new_config = derive_config( base_config, tokenizer=WanVAEInterfaceConfig(checkpoint_path=...), dit=dict(len_t=3, checkpoint_path=...), )
Pipeline¶
The pipeline is the top-level streaming inference loop. It autoregressively generates one chunk of latent video at a time by running the encoder, the diffusion model, and the decoder back-to-back, threading per-chunk caches through every component.
- class StreamInferencePipelineConfig(*, _target: type[StreamInferencePipeline] = <factory>, name: str, diffusion_model: DiffusionModelConfig, decoder: DecoderConfig | None = None, encoder: EncoderConfig | None = None, enable_sync_and_profile: bool = False)[source]¶
Bases:
InstantiateConfigConfig for the streaming inference pipeline.
Set
encoder=Nonewhen the pipeline has no per-AR-step control input (pure T2V). Setdecoder=Noneto return the clean latent directly (training, latent-space evaluation, or pipelines that own decoding).- name: str¶
Stable slug for this pipeline variant; the primary key of
<NAME>_CONFIGS. Runners mirror it asrunner_namesoflashdreams-run <slug>resolves to this pipeline.
- diffusion_model: DiffusionModelConfig¶
Transformer + scheduler config.
- decoder: DecoderConfig | None = None¶
Optional output
StreamingDecoderwith a per-rollout cache, called asdecoder(input, autoregressive_index, cache). UseNoneto return the clean latent unchanged.
- class StreamInferencePipeline(config: StreamInferencePipelineConfig)[source]¶
Bases:
Module,Generic[StreamingEncoderCacheT,TransformerCacheT,StreamingDecoderCacheT]End-to-end streaming inference pipeline.
Generic over the encoder, transformer, and decoder cache types. The encoder’s input/output types are forwarded as
Anyso the transformer’spredict_flow/postprocess_clean_latentoverrides own the typing on theinputargument they receive.Examples
cache = pipeline.initialize_cache(transformer_context={…}) output = pipeline.generate(0, cache, input=…) pipeline.finalize(0, cache) output = pipeline.generate(1, cache, input=…) pipeline.finalize(1, cache) # optional for the last rollout
- initialize_cache(transformer_context: dict[str, Any] | None = None, encoder_context: dict[str, Any] | None = None, decoder_context: dict[str, Any] | None = None) StreamInferencePipelineCache[StreamingEncoderCacheT, TransformerCacheT, StreamingDecoderCacheT][source]¶
Build a fresh per-rollout cache.
Each
*_contextdict is forwarded as keyword arguments to the corresponding component’sinitialize_autoregressive_cache.- Parameters:
transformer_context – Per-rollout state for the transformer (e.g.
{"text_embeddings": ..., "image_embeddings": ...}).encoder_context – Per-rollout state for the encoder. Ignored when there is no encoder.
decoder_context – Per-rollout state for the decoder. Ignored when there is no decoder.
- Returns:
A fresh cache to thread through
generate/finalize.
- generate(autoregressive_index: int, cache: StreamInferencePipelineCache[StreamingEncoderCacheT, TransformerCacheT, StreamingDecoderCacheT], input: Any = None) Tensor[source]¶
Generate one chunk for this AR step.
- Parameters:
autoregressive_index – Must be
cache.autoregressive_index + 1, or0for the first call afterinitialize_cache.cache – Per-rollout cache from
initialize_cache.input – Raw input fed to the encoder. Required when an encoder is configured, must be
Noneotherwise. UseNullEncoderConfigto pass an already-encoded tensor straight through.
- Returns:
Decoded tensor (e.g. RGB video) when a decoder is configured; otherwise the unpatchified clean latent from the diffusion model.
- finalize(autoregressive_index: int, cache: StreamInferencePipelineCache[StreamingEncoderCacheT, TransformerCacheT, StreamingDecoderCacheT]) dict[str, float] | None[source]¶
Advance the diffusion AR cache for the next AR step.
- Parameters:
autoregressive_index – Must match the index passed to the most recent
generate(asserted).cache – Same cache used by
generate. Consumescache.final_state.
- Returns:
Nonewhen profiling is disabled. Otherwise a snapshot of this AR step’s per-stage timings (ms) and GPU memory (GiB):{<stage>_ms, total_ms, total_ms_wo_finalize, mem_alloc_gib, mem_reserved_gib, mem_peak_gib}. The same numbers are also logged vialogger.info.
- class StreamInferencePipelineCache(*, transformer_cache: TransformerCacheT, encoder_cache: StreamingEncoderCacheT | None = None, decoder_cache: StreamingDecoderCacheT | None = None, final_state: FinalState[TransformerCacheT] | None = None, autoregressive_index: int | None = None, event_profiler: EventProfiler | None = None)[source]¶
Bases:
Generic[StreamingEncoderCacheT,TransformerCacheT,StreamingDecoderCacheT]Per-rollout cache held by the pipeline.
- transformer_cache: TransformerCacheT¶
Long-lived transformer AR cache (always present).
- encoder_cache: StreamingEncoderCacheT | None = None¶
Encoder AR cache;
Noneiff the pipeline has no encoder.
- decoder_cache: StreamingDecoderCacheT | None = None¶
Decoder AR cache;
Noneiff the pipeline has no decoder.
Diffusion model¶
Wraps a transformer backbone with a denoising scheduler. Callers see
only noise → clean_latent; the per-step flow prediction and the
iteration loop are hidden inside
generate().
- class DiffusionModelConfig(*, _target: type[DiffusionModel] = <factory>, transformer: TransformerConfig, scheduler: SchedulerConfig, seed: int | None = None, context_noise: int = 0, noise_in_unpatchified_shape: bool = False)[source]¶
Bases:
InstantiateConfigConfig for the autoregressive diffusion model.
- transformer: TransformerConfig¶
Flow-prediction network config.
- scheduler: SchedulerConfig¶
Denoising-loop config.
- seed: int | None = None¶
RNG seed for initial-noise draws and scheduler sampling.
Noneuses the global RNG.
- class DiffusionModel(config: DiffusionModelConfig)[source]¶
Bases:
Module,Generic[TransformerCacheT]Autoregressive diffusion model (scheduler + transformer).
Generic over the transformer’s AR cache type so user-facing typing on
cacheis preserved end-to-end.Examples
model = config.setup().to(“cuda”) cache = model.transformer.initialize_autoregressive_cache(…) clean, final_state = model.generate(autoregressive_index=0, cache=cache) model.finalize(final_state)
- class FinalState(*, clean_latent: Tensor, autoregressive_index: int, cache: _FinalStateCacheT, input: Any = None)[source]¶
Bases:
Generic[_FinalStateCacheT]State passed from
generatetofinalize.- cache: _FinalStateCacheT¶
Long-lived AR cache used during generation.
- property rng: Generator | None¶
Per-model generator, lazily built on the current device.
Returns
Nonewhenconfig.seedisNone. Rebuilt the first time the model’s device changes after a.to(...). A device move resets the RNG stream — fine for “construct on CPU,.to(gpu)once” but mid-rollout device hops lose RNG state.
- generate(autoregressive_index: int, cache: TransformerCacheT, input: Any = None) tuple[Tensor, FinalState[TransformerCacheT]][source]¶
Run the denoising loop for one AR step.
- Parameters:
autoregressive_index – AR step index.
cache – Long-lived AR cache, mutated in place.
input – Optional per-AR-step encoder output. Patchified here and forwarded to
predict_flow/postprocess_clean_latent, then stashed on the returnedFinalStateforfinalize.
- Returns:
(clean_latent, final_state).clean_latentis unpatchified;final_stateshould be passed tofinalize.
- finalize(final_state: FinalState[TransformerCacheT]) None[source]¶
Advance the AR cache using the clean latent from
generate.Re-noises the clean latent to
config.context_noiseand runs the transformer’sfinalize_kv_cache(one forward for vanilla transformers, multiple for dual-network DiTs).context_noise == 0skipsadd_noise(sigma=0 is identity) and feeds the clean latent directly. This also dodges the requirement for schedulers to support at=0lookup (UniPC’s inference schedule has not=0entry).
Transformer¶
- class Transformer(config: TransformerConfig)[source]¶
Bases:
Module,ABC,Generic[TransformerCacheT]Flow-prediction transformer, generic over its AR cache subclass.
Subclasses implement
predict_flowand the patchify hooks. AR transformers also subclassTransformerAutoregressiveCacheand overrideinitialize_autoregressive_cache.Example:
class MyTransformer(Transformer[MyCache]): def predict_flow(self, noisy_latent, timestep, cache, input=None): ... def initialize_autoregressive_cache(self, **context) -> MyCache: ...
- abstract property latent_shape: tuple[int, ...]¶
Shape of the input/output latent tensor for this rank.
Includes batch dims. May depend on hierarchical context-parallel group sizes (V/T/HW), so subclasses typically derive this from
self.cp_groupsrather than from static config alone.
- abstractmethod predict_flow(noisy_latent: Tensor, timestep: Tensor, cache: TransformerCacheT, input: Any = None) Tensor[source]¶
Predict the flow at
timestep.- Parameters:
noisy_latent – Patchified noisy latent for this denoising step.
timestep – Scalar timestep tensor.
cache – Per-rollout AR cache.
input – Patchified encoder output for this AR step, or
Nonewhen the pipeline has no encoder. Subclasses should narrow the type to their encoder’s output type.
- Returns:
Predicted flow tensor with the same shape as
noisy_latent.
- finalize_kv_cache(noisy_latent: Tensor, timestep: Tensor, cache: TransformerCacheT, input: Any = None) None[source]¶
Advance the AR cache so it is ready for the next AR step.
Called by
DiffusionModel.finalizeafter the denoising loop; the flow is discarded — only the cache side effect matters. Default runs a singlepredict_flowforward. Override for transformers with multiple parallel networks that must stay in lock-step.- Parameters:
noisy_latent – Patchified latent at the AR-step’s context noise (or the clean latent when
context_noise == 0).timestep – 0-d context-noise timestep tensor.
cache – Per-rollout AR cache.
input – Same patchified encoder output passed to
predict_flow.
- initialize_autoregressive_cache(**context: Any) TransformerCacheT[source]¶
Build a fresh AR cache for a new rollout.
Default returns an empty cache, correct for non-AR transformers. Subclasses with custom cache types must override this and may declare typed per-rollout context (e.g.
text_embeddings).- Parameters:
context – Per-rollout state forwarded as keyword arguments.
- Returns:
Fresh AR cache.
- postprocess_clean_latent(clean_latent: Tensor, cache: TransformerCacheT, input: Any = None) Tensor[source]¶
Optional postprocessing hook for the predicted clean latent.
Default is identity. Override to clamp or re-inject regions whose clean value is known a priori (e.g. I2V first-frame pinning). Called at the end of
DiffusionModel.generate.- Parameters:
clean_latent – Patchified
x0from the denoising loop.cache – Per-rollout AR cache.
input – Same patchified encoder output passed to
predict_flow.
- Returns:
Postprocessed clean latent with the same shape.
- abstractmethod patchify_and_maybe_split_cp(x: Any) Any[source]¶
Patchify and (optionally) CP-split a noisy latent or encoder payload.
Tensors patchify and split. Structured payloads (e.g. an image-control struct with
latent+mask) patchify each tensor field and return the same struct type. Implement as identity when neither token packing nor CP sharding applies. Output preserves the input Python type — only shapes change.
- class TransformerAutoregressiveCache[source]¶
Bases:
objectCache that persists across an AR rollout.
Empty by default; safe to instantiate directly for non-AR transformers (default
start/finalizeare no-ops). Subclass and add fields plus AR bookkeeping for real per-rollout state.Example:
cache.start(autoregressive_index) # one or more denoising steps... cache.finalize(autoregressive_index)
Schedulers¶
A scheduler owns the entire denoising loop. It is shape-agnostic: every internal op is a broadcast against per-step scalar sigmas, so the same scheduler works for any latent layout.
- class Scheduler(config: SchedulerConfig)[source]¶
-
Denoising scheduler.
Owns the entire denoising loop. Callers see only
noise → clean; the loop shape (renoise / multistep / plain ODE) is private.Concrete configs inherit
SchedulerConfigand declare their ownnum_inference_steps/shiftfields (the base holds no shared dataclass fields).Examples
scheduler = config.setup() clean = scheduler.sample(initial_noise=noise, predict_flow=predictor) noisy = scheduler.add_noise(clean_input=clean, timestep=t)
- abstractmethod sample(initial_noise: Tensor, predict_flow: FlowPredictor, rng: Generator | None = None) Tensor[source]¶
Run the full denoising loop and return the clean latent.
Schedulers are shape-agnostic: every internal op broadcasts against per-step scalar sigmas. In practice
initial_noiseis a video latent[B, C, T, H, W], conventionally treated as a sample atsigma=1.- Parameters:
initial_noise – Gaussian noise on the caller’s device/dtype.
predict_flow – Per-step closure invoked
num_inference_stepstimes.rng – Generator on the same device. Used by self-forcing renoise loops; pure ODE solvers ignore it.
- Returns:
Clean latent with the same shape, device, and dtype as
initial_noise.
- abstractmethod add_noise(clean_input: Tensor, timestep: Tensor, rng: Generator | None = None) Tensor[source]¶
Apply the forward corruption
x_t = (1 - sigma(t)) * x_0 + sigma(t) * eps.Timestep value semantics are scheduler-specific: all schedulers snap to the nearest entry of their inference schedule.
- class FlowPredictor(*args, **kwargs)[source]¶
Bases:
ProtocolClosure
(noisy_latent, timestep) -> predicted_flow.Built by
DiffusionModel.generateby binding the per-AR-stepcache/inputto the transformer’spredict_flow. A scheduler invokes it once per denoising iteration. The scheduler decides the timestep dtype (UniPC uses int64, flow-match uses float).
- class FlowMatchSchedulerConfig(*, _target: type[FlowMatchScheduler] = <factory>, num_inference_steps: int = 4, shift: float = 8.0, denoising_timesteps: list[int] = <factory>, warp_denoising_step: bool = True, num_train_timesteps: int = 1000, sigma_max: float = 1.0, sigma_min: float = 0.0, extra_one_step: bool = True, timestep_dtype: dtype = torch.float32, enable_tqdm: bool = False)[source]¶
Bases:
SchedulerConfigConfig for the flow-matching scheduler.
- sigma_max: float = 1.0¶
Top of the linspace before warping;
1.0matches DiffSynth, upstream Wan / Lingbot ships0.999.
- sigma_min: float = 0.0¶
Bottom of the linspace before warping. Reserved for upstream parity; only
0.0is exercised.
- extra_one_step: bool = True¶
If
True, build the schedule fromlinspace(sigma_max, sigma_min, N+1)[:-1](matches DiffSynth / upstream Wan);FalseusesNpoints and is kept for non-Wan recipes.
- timestep_dtype: dtype = torch.float32¶
Dtype of
denoising_step_list. Set to an integer dtype (e.g.torch.int64) when the network’s time embedding is sensitive to the fractional part of the warped timestep — upstream Wan storesscheduler.timestepsasint64and lets the embedding upcast tofloat64internally.
- class FlowMatchScheduler(config: FlowMatchSchedulerConfig)[source]¶
Bases:
SchedulerFlow-matching scheduler with self-forcing renoise (DiffSynth-style).
Each iteration converts the predicted flow to an
x0estimate, then re-noises at the same sigma to feed the next iteration. The finalx0is returned:x_t = initial_noise for t in denoising_step_list: v = predict_flow(x_t, t) x0 = x_t - sigma(t) * v x_t = (1 - sigma(t)) * x0 + sigma(t) * eps return x0
Example:
scheduler = FlowMatchSchedulerConfig( num_inference_steps=4, shift=8.0, denoising_timesteps=[1000, 750, 500, 250], ).setup().to("cuda") clean = scheduler.sample(initial_noise=noise, predict_flow=fn)
Schedule buffers are pinned to fp32 even after
module.to(bf16); integer timesteps like 1000 would otherwise round to 1024.- sample(initial_noise: Tensor, predict_flow: FlowPredictor, rng: Generator | None = None) Tensor[source]¶
Run the self-forcing flow-match denoising loop.
Iteration 0 trusts
initial_noiseas thesigma=1sample; later iterations re-noise the previousx0estimate to the new sigma before the network forward. Schedule arithmetic auto-promotes to fp32; the result is cast back toinitial_noise.dtype.
- class FlowMatchUniPCSchedulerConfig(*, _target: type[FlowMatchUniPCScheduler] = <factory>, num_inference_steps: int = 50, shift: float = 5.0, num_train_timesteps: int = 1000, solver_order: int = 2, use_kerras_sigma: bool = False, enable_tqdm: bool = False)[source]¶
Bases:
SchedulerConfigConfig for the flow-matching UniPC scheduler.
Defaults match the official Wan 2.1 inference integration (UniPC, BH2, order 2, shift 5.0). Override
shiftper checkpoint as recommended upstream (e.g. 3.0 for Wan 2.1 14B I2V 480P).
- class FlowMatchUniPCScheduler(config: FlowMatchUniPCSchedulerConfig)[source]¶
Bases:
SchedulerOrder-2 UniPC predictor-corrector for flow-matching.
Specialized + pre-baked variant of the upstream Wan 2.1 UniPC solver. Schedule buffers (sigmas + per-step coefficients) stay fp32 regardless of
module.to(dtype).Example:
scheduler = FlowMatchUniPCSchedulerConfig( num_inference_steps=50, shift=5.0, ).setup().to("cuda") clean = scheduler.sample(initial_noise=noise, predict_flow=fn)
- sample(initial_noise: Tensor, predict_flow: FlowPredictor, rng: Generator | None = None) Tensor[source]¶
Run the order-2 UniPC predictor-corrector denoising loop.
Each iteration: network → flow →
x0→ corrector (skipped at step 0) → predictor. All per-step coefficients are pre-baked at construction; the loop is pure tensor ops. Internal arithmetic is fp32; the result is cast back toinitial_noise.dtype.rngis unused (deterministic ODE) but accepted for interface conformance.
Encoder¶
Encoders turn raw conditioning (text prompts, reference images, per-AR-step control inputs, …) into latent tensors. Two flavours:
Encoderis stateless and one-shot.forward(self, input). Used astransformer.context_encoderfor text / CLIP-image / identity.StreamingEncoderis stateful and per-AR-step.forward(self, input, autoregressive_index, cache)with anStreamingEncoderCache. Used aspipeline.encoderfor per-step control (HDMap, camera trajectory, I2V first-frame VAE).StreamingVideoEncoderextendsStreamingEncoderwith the contracts a streaming pixel-video encoder always needs: spatial / temporal compression ratios plus AR-step-aware temporal size mappers between pixel and latent space.
- class Encoder(config: EncoderConfig)[source]¶
-
Stateless encoder.
forwardis not pinned by the base. Encoders used as acontext_encoder(one-shot, called once insideTransformer.initialize_autoregressive_cache()) must match the slim call shapeforward(self, input).For per-AR-step encoders that need a per-rollout cache, inherit from
StreamingEncoderinstead.
- class StreamingEncoder(config: EncoderConfig)[source]¶
Bases:
ABC,Module,Generic[StreamingEncoderCacheT]Streaming encoder, generic over the per-rollout cache type.
forwardis not pinned by the base. Streaming encoders called byStreamInferencePipelinemust match its call shape:forward(self, input, autoregressive_index=0, cache=None).
- class StreamingVideoEncoder(config: EncoderConfig)[source]¶
Bases:
StreamingEncoder[StreamingEncoderCacheT]Streaming pixel-video encoder.
Pins down the contracts that every streaming pixel→latent video encoder satisfies in addition to
StreamingEncoder:Spatial and temporal compression ratios between the pixel and latent grids (constants of the architecture).
AR-step-aware temporal size mappers, so a pipeline can size its inputs and outputs without knowing the encoder’s concrete temporal cache topology (causal first-frame padding, sliding windows, etc.).
Spatial scaling is trivially
side // spatial_compression_ratioin either direction; the AR-step-asymmetric piece is the temporal size, which gets its own mapper. Typically AR 0 takes1 + (T_lat - 1) * rpixel frames to produceT_latlatent frames because of causal first-frame padding, while AR ≥ 1 takesT_lat * rpixel frames.- abstract property spatial_compression_ratio: int¶
Pixel side ÷ latent side. Constant across AR steps.
- abstract property temporal_compression_ratio: int¶
Pixel frames ÷ latent frames in steady state (AR ≥ 1).
AR 0 typically takes one extra (un-grouped) pixel frame to produce its first latent frame because of causal first-frame padding; that asymmetry lives inside
get_output_temporal_size()/get_input_temporal_size().
- abstractmethod get_output_temporal_size(autoregressive_index: int, input_temporal_size: int) int[source]¶
Latent frame count produced from
input_temporal_sizepixel frames.- Parameters:
autoregressive_index – AR step index (0-based).
input_temporal_size – Number of pixel frames fed at this step.
- Returns:
Number of latent frames emitted at this step.
- abstractmethod get_input_temporal_size(autoregressive_index: int, output_temporal_size: int) int[source]¶
Pixel frame count needed to produce
output_temporal_sizelatents.Inverse of
get_output_temporal_size(). Implementations should assertoutput_temporal_sizeis achievable at this AR step (i.e. the corresponding pixel count comes out as a positive integer).- Parameters:
autoregressive_index – AR step index (0-based).
output_temporal_size – Desired number of latent frames.
- Returns:
Number of pixel frames needed at this step.
- class StreamingEncoderCache[source]¶
Bases:
objectPer-rollout cache for
StreamingEncoder.Empty by default; subclass to add fields (e.g. last-frame latent, cross-step accumulators).
- class NullEncoderConfig(*, _target: type[NullEncoder] = <factory>)[source]¶
Bases:
EncoderConfigConfig for the identity encoder.
- class NullEncoder(config: EncoderConfig)[source]¶
Bases:
EncoderIdentity encoder: returns its input unchanged.
Wire as the transformer’s
context_encoderslot to pass already- encoded tensors straight to the diffusion model.Example:
config = TransformerConfig( context_encoder=NullEncoderConfig(), ..., )
Decoder¶
Decoders turn the latents emitted by the diffusion model back into pixel frames. Single base class with two specialisations:
StreamingDecoderis stateful.forward(self, input, autoregressive_index, cache)with aStreamingDecoderCache. Use for chunk-by-chunk streaming decoders (e.g. WAN VAE that maintains a temporal cache across AR steps); stateless decoders just return an emptyStreamingDecoderCachefromStreamingDecoder.initialize_autoregressive_cache()and ignoreautoregressive_index/cacheinforward.StreamingVideoDecoderextendsStreamingDecoderwith the contracts a streaming pixel-video decoder always needs: spatial / temporal compression ratios plus AR-step-aware temporal size mappers between latent and pixel space.
- class StreamingDecoder(config: DecoderConfig)[source]¶
Bases:
ABC,Module,Generic[StreamingDecoderCacheT]Streaming decoder, generic over the per-rollout cache type.
forwardis not pinned by the base. Streaming decoders called byStreamInferencePipelinemust match its call shape:forward(self, input, autoregressive_index=0, cache=None).
- class StreamingVideoDecoder(config: DecoderConfig)[source]¶
Bases:
StreamingDecoder[StreamingDecoderCacheT]Streaming pixel-video decoder.
Pins down the contracts that every streaming latent→pixel video decoder satisfies in addition to
StreamingDecoder:Spatial and temporal compression ratios between the latent and pixel grids (constants of the architecture).
AR-step-aware temporal size mappers, so a pipeline can size its inputs and outputs without knowing the decoder’s concrete temporal cache topology (causal first-frame padding, sliding windows, etc.).
Spatial scaling is trivially
side * spatial_compression_ratioin either direction; the AR-step-asymmetric piece is the temporal size, which gets its own mapper. Typically AR 0 produces fewer pixel frames per latent frame than AR ≥ 1 because of causal first-frame padding.- abstract property spatial_compression_ratio: int¶
Pixel side ÷ latent side. Constant across AR steps.
- abstract property temporal_compression_ratio: int¶
Pixel frames ÷ latent frames in steady state (AR ≥ 1).
AR 0 typically yields fewer pixel frames per latent frame because of causal first-frame padding; that asymmetry lives inside
get_output_temporal_size()/get_input_temporal_size().
- abstractmethod get_output_temporal_size(autoregressive_index: int, input_temporal_size: int) int[source]¶
Pixel frame count produced by
input_temporal_sizelatent frames.- Parameters:
autoregressive_index – AR step index (0-based).
input_temporal_size – Number of latent frames fed at this step.
- Returns:
Number of pixel frames emitted at this step.
- abstractmethod get_input_temporal_size(autoregressive_index: int, output_temporal_size: int) int[source]¶
Latent frame count needed to produce
output_temporal_sizepixels.Inverse of
get_output_temporal_size(). Implementations should assertoutput_temporal_sizeis achievable at this AR step (i.e. divisible by the right ratio after subtracting any causal padding).- Parameters:
autoregressive_index – AR step index (0-based).
output_temporal_size – Desired number of pixel frames.
- Returns:
Number of latent frames needed at this step.
- class StreamingDecoderCache[source]¶
Bases:
objectPer-rollout cache for
StreamingDecoder.Empty by default; subclass to add fields (e.g. temporal feature buffers carried across AR steps).