Core¶
The flashdreams.core package collects the low-level kernels and
process-group utilities that integrations share.
Attention¶
The attention package provides the kernels used by the transformer and the block-structured KV cache that backs streaming inference.
- class NativeAttention(qkv_format: Literal['bhsd', 'bshd'] = 'bhsd', backend: Literal['math', 'efficient', 'cudnn', 'flash'] = 'cudnn')[source]¶
Bases:
ModuleNative attention module with configurable QKV layout and SDPA backend.
- set_context_parallel_group(cp_group: ProcessGroup | None) None[source]¶
Enable or disable context parallelism for ring attention.
- Parameters:
cp_group – Process group for context parallel; use None to disable.
- forward(query: Tensor, key: Tensor, value: Tensor) Tensor[source]¶
Run context-parallel SDPA (or single-rank SDPA when CP is disabled).
- Parameters:
query – Query tensor in configured
qkv_format.key – Key tensor in configured
qkv_format.value – Value tensor in configured
qkv_format.
- Returns:
Attention output in the same format as inputs.
- class ContextParallelAttention(qkv_format: Literal['bhsd', 'bshd'] = 'bhsd', backend: Literal['cudnn', 'flash'] = 'cudnn', method: Literal['ring', 'ulysses'] = 'ring', convert_to_fp32: bool = True)[source]¶
Bases:
NativeAttentionContext-parallel attention with selectable method and SDPA backend.
- class BlockKVCache(k_shape: tuple[int, ...], v_shape: tuple[int, ...], seq_dim: int, chunk_size: int, window_size: int, sink_size: int = 0, device: device | str = device(type='cuda'), dtype: dtype = torch.float16, _prev_chunk_idx: int = -1, _curr_chunk_idx: int | None = None, _n_cached: int = 0)[source]¶
Bases:
objectKV cache for causal attention with a fixed-size local window, CUDA-graph compatible.
Keys and values can have arbitrary shape
[..., total_size, ...]; the sequence (rolling) dimension is given byseq_dim(dimension index, can be negative). Layout along that dimension: [sink tokens | local window tokens]. Sink tokens are never evicted; the local window rolls left as new chunks are added if full. Chunks are non-overlapping: each update adds one chunk ofchunk_sizetokens at the next logical position in the full sequence.Note: Currently only supports
total_size(sink_size + window_size) divisible bychunk_size.- Phases:
- Filling – cache not yet full; tokens are written contiguously;
cached_k()/cached_v()return only the valid prefix.- Steady-state – cache full; each new chunk triggers a left-roll of the local window and overwrites the rightmost positions;
cached_k()/cached_v()return the full buffer.
The argument
chunk_idx(0, 1, 2, …) is the index of the new chunk in the full sequence (not an index into the cache). Ifchunk_idxis greater than the previous one, the chunk is appended (or, in steady-state, written after the roll). Ifchunk_idxequals the previous one, the same cache positions are overwritten.- Per-step usage:
before_update(chunk_idx) — prepare (roll local window if steady-state).
update(k, v) — write the new chunk’s keys/values into the cache.
cached_k() / cached_v() — get cached keys/values for attention.
after_update(chunk_idx) — update internal bookkeeping.
- k_shape: tuple[int, ...]¶
Shape of the keys. Must be the same as the values shape except for the last dimension.
- v_shape: tuple[int, ...]¶
Shape of the values. Must be the same as the keys shape except for the last dimension.
- sink_size: int = 0¶
Number of sink tokens at the start of the cache that are never evicted. Defaults to 0.
- classmethod from_tensor(k: Tensor, v: Tensor, seq_dim: int) Self[source]¶
Build a single-chunk cache pre-filled with the given key and value tensors.
- before_update(chunk_idx: int) None[source]¶
Prepare the cache before writing new tokens.
If
chunk_idxequals the previous chunk index, this is a no-op. Otherwise, we expect thechunk_idxto be +1 from the previous chunk index. In this case, we will roll the local window left if the cache is in steady-state, or no op if the cache is in filling phase.- Parameters:
chunk_idx – Chunk index of the new chunk in the full sequence.
- update(k: Tensor, v: Tensor) None[source]¶
Write the new chunk’s keys and values into the cache.
Must be called after
before_update()and beforeafter_update().- Parameters:
k – Keys; shape must match cached keys except at seq_dim, where length must be chunk_size.
v – Values; shape must match cached values except at seq_dim, where length must be chunk_size.
- after_update(chunk_idx: int) None[source]¶
Finalize bookkeeping after writing new tokens.
Updates
_prev_chunk_idxand, in filling phase,_n_cached.- Parameters:
chunk_idx – The index of the new chunk in the full sequence.
- cached_k() Tensor[source]¶
Return cached keys for attention (valid prefix in filling phase, full buffer in steady-state).
Distributed¶
Helpers for multi-GPU / multi-node inference. init boots the NCCL
process group with sensible defaults (NVML-derived CPU affinity,
heartbeat timeout, larger L2 fetch granularity) and is a drop-in for
the boilerplate at the top of the example launchers.