The 128×4 Tiled Layout for Block Scaling Factors

When you pass MXFP8 or NVFP4 tensors as inputs to cuDNN operations on Blackwell, the accompanying block scaling factors must be stored in a specific tiled memory layout. This post explains what that layout is, why it exists, and how to work with it.

Background: MX Block Scaling

MX (Microscaling) is an industry specification established by AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm to standardize low-precision block-scaled data types. The core idea: a tensor is divided into fixed-size blocks, and each block shares a single scale factor. The higher-precision reconstruction is simply element × scale.

Format	Element type	Element bits	Block size	Scale type	Scale bits
MXFP8	E5M2 / E4M3	8	32	E8M0	8
MXFP6	E3M2 / E2M3	6	32	E8M0	8
MXFP4	E2M1	4	32	E8M0	8
MXINT8	INT8	8	32	E8M0	8

ExMy denotes x exponent bits and y mantissa bits. E8M0 is special — it has no sign bit and represents pure powers of 2.

NVFP4 is an inference-oriented format that uses double quantization: MX-style block scaling plus a per-tensor FP32 scale.

Format	Element type	Block size	Block scale type	Tensor scale type
NVFP4	E2M1 / E0M3	16	E4M3	FP32
NVFP4	E2M1 / E0M3	16	E8M0	FP32

On Blackwell, MXFP8 is the primary format for LLM training and NVFP4 for inference. cuDNN operations that consume these formats require the block scaling factors to be in the 128×4 tiled layout described below.

Why a Special Layout?

Consider a matrix multiply where the hardware processes data in 128-row tiles. Each row has its own set of block scale factors (one per block_size elements along the contraction dimension). In a naive row-major layout, scales for 128 different rows are scattered across 128 separate memory regions — loading them requires 128 cache-line accesses.

The 128×4 tiled layout solves this by co-locating the scale factors for 128 consecutive rows into a single contiguous memory chunk. The hardware can then load all scales for one tile in a single coalesced memory transaction and feed them directly to the tensor core without any gather/scatter overhead.

Row-major (scattered):    row 0 scales | row 1 scales | ... | row 127 scales
                          ↑ 128 separate memory regions for one tile's worth of scales

128×4 tiled (coalesced):  [ row 0..127, cols 0..3 interleaved ]
                          ↑ one contiguous 512-byte chunk

The Logical Scale Tensor

Before getting into the tiled layout, let’s define the logical scale tensor shape. For a 2D data matrix of shape [M, K]:

MXFP8 (block_size = 32, E8M0 scales):

Scale shape: [M, ceil(K / 32)]
Each scale covers 32 consecutive elements along K.
One 128×4 tile of scales → 128 rows × 4 scale cols × 32 elements/scale = 128×128 data block.

NVFP4 (block_size = 16, E4M3 scales):

Scale shape: [M, ceil(K / 16)]
Each scale covers 16 consecutive elements along K.
One 128×4 tile of scales → 128 rows × 4 scale cols × 16 elements/scale = 128×64 data block.

Both formats share the same tile geometry: 128 rows × 4 scale columns. The only difference is how many data elements each scale factor covers.

Tile-Internal Layout

Each 128×4 tile contains 512 scale values. They are not stored in simple row-major order within the tile. Instead, the 128 rows are interleaved in groups of 32, matching the warp-group structure of the Blackwell tensor core.

The coordinate mapping between a logical (outer, inner) position within a tile and the linear memory offset is:

// (outer, inner) → linear offset within a 128×4 tile
// outer: row index within tile, 0–127
// inner: column index within tile, 0–3
offset = (outer % 32) * 16 + (outer / 32) * 4 + inner

// linear offset → (outer, inner)
outer = ((offset % 16) / 4) * 32 + (offset / 16)
inner = offset % 4

Here’s what the interleaving looks like concretely. Within the tile, memory offset 0–15 contains:

offset 0:  (outer=0,  inner=0)     ←─ row 0, col 0
offset 1:  (outer=0,  inner=1)     ←─ row 0, col 1
offset 2:  (outer=0,  inner=2)     ←─ row 0, col 2
offset 3:  (outer=0,  inner=3)     ←─ row 0, col 3
offset 4:  (outer=32, inner=0)     ←─ row 32, col 0
offset 5:  (outer=32, inner=1)     ←─ row 32, col 1
offset 6:  (outer=32, inner=2)     ←─ row 32, col 2
offset 7:  (outer=32, inner=3)     ←─ row 32, col 3
offset 8:  (outer=64, inner=0)     ←─ row 64, col 0
offset 9:  (outer=64, inner=1)     ←─ row 64, col 1
offset 10: (outer=64, inner=2)     ←─ row 64, col 2
offset 11: (outer=64, inner=3)     ←─ row 64, col 3
offset 12: (outer=96, inner=0)     ←─ row 96, col 0
offset 13: (outer=96, inner=1)     ←─ row 96, col 1
offset 14: (outer=96, inner=2)     ←─ row 96, col 2
offset 15: (outer=96, inner=3)     ←─ row 96, col 3

Then offset 16–31 covers rows 1, 33, 65, 97 (each with 4 columns), and so on. The pattern: each consecutive 16-element group packs 4 rows (spaced 32 apart) × 4 columns. There are 32 such groups per tile, yielding 32 × 16 = 512 total elements.

The reason for the 32-row stride: it matches the warp group size on Blackwell. Each warp group processes 32 rows of data and can extract its 16 scale factors from a contiguous slice of the tile, with zero gather overhead.

Warp group 0 (rows 0–31):   offsets [0..3], [16..19], [32..35], ..., [496..499]
                             ↕ stride-16 within tile → contiguous per sub-group

Warp group 1 (rows 32–63):  offsets [4..7], [20..23], [36..39], ..., [500..503]
Warp group 2 (rows 64–95):  offsets [8..11], [24..27], [40..43], ..., [504..507]
Warp group 3 (rows 96–127): offsets [12..15], [28..31], [44..47], ..., [508..511]

Multi-Tile Layout

Real scale tensors are larger than a single 128×4 tile. Tiles are arranged row-major across the full scale tensor. Given sf_inner_dim scale columns (padded to a multiple of 4), the starting offset of a tile at logical scale position (sf_outer, sf_inner) is:

// sf_inner must be a multiple of 4 (tile-aligned)
tile_start = (sf_inner + sf_outer * sf_inner_dim) * 128

Worked example — a data matrix of shape [256, 256] with MXFP8 (block_size=32):

Logical scale shape:   [256, ceil(256/32)] = [256, 8]
Tiles needed:          ceil(256/128) × ceil(8/4) = 2 × 2 = 4 tiles

Memory layout (row-major tiles, each tile = 512 elements):

  ┌───────────────────┬───────────────────┐
  │ Tile A             │ Tile B             │
  │ rows 0–127         │ rows 0–127         │
  │ scale cols 0–3     │ scale cols 4–7     │
  ├───────────────────┼───────────────────┤
  │ Tile C             │ Tile D             │
  │ rows 128–255       │ rows 128–255       │
  │ scale cols 0–3     │ scale cols 4–7     │
  └───────────────────┴───────────────────┘

  Memory: [Tile A (512)] [Tile B (512)] [Tile C (512)] [Tile D (512)]
  Total:  2048 scale elements

Padding Rules

When dimensions don’t align to tile boundaries, pad to full tiles and zero-fill the out-of-bounds entries:

outer dimension (rows): pad to a multiple of 128
inner dimension (scale columns): pad to a multiple of 4

from math import ceil

def padded_scale_shape(M, K, block_size):
    """Compute the padded scale tensor shape for 128×4 tiling."""
    sf_cols = ceil(K / block_size)
    sf_cols_padded = ceil(sf_cols / 4) * 4     # inner: multiple of 4
    sf_rows_padded = ceil(M / 128) * 128       # outer: multiple of 128
    return sf_rows_padded, sf_cols_padded

# MXFP8 example: M=500, K=192, block_size=32
# → sf_cols = 6, sf_cols_padded = 8, sf_rows_padded = 512
# → allocate [512, 8] = 4096 elements, zero-fill OOB

# NVFP4 example: M=500, K=192, block_size=16
# → sf_cols = 12, sf_cols_padded = 12, sf_rows_padded = 512
# → allocate [512, 12] = 6144 elements, zero-fill OOB

Additional constraints:

Scale tensor starting addresses must be 16-byte aligned
The tiled layout is not transposition-invariant — even when data is transposed, the scale layout stays the same; you must recompute scales in the correct orientation rather than transposing the scale tensor
Kernels may overwrite out-of-bounds slots with zeros — don’t assume OOB values persist

Putting It Together: cuDNN MXFP8 Attention

In MXFP8 attention, each of Q, K, V has shape [B, H, S, D] with block scaling along a specific dimension. The scale tensors adopt the 128×4 tiled layout where:

outer = the non-scaling dimension (e.g., sequence positions for row-wise scales)
inner = the scaling dimension divided by block_size (e.g., ceil(D/32) for head-dimension scales)

Q [B, H, S_q, D]  — row-wise scales (block along D):
  Logical:   SF_Q [B, H, S_q,          ceil(D/32)]
  Physical:  SF_Q [B, H, ceil(S_q/128)×128, ceil(ceil(D/32)/4)×4]  in 128×4 tiled layout

V [B, H, S_kv, D] — column-wise scales (block along S_kv):
  Logical:   SF_V [B, H, ceil(S_kv/32), D]
  Physical:  SF_V [B, H, ceil(ceil(S_kv/32)/128)×128, ceil(D/4)×4]  in 128×4 tiled layout

Reference Implementation

Complete Python conversion between row-major and 128×4 tiled layout:

import numpy as np
from math import ceil

def to_128x4_tiled(scales_2d: np.ndarray) -> np.ndarray:
    """Convert a padded [outer, inner] row-major scale tensor to 128×4 tiled layout.

    Requires: outer % 128 == 0 and inner % 4 == 0.
    """
    outer_dim, inner_dim = scales_2d.shape
    assert outer_dim % 128 == 0 and inner_dim % 4 == 0

    num_tile_rows = outer_dim // 128
    num_tile_cols = inner_dim // 4
    out = np.empty(outer_dim * inner_dim, dtype=scales_2d.dtype)

    for r in range(outer_dim):
        for c in range(inner_dim):
            tile_r, tile_c = r // 128, c // 4
            tile_base = (tile_c + tile_r * num_tile_cols) * 512

            lr, lc = r % 128, c % 4  # local coords within tile
            local_off = (lr % 32) * 16 + (lr // 32) * 4 + lc

            out[tile_base + local_off] = scales_2d[r, c]

    return out


def from_128x4_tiled(flat: np.ndarray, outer_dim: int, inner_dim: int) -> np.ndarray:
    """Convert a flat 128×4 tiled buffer back to [outer, inner] row-major."""
    assert outer_dim % 128 == 0 and inner_dim % 4 == 0

    num_tile_cols = inner_dim // 4
    scales_2d = np.empty((outer_dim, inner_dim), dtype=flat.dtype)

    for idx in range(len(flat)):
        tile_idx = idx // 512
        local_off = idx % 512

        tile_r = tile_idx // num_tile_cols
        tile_c = tile_idx % num_tile_cols

        lr = ((local_off % 16) // 4) * 32 + (local_off // 16)
        lc = local_off % 4

        r = tile_r * 128 + lr
        c = tile_c * 4 + lc
        scales_2d[r, c] = flat[idx]

    return scales_2d


def quantize_scales_for_cudnn(scales_2d: np.ndarray, block_size: int) -> np.ndarray:
    """Pad and convert a raw [M, ceil(K/block_size)] scale tensor for cuDNN."""
    M, sf_cols = scales_2d.shape
    padded_rows = ceil(M / 128) * 128
    padded_cols = ceil(sf_cols / 4) * 4

    padded = np.zeros((padded_rows, padded_cols), dtype=scales_2d.dtype)
    padded[:M, :sf_cols] = scales_2d

    return to_128x4_tiled(padded)

Worked Example: Tracing the Mapping

Let’s trace a small case to build intuition. Scale tensor shape [4, 2], after padding to [128, 4]:

Original (4×2):         After padding (128×4):
 row 0: [a, b]          row 0:   [a, b, 0, 0]
 row 1: [c, d]          row 1:   [c, d, 0, 0]
 row 2: [e, f]          row 2:   [e, f, 0, 0]
 row 3: [g, h]          row 3:   [g, h, 0, 0]
                         rows 4–127: all zeros

After 128×4 tiling (single tile, 512 elements):

offset  0– 3:  row 0   → [a, b, 0, 0]
offset  4– 7:  row 32  → [0, 0, 0, 0]
offset  8–11:  row 64  → [0, 0, 0, 0]
offset 12–15:  row 96  → [0, 0, 0, 0]
offset 16–19:  row 1   → [c, d, 0, 0]
offset 20–23:  row 33  → [0, 0, 0, 0]
...
offset 32–35:  row 2   → [e, f, 0, 0]
...
offset 48–51:  row 3   → [g, h, 0, 0]
...
offsets 64–511: all zeros (rows 4–31 and their 32-stride partners)

Rows 0, 32, 64, 96 are grouped first, then 1, 33, 65, 97, and so on. Within each 4-row group, the 4 inner-dimension values are stored contiguously.

MXFP8 vs NVFP4: Layout Comparison

While both formats share the same 128×4 tile structure, the key differences are in block size and how much data each tile covers:

	MXFP8	NVFP4
Block size	32	16
Scale type	E8M0 (8-bit)	E4M3 (8-bit) or E8M0 (8-bit)
Data covered per tile	128 rows × 128 cols	128 rows × 64 cols
Scale cols per tile	4 (each covers 32 elements)	4 (each covers 16 elements)
Tensor-level scale	No	Yes (FP32, double quantization)
Use case	Training (Blackwell)	Inference (Blackwell)

For NVFP4 with double quantization, the block scales in 128×4 tiled layout are multiplied with the per-tensor FP32 scale during dequantization:

dequantized_value = element × block_scale × tensor_scale

Common Pitfalls

Forgetting to pad. If the scale dimension isn’t a multiple of 4 or the spatial dimension isn’t a multiple of 128, cuDNN will reject the input or produce incorrect results. Always pad first, zero-fill, then tile.

Confusing logical and physical shapes. The logical scale shape [M, ceil(K/block_size)] tells you the semantic meaning; the physical allocation [ceil(M/128)×128, ceil(ceil(K/block_size)/4)×4] is what you actually allocate and tile.

Transposing scale tensors. The 128×4 layout does not support transposition. If your operation needs the data in a transposed orientation, you must requantize and produce new scale factors in the correct layout — you cannot simply transpose the existing scale tensor.

Misaligned allocation. Scale factor pointers must be 16-byte aligned. Standard CUDA allocation (cudaMalloc) guarantees this, but custom allocators or sub-allocation from pools may not.

Learn More

OCP Microscaling Spec: OCP MX v1.0 — the standard that defines block scaling with E8M0
cuDNN Attention API: Fused Attention Operations — how cuDNN uses the 128×4 layout for SDPA
GTC 2025: cuDNN on Blackwell — architecture deep dive on MXFP8 and NVFP4 support