postprocess

Utils to load and process model_config.

Functions

`check_weight_shape_valid`	Check if weight shape are valid with inference TP.
`pad_embedding_lm_head`	Pad lm_head and embedding as multiples of 64 for AWQ quantization.
`postprocess_model_config`	Postprocesses the model configs with trained tensor parallel to target inference tensor parallel.
`postprocess_tensors`	Make all tensors in the weights to be with the target dtype, on CPU, contiguous and own the memory.
`update_lm_head_quantization`	Update lm_head quantization config for TRT-LLM export.
`view_as_float8_e4m3fn_if_needed`	Views uint8 tensor as fp8 tensor after processing using torch operations.
`view_as_uint8_if_needed`	Views fp8 tensor as uint8 tensor for processing using torch operations.

check_weight_shape_valid(config, inference_tensor_parallel=1, training_tensor_parallel=1)

Check if weight shape are valid with inference TP.

This function is recurisve.

pad_embedding_lm_head(model_config, padding_factor=64)

Pad lm_head and embedding as multiples of 64 for AWQ quantization.

Parameters:

model_config (ModelConfig)
padding_factor (int)

postprocess_model_config(model_config, inference_tensor_parallel=1, inference_pipeline_parallel=1, training_pipeline_parallel=1, workspace_path=None)

Postprocesses the model configs with trained tensor parallel to target inference tensor parallel.

If the training_pipeline_parallel > 1, the model configs across PP will be merged to one.

Returns:

The processed model config as a list.

For the merging case:: The merged rank will return the merged model_config as an single item list. The other ranks will return an empty list as we no longer export them.
For the split case:: The splitted model config list is returned.

Parameters:

inference_tensor_parallel (int)
inference_pipeline_parallel (int)
training_pipeline_parallel (int)
workspace_path (Path | str | None)

Return type:

list[ModelConfig]

postprocess_tensors(weights, dtype, force_cpu=True, force_contiguous=True, force_non_view=True)

Make all tensors in the weights to be with the target dtype, on CPU, contiguous and own the memory.

Parameters:

weights (dict[str, tensor])
dtype (dtype)
force_cpu (bool)
force_contiguous (bool)
force_non_view (bool)

update_lm_head_quantization(config, lm_head, inference_tensor_parallel=1)

Update lm_head quantization config for TRT-LLM export.

Parameters:

config (ModelConfig)
lm_head (QuantLinear)
inference_tensor_parallel (int)

view_as_float8_e4m3fn_if_needed(tensor): Views uint8 tensor as fp8 tensor after processing using torch operations.

view_as_uint8_if_needed(tensor): Views fp8 tensor as uint8 tensor for processing using torch operations.