Expert Parallelism in TensorRT-LLM
Mixture of Experts (MoE)
Mixture of Experts (MoE) architectures have been used widely recently, such as Mistral Mixtral 8x7B. Specifically, MOE’s structure supports multiple parallel Feedforward Neural Network (FFN) layers (called experts) to replace the single FFN layer in the dense model. When tokens arrive, the router layer selects the TopK experts for each token. The corresponding hidden state of the token is then dispatched to the selected TopK experts, respectively. As a result, there are multiple tokens’ hidden states that are dispatched to each expert.
the MOE structure in Switch Transformer: https://arxiv.org/pdf/2101.03961.pdf
Tensor Parallel vs Expert Parallel
Parallelism on multi-GPUs is necessary if the MoE model can not be accommodated by a single GPU’s memory. We have supported two kinds of parallel patterns for MoE structure, Tensor Parallel (default pattern), Expert Parallel, and a hybrid of the two.
Tensor Parallel evenly splits each expert’s weight and distributes them to different GPUs, which means each GPU holds partial weight of all experts, While Expert Parallel evenly distributes some of the experts’ full weight to different GPUs, which means each GPU holds part of the experts’ full weight. As a result, each GPU rank in the Tensor Parallel group receives all tokens’ hidden states for all experts, then computes using the partial weights, while for Expert Parallel, each GPU rank only receives part of tokens’ hidden states for experts on this rank, then computes using the full weights.
When both Tensor Parallel and Expert Parallel are enabled, each GPU handles a portion of the expert weights matrices (as in EP mode) and these weights are further sliced across multiple GPUs (as in TP mode). This hybrid approach aims to balance the workload more evenly across GPUs, enhancing efficiency and reducing the likelihood of bottlenecks associated with EP mode alone.
How to Enable
The default parallel pattern is Tensor Parallel. You can enable Expert Parallel or hybrid parallel by setting --moe_tp_size
and --moe_ep_size
when calling convert_coneckpoint.py
. If only --moe_tp_size
is provided, TRT-LLM will use Tensor Parallel for the MoE model; if only --moe_ep_size
is provided, TRT-LLM will use Expert Parallel; if both are provided, the hybrid parallel will be used.
Ensure the product of moe_tp_size
and moe_ep_size
is equal to tp_size
, since the total number of MoE parallelism across all GPUs must match the total number of parallelism in other parts of the model.
The other parameters related to the MoE structure, such as num_experts_per_tok
(TopK in previous context) and num_local_experts,
can be found in the model’s configuration file, such as the one for Mixtral 8x7B model.
)