Pruning

Tip

Checkout ResNet20 on CIFAR-10 Notebook and Llama 3.1 NeMo Minitron Pruning for an end-to-end example of pruning.

ModelOpt provides three main pruning methods (aka mode) - Minitron, FastNAS and GradNAS - via a unified API mtp.prune. Given a model, these methods finds the subnet which meets the given deployment constraints (e.g. FLOPs, parameters) from your provided base model with little to no accuracy degradation (depending on how aggressive is the pruning). These pruning methods support pruning the convolutional and linear layers, and attention heads of the model. More details on these pruning modes is as follows:

fastnas: A pruning method recommended for Computer Vision models. Given a pretrained model, FastNAS finds the subnet which maximizes the score function while meeting the given constraints.
mcore_gpt_minitron: A pruning method developed by NVIDIA Research for pruning GPT-style models (e.g. Llama 3) in NVIDIA NeMo or Megatron-LM framework that are using Pipeline Parallelism. It uses the activation magnitudes to prune the mlp, attention heads, GQA query groups, embedding hidden size and number of layers of the model. Checkout more details of the algorithm in the paper.
gradnas: A light-weight pruning method recommended for language models like Hugging Face BERT and GPT-J. It uses the gradient information to prune the model’s linear layers and attention heads to meet the given constraints.

Follow the steps described below to obtain the optimal model satisfying your requirements using mtp:

Training: Simply train your model using existing training pipeline or load a pre-trained checkpoint for your model.
Pruning: Prune the model using our provided mtp.prune API and get an optimal subnet describing the pruned network architecture.
Fine-tuning: fine-tune the resulting subnet to recover the accuracy.

To find out more about the concepts behind NAS and pruning, please refer to NAS concepts.

Training

To perform pruning, you can either use a model obtained by converting a pre-trained checkpoint model or train the model from scratch.

Use a pre-trained checkpoint

Simply initialize your model and load the checkpoint before you start using ModelOpt.

Train the model

You can simply use your existing training pipeline to train the model without further modifications.

Pruning and subnet search

The next step in pruning is to perform a search over potential subnet architectures, i.e., prune the network, to find the best subnet satisfying your deployment constraints.

Prerequisites

To perform pruning (mtp.prune()) on a trained model, you need to set up data loaders, provide search constraints and a dummy_input (to measure your deployment constraints).
You can provide one search constraint for either flops or params by specifying an upper bound in terms of absolute number (3e-6) or a percentage ("60%").
You should also specify the pruning algorithm (mode), you would like to use. Depending on the mode, you will need to provide additional config parameters like score_func (fastnas mode) or loss_func (gradnas mode), dataloader, checkpoint, etc. The most common score function is the validation accuracy of the model and is used to rank the sub-nets sampled from the search space. Loss function is used to run some forward and backward passes on the train dataloader to get the gradients.
Please see the API reference of mtp.prune() for more details.

Below we show an example using "fastnas".

Perform pruning

import modelopt.torch.prune as mtp
from torchvision.models import resnet50

# User-defined model
model = resnet50()

# Load pretrained weights here
model.load_state_dict(pretrained_weights)


# Wrap your original validation function to only take the model as input.
# This function acts as the score function to rank models.
def score_func(model):
    return validate(model, val_loader, ...)


# Define a dummy input with similar shape as that of your input data
dummy_input = torch.randn(1, 3, 224, 244)

# Prune to a model with less than or equal to 60% of original FLOPs
prune_constraints = {"flops": "60%"}

# prune_res (dict) contains state_dict / stats of the pruner/searcher.
pruned_model, prune_res = mtp.prune(
    model=model,
    mode="fastnas",
    constraints=prune_constraints,
    dummy_input=dummy_input,
    config={
        "data_loader": train_loader,  # training data is used for calibrating BN layers
        "score_func": score_func,  # validation score is used to rank the subnets
        # checkpoint to store the search state and resume or re-run the search with different constraint
        "checkpoint": "modelopt_fastnas_search_checkpoint.pth",
    },
)

Note that during pruning we first convert the model into a search space containing different possible network configurations and an optimal configuration is then searched for.

Tip

If the runtime of the score function is longer than a few minutes, consider subsampling the dataset used in the score function. A PyTorch dataset can be subsampled using torch.utils.data.Subset as following:

subset_dataset = torch.utils.data.Subset(dataset, indices)

Note

Pruning will modify the model in-place.

Note

mtp.prune() supports distributed data parallelism via DistributedDataParallel in PyTorch.

Currently, the API does not support pruning pytorch Fully Sharded Data Parallel (FSDP) models so you would need to run pruning on a CPU and then finetune using FSDP. Note that GradNAS is much much faster than FastNAS (hence feasible on CPU as well) and is recommended for language models like BERT, GPT-J 6B, and other LLMs.

Storing the prune results

To store the pruned model for future use you can use mto.save():

import modelopt.torch.opt as mto

mto.save(pruned_model, "modelopt_pruned_model.pth")

Note

Please see saving and restoring of ModelOpt-modified models to learn about all the available options for saving and restoring.

Customizing pruning config

In the above example, we have used the default mode config for mtp.prune(). You can see it using mtp.fastnas.FastNASConfig(). You can also specify custom mode configs to have a different search space. See mtp.prune() documentation for more information. An example config is shown below:

import modelopt.torch.prune as mtp

# config to restrict the search space to have a Conv2d out channels as multiple of 64
ss_config = mtp.fastnas.FastNASConfig()
ss_config["nn.Conv2d"]["*"]["channel_divisor"] = 64

# run pruning as shown above
mtp.prune(model, mode=[("fastnas", ss_config)], ...)

Profiling the search space and choosing constraints

The search space describes the candidates of potential pruned subnets. You can obtain information about the overall statistics of the search space in mtp.prune() API. Following info will be printed before the pruning process is started:

    Profiling the following subnets from the given model: ('min', 'centroid', 'max').
--------------------------------------------------------------------------------

                            Profiling Results
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Constraint   ┃ min          ┃ centroid     ┃ max          ┃ max/min ratio ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ flops        │ 274.34M      │ 1.28G        │ 4.59G        │ 16.73         │
│ params       │ 2.70M        │ 9.75M        │ 25.50M       │ 9.43          │
└──────────────┴──────────────┴──────────────┴──────────────┴───────────────┘

            Constraints Evaluation
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃              ┃              ┃ Satisfiable  ┃
┃ Constraint   ┃ Upper Bound  ┃ Upper Bound  ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ flops        │ 2.75G        │ True         │
└──────────────┴──────────────┴──────────────┘


Search Space Summary:
----------------------------------------------------------------------------------------------------
* conv1.out_channels                                                               [32, 64]
  conv1.in_channels                                                                [3]
  bn1.num_features                                                                 [32, 64]
* layer1.0.conv1.out_channels                                                      [32, 64]
  layer1.0.conv1.in_channels                                                       [32, 64]
  layer1.0.bn1.num_features                                                        [32, 64]
* layer1.0.conv2.out_channels                                                      [32, 64]
  ...
  ...
  ...
* layer4.2.conv1.out_channels                                                      [32, 64, 96, 128, ..., 416, 448, 480, 512]
  layer4.2.conv1.in_channels                                                       [2048]
  layer4.2.bn1.num_features                                                        [32, 64, 96, 128, ..., 416, 448, 480, 512]
* layer4.2.conv2.out_channels                                                      [32, 64, 96, 128, ..., 416, 448, 480, 512]
  layer4.2.conv2.in_channels                                                       [32, 64, 96, 128, ..., 416, 448, 480, 512]
  layer4.2.bn2.num_features                                                        [32, 64, 96, 128, ..., 416, 448, 480, 512]
  layer4.2.conv3.out_channels                                                      [2048]
  layer4.2.conv3.in_channels                                                       [32, 64, 96, 128, ..., 416, 448, 480, 512]
----------------------------------------------------------------------------------------------------
Number of configurable hparams: 36
Total size of the search space: 2.48e+28
Note: all constraints can be satisfied within the search space!

The profiling results will help you understand the search space and come up with a potential search constraint that you can iterate on.

Tip

Generally a search space with max/min ratio above 3 is a good search space with many layers prunable. The higher the max/min ratio, the more prunable the model is (potentially making FastNAS slower but better).

A good starting point for your search constraints is the centroid of the search space. If you are using flops/params constraints, we highly recommend you first obtain a pruned model and measure its latency on your target deployment before you finetune the pruned model. Depending on the latency, you can adjust the constraints accordingly. Once you have a model that is within your latency constraints, you can fine-tune it to recover the accuracy. If you are unable to recover the accuracy (perhaps because of too aggressive pruning), you can try increasing the constraints and repeat the process.

Note

If the constraint cannot be satisfied within the search space, the pruning will be interrupted and an error will be raised.

Fine-tuning

The final step of architecture search is to fine-tune the pruned model on your dataset. This way you can ensure to obtain the best possible performance for your pruned model.

Prerequisites

To perform fine-tuning you need a pruned subnet as explained in the previous section.
You can reuse your existing training pipeline. We recommend running fine-tuning with your original training schedule:
- 1x training epochs (or 1x downstream task fine-tuning),
- same or smaller (0.5x-1x) learning rate.

Load the pruned model

You can simply restore your pruned model (weights and architecture) using mto.restore():

import modelopt.torch.opt as mto
from torchvision.models import resnet50

# Build original model
model = resnet50()

# Restore the pruned architecture and weights
pruned_model = mto.restore(model, "modelopt_pruned_model.pth")

Run fine-tuning

Now, please go ahead and fine-tune the pruned subnet using your standard training pipeline with the pre-configured hyperparameters. A usually good fine-tuning schedule is to repeat the pre-training schedule with 0.5x-1x initial learning rate.

Do not forget to save the model using mto.save().

train(pruned_model)

mto.save(pruned_model, "modelopt_pruned_finetuned_model.pth")

Deploy

The pruned and finetuned model is now ready for downstream tasks like deployment. The model you have in hand now should be the best neural network meeting your deployment-aware search constraint.

import modelopt.torch.opt as mto
from torchvision.models import resnet50

# Build original model
model = resnet50()

model = mto.restore(model, "modelopt_pruned_finetuned_model.pth")

# Continue with downstream tasks like deployment (e.g. TensorRT or TensorRT-LLM)
...

Pruning Concepts

Pruning is the process of removing redundant components from a neural network for a given task. Conceptually, pruning is similar to NAS, but has less computational overhead compared to NAS at the cost of potentially finding a less optimal architecture compared to NAS. Most APIs are based on the corresponding NAS APIs but are adapted to reflect the simpler workflow.

Specifically, for pruning we do not specifically train the search space and all its subnets. Instead, a pre-trained checkpoint is used to approximate the search space. Therefore, we can skip the (potentially expensive) search space training step and directly search for a subnet architecture before fine-tuning the resulting subnet.

Note

If you want to learn more about the concept behind NAS and pruning, take a look at NAS Concepts including a more detailed comparison between NAS and pruning.