Quantized LLM Inference Model

Example Features

This example deploys a developer RAG pipeline for chat Q&A and serves inferencing with the NeMo Framework Inference container across multiple local GPUs with a quantized version of the Llama 7B chat model.

This example uses a local host with an NVIDIA A100, H100, or L40S GPU.

Model

Embedding

Framework

Description

Multi-GPU

TRT-LLM

Model Location

Triton

Vector Database

llama-2-7b-chat

e5-large-v2

LlamaIndex

QA chatbot

YES

YES

Local Model

YES

Milvus

Prerequisites

  • Clone the Generative AI examples Git repository using Git LFS:

    $ sudo apt -y install git-lfs
    $ git clone git@github.com:NVIDIA/GenerativeAIExamples.git
    $ cd GenerativeAIExamples/
    $ git lfs pull
    
  • A host with one or more NVIDIA A100, H100, or L40S GPU.

  • Verify NVIDIA GPU driver version 535 or later is installed and that the GPU is in compute mode:

    $ nvidia-smi -q -d compute
    

    Example Output

    ==============NVSMI LOG==============
    
    Timestamp                                 : Sun Nov 26 21:17:25 2023
    Driver Version                            : 535.129.03
    CUDA Version                              : 12.2
    
    Attached GPUs                             : 2
    GPU 00000000:CA:00.0
        Compute Mode                          : Default
    
    GPU 00000000:FA:00.0
        Compute Mode                          : Default
    

    If the driver is not installed or below version 535, refer to the NVIDIA Driver Installation Quickstart Guide.

  • Install Docker Engine and Docker Compose. Refer to the instructions for Ubuntu.

  • Install the NVIDIA Container Toolkit.

    1. Refer to the installation documentation.

    2. When you configure the runtime, set the NVIDIA runtime as the default:

      $ sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
      

      If you did not set the runtime as the default, you can reconfigure the runtime by running the preceding command.

    3. Verify the NVIDIA container toolkit is installed and configured as the default container runtime:

      $ cat /etc/docker/daemon.json
      

      Example Output

      {
          "default-runtime": "nvidia",
          "runtimes": {
              "nvidia": {
                  "args": [],
                  "path": "nvidia-container-runtime"
              }
          }
      }
      
    4. Run the nvidia-smi command in a container to verify the configuration:

      $ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi -L
      

      Example Output

      GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-d8ce95c1-12f7-3174-6395-e573163a2ace)
      GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-1d37ef30-0861-de64-a06d-73257e247a0d)
      
  • Optional: Enable NVIDIA Riva automatic speech recognition (ASR) and text to speech (TTS).

    • To launch a Riva server locally, refer to the Riva Quick Start Guide.

      • In the provided config.sh script, set service_enabled_asr=true and service_enabled_tts=true, and select the desired ASR and TTS languages by adding the appropriate language codes to asr_language_code and tts_language_code.

      • After the server is running, assign its IP address (or hostname) and port (50051 by default) to RIVA_API_URI in deploy/compose/compose.env.

    • Alternatively, you can use a hosted Riva API endpoint. You might need to obtain an API key and/or Function ID for access.

      In deploy/compose/compose.env, make the following assignments as necessary:

      export RIVA_API_URI="<riva-api-address/hostname>:<port>"
      export RIVA_API_KEY="<riva-api-key>"
      export RIVA_FUNCTION_ID="<riva-function-id>"
      

Download the Llama 2 Model and Weights

  1. Go to https://huggingface.co/models.

    • Locate the model to download, such as Llama 2 7B chat HF.

    • Follow the information about accepting the license terms from Meta.

    • Log in or sign up for an account with Hugging Face.

  2. After you are granted access, clone the repository by clicking the vertical ellipses button and selecting Clone repository.

    During the clone, you might be asked for your username and password multiple times. Provide the information until the clone is complete.

Download TensorRT-LLM and Quantize the Model

The following steps summarize downloading the TensorRT-LLM repository, building a container image, and quantizing the model.

  1. Clone the NVIDIA TensorRT-LLM repository:

    $ git clone https://github.com/NVIDIA/TensorRT-LLM.git
    $ cd TensorRT-LLM
    $ git checkout release/0.5.0
    $ git submodule update --init --recursive
    $ git lfs install
    $ git lfs pull
    
  2. Build the TensorRT-LLM Docker image:

    $ make -C docker release_build
    

    Building the image can require more than 30 minutes and requires approximately 30 GB. The image is named tensorrt_llm/release:latest.

  3. Start the container. Ensure that the container has one volume mount to the model directory and one volume mount to the TensorRT-LLM repository:

    $ docker run --rm -it --gpus all --ipc=host \
        -v <path-to-llama-2-7b-chat-model>:/model-store \
        -v $(pwd):/repo -w /repo \
        --ulimit memlock=-1 --shm-size=20g \
        tensorrt_llm/release:latest bash
    
  4. Install NVIDIA AMMO Toolkit in the container:

    # Obtain the cuda version from the system. Assuming nvcc is available in path.
    $ cuda_version=$(nvcc --version | grep 'release' | awk '{print $6}' | awk -F'[V.]' '{print $2$3}')
    # Obtain the python version from the system.
    $ python_version=$(python3 --version 2>&1 | awk '{print $2}' | awk -F. '{print $1$2}')
    # Download and install the AMMO package from the DevZone.
    $ wget https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.3.0.tar.gz
    $ tar -xzf nvidia_ammo-0.3.0.tar.gz
    $ pip install nvidia_ammo-0.3.0/nvidia_ammo-0.3.0+cu$cuda_version-cp$python_version-cp$python_version-linux_x86_64.whl
    # Install the additional requirements
    $ pip install -r examples/quantization/requirements.txt
    
  5. Install version 0.25.0 of the accelerate Python package:

    $ pip install accelerate==0.25.0
    
  6. Run the quantization with the container:

    $ python3 examples/llama/quantize.py --model_dir /model-store \
        --dtype float16 --qformat int4_awq \
        --export_path ./llama-2-7b-4bit-gs128-awq.pt --calib_size 32
    

    Quantization can require more than 15 minutes to complete. The sample command creates a llama-2-7b-4bit-gs128-awq.pt quantized checkpoint.

  7. Copy the quantized checkpoint directory to the model directory:

    $ cp <quantized-checkpoint>.pt <model-dir>
    

The preceding steps summarize several documents from the NVIDIA TensorRT-LLM GitHub repository. Refer to the repository for more detail about the following topics:

  • Building the TensorRT-LLM image, refer to the installation.md file in the release/0.5.0 branch.

  • Installing NVIDIA AMMO Toolkit, refer to the README file in the examples/quantization directory.

  • Running the quantize.py command, refer to AWQ in the examples/llama directory.

Build and Start the Containers

  1. In the Generative AI Examples repository, edit the deploy/compose/compose.env file.

    • Update the MODEL_DIRECTORY variable to identify the Llama 2 model directory that contains the quantized checkpoint.

    • Uncomment the QUANTIZATION variable:

      export QUANTIZATION="int4_awq"
      
  2. From the root of the repository, build the containers:

    $ docker compose --env-file deploy/compose/compose.env -f deploy/compose/rag-app-text-chatbot.yaml build
    
  3. Start the containers:

    $ docker compose --env-file deploy/compose/compose.env -f deploy/compose/rag-app-text-chatbot.yaml up -d
    

    NVIDIA Triton Inference Server can require 5 minutes to start. The -d flag starts the services in the background.

    Example Output

    ✔ Network nvidia-rag              Created
    ✔ Container llm-inference-server  Started
    ✔ Container notebook-server       Started
    ✔ Container chain-server          Started
    ✔ Container rag-playground        Started
    
  4. Start the Milvus vector database:

    $ docker compose --env-file deploy/compose/compose.env -f deploy/compose/docker-compose-vectordb.yaml up -d milvus
    

    Example Output

    ✔ Container milvus-minio       Started
    ✔ Container milvus-etcd        Started
    ✔ Container milvus-standalone  Started
    
  5. Confirm the containers are running:

    $ docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
    

    Example Output

    CONTAINER ID   NAMES                  STATUS
    256da0ecdb7b   rag-playground         Up 48 minutes
    2974aa4fb2ce   chain-server           Up 48 minutes
    4a8c4aebe4ad   notebook-server        Up 48 minutes
    5be2b57bb5c1   milvus-standalone      Up 48 minutes (healthy)
    ecf674c8139c   llm-inference-server   Up 48 minutes (healthy)
    a6609c22c171   milvus-minio           Up 48 minutes (healthy)
    b23c0858c4d4   milvus-etcd            Up 48 minutes (healthy)
    

Stopping the Containers

  • To uninstall, stop and remove the running containers from the root of the Generative AI Examples repository:

    $ docker compose --env-file deploy/compose/compose.env -f deploy/compose/rag-app-text-chatbot.yaml down
    $ docker compose -f deploy/compose/docker-compose-vectordb.yaml down
    

Next Steps