NeMo Framework Inference Server

About the Inference Server

The generative AI examples use NeMo Framework Inference Server container. NeMo can create optimized LLM using TensorRT-LLM and can deploy models using NVIDIA Triton Inference Server for high-performance, cost-effective, and low-latency inference. Many examples use Llama 2 models and LLM Inference Server container contains modules and scripts that are required for TRT-LLM conversion of the Llama 2 models and deployment using NVIDIA Triton Inference Server.

The inference server is used with examples that deploy a model on-premises. The examples that use NVIDIA AI foundation models or NVIDIA AI Endpoints do not use this component.

Running the Inference Server Individually

The following steps describe how a Llama 2 model deployment.

  • Download Llama 2 Chat Model Weights from Meta or HuggingFace. You can check support matrix for GPU requirements for the deployment.

  • Update the deploy/compose/compose.env file with MODEL_DIRECTORY as the downloaded Llama 2 model path and other model parameters as needed.

  • Build the LLM inference server container from source:

    $ source deploy/compose/compose.env
    $ docker compose -f deploy/compose/rag-app-text-chatbot.yaml build llm
    
  • Run the container. The container starts Triton Inference Server with TRT-LLM optimized Llama 2 model:

    $ source deploy/compose/compose.env
    $ docker compose -f deploy/compose/rag-app-text-chatbot.yaml up llm
    

After the optimized Llama 2 model is deployed in Triton Inference Server, clients can send HTTP/REST or gRPC requests directly to the server. A sample implementation of a client can be found in the triton_trt_llm.py file of GitHub repository at integrations/langchain/llms/.