NeMo Framework Inference Server
About the Inference Server
The generative AI examples use NeMo Framework Inference Server container. NeMo can create optimized LLM using TensorRT-LLM and can deploy models using NVIDIA Triton Inference Server for high-performance, cost-effective, and low-latency inference. Many examples use Llama 2 models and LLM Inference Server container contains modules and scripts that are required for TRT-LLM conversion of the Llama 2 models and deployment using NVIDIA Triton Inference Server.
The inference server is used with examples that deploy a model on-premises. The examples that use NVIDIA AI foundation models or NVIDIA AI Endpoints do not use this component.
Running the Inference Server Individually
The following steps describe how a Llama 2 model deployment.
Download Llama 2 Chat Model Weights from Meta or HuggingFace. You can check support matrix for GPU requirements for the deployment.
Update the
deploy/compose/compose.env
file withMODEL_DIRECTORY
as the downloaded Llama 2 model path and other model parameters as needed.Build the LLM inference server container from source:
$ source deploy/compose/compose.env $ docker compose -f deploy/compose/rag-app-text-chatbot.yaml build llm
Run the container. The container starts Triton Inference Server with TRT-LLM optimized Llama 2 model:
$ source deploy/compose/compose.env $ docker compose -f deploy/compose/rag-app-text-chatbot.yaml up llm
After the optimized Llama 2 model is deployed in Triton Inference Server, clients can send HTTP/REST or gRPC requests directly to the server.
A sample implementation of a client can be found in the triton_trt_llm.py
file of GitHub repository at integrations/langchain/llms/.