LLM Configuration#
The LLM Model#
To configure the main LLM model that will be used by the guardrails configuration, you set the models
key as shown below:
models:
- type: main
engine: openai
model: gpt-3.5-turbo-instruct
The meaning of the attributes is as follows:
type
: is set to main to indicate the model is the application LLM.engine
: the LLM provider, such asopenai
,huggingface_endpoint
,self_hosted
, and so on.model
: the name of the model, such asgpt-3.5-turbo-instruct
.parameters
: arguments to pass to the LangChain class used by the LLM provider. For example, whenengine
is set toopenai
, the toolkit loads theChatOpenAI
class. The ChatOpenAI class supportstemperature
,max_tokens
, and other class-specific arguments.
Supported LLM Providers#
You can use any LLM provider that is supported by LangChain, such as ai21
, aleph_alpha
, anthropic
, anyscale
, azure
, cohere
, huggingface_endpoint
, huggingface_hub
, openai
, self_hosted
, self_hosted_hugging_face
. Check out the LangChain official documentation for the full list.
In addition to the above LangChain providers, connecting to NVIDIA NIM microservices is supported using the nim
engine.
The nvidia_ai_endpoints
engine is an alias for the nim
engine.
The engine provides access to locally-deployed NIM microservices or NVIDIA hosted models that you can view from https://build.nvidia.com/models.
To use any of the LLM providers, you must install the LangChain package for the provider. When you first try to use a configuration with a new provider, you typically receive an error from LangChain that instructs which packages you should install.
Important
Although you can instantiate any of the previously mentioned LLM providers, depending on the capabilities of the model, the NeMo Guardrails toolkit works better with some providers than others.
The toolkit includes prompts that have been optimized for certain types of models, such as models provided by openai
or llama3
models.
For others, you can optimize the prompts yourself following the information in the LLM Prompts section.
Exploring Available Providers#
To help you explore and select the right LLM provider for your needs, NeMo Guardrails provides the find-providers
command. This command offers an interactive interface to discover available providers:
nemoguardrails find-providers [--list]
The command supports two modes:
Interactive mode (default): Guides you through selecting a provider type (text completion or chat completion) and then shows available providers for that type
List mode (
--list
): Simply lists all available providers without interactive selection
This can be particularly helpful when you’re setting up your configuration and need to explore which providers are available and supported.
For more details about the command and its usage, see the CLI documentation.
Using LLMs with Reasoning Traces#
By default, reasoning models, such as DeepSeek-R1 and NVIDIA Llama 3.1 Nemotron Ultra 253B V1, can include the reasoning traces in the model response.
DeepSeek and the Nemotron family of models use <think>
and </think>
as tokens to identify the traces.
The reasoning traces and the tokens can interfere with NeMo Guardrails and result in falsely triggering output guardrails for safe responses. To use these reasoning models, you can remove the traces and tokens from the model response with a configuration like the following example.
models:
- type: main
engine: deepseek
model: deepseek-reasoner
reasoning_config:
remove_reasoning_traces: True
start_token: "<think>"
end_token: "</think>"
- type: main
engine: nim
model: nvidia/llama-3.1-nemotron-ultra-253b-v1
reasoning_config:
remove_reasoning_traces: True
rails:
output:
apply_to_reasoning_traces: False
Field |
Description |
Default Value |
---|---|---|
|
When set to |
|
|
Specifies the start token for the reasoning trace. |
|
|
Specifies the end token for the reasoning trace. |
|
|
When set to By default, output rails are applied to the text of the model response only. |
|
The reasoning_config
field for a model specifies the required configuration for a reasoning model that returns reasoning traces.
By removing the traces, the guardrails runtime processes only the actual responses from the LLM.
The following table summarizes the interaction between the remove_reasoning_traces
and apply_to_reasoning_traces
values:
|
|
Outcome |
---|---|---|
Any |
True |
Reasoning traces are not removed and output rails are applied to the reasoning traces and the model response.
The value of |
False |
False |
Reasoning traces are not removed from internal tasks where they do not impact Guardrails functionality. Output rails are applied to the reasoning traces and the model response. |
True |
False |
Reasoning traces are removed from internal tasks where they could interfere with Guardrails. Output rails are applied to the model response only. |
Even when remove_reasoning_traces
is True
, end users can still receive the thinking traces from the Nemotron models by requesting the detailed thinking, as shown in the following example:
from nemoguardrails import LLMRails, RailsConfig
config = RailsConfig.from_path("./config")
rails = LLMRails(config, verbose=True)
messages = [
{ "role": "system", "content": "detailed thinking on" },
{ "role": "user", "content": "Tell me about Cape Hatteras National Seashore in 50 words or less." }
]
rails.generate(messages=messages)
The referenced ./.config
directory uses the following sample file content.
When you develop your own prompts for LLM-as-a-judge checks, such as an input self-check, ensure that the prompt instructs the LLM to respond with yes
or no
.
Refer to the following examples:
Sample
config/config.yml
:models: - type: main engine: nim model: nvidia/llama-3.1-nemotron-ultra-253b-v1 reasoning_config: remove_reasoning_traces: True rails: input: flows: - self check input output: flows: - self check output
Sample
config/prompts.yml
:prompts: - task: self_check_input messages: - type: system content: | You are an AI assistant tasked with evaluating user instructions. Your role is to determine if a given instruction would make a language model break moderation policies, deviate from good aligned responses, or provide answers that a language model should ideally not. Analyze the instruction carefully. First, internally consider potential ways the instruction could be interpreted and whether any interpretation could lead to a problematic response. Then, conclude with a simple "yes" or "no" answer. Respond with only "yes" or "no". - type: user content: | Instruction to evaluate: "{{ user_input }}" Would this instruction lead to a problematic response (yes/no)? - task: self_check_output messages: - type: system content: | Is this output illegal, unethical, or harmful to any person? Respond with only "yes" or "no". - type: bot content: | Output to evaluate: "{{ bot_response }}"
Using LLMs with reasoning traces have the following limitations:
You must use message-based prompts only. Refer to the preceding example.
Dialog rails are not supported.
NIM for LLMs#
NVIDIA NIM is a set of easy-to-use microservices designed to accelerate the deployment of generative AI models across the cloud, data center, and workstations. NVIDIA NIM for LLMs brings the power of state-of-the-art LLMs to enterprise applications, providing unmatched natural language processing and understanding capabilities. Learn more about NIMs.
NIMs can be self hosted, using downloadable containers, or Nvidia hosted and accessible through an Nvidia AI Enterprise (NVAIE) licesnse.
NeMo Guardrails supports connecting to NIMs as follows:
Self-hosted NIMs#
To connect to self-hosted NIMs, set the engine to nim
. Also make sure the model name matches one of the model names the hosted NIM supports (you can get a list of supported models using a GET request to v1/models endpoint).
models:
- type: main
engine: nim
model: <MODEL_NAME>
parameters:
base_url: <NIM_ENDPOINT_URL>
For example, to connect to a locally deployed meta/llama3-8b-instruct
model, on port 8000, use the following model configuration:
models:
- type: main
engine: nim
model: meta/llama3-8b-instruct
parameters:
base_url: http://localhost:8000/v1
NVIDIA AI Endpoints#
NVIDIA AI Endpoints give users easy access to NVIDIA hosted API endpoints for NVIDIA AI Foundation Models such as Llama 3, Mixtral 8x7B, and Stable Diffusion. These models, hosted on the NVIDIA API catalog, are optimized, tested, and hosted on the NVIDIA AI platform, making them fast and easy to evaluate, further customize, and seamlessly run at peak performance on any accelerated stack.
To use an LLM model through the NVIDIA AI Endpoints, use the following model configuration:
models:
- type: main
engine: nim
model: <MODEL_NAME>
For example, to use the llama3-8b-instruct
model, use the following model configuration:
models:
- type: main
engine: nim
model: meta/llama3-8b-instruct
Important
To use the nvidia_ai_endpoints
or nim
LLM provider, you must install the langchain-nvidia-ai-endpoints
package using the command pip install langchain-nvidia-ai-endpoints
, and configure a valid NVIDIA_API_KEY
.
For further information, see the user guide.
Here’s an example configuration for using llama3
model with Ollama:
models:
- type: main
engine: ollama
model: llama3
parameters:
base_url: http://your_base_url
TRT-LLM#
NeMo Guardrails also supports connecting to a TRT-LLM server.
models:
- type: main
engine: trt_llm
model: <MODEL_NAME>
Below is the list of supported parameters with their default values. Please refer to TRT-LLM documentation for more details.
models:
- type: main
engine: trt_llm
model: <MODEL_NAME>
parameters:
server_url: <SERVER_URL>
temperature: 1.0
top_p: 0
top_k: 1
tokens: 100
beam_width: 1
repetition_penalty: 1.0
length_penalty: 1.0
Configuring LLMs per Task#
The interaction with the LLM is structured in a task-oriented manner. Each invocation of the LLM is associated with a specific task. These tasks are integral to the guardrail process and include:
generate_user_intent
: This task transforms the raw user utterance into a canonical form. For instance, “Hello there” might be converted toexpress greeting
.generate_next_steps
: This task determines the bot’s response or the action to be executed. Examples includebot express greeting
orbot respond to question
.generate_bot_message
: This task decides the exact bot message to be returned.general
: This task generates the next bot message based on the history of user and bot messages. It is used when there are no dialog rails defined (i.e., no user message canonical forms).
For a comprehensive list of tasks, refer to the Task type.
You can use different LLM models for specific tasks. For example, you can use a different model for the self_check_input
and self_check_output
tasks from various providers. Here’s an example configuration:
models:
- type: main
model: meta/llama-3.1-8b-instruct
engine: nim
- type: self_check_input
model: meta/llama3-8b-instruct
engine: nim
- type: self_check_output
model: meta/llama-3.1-70b-instruct
engine: nim
In the previous example, the self_check_input
and self_check_output
tasks use different models. It is even possible to get more granular and use different models for a task like generate_user_intent
:
models:
- type: main
model: meta/llama-3.1-8b-instruct
engine: nim
- type: self_check_input
model: meta/llama3-8b-instruct
engine: nim
- type: self_check_output
model: meta/llama-3.1-70b-instruct
engine: nim
- type: generate_user_intent
model: meta/llama-3.1-8b-instruct
engine: nim
Tip
Remember, the best model for your needs will depend on your specific requirements and constraints. It’s often a good idea to experiment with different models to see which one works best for your specific use case.
The Embeddings Model#
To configure the embedding model used for the various steps in the guardrails process, such as canonical form generation and next step generation, add a model configuration in the models
key as shown in the following configuration file:
models:
- ...
- type: embeddings
engine: FastEmbed
model: all-MiniLM-L6-v2
The FastEmbed
engine is the default one and uses the all-MiniLM-L6-v2
model. NeMo Guardrails also supports using OpenAI models for computing the embeddings, e.g.:
models:
- ...
- type: embeddings
engine: openai
model: text-embedding-ada-002
Supported Embedding Providers#
The following tables lists the supported embedding providers:
Provider Name |
|
|
---|---|---|
FastEmbed (default) |
|
|
OpenAI |
|
|
SentenceTransformers |
|
|
NVIDIA AI Endpoints |
|
|
Note
You can use any of the supported models for any of the supported embedding providers. The previous table includes an example of a model that can be used.
Embedding Search Provider#
NeMo Guardrails uses embedding search, also called vector databases, for implementing the guardrails process and for the knowledge base functionality. The default embedding search uses FastEmbed for computing the embeddings (the all-MiniLM-L6-v2
model) and Annoy for performing the search. As shown in the previous section, the embeddings model supports both FastEmbed and OpenAI. SentenceTransformers is also supported.
For advanced use cases or integrations with existing knowledge bases, you can provide a custom embedding search provider.