Guardrails Configuration#
Guardrails (or rails) implement flows based on their role. Rails fall into five main categories:
Input rails: Trigger when the system receives new user input.
Output rails: Trigger when the system generates new output for the user.
Dialog rails: Trigger after the system interprets a user message and identifies its canonical form.
Retrieval rails: Trigger after the system completes the retrieval step (when the
retrieve_relevant_chunks
action finishes).Execution rails: Trigger before and after the system invokes an action.
You can configure active rails using the rails
key in config.yml
as shown in the following example:
rails:
# Input rails trigger when the system receives a new user message.
input:
flows:
- check jailbreak
- check input sensitive data
- check toxicity
- ... # Other input rails
# Output rails trigger after the system generates a bot message.
output:
flows:
- self check facts
- self check hallucination
- check output sensitive data
- ... # Other output rails
# Retrieval rails trigger when the system computes `$relevant_chunks`.
retrieval:
flows:
- check retrieval sensitive data
Flows that aren’t input, output, or retrieval rails become dialog rails and execution rails. These flows control dialog flow and action invocation timing. Dialog/execution rail flows don’t require explicit enumeration in the config. Several configuration options control their behavior.
rails:
# Dialog rails trigger after the system interprets a user message and computes its canonical form.
dialog:
# Whether to use a single LLM call for generating user intent, next step, and bot message.
single_call:
enabled: False
# Whether to fall back to multiple LLM calls if a single call fails.
fallback_to_multiple_calls: True
user_messages:
# Whether to use only embeddings when interpreting user messages.
embeddings_only: False
Input Rails#
Input rails process user messages. For example:
define flow self check input
$allowed = execute self_check_input
if not $allowed
bot refuse to respond
stop
Input rails can alter input by modifying the $user_message
context variable.
Output Rails#
Output rails process bot messages. The $bot_message
context variable contains the message to process. Output rails can modify the $bot_message
variable, for example, to mask sensitive information.
To temporarily deactivate output rails for the next bot message, set the $skip_output_rails
context variable to True
.
Streaming Output Configuration#
Output rails provide synchronous responses by default. Enable streaming to receive responses sooner.
Set the top-level streaming: True
field in your config.yml
file.
For the output rails, add the streaming
field and configuration parameters.
rails:
output:
- rail name
streaming:
enabled: True
chunk_size: 200
context_size: 50
stream_first: True
streaming: True
When streaming is enabled, the toolkit applies output rails to token chunks. If a rail blocks a token chunk, the toolkit returns a JSON error object in the following format:
{
"error": {
"message": "Blocked by <rail-name> rails.",
"type": "guardrails_violation",
"param": "<rail-name>",
"code": "content_blocked"
}
}
When integrating with the OpenAI Python client, server code catches this JSON error and converts it to an API error following the OpenAI SSE format.
The following table describes the subfields for the streaming
field:
Field |
Description |
Default Value |
---|---|---|
streaming.chunk_size |
Specifies the number of tokens per chunk. The toolkit applies output guardrails to each token chunk. Larger values provide more meaningful information for rail assessment but add latency while accumulating tokens for a full chunk. Higher latency risk occurs when you specify |
|
streaming.context_size |
Specifies the number of tokens to keep from the previous chunk for context and processing continuity. Larger values provide continuity across chunks with minimal latency impact. Small values might fail to detect cross-chunk violations. Specifying approximately 25% of |
|
streaming.enabled |
When set to |
|
streaming.stream_first |
When set to By default, the toolkit streams chunks as soon as possible and before applying output rails to them. |
|
The following table shows how token count, chunk size, and context size interact to determine the number of rails invocations.
Input Length |
Chunk Size |
Context Size |
Rails Invocations |
---|---|---|---|
512 |
256 |
64 |
3 |
600 |
256 |
64 |
3 |
256 |
256 |
64 |
1 |
1024 |
256 |
64 |
5 |
1024 |
256 |
32 |
5 |
1024 |
256 |
32 |
5 |
1024 |
128 |
32 |
11 |
512 |
128 |
32 |
5 |
Refer to ../getting-started/5-output-rails/README.md#streaming-output
for a code sample.
Parallel Execution of Input and Output Rails#
You can configure input and output rails to run in parallel. This can improve latency and throughput.
When to Use Parallel Rails Execution#
Use parallel execution for I/O-bound rails such as external API calls to LLMs or third-party integrations.
Enable parallel execution if you have two or more independent input or output rails without shared state dependencies.
Use parallel execution in production environments where response latency affects user experience and business metrics.
When Not to Use Parallel Rails Execution#
Avoid parallel execution for CPU-bound rails; it might not improve performance and can introduce overhead.
Use sequential mode during development and testing for debugging and simpler workflows.
Configuration Example#
To enable parallel execution, set parallel: True
in the rails.input
and rails.output
sections in the config.yml
file. The following configuration example is tested by NVIDIA and shows how to enable parallel execution for input and output rails.
Note
Input rail mutations can lead to erroneous results during parallel execution because of race conditions arising from the execution order and timing of parallel operations. This can result in output divergence compared to sequential execution. For such cases, use sequential mode.
The following is an example configuration for parallel rails using models from NVIDIA Cloud Functions (NVCF). When you use NVCF models, make sure that you export NVIDIA_API_KEY
to access those models.
models:
- type: main
engine: nim
model: meta/llama-3.1-70b-instruct
- type: content_safety
engine: nim
model: nvidia/llama-3.1-nemoguard-8b-content-safety
- type: topic_control
engine: nim
model: nvidia/llama-3.1-nemoguard-8b-topic-control
rails:
input:
parallel: True
flows:
- content safety check input $model=content_safety
- topic safety check input $model=topic_control
output:
parallel: True
flows:
- content safety check output $model=content_safety
- self check output
Retrieval Rails#
Retrieval rails process retrieved chunks stored in the $relevant_chunks
variable.
Dialog Rails#
Dialog rails enforce predefined conversational paths. Define canonical forms for various user messages to trigger dialog flows. See the Hello World bot for a basic example. The ABC bot demonstrates dialog rails preventing the bot from discussing specific topics.
Dialog rails require a three-step process:
Generate canonical user message.
Decide next step(s) and execute them.
Generate bot utterance(s).
See The Guardrails Process for detailed description.
Each step may require an LLM call.
Single Call Mode#
NeMo Guardrails supports “single call” mode since version 0.6.0
. This mode performs all three steps using a single LLM call. Set the single_call.enabled
flag to True
to enable it.
rails:
dialog:
# Whether to try to use a single LLM call for generating the user intent, next step and bot message.
single_call:
enabled: True
# If a single call fails, whether to fall back to multiple LLM calls.
fallback_to_multiple_calls: True
In typical RAG (Retrieval Augmented Generation) scenarios, this option provides latency improvement and uses fewer tokens.
Important
Currently, single call mode only predicts bot messages as next steps. The LLM cannot generalize and execute actions on dynamically generated user canonical form messages.
Embeddings Only#
Use embeddings of pre-defined user messages to determine the canonical form for user input. This speeds up dialog rails. Set the embeddings_only
flag to enable this option.
rails:
dialog:
user_messages:
# Whether to use only embeddings when interpreting user messages.
embeddings_only: True
# Use only embeddings when similarity exceeds the specified threshold.
embeddings_only_similarity_threshold: 0.75
# When fallback is None, similarity below threshold triggers normal LLM user intent computation.
# When set to a string value, that string becomes the intent.
embeddings_only_fallback_intent: None
Important
Use this only when you provide sufficient examples. The 0.75 threshold triggers LLM calls for user intent generation when similarity falls below this value. Increase the threshold to 0.8 if you encounter false positives. Threshold values are model dependent.