Retrieval#
This guide explains how to query documents in the Context-Aware RAG system.
Making Queries#
Queries can be made to the system using the /chat/completions
endpoint of the Retrieval Service.
Request Format#
{
"model": "meta/llama-3.1-70b-instruct",
"base_url": "https://integrate.api.nvidia.com/v1",
"messages": [{"role": "user", "content": "Your question here"}],
"uuid": "unique-request-id"
}
Example Query#
import requests
import json
url = "http://localhost:8000/chat/completions"
headers = {"Content-Type": "application/json"}
chat_data = {
"model": "meta/llama-3.1-70b-instruct",
"base_url": "https://integrate.api.nvidia.com/v1",
"messages": [{"role": "user", "content": "Who mentioned the fire?"}],
"uuid": "your_session_uuid"
}
response = requests.post(url, headers=headers, data=json.dumps(chat_data))
print(response.json()["choices"][0]["message"]["content"])
Query Parameters#
model
: The model to use for the completion (e.g., “meta/llama-3.1-70b-instruct”)base_url
: The base URL for the API (e.g., “https://integrate.api.nvidia.com/v1”)messages
: Array of message objects withrole
andcontent
fieldsuuid
: Unique identifier for the request
Summary Query#
Summary query can be made to the system using the /summary
endpoint of the Retrieval Service.
start_index: The start index of the batch summary (e.g., 0) end_index: The end index of the batch summary (e.g., -1)
Request Format#
{
"uuid": "your_session_uuid",
"summarization": {
"start_index": 0,
"end_index": -1
}
}
Example Query#
import requests
url = "http://localhost:8000/summary"
headers = {"Content-Type": "application/json"}
data = {
"uuid": "your_session_uuid",
"summarization": {
"start_index": 0,
"end_index": -1
}
}
response = requests.post(url, headers=headers, json=data)
print(response.json()["result"])
Best Practices#
Question Formulation
Be specific and clear in your questions
Use natural language
Avoid overly complex or multi-part questions
Message Structure
Use clear role assignments (“user”, “assistant”, “system”)
Structure your content clearly within the message
Provide meaningful UUIDs for request tracking
Error Handling
Always check response status codes
Handle timeouts appropriately
Implement retry logic for failed requests