AIQ Function/Tool#

The Context Aware RAG AIQ plugin can also be used as a function/tool in custom AIQ workflows.

In ./src/vss_ctx_rag/aiq_config/function/ there are two example config files for using Context Aware RAG as a function/tool for ingestion and retrieval.

Retrieval Function#

This is an example of the config file for using Context Aware RAG as a function/tool for retrieval:

general:
use_uvloop: true

llms:
nim_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    max_tokens: 2048
    base_url: "https://integrate.api.nvidia.com/v1"


embedders:
embedding_llm:
    _type: nim
    model_name: nvidia/llama-3.2-nv-embedqa-1b-v2
    truncate: "END"
    base_url: "https://integrate.api.nvidia.com/v1"


functions:
retrieval_function:
    _type: vss_ctx_rag_retrieval
    llm_name: nim_llm

    vector_db_host: localhost
    vector_db_port: "19530"

    graph_db_uri: bolt://localhost:7687
    graph_db_user: neo4j
    graph_db_password: passneo4j

    embedding_model_name: embedding_llm

    rerank_model_name: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
    rerank_model_url: "https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nv-rerankqa-1b-v2/reranking"
    rag_type: "vector-rag" # or "graph-rag"
    chat_batch_size: 1
    summ_batch_size: 5
    summ_batch_max_concurrency: 20

    uuid: "123456"


workflow:
_type: react_agent
tool_names: [retrieval_function]
llm_name: nim_llm
verbose: true
retry_parsing_errors: true
max_retries: 3

Here vss_ctx_rag_retrieval function is added as a tool to Langchain react agent. The react agent is a agent that uses a language model to decide which tool to use based on the user’s query. In this example, the react agent will use the vss_ctx_rag_retrieval function to retrieve information from the vector database.

Ingestion Function#

This is an example of the config file for using Context Aware RAG as a function/tool for ingestion:

general:
use_uvloop: true

llms:
nim_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    max_tokens: 2048
    base_url: "https://integrate.api.nvidia.com/v1"


embedders:
embedding_llm:
    _type: nim
    model_name: nvidia/llama-3.2-nv-embedqa-1b-v2
    truncate: "END"
    base_url: "https://integrate.api.nvidia.com/v1"


functions:
ingestion_function:
    _type: vss_ctx_rag_ingestion
    llm_name: nim_llm

    vector_db_host: localhost
    vector_db_port: "19530"

    graph_db_uri: bolt://localhost:7687
    graph_db_user: neo4j
    graph_db_password: passneo4j

    embedding_model_name: embedding_llm

    rerank_model_name: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
    rerank_model_url: "https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nv-rerankqa-1b-v2/reranking"
    rag_type: "vector-rag" # or "graph-rag"
    chat_batch_size: 1
    summ_batch_size: 5
    summ_batch_max_concurrency: 20

    uuid: "123456"


workflow:
_type: tool_call_workflow
tool_names: [ingestion_function]
llm_name: nim_llm

A custom tool call workflow is defined that will use the Context Aware RAG ingestion function to ingest documents into the vector database. This is so the input passed in will be treated as a document and not a query.

Running the function#

Exporting environment variables#

Export environment variables for our vector and/or graph databases. Also nvidia api key for LLM models.

Vector-RAG#

export MILVUS_HOST=<MILVUS_HOST_IP> #milvus host, e.g. localhost
export MILVUS_PORT=<MILVUS_DB_PORT> #milvus port, e.g. 19530
export NVIDIA_API_KEY=<NVIDIA_API_KEY> #NVIDIA API key

Graph-RAG#

export GRAPH_DB_URI=<GRAPH_DB_URI> #neo4j uri, e.g. bolt://localhost:7687
export GRAPH_DB_USERNAME=<GRAPH_DB_USERNAME> #neo4j username, e.g. neo4j
export GRAPH_DB_PASSWORD=<GRAPH_DB_PASSWORD> #neo4j password, e.g. password
export NVIDIA_API_KEY=<NVIDIA_API_KEY> #NVIDIA API key

Running Data Ingestion#

aiq serve --config_file=./src/vss_ctx_rag/aiq_config/function/config-ingestion-function.yml --port <PORT>

Running Graph Retrieval#

aiq serve --config_file=./src/vss_ctx_rag/aiq_config/function/config-retrieval-function.yml --port <PORT>

Example Python API calls to the services#

Here there are two services running, one for ingestion on port 8000 and one for retrieval on port 8001.

Ingestion Python request#

import requests

url = "http://localhost:8000/generate"
headers = {"Content-Type": "application/json"}
data = {
    "rag_workflow": "The bridge is bright blue."
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Retrieval Python request#

import requests

url = "http://localhost:8001/generate"
headers = {"Content-Type": "application/json"}
data = {
    "input_message": "Is there a bridge? If so describe it"
}

response = requests.post(url, headers=headers, json=data)
print(response.json())