Creating a RAG Chain

Implementing the Method

The purpose of the rag_chain method is to retrieve document chunks from the vector store that are closely related to the query. The chunks are provided to the LLM to augment the query and then generate the response.

  1. Edit the RetrievalAugmentedGeneration/examples/simple_rag_api_catalog/chains.py file and add the following import statements:

    from langchain_core.output_parsers import StrOutputParser
    from langchain_core.prompts import ChatPromptTemplate
    from RetrievalAugmentedGeneration.common.utils import get_llm, get_config
    
  2. Update the rag_chain method with the following statements:

        def rag_chain(self, query: str, chat_history: List["Message"], **kwargs) -> Generator[str, None, None]:
            """Code to fetch context and form an answer using LLM"""
            logger.info("Using rag to generate response from document")
    
            settings = get_config()
            system_message = [("system", settings.prompts.rag_template)]
            conversation_history = [(msg.role, msg.content) for msg in chat_history]
            user_input = [("user", "{input}")]
            if conversation_history:
                prompt_template = ChatPromptTemplate.from_messages(
                    system_message + conversation_history + user_input
                )
            else:
                prompt_template = ChatPromptTemplate.from_messages(
                    system_message + user_input
                )
    
            llm = get_llm(**kwargs)
    
            chain = prompt_template | llm | StrOutputParser()
    
            try:
                retriever = vector_store.as_retriever()
                docs = retriever.get_relevant_documents(query)
    
                context = ""
                for doc in docs:
                    context += doc.page_content + "\n\n"
                    augmented_user_input = (
                        "Context: " + context + "\n\nQuestion: " + query + "\n"
                    )
                return chain.stream({"input": augmented_user_input})
    
            except Exception as e:
                logger.warning(f"Failed to generate response: {e}")
    

Building and Running with Docker Compose

Using the containers has one additional step this time: exporting your NVIDIA API key as an environment variable.

  1. Build the container for the Chain Server:

    $ docker compose --env-file deploy/compose/compose.env -f deploy/compose/simple-rag-api-catalog.yaml build chain-server
    
  2. Export your NVIDIA API key in an environment variable:

    $ export NVIDIA_API_KEY=nvapi-...
    
  3. Run the containers:

    $ docker compose --env-file deploy/compose/compose.env -f deploy/compose/simple-rag-api-catalog.yaml up -d
    

Verify the RAG Chain Method Using Curl

You can access the Chain Server with a URL like http://localhost:8081.

  1. Upload a sample document, such as the README from the repository:

    $ curl http://localhost:8081/documents -F "file=@README.md"
    
  2. Confirm the rag_chain method runs by submitting a query:

    $ curl -H "Content-Type: application/json" http://localhost:8081/generate \
        -d '{"messages":[{"role":"user", "content":"how many models are used in generative AI examples from NVIDIA?"}], "use_knowledge_base": true}'
    

    Example Output

    data: {"id":"0fbc961e-34b6-44e9-a996-9d2f84e794c9","choices":[{"index":0,"message":{"role":"assistant","content":""},"finish_reason":""}]}
    
    data: {"id":"0fbc961e-34b6-44e9-a996-9d2f84e794c9","choices":[{"index":0,"message":{"role":"assistant","content":" The"},"finish_reason":""}]}
    
    data: {"id":"0fbc961e-34b6-44e9-a996-9d2f84e794c9","choices":[{"index":0,"message":{"role":"assistant","content":" text provided mentions several models used in the generative AI examples from NVIDIA, including:\n\n1. Gemma\n2. LoRA\n"},"finish_reason":""}]}
    
    data: {"id":"0fbc961e-34b6-44e9-a996-9d2f84e794c9","choices":[{"index":0,"message":{"role":"assistant","content":"3. SFT (not specified what it stands for)\n4. Starcoder-2\n5. Small language models (SLMs)\n\n"},"finish_reason":""}]}
    
    data: {"id":"0fbc961e-34b6-44e9-a996-9d2f84e794c9","choices":[{"index":0,"message":{"role":"assistant","content":"However, it's unclear whether all of these models are used in every example or just some of them. The specific number of models used in each example"},"finish_reason":""}]}
    
    data: {"id":"0fbc961e-34b6-44e9-a996-9d2f84e794c9","choices":[{"index":0,"message":{"role":"assistant","content":" is not provided."},"finish_reason":""}]}
    
    data: {"id":"0fbc961e-34b6-44e9-a996-9d2f84e794c9","choices":[{"index":0,"message":{"role":"assistant","content":""},"finish_reason":""}]}
    
    data: {"id":"0fbc961e-34b6-44e9-a996-9d2f84e794c9","choices":[{"index":0,"message":{"role":"assistant","content":""},"finish_reason":"[DONE]"}]}
    

Next Steps

  • You can stop the containers by running the docker compose -f deploy/compose/simple-rag-api-catalog.yaml down command.