Advanced Q&A with LlamaIndex

This notebook demonstrates how to use LlamaIndex to build a more complex retrieval for a chatbot.

The retrieval method shown in this notebook works well for code documentation; it retrieves more contiguous document blocks that preserve both code snippets and explanations of code.

⚠️ There are many node parsing and retrieval techniques supported in LlamaIndex and this notebook just shows how two of these techniques, HierarchialNodeParser and AutoMergingRetriever, can be useful for chatting with code documentation.

In this demo, we’ll use the llama_docs_bot GitHub repository as our sample documentation to query. This repository contains the content for a development series with LlamaIndex covering the following topics:

  • LLMs

  • Nodes and documents

  • Evaluation

  • Embeddings

  • Retrieval

Step 1: Prerequisite Setup

By now you should be familiar with these steps:

  1. Create an LLM client.

  2. Set the prompt template for the LLM.

  3. Download embeddings.

  4. Set the service context.

  5. Split the text

WARNING! Be sure to replace server_url with the address and port that Triton is running on.

Use the address and port that the Triton is available on; for example localhost:8001. **If you are running this notebook as part of the generative ai workflow, you can use the existing url.

from triton_trt_llm import TensorRTLLM
from llama_index.llms import LangChainLLM
trtllm =TensorRTLLM(server_url ="llm:8001", model_name="ensemble", tokens=500)
llm = LangChainLLM(llm=trtllm)
from llama_index import Prompt

LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer."
 "<</SYS>>"
 "<s>[INST] Context: {context_str} Question: {query_str} Only return the helpful answer below and nothing else. Helpful answer:[/INST]"
)

qa_template = Prompt(LLAMA_PROMPT_TEMPLATE)
from langchain.embeddings import HuggingFaceEmbeddings
from llama_index.embeddings import LangchainEmbedding
from llama_index import ServiceContext, set_global_service_context

model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": False}
hf_embeddings = HuggingFaceEmbeddings(
    model_name="intfloat/e5-large-v2",
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)
# Load in a specific embedding model
embed_model = LangchainEmbedding(hf_embeddings)
service_context = ServiceContext.from_defaults(
  llm=llm,
  embed_model=embed_model
)
set_global_service_context(service_context)

When splitting the text, we split it into a parent node of 1024 tokens and two children nodes of 510 tokens. Our leaf nodes’ maximum size is 512 tokens, so we need to make the largest leaves that can exist under 512 tokens.

from llama_index.text_splitter import TokenTextSplitter
text_splitter_ids = ["1024", "510"]
text_splitter_map = {}
for ids in text_splitter_ids:
    text_splitter_map[ids] = TokenTextSplitter(
        chunk_size=int(ids),
        chunk_overlap=200
    )

Step 2: Clone the Llama Docs Bot Repo

This repository will be our sample documentation that we chat with.

!git clone https://github.com/run-llama/llama_docs_bot.git

Step 3: Define Document Loading and Node Parsing Function

Assuming hierarchical node parsing is set to true, this function:

  • Parses each directory into a single giant document

  • Chunks the document into a hierarchy of nodes with a top-level chunk size (1024) and children chunks that are smaller (aka hierarchical node parsing)

          1024
       /--------\
    1024//2     1024//2
    

Hierarchical Node Parser

The novel part of this step is using LlamaIndex’s Hierarchical Node Parser. This parses nodes into several chunk sizes.

During retrieval, if a majority of chunks are retrieved that have the same parent chunk, the larger parent chunk is returned instead of the smaller chunks.

Simple Node Parser

If hierarchical parsing is false, a simple node structure is used and returned.

from llama_index import SimpleDirectoryReader, Document
from llama_index.node_parser import HierarchicalNodeParser, SimpleNodeParser, get_leaf_nodes
from llama_index.schema import MetadataMode
from llama_docs_bot.llama_docs_bot.markdown_docs_reader import MarkdownDocsReader

# This function takes in a directory of files, puts them in a giant document, and parses and returns them as:
# - a hierarchical node structure if it's a hierarchical implementation
# - a simple node structure if it's a non-hierarchial implementation
def load_markdown_docs(filepath, hierarchical=True):
    """Load markdown docs from a directory, excluding all other file types."""
    loader = SimpleDirectoryReader(
        input_dir=filepath,
        required_exts=[".md"],
        file_extractor={".md": MarkdownDocsReader()},
        recursive=True
    )

    documents = loader.load_data()

    if hierarchical:
        # combine all documents into one
        documents = [
            Document(text="\n\n".join(
                    document.get_content(metadata_mode=MetadataMode.ALL)
                    for document in documents
                )
            )
        ]

        # chunk into 3 levels
        # majority means 2/3 are retrieved before using the parent
        large_chunk_size = 1536
        node_parser = HierarchicalNodeParser.from_defaults(node_parser_ids=text_splitter_ids, node_parser_map=text_splitter_map)

        nodes = node_parser.get_nodes_from_documents(documents)
        return nodes, get_leaf_nodes(nodes)
    ########## This is NOT a hierarchical parser for demonstration purposes later in the notebook ##########
    else:
        node_parser = SimpleNodeParser.from_defaults()
        nodes = node_parser.get_nodes_from_documents(documents)
        return nodes

Step 4: Load and Parse Documents with Node Parser

First, we define all of the documentation directories we want to pull from.

Next, we load the documentation and store parent nodes in a SimpleDocumentStore and leaf nodes in a VectorStoreIndex.

docs_directories = {
    "./llama_docs_bot/docs/community": "Useful for information on community integrations with other libraries, vector dbs, and frameworks.",
    "./llama_docs_bot/docs/core_modules/agent_modules": "Useful for information on data agents and tools for data agents.",
    "./llama_docs_bot/docs/core_modules/data_modules": "Useful for information on data, storage, indexing, and data processing modules.",
    "./llama_docs_bot/docs/core_modules/model_modules": "Useful for information on LLMs, embedding models, and prompts.",
    "./llama_docs_bot/docs/core_modules/query_modules": "Useful for information on various query engines and retrievers, and anything related to querying data.",
    "./llama_docs_bot/docs/core_modules/supporting_modules": "Useful for information on supporting modules, like callbacks, evaluators, and other supporting modules.",
    "./llama_docs_bot/docs/getting_started": "Useful for information on getting started with LlamaIndex.",
    "./llama_docs_bot/docs/development": "Useful for information on contributing to LlamaIndex development.",
}
from llama_index import VectorStoreIndex,StorageContext, load_index_from_storage
from llama_index.query_engine import RetrieverQueryEngine

from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.storage.docstore import SimpleDocumentStore
import os
import time

start_time = time.time()
for directory, description in docs_directories.items():
    nodes, leaf_nodes = load_markdown_docs(directory, hierarchical=True)

    docstore = SimpleDocumentStore()
    docstore.add_documents(nodes)
    storage_context = StorageContext.from_defaults(docstore=docstore)

    index = VectorStoreIndex(leaf_nodes, storage_context=storage_context)
    index.storage_context.persist(persist_dir=f"./data_{os.path.basename(directory)}")

print(f"--- {time.time() - start_time} seconds ---")

Step 5: Define Custom Node Post-Processor

A Node PostProcessor takes a list of retrieved nodes and transforms them (filtering, replacement, etc).

This custom node post-processor provides a simple approach to approximate token counts and returns the most nodes that fit within the token count (2500 tokens). Nodes are already sorted, so the most similar ones are returned first.

from typing import Callable, Optional

from llama_index.utils import globals_helper, get_tokenizer
from llama_index.schema import MetadataMode

class LimitRetrievedNodesLength:

    def __init__(self, limit: int = 2500, tokenizer: Optional[Callable] = None):
        self._tokenizer = tokenizer or get_tokenizer()

        self.limit = limit

    def postprocess_nodes(self, nodes, query_bundle):
        included_nodes = []
        current_length = 0

        for node in nodes:
            current_length += len(self._tokenizer(node.node.get_content(metadata_mode=MetadataMode.LLM)))
            if current_length > self.limit:
                break
            included_nodes.append(node)

        return included_nodes

Step 5: Build the Retriever and Query Engine

AutoMergingRetriever

The AutoMergingRetriever takes in a set of leaf nodes and recursively merges subsets of leaf nodes that reference a parent node beyond a given threshold. This allows for a consolidation of potentially disparate, smaller contexts into a larger context that may help synthesize disparate information.

Query Engine

A query engine is an object that takes in a query and returns a response.

It may contain the following components:

  • Retriever: Given a query, retrieves relevant nodes.

    • This example uses an AutoMergingRetriever if it’s a hierarchial implementation. This replaces the retrieved nodes with the larger parent chunk.

  • Node PostProcessor: Takes a list of retrieved nodes and transforms them (filtering, replacement, etc.)

    • This example uses a post-processor that filters the retrieved nodes to a limited length.

  • Response Synthesizer: Takes a list of relevant nodes and synthesizes a response with an LLM.

from llama_index.retrievers import AutoMergingRetriever
from llama_index.query_engine import RetrieverQueryEngine

retriever = AutoMergingRetriever(
        index.as_retriever(similarity_top_k=12),
        storage_context=storage_context
    )

query_engine = RetrieverQueryEngine.from_args(
    retriever,
    text_qa_template=qa_template,
    node_postprocessors=[LimitRetrievedNodesLength(limit=2500)],
    streaming=True
)

Step 6: Stream Response

query = "How do I setup a weaviate vector db? Give me a code sample please."
import time

start_time = time.time()
response = query_engine.query(query)
response.print_response_stream()
print(f"\n--- {time.time() - start_time} seconds ---")

To clear out cached data run:

!rm -rf data_*