Q&A with LlamaIndex

This notebook demonstrates how to use LlamaIndex to build a chatbot that references a custom knowledge base.

Suppose you have some text documents (PDF, blog, Notion pages, etc.) and want to ask questions related to the contents of those documents. LLMs, given their proficiency in understanding text, are a great tool for this.

⚠️ The notebook before this one, 02_langchain_index_simple.ipynb, contains the same functionality as this notebook but uses some LangChain components instead of LlamaIndex components.

Concepts that are used in this notebook are explained in-depth in the previous notebook. If you are new to retrieval augmented generation, it is recommended to go through the previous notebook before this one.

Ultimately, we recommend reading about LangChain vs. LlamaIndex and picking the software/components of the software that makes the most sense to you. This is discussed a bit further below.

LlamaIndex 

LlamaIndex is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. Since LLMs are both only trained up to a fixed point in time and do not contain knowledge that is proprietary to an Enterprise, they can’t answer questions about new or proprietary knowledge. LlamaIndex helps solve this problem by providing data connectors to ingest data, indices to structure data for storage, and engines to communicate with data.

LlamaIndex or LangChain?

It’s recommended to read more about the unique strengths of both LlamaIndex and LangChain. At a high level, LangChain is a more general framework for building applications with LLMs. LangChain is (currently) more mature when it comes to multi-step chains and some other chat functionality such as conversational memory. LlamaIndex has plenty of overlap with LangChain, but is particularly strong for loading data from a wide variety of sources and indexing/querying tasks.

Since LlamaIndex can be used with LangChain, the frameworks’ unique capabilities can be leveraged together; the combination of the two is demonstrated in this notebook.

Step 1: Integrate TensorRT-LLM to LangChain and LlamaIndex

Customized LangChain LLM in LlamaIndex

Langchain allows you to create custom wrappers for your LLM in case you want to use your own LLM or a different wrapper than the one that is supported in LangChain. Since we are using LlamaIndex, we have written a custom langchain wrapper compatible with LlamaIndex.

We can easily take a custom LLM that has been wrapped for LangChain and plug it into LlamaIndex as an LLM! We use the LlamaIndex LangChainLLM library so the LangChain LLM can be used in LlamaIndex.

WARNING! Be sure to replace server_url with the address and port that Triton is running on.

Use the address and port that the Triton is available on; for example localhost:8001. If you are running this notebook as part of the generative ai workflow, your can use the existing url.

from triton_trt_llm import TensorRTLLM
from llama_index.llms import LangChainLLM
trtllm =TensorRTLLM(server_url ="llm:8001", model_name="ensemble", tokens=500)
llm = LangChainLLM(llm=trtllm)

Step 2: Create a Prompt Template

A prompt template is a common paradigm in LLM development.

They are a pre-defined set of instructions provided to the LLM and guide the output produced by the model. They can contain few shot examples and guidance and are a quick way to engineer the responses from the LLM. Llama 2 accepts the prompt format shown in LLAMA_PROMPT_TEMPLATE, which we manipulate to be constructed with:

The system prompt
The context
The user’s question

Much like LangChain’s abstraction of prompts, LlamaIndex has similar abstractions for you to create prompts.

from llama_index import Prompt

LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer."
 "<</SYS>>"
 "<s>[INST] Context: {context_str} Question: {query_str} Only return the helpful answer below and nothing else. Helpful answer:[/INST]"
)

qa_template = Prompt(LLAMA_PROMPT_TEMPLATE)

Step 3: Load Documents

LlamaIndex provides data loaders through Llama Hub. These allow for custom data sources to be connected to your LLM using integrations. For example, integrations are available to load documents from Jira, Outlook Calendar, Slack, Trello, and many other applications.

At the core of each data loader is a download_loader function which downloads the loader file into a module that you can use in your application. Once the loader is downloaded, data is ingested through the loader. The output of this ingestion is data formatted as a LlamaIndex Document (text and metadata).

Similar to the previous notebook with LangChain, an UnstructuredReader is used in this example. However, this time it’s from from Llama Hub (LlamaIndex). Again, we load a research paper about Llama2 from Meta.

Here are some of the other document loaders available from LangChain.

! wget -O "llama2_paper.pdf" -nc --user-agent="Mozilla" https://arxiv.org/pdf/2307.09288.pdf

File ‘llama2_paper.pdf’ already there; not retrieving.

from llama_hub.file.unstructured.base import UnstructuredReader
import time

loader = UnstructuredReader()
start_time = time.time()
documents = loader.load_data(file="llama2_paper.pdf")
print(f"--- {time.time() - start_time} seconds ---")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!

Step 4: Transform Documents with Text Splitting and a Node Parser

a) Generate Embeddings

Once documents have been loaded, they are often transformed. One method of transformation is known as chunking, which breaks down large pieces of text, for example, a long document, into smaller segments. This technique is valuable because it helps optimize the relevance of the content returned from the vector database.

This is the same process as the previous notebook; again, we use a LangChain text splitter. In this example, we use a SentenceTransformersTokenTextSplitter. The SentenceTransformersTokenTextSplitter is a specialized text splitter for use with the sentence-transformer models. The default behavior is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use. This sentence transformer model is used to generate the embeddings from documents.

There are some nuanced complexities to text splitting since semantically related text, in theory, should be kept together.

To use the Langchain’s SentenceTransformersTokenTextSplitter with LlamaIndex we use the Langchain node parser on top of the text splitter from LangChain. This is not required, but since LlamaIndex provides a node structure, we choose to use this functionality to level up our storage of documents.

Nodes represent chunks of source documents, but they also contain metadata and relationship information with other nodes and index structures. Since nodes provide these additional forms of hierarchy and connections across the data, they can help generate more accurate answers upon retrieval.

from langchain.text_splitter import SentenceTransformersTokenTextSplitter
from llama_index.node_parser import LangchainNodeParser


TEXT_SPLITTER_MODEL = "intfloat/e5-large-v2"
TEXT_SPLITTER_TOKENS_PER_CHUNK = 510
TEXT_SPLITTER_CHUNCK_OVERLAP = 200

text_splitter = SentenceTransformersTokenTextSplitter(
    model_name=TEXT_SPLITTER_MODEL,
    tokens_per_chunk=TEXT_SPLITTER_TOKENS_PER_CHUNK,
    chunk_overlap=TEXT_SPLITTER_CHUNCK_OVERLAP,
)

node_parser = LangchainNodeParser(text_splitter)

/usr/local/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Additionally, we use a LlamaIndex PromptHelper to help deal with LLM context window token limitations. It calculates available context size to the LLM by taking the initial context token length and subtracting out reserved token space for the prompt template and output. It provides a utility for re-packing text chunks from the index to maximally use the context window to minimize requests sent to the LLM.

context_window: context window for the LLM – the context length for Llama2 is 4k tokens
num_ouptut: number of output tokens for the LLM
chunk_overlap_ratio: chunk overlap as a ratio to chunk size
chunk_size_limit: maximum chunk size to use

from llama_index import PromptHelper

prompt_helper = PromptHelper(
  context_window=4096,
  num_output=256,
  chunk_overlap_ratio=0.1,
  chunk_size_limit=None
)

Step 5: Generate and Store Embeddings

a) Generate Embeddings

Embeddings for documents are created by vectorizing the document text; this vectorization captures the semantic meaning of the text. This allows you to quickly and efficiently find other pieces of text that are similar.

When a user sends in their query, the query is also embedded using the same embedding model that was used to embed the documents. As explained earlier, this allows us to find similar (relevant) documents to the user’s query.

Like other sections in this notebook, we can easily take a LangChain embedding object and use with LlamaIndex. We use the LangchainEmbedding library, which acts as a wrapper around Langchain’s embedding models.

from langchain.embeddings import HuggingFaceEmbeddings
from llama_index.embeddings import LangchainEmbedding

#Running the model on CPU as we want to conserve gpu memory.
#In the production deployment (API server shown as part of the 5th notebook we run the model on GPU)
model_name="intfloat/e5-large-v2"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": False}
hf_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)
# Load in a specific embedding model
embed_model = LangchainEmbedding(hf_embeddings)

b) Store Embeddings

LlamaIndex provides a supporting module, ServiceContext, to bundle commonly used resources during the indexing and querying stage. In this example, we bundle resources we’ve built: the LLM, the embedding model, the node parser, and the prompt helper.

from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(
  llm=llm,
  embed_model=embed_model,
  node_parser=node_parser,
  prompt_helper=prompt_helper
)

Set the service context globally, to avoid passing it to every llm call/

from llama_index import set_global_service_context
set_global_service_context(service_context)

⚠️ in the deployment of this workflow, Milvus is running as a vector database microservice.

from llama_index import VectorStoreIndex
from llama_index.storage.storage_context import StorageContext
from llama_index.vector_stores import MilvusVectorStore

vector_store = MilvusVectorStore(uri="http://milvus:19530", dim=1024, overwrite=False)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_vector_store(vector_store)

Let’s load the documents into the vector database index

import time
start_time = time.time()
nodes = node_parser.get_nodes_from_documents(documents)
index.insert_nodes(nodes)
print(f"--- {time.time() - start_time} seconds ---")

Step 6: Build the Query Engine and Stream Response

a) Build the Query Engine

A query engine is an object that takes in a query and returns a response. Each vector index has a default corresponding query engine; for example, the default query engine for a vector index performs a standard top-k retrieval over the vector store.

A query engine contains the following components:

Retriever
Node PostProcessor
Response Synthesizer

query_engine = index.as_query_engine(text_qa_template=qa_template, streaming=True)

b) Stream a Response from the Query Engine

Lastly, we pass the query engine a user’s question and stream the response.

import time

start_time = time.time()
response = query_engine.query("what is the context length of llama2?")
response.print_response_stream()
print(f"\n--- {time.time() - start_time} seconds ---")