NVIDIA AI Endpoints, LlamaIndex, and LangChain
This notebook demonstrates how to plug in a NVIDIA AI Endpoint mixtral_8x7b and embedding nvolveqa_40k, bind these into LlamaIndex with these customizations.
⚠️ There are continous development and retrieval techniques supported in LlamaIndex and this notebook just shows to quikcly replace components such as llm and embedding to a user-choice, read more documentation on llama-index for the latest information.
Prerequisite
In order to successfully run this notebook, you will need the following -
Already successfully gone through the setup and generated an API key.
Please verify you have successfully pip install all python packages in requirements.txt
In this notebook, we will cover the following custom plug-in components -
- LLM using NVIDIA AI Endpoint mixtral_8x7b
- A NVIDIA AI endpoint embedding nvolveqa_40k
Note: As one can see, since we are using NVIDIA AI endpoints as an API, there is no further requirement in the prerequisites about GPUs as compute hardware
Step 1 - Load NVIDIA AI Endpoint mixtral_8x7b
Note: check the prerequisite if you have not yet obtain a valid API key
import getpass
import os
## API Key can be found by going to NVIDIA NGC -> AI Foundation Models -> (some model) -> Get API Code or similar.
## 10K free queries to any endpoint (which is a lot actually).
# del os.environ['NVIDIA_API_KEY'] ## delete key and reset
if os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
print("Valid NVIDIA_API_KEY already in environment. Delete to reset")
else:
nvapi_key = getpass.getpass("NVAPI Key (starts with nvapi-): ")
assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
os.environ["NVIDIA_API_KEY"] = nvapi_key
run a test and see the model generating output response
# test run and see that you can genreate a respond successfully
from langchain_nvidia_ai_endpoints import ChatNVIDIA
llm = ChatNVIDIA(model="mixtral_8x7b", nvidia_api_key=nvapi_key)
result = llm.invoke("Write a ballad about LangChain.")
print(result.content)
Step 2 - Load the chosen NVIDIA Endpoint Embedding into llama-index
# Create and dl embeddings instance wrapping huggingface embedding into langchain embedding
# Bring in embeddings wrapper
from llama_index.embeddings import LangchainEmbedding
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
nv_embedding = NVIDIAEmbeddings(model="nvolveqa_40k", model_type="query")
li_embedding=LangchainEmbedding(nv_embedding)
# Alternatively, if you want to specify whether it will use the query or passage type
# embedder = NVIDIAEmbeddings(model="nvolveqa_40k", model_type="passage")
Note: if you encounter typing_extension error, simply reinstall via :pip install typing_extensions==4.7.1 –force-reinstall
Step 3 - Wrap the NVIDIA embedding endpoint and the NVIDIA mixtral_8x7b endpoints into llama-index’s ServiceContext
# Bring in stuff to change service context
from llama_index import set_global_service_context
from llama_index import ServiceContext
# Create new service context instance
service_context = ServiceContext.from_defaults(
chunk_size=1024,
llm=llm,
embed_model=li_embedding
)
# And set the service context
set_global_service_context(service_context)
Step 4a - Load the text data using llama-index’s SimpleDirectoryReader and we will be using the built-in VectorStoreIndex
#create query engine with cross encoder reranker
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
import torch
documents = SimpleDirectoryReader("./toy_data").load_data()
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
Step 4b - This will serve as the query engine for us to ask questions
# Setup index query engine using LLM
query_engine = index.as_query_engine()
# Test out a query in natural
response = query_engine.query("who is the director of the movie Titanic?")
response.metadata
response.response