NVIDIA API Catalog, LlamaIndex, and LangChain

This notebook demonstrates how to plug in a NVIDIA API Catalog ai-mixtral-8x7b-instruct as LLM and embedding ai-embed-qa-4, bind these into LlamaIndex with these customizations.

⚠️ There are continous development and retrieval techniques supported in LlamaIndex and this notebook just shows to quikcly replace components such as llm and embedding to a user-choice, read more documentation on llama-index for the latest information.

Prerequisite

In order to successfully run this notebook, you will need the following -

Already successfully gone through the setup and generated an API key.
Please verify you have successfully pip install all python packages in requirements.txt

In this notebook, we will cover the following custom plug-in components -

- LLM using ai-mixtral-8x7b-instruct

- A embedding ai-embed-qa-4

Note: As one can see, since we are using NVIDIA API Catalog as an API, there is no further requirement in the prerequisites about GPUs as compute hardware

Step 1 - Load ai-mixtral-8x7b-instruct as LLM 

Note: check the prerequisite if you have not yet obtain a valid API key

import getpass
import os


# del os.environ['NVIDIA_API_KEY']  ## delete key and reset
if os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    print("Valid NVIDIA_API_KEY already in environment. Delete to reset")
else:
    nvapi_key = getpass.getpass("NVAPI Key (starts with nvapi-): ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

run a test and see the model generating output response

# test run and see that you can genreate a respond successfully
from langchain_nvidia_ai_endpoints import ChatNVIDIA
llm = ChatNVIDIA(model="ai-mixtral-8x7b-instruct", nvidia_api_key=nvapi_key, max_tokens=1024)
result = llm.invoke("Write a ballad about LangChain.")
print(result.content)

Step 2 - Load the chosen an embedding into llama-index

# Create and dl embeddings instance wrapping huggingface embedding into langchain embedding
# Bring in embeddings wrapper
from llama_index.embeddings import LangchainEmbedding

from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
nv_embedding = NVIDIAEmbeddings(model="ai-embed-qa-4")
li_embedding=LangchainEmbedding(nv_embedding)
# Alternatively, if you want to specify whether it will use the query or passage type
# embedder = NVIDIAEmbeddings(model="nvolveqa_40k", model_type="passage")

Note: if you encounter typing_extension error, simply reinstall via :pip install typing_extensions==4.7.1 –force-reinstall

Step 3 - Wrap the NVIDIA embedding and the NVIDIA ai-mixtral-8x7b-instruct model as the main llm into llama-index’s ServiceContext

# Bring in stuff to change service context
from llama_index import set_global_service_context
from llama_index import ServiceContext

# Create new service context instance
service_context = ServiceContext.from_defaults(
    chunk_size=1024,
    llm=llm,
    embed_model=li_embedding
)
# And set the service context
set_global_service_context(service_context)

Step 4a - Load the text data using llama-index’s SimpleDirectoryReader and we will be using the built-in VectorStoreIndex 

#create query engine with cross encoder reranker
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
import torch

documents = SimpleDirectoryReader("./toy_data").load_data()
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

Step 4b - This will serve as the query engine for us to ask questions

# Setup index query engine using LLM
query_engine = index.as_query_engine()

# Test out a query in natural
response = query_engine.retrieve("Who is the director of the movie Titanic?")

for item in response:
    print(f"retrieved text {item.get_text()} , with score :{ item.get_score()}")
    print("---"*10)