HF Checkpoints with LlamaIndex and LangChain

This notebook demonstrates how to plug in a local llm from HuggingFace Hub Llama-2-13b-chat-hf and all-MiniLM-L6-v2 embedding from Huggingface, bind these to into LlamaIndex with these customizations.

The custom plug-ins shown in this notebook can be replaced, for example, you can swap out the HuggingFace Llama-2-13b-chat-hf with HuggingFace checkpoint from Mistral.

⚠️ The notebook before this one, 08_Option(1)_llama_index_with_NVIDIA_AI_endpoint.ipynb, contains the same exercise as this notebook but uses NVIDIA AI Catelog’s models via API calls instead of loading the models’ checkpoints pulled from huggingface model hub, and then load from host to devices (i.e GPUs).

Noted that, since we will load the checkpoints, it will be significantly slower to go through this entire notebook.

If you do decide to go through this notebook, please kindly check the Prerequisite section below.

There are continous development and retrieval techniques supported in LlamaIndex and this notebook just shows how to quickly replace components such as llm and embedding per user’s choice, read more documentation on llama-index for the latest nformation.

Prerequisite

In order to successfully run this notebook, you will need the following -

Already being approved of using the checkpoints via applying for meta-llama
At least 2 NVIDIA GPUs, each with at least 32G mem, preferably using Ampere architecture
docker and nvidia-docker installed
Registered NVIDIA NGC and can pull and run NGC pytorch containers
install necesary python dependencies : Note: if you are using the Dockerfile.gpu_notebook, it should already prepare the environment for you. Otherwise please refer to the Dockerfile for environment building.

In this notebook, we will cover the following custom plug-in components -

- LLM locally load from [HuggingFace Hub Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) and warp this into llama-index 

- A [HuggingFace embedding all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) 

Step 1 - Load HuggingFace Hub Llama-2-13b-chat-hf 

Note: Scroll down and make sure you supply the hf_token in code block below, replace [FILL_IN] with your huggingface token , for how to generate the token from huggingface, please following instruction from this link

## uncomment the below if you have not yet install the python dependencies
#!pip install accelerate transformers==4.33.1 --upgrade

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import os
from IPython.display import Markdown, display
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch

def load_hf_model(model_name_or_path, device, num_gpus,hf_auth_token, debug=False):
    """Load an HF locally saved checkpoint."""
    if device == "cpu":
        kwargs = {}
    elif device == "cuda":
        kwargs = {"torch_dtype": torch.float16}
        if num_gpus == "auto":
            kwargs["device_map"] = "auto"
        else:
            num_gpus = int(num_gpus)
            if num_gpus != 1:
                kwargs.update(
                    {
                        "device_map": "auto",
                        "max_memory": {i: "13GiB" for i in range(num_gpus)},
                    }
                )
    elif device == "mps":
        kwargs = {"torch_dtype": torch.float16}
        # Avoid bugs in mps backend by not using in-place operations.
        print("mps not supported")
    else:
        raise ValueError(f"Invalid device: {device}")

    if hf_auth_token is None:
        tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
        model = AutoModelForCausalLM.from_pretrained(
            model_name_or_path, low_cpu_mem_usage=True, **kwargs
        )
    else:
        tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_auth_token=hf_auth_token, use_fast=False)
        model = AutoModelForCausalLM.from_pretrained(
            model_name_or_path, low_cpu_mem_usage=True,use_auth_token=hf_auth_token, **kwargs
        )

    if device == "cuda" and num_gpus == 1:
        model.to(device)

    if debug:
        print(model)

    return model, tokenizer



# Define variable to hold llama2 weights naming
model_name_or_path = "meta-llama/Llama-2-13b-chat-hf"
# Set auth token variable from hugging face
# Create tokenizer
hf_token= "[FILL_IN]"
device = "cuda"
num_gpus = 2

model, tokenizer = load_hf_model(model_name_or_path, device, num_gpus,hf_auth_token=hf_token, debug=False)
# Setup a prompt
prompt = "### User:What is the fastest car in  \
          the world and how much does it cost? \
          ### Assistant:"
# Pass the prompt to the tokenizer
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Setup the text streamer
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

run a test and see the model generating output response

output = model.generate(**inputs, streamer=streamer, use_cache=True, max_new_tokens=100)
# Covert the output tokens back to text
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
output_text

Step 2 - Construct prompt template

# Import the prompt wrapper...but for llama index
from llama_index.prompts.prompts import SimpleInputPrompt
# Create a system prompt
system_prompt = """<<SYS>>
You are a helpful, respectful and honest assistant. Always answer as
helpfully as possible, while being safe. Your answers should not include
any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain
why instead of answering something not correct. If you don't know the answer
to a question, please don't share false information.

Your goal is to provide answers relating to the financial performance of
the company.<</SYS>>[INST]
"""
# Throw together the query wrapper
query_wrapper_prompt = SimpleInputPrompt("{query_str} [/INST]")
## do a test query
query_str='What can you help me with?'
query_wrapper_prompt.format(query_str=query_str)

Step 3 - Load the chosen huggingface Embedding

# Create and dl embeddings instance wrapping huggingface embedding into langchain embedding
# Bring in embeddings wrapper
from llama_index.embeddings import LangchainEmbedding
# Bring in HF embeddings - need these to represent document chunks
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
embeddings=LangchainEmbedding(
    HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
)

Step 4 - Prepare the locally loaded huggingface llm into into llamaindex

# Import the llama index HF Wrapper
from llama_index.llms import HuggingFaceLLM
# Create a HF LLM using the llama index wrapper
llm = HuggingFaceLLM(context_window=4096,
                    max_new_tokens=256,
                    system_prompt=system_prompt,
                    query_wrapper_prompt=query_wrapper_prompt,
                    model=model,
                    tokenizer=tokenizer)

Step 5 - Wrap the custom embedding and the locally loaded huggingface llm into llama-index’s ServiceContext

# Bring in stuff to change service context
from llama_index import set_global_service_context
from llama_index import ServiceContext

# Create new service context instance
service_context = ServiceContext.from_defaults(
    chunk_size=1024,
    llm=llm,
    embed_model=embeddings
)
# And set the service context
set_global_service_context(service_context)

Step 6a - Load the text data using llama-index’s SimpleDirectoryReader and we will be using the built-in VectorStoreIndex 

#create query engine with cross encoder reranker
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
import torch

documents = SimpleDirectoryReader("./toy_data").load_data()
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

Step 6b - This will serve as the query engine for us to ask questions

# Setup index query engine using LLM
query_engine = index.as_query_engine()

# Test out a query in natural
response = query_engine.query("Tell me about Sweden's population?")

response.metadata

response.response