LLM Streaming Client

This notebook demonstrates how to stream responses from the LLM.

Triton Inference Server

The LLM has been deployed to NVIDIA Triton Inference Server and leverages NVIDIA TensorRT-LLM (TRT-LLM), so it’s optimized for low latency and high throughput inference.

The Triton client is used to communicate with the inference server hosting the LLM and is available in LangChain.

Streaming LLM Responses

TRT-LLM on its own can provide drastic improvements to LLM response latency, but streaming can take the user-experience to the next level. Instead of waiting for an entire response to be returned from the LLM, chunks of it can be processed as soon as they are available. This helps reduce the perceived latency by the user.

Step 1: Structure the Query in a Prompt Template

A prompt template is a common paradigm in LLM development.

They are a pre-defined set of instructions provided to the LLM and guide the output produced by the model. They can contain few shot examples and guidance and are a quick way to engineer the responses from the LLM. Llama 2 accepts the prompt format shown in LLAMA_PROMPT_TEMPLATE, which we modify to be constructed with:

  • The system prompt

  • The context

  • The user’s question

LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "{system_prompt}"
 "<</SYS>>"
 "[/INST] {context} </s><s>[INST] {question} [/INST]"
)
system_prompt = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Please ensure that your responses are positive in nature."
context=""
question='What is the fastest land animal?'
prompt = LLAMA_PROMPT_TEMPLATE.format(system_prompt=system_prompt, context=context, question=question)

Step 2: Create the Triton Client

WARNING! Be sure to replace `triton_url` with the address and port that Triton is running on.

Use the address and port that the Triton is available on; for example localhost:8001.

If you are running this notebook as part of the AI workflow, you dont have to replace the url.

from langchain_nvidia_trt.llms import TritonTensorRTLLM

triton_url = "llm:8001"
pload = {
            'tokens':300,
            'server_url': triton_url,
            'model_name': "ensemble",
            'temperature':1.0,
            'top_k':1,
            'top_p':0,
            'beam_width':1,
            'repetition_penalty':1.0,
            'length_penalty':1.0
}
client = TritonTensorRTLLM(**pload)

Additional inputs to the LLM can be modified:

  • tokens: the maximum number of tokens (words/sub-words) generated

  • temperature: [0,1] – higher values produce more diverse outputs

  • top_k: sample from the k most likely next tokens at each step; lower value will concentrate sampling on the highest probability tokens for each step (reduces variety)

  • top_p: [0, 1] – cumulative probability cutoff for token selection; lower values mean sampling from a smaller nucleus sample (reduces variety)

  • repetition_penalty: [1, 2] – penalize repeated tokens

  • length_penalty: 1 means no penalty for length of generation

Step 3: Load the Model and Stream Responses

import time
import random

start_time = time.time()
tokens_generated = 0

for val in client.stream(prompt):
    tokens_generated += 1
    print(val, end="", flush=True)

total_time = time.time() - start_time
print(f"\n--- Generated {tokens_generated} tokens in {total_time} seconds ---")
print(f"--- {tokens_generated/total_time} tokens/sec")