LLM Streaming Client
This notebook demonstrates how to stream responses from the LLM.
Triton Inference Server
The LLM has been deployed to NVIDIA Triton Inference Server and leverages NVIDIA TensorRT-LLM (TRT-LLM), so it’s optimized for low latency and high throughput inference.
The Triton client is used to communicate with the inference server hosting the LLM and is available in Langchain.
Streaming LLM Responses
TRT-LLM on its own can provide drastic improvements to LLM response latency, but streaming can take the user-experience to the next level. Instead of waiting for an entire response to be returned from the LLM, chunks of it can be processed as soon as they are available. This helps reduce the perceived latency by the user.
Step 1: Structure the Query in a Prompt Template
A prompt template is a common paradigm in LLM development.
They are a pre-defined set of instructions provided to the LLM and guide the output produced by the model. They can contain few shot examples and guidance and are a quick way to engineer the responses from the LLM. Llama 2 accepts the prompt format shown in LLAMA_PROMPT_TEMPLATE
, which we modify to be constructed with:
The system prompt
The context
The user’s question
LLAMA_PROMPT_TEMPLATE = (
"<s>[INST] <<SYS>>"
"{system_prompt}"
"<</SYS>>"
"[/INST] {context} </s><s>[INST] {question} [/INST]"
)
system_prompt = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Please ensure that your responses are positive in nature."
context=""
question='What is the fastest land animal?'
prompt = LLAMA_PROMPT_TEMPLATE.format(system_prompt=system_prompt, context=context, question=question)
Step 2: Create the Triton Client
Use the address and port that the Triton is available on; for example localhost:8001
.
If you are running this notebook as part of the AI workflow, you dont have to replace the url.
from langchain_nvidia_trt.llms import TritonTensorRTLLM
triton_url = "llm:8001"
pload = {
'tokens':300,
'server_url': triton_url,
'model_name': "ensemble",
'temperature':1.0,
'top_k':1,
'top_p':0,
'beam_width':1,
'repetition_penalty':1.0,
'length_penalty':1.0
}
client = TritonTensorRTLLM(**pload)
Additional inputs to the LLM can be modified:
tokens: the maximum number of tokens (words/sub-words) generated
temperature: [0,1] – higher values produce more diverse outputs
top_k: sample from the k most likely next tokens at each step; lower value will concentrate sampling on the highest probability tokens for each step (reduces variety)
top_p: [0, 1] – cumulative probability cutoff for token selection; lower values mean sampling from a smaller nucleus sample (reduces variety)
repetition_penalty: [1, 2] – penalize repeated tokens
length_penalty: 1 means no penalty for length of generation
Step 3: Load the Model and Stream Responses
import time
import random
start_time = time.time()
tokens_generated = 0
for val in client.stream(prompt):
tokens_generated += 1
print(val, end="", flush=True)
total_time = time.time() - start_time
print(f"\n--- Generated {tokens_generated} tokens in {total_time} seconds ---")
print(f"--- {tokens_generated/total_time} tokens/sec")