LLM Examples Introduction
Here is a simple example to show how to use the LLM with TinyLlama.
1from tensorrt_llm import LLM, SamplingParams
2
3
4def main():
5
6 prompts = [
7 "Hello, my name is",
8 "The president of the United States is",
9 "The capital of France is",
10 "The future of AI is",
11 ]
12 sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
13
14 llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
15
16 outputs = llm.generate(prompts, sampling_params)
17
18 # Print the outputs.
19 for output in outputs:
20 prompt = output.prompt
21 generated_text = output.outputs[0].text
22 print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
23
24
25# The entry point of the program need to be protected for spawning processes.
26if __name__ == '__main__':
27 main()
The LLM API can be used for both offline or online usage. See more examples of the LLM API here:
- Generate text
- Distributed LLM Generation
- Generate Text Asynchronously
- Generate Text in Streaming
- Generation with Quantization
- Automatic Parallelism with LLM
- Generate text with multiple LoRA adapters
- Control generated text using logits post processor
- Generate text with guided decoding
- Generate Text Using Lookahead Decoding
For more details on how to fully utilize this API, check out:
Supported Models
Llama (including variants Mistral, Mixtral, InternLM)
GPT (including variants Starcoder-1/2, Santacoder)
Gemma-1/2
Phi-1/2/3
ChatGLM (including variants glm-10b, chatglm, chatglm2, chatglm3, glm4)
QWen-1/1.5/2
Falcon
Baichuan-1/2
GPT-J
Mamba-1/2
Model Preparation
The LLM
class supports input from any of the following:
Hugging Face Hub: Triggers a download from the Hugging Face model hub, such as
TinyLlama/TinyLlama-1.1B-Chat-v1.0
.Local Hugging Face models: Uses a locally stored Hugging Face model.
Local TensorRT-LLM engine: Built by
trtllm-build
tool or saved by the Python LLM API.
Any of these formats can be used interchangeably with the LLM(model=<any-model-path>)
constructor.
The following sections show how to use these different formats for the LLM API.
Hugging Face Hub
Using the Hugging Face hub is as simple as specifying the repo name in the LLM constructor:
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
Local Hugging Face Models
Given the popularity of the Hugging Face model hub, the API supports the Hugging Face format as one of the starting points. To use the API with Llama 3.1 models, download the model from the Meta Llama 3.1 8B model page by using the following command:
git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B
After the model downloading finished, we can load the model as below:
llm = LLM(model=<path_to_meta_llama_from_hf>)
- Note:
Using this model is subject to a particular license. Agree to the terms and authenticate with HuggingFace to begin the download.
From TensorRT-LLM Engine
There are two ways to build the TensorRT-LLM engine:
Using the ``trtllm-build`` Tool: You can build the TensorRT-LLM engine from the Hugging Face model directly with the
trtllm-build
tool and then save the engine to disk for later use. Refer to the README in theexamples/llama
repository on GitHub.After the engine building is finished, we can load the model as below:
llm = LLM(model=<path_to_trt_engine>)
Using an ``LLM`` Instance: Use an
LLM
instance to create the engine and persist to local disk:
llm = LLM(<model-path>)
# Save engine to local disk
llm.save(<engine-dir>)
The engine can be reloaded as above.