Support Matrix
GPU Requirements
Large Language Models are a heavily GPU-limited workflow. All LLMs are defined by the number of billions of parameters that make up their networks. These generative AI examples focus on the Llama 2 Chat models from Meta. These models are available in three different sizes: 7B, 13B, and 70B. All three models perform well, but the 13B model is a good balance of performance and GPU memory utilization.
Model |
GPU Memory Requirement |
---|---|
Llama-2-7B-Chat |
30 GB |
Llama-2-13B-Chat |
50 GB |
Llama-2-70B-Chat |
320 GB |
Llama-2-7B-Chat AWQ Quantized |
30 GB |
Nemotron-8B-Chat-SFT |
100 GB |
These resources can be provided by multiple GPUs on the same machine.
To perform retrieval augmentation, an embedding model is required. The embedding model converts a sequence of words to a representation in the form of a vector of numbers. This model is much smaller and requires an additional 2GB of GPU memory.
In the examples, Milvus is set as the default vector database. Milvus is the default because it can use the NVIDIA RAFT libraries that enable GPU acceleration of vector searches. For the Milvus database, allow an additional 4GB of GPU Memory.
CPU and Memory Requirements
For development purposes, have at least 10 CPU cores and 64 GB of RAM.
Storage Requirements
The two primary considerations for storage in retrieval augmented generation are the model weights and the documents in the vector database. The file size of the model varies according to the number of parameters in the model:
Model |
Disk Storage |
---|---|
Llama-2-7B-Chat |
30 GB |
Llama-2-13B-Chat |
50 GB |
Llama-2-70B-Chat |
150 GB |
Nemotron-8B-Chat-SFT |
50 GB |
The file space needed for the vector database varies by how many documents that you upload. For development purposes, 10 GB is sufficient.
You need approximately 60 GB for Docker images.