Inference¶
Here are the instructions on how to run inference with our repo.
Download/convert the model¶
Get the model you want to use. You can use any model that's supported by vLLM, TensorRT-LLM or NeMo. You can also use Nvidia NIM API for models that are hosted there.
Convert the model if it's not in the format you want to use. You do not need any conversion if using vLLM inference with HF models (and can directly use model id if you want vLLM to download it for you). For fastest inference we recommend to convert the model to TensorRT-LLM format.
Start the server¶
Start the server hosting your model. Here is an example (make sure the /hf_models
mount is defined in your cluster config). Skip this step if you want to use cloud models through an API.
ns start_server \
--cluster local \
--model /hf_models/Meta-Llama-3.1-8B-Instruct \
--server_type vllm \
--server_gpus 1 \
--server_nodes 1
If the model needs to execute code, add --with_sandbox
Send inference requests¶
Click on symbols in the snippet below to learn more details.
from nemo_skills.inference.server.model import get_model
from nemo_skills.prompt.utils import get_prompt
llm = get_model(server_type="vllm") # localhost by default
prompt = get_prompt('generic/default', 'llama3-instruct') # (1)!
prompts = [prompt.fill({'question': "What's 2 + 2?"})]
print(prompts[0]) # (2)!
outputs = llm.generate(prompts=prompts)
print(outputs[0]["generation"]) # (3)!
-
Here we use generic/default config and llama3-instruct template.
See nemo_skills/prompt for more config/template options or create your own prompts
-
This should print
>>> print(prompts[0]) <|begin_of_text|><|start_header_id|>system<|end_header_id|> <|eot_id|><|start_header_id|>user<|end_header_id|> What's 2 + 2?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
If you don't want to use our prompt class, just create this string yourself
-
This should print
from nemo_skills.inference.server.model import get_model
from nemo_skills.prompt.utils import get_prompt
llm = get_model( # (1)!
server_type="openai", # NIM models are using OpenAI API
base_url="https://integrate.api.nvidia.com/v1",
model="meta/llama-3.1-8b-instruct",
)
prompt = get_prompt('generic/default') # (2)!
prompts = [prompt.fill({'question': "What's 2 + 2?"})]
print(prompts[0]) # (3)!
outputs = llm.generate(prompts=prompts)
print(outputs[0]["generation"]) # (4)!
-
Don't forget to define
NVIDIA_API_KEY
.To use OpenAI models, use
OPENAI_API_KEY
and setbase_url=https://api.openai.com/v1
. -
Here we use generic/default config. Note that with API models we can't add special tokens, so prompt template is not specified.
See nemo_skills/prompt for more config/template options or create your own prompts
-
This should print
If you don't want to use our prompt class, just create this list yourself
-
This should print
from nemo_skills.code_execution.sandbox import get_sandbox
from nemo_skills.inference.server.code_execution_model import get_code_execution_model
from nemo_skills.prompt.utils import get_prompt
sandbox = get_sandbox() # localhost by default
llm = get_code_execution_model(server_type="vllm", sandbox=sandbox)
prompt = get_prompt('generic/default', 'llama3-instruct') # (1)!
prompt.config.system = ( # (2)!
"Environment: ipython\n\n"
"Use Python to solve this math problem."
)
prompts = [prompt.fill({'question': "What's 2 + 2?"})]
print(prompts[0]) # (3)!
outputs = llm.generate(prompts=prompts, **prompt.get_code_execution_args()) # (4)!
print(outputs[0]["generation"]) # (5)!
-
Here we use generic/default config and llama3-instruct template.
Note how we are updating system message on the next line (you can also include it in the config directly).
See nemo_skills/prompt for more config/template options or create your own prompts
-
8B model doesn't always follow these instructions, so using 70B or 405B for code execution is recommended.
-
This should print
>>> print(prompts[0]) <|begin_of_text|><|start_header_id|>system<|end_header_id|> Environment: ipython Use Python to solve this math problem.<|eot_id|><|start_header_id|>user<|end_header_id|> What's 2 + 2?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
If you don't want to use our prompt class, just create this string yourself
-
prompt.get_code_execution_args()
simply returns a dictionary with start/stop tokens, so that we know when to stop LLM generation and how to format the output.If you don't want to use our prompt class, just define those parameters directly.
-
This should print
>>> print(outputs[0]["generation"]) <|python_tag|>print(2 + 2)<|eom_id|><|start_header_id|>ipython<|end_header_id|> completed [stdout] 4 [/stdout]<|eot_id|><|start_header_id|>assistant<|end_header_id|> The answer is 4.
The "4" in the stdout is coming directly from Python interpreter running in the sandbox.
Note that for self-hosted models we are explicitly adding all the special tokens before sending prompt to an LLM. This is necessary to retain flexibility. E.g. this way we can use base model format with instruct models that we found to work better with few-shot examples.
You can learn more about how our prompt formatting works in the prompt format docs.
Note
You can also use slurm config when launching a server. If you do that, add host=<slurm node hostname>
to the get_model/sandbox
calls and define NEMO_SKILLS_SSH_KEY_PATH
and NEMO_SKILLS_SSH_SERVER
env vars
to set the connection through ssh.