Tool-calling¶
Supported benchmarks¶
bfcl_v3¶
BFCL v3 consists of seventeen distinct evaluation subsets that comprehensively test various aspects of function calling capabilities, from simple function calls to complex multi-turn interactions.
- Benchmark is defined in
nemo_skills/dataset/bfcl_v3/__init__.py
- Original benchmark source is here.
Data Preparation¶
To prepare BFCL v3 data for evaluation:
This command performs the following operations:
- Downloads the complete set of BFCL v3 evaluation files
- Processes and organizes data into seventeen separate subset folders
- Creates standardized test files in JSONL format
Example output structure:
nemo_skills/dataset/bfcl_v3/
├── simple/test.jsonl
├── parallel/test.jsonl
├── multiple/test.jsonl
└── ... (other subsets)
Challenges of tool-calling tasks¶
There are three key steps in tool-calling which differentiate it from typical text-only tasks:
- Tool Presentation: Presenting the available tools to the LLM
- Response Parsing: Extracting and validating tool calls from model-generated text
- Tool Execution: Executing the tool calls and communicating the results back to the model
For 1 and 3, we borrow the implementation choices from the the BFCL repo. For tool call parsing, we support both client side and server side implementations.
Client-Side Parsing (Default)¶
When to use: Standard models supported by the BFCL repository
How it works: Utilizes the parsing logic from BFCL's local inference handlers
Configuration Requirements:
- Model name specification via ++model_name=<model_id>
Sample Command:
ns eval \
--benchmarks bfcl_v3 \
--cluster dfw \
--model /hf_models/Qwen3-4B \
--server_gpus 2 \
--server_type vllm \
--output_dir /workspace/qwen3-4b-client-parsing/ \
++inference.tokens_to_generate=8192 \
++model_name=Qwen/Qwen3-4B-FC \
Server-Side Parsing¶
When to use: - Models not supported by BFCL client-side parsing - Custom tool-calling formats
Configuration Requirements:
- Set ++use_client_parsing=False
and
- Specify appropriate server arguments. For example, evaluating Qwen models with vllm server would require setting the server_args as follows:
Sample Command:
The following command evaluates the Qwen3-4B
model which uses a standard tool-calling format supported by vllm
ns eval \
--benchmarks bfcl_v3 \
--cluster dfw \
--model /hf_models/Qwen3-4B \
--server_gpus 2 \
--server_type vllm \
--output_dir /workspace/qwen3-4b-server-parsing/ \
++inference.tokens_to_generate=8192 \
++use_client_parsing=False \
--server_args="--enable-auto-tool-choice --tool-call-parser hermes"
Custom parsing example (NVIDIA Llama-3.3-Nemotron-Super-49B-v1.5)
Some models implement bespoke tool-calling formats that require specialized parsing logic. For example, the Llama-3.3-Nemotron-Super-49B-v1.5 model implements its tool calling logic which requires passing the model-specific parsing script, llama_nemotron_toolcall_parser_no_streaming.py, to the server.
ns eval \
--cluster=local \
--benchmarks=bfcl_v3 \
--model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5/ \
--server_gpus=2 \
--server_type=vllm \
--output_dir=/workspace/llama_nemotron_49b_1_5_tool_calling/ \
++inference.tokens_to_generate=65536 \
++inference.temperature=0.6 \
++inference.top_p=0.95 \
++system_message='' \
++use_client_parsing=False \
--server_args="--tool-parser-plugin \"/workspace/Llama-3_3-Nemotron-Super-49B-v1_5/llama_nemotron_toolcall_parser_no_streaming.py\" \
--tool-call-parser \"llama_nemotron_json\" \
--enable-auto-tool-choice"
Configuration Parameters¶
Configuration | True | False |
---|---|---|
++use_client_parsing |
Default | - |
++model_name |
Required for client parsing | - |
--server_args |
- | Required for server-side parsing |
Note
To evaluate individual splits of bfcl_v3
, such as simple
, use benchmarks=bfcl_v3.simple
.
Note
Currently, ns summarize_results does not support benchmarks with custom aggregation requirements like BFCL v3. To handle this, the evaluation pipeline automatically launches a dependent job that processes the individual subset scores using our scoring script.