Input JSON Format#
This guide describes the input JSON format for the LLM inference tool. The format supports text-only and multimodal inputs, multi-turn conversations, LoRA adapters, and advanced features.
Basic Structure#
{
"batch_size": 1,
"temperature": 1.0,
"top_p": 0.8,
"top_k": 50,
"max_generate_length": 256,
"apply_chat_template": true,
"enable_thinking": false,
"available_lora_weights": {},
"requests": [
{
"messages": [
{
"role": "user",
"content": "Your message here"
}
],
"lora_name": "optional_lora_name",
"save_system_prompt_kv_cache": false,
"disable_spec_decode": false
}
]
}
Parameters#
Required#
requests: Array of conversation requests
Optional Global Parameters#
batch_size(default: 1): Number of requests per batchtemperature(default: 1.0): Sampling temperature (0.0 = deterministic)top_p(default: 0.8): Nucleus sampling thresholdtop_k(default: 50): Top-k sampling parametermax_generate_length(default: 256): Maximum tokens to generateapply_chat_template(default: true): Apply chat template formattingadd_generation_prompt(default: true): Add generation prompt token sequenceenable_thinking(default: false): Enable thinking mode (Qwen3+)available_lora_weights(default: {}): Map of LoRA adapter names to file paths
Request Fields#
messages(required): Array of conversation messageslora_name(optional): LoRA adapter name fromavailable_lora_weightssave_system_prompt_kv_cache(optional): Cache system prompt KV for reusedisable_spec_decode(optional, default: false): Disable EAGLE speculative decoding for this request even if draft engine is loaded
Message Fields#
role:"system","user", or"assistant"content: String (text-only) or Array (multimodal)
Content Array Format:
Text:
{"type": "text", "text": "..."}Image:
{"type": "image", "image": "/path/to/image.jpg"}Video:
{"type": "video", "video": "/path/to/video.mp4"}(Note: Video support is a placeholder for future releases and is not available for now)
Examples#
Basic Text Input#
{
"batch_size": 1,
"max_generate_length": 256,
"requests": [
{
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}
]
}
Multi-Turn Conversation#
{
"requests": [
{
"messages": [
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."},
{"role": "user", "content": "What is the population?"}
]
}
]
}
Multimodal Input (Vision-Language Models)#
{
"requests": [
{
"messages": [
{
"role": "user",
"content": [
{"type": "image", "image": "/path/to/image.jpg"},
{"type": "text", "text": "Describe this image."}
]
}
]
}
]
}
LoRA Adapters#
{
"available_lora_weights": {
"french": "/path/to/french_adapter.safetensors"
},
"requests": [
{
"messages": [
{"role": "user", "content": "Translate to French."}
],
"lora_name": "french"
}
]
}
Note: All requests in the same batch must use the same LoRA adapter.
System Prompt KV Cache#
{
"requests": [
{
"messages": [
{"role": "system", "content": "Long system prompt..."},
{"role": "user", "content": "Question?"}
],
"save_system_prompt_kv_cache": true
}
]
}
Raw Format (No Chat Template)#
{
"apply_chat_template": false,
"requests": [
{
"messages": [
{"role": "user", "content": "Raw text without special tokens"}
]
}
]
}
Disable Speculative Decoding#
When using EAGLE speculative decoding, you can disable it for specific requests:
{
"requests": [
{
"messages": [
{"role": "user", "content": "Your question here"}
],
"disable_spec_decode": true
}
]
}
Use cases:
Quality: Some inputs may benefit from standard decoding over EAGLE
Switching strategies: Different batches can use different decoding strategies (one batch with EAGLE, another without)
Debugging: Compare performance with/without speculative decoding
Note: If any request in a batch has disable_spec_decode: true, speculative decoding will be disabled for the entire batch. Requests within one batch cannot use different decoding strategies simultaneously for now.
Notes#
System prompt: Uses provided system message, or model default from chat template
LoRA: All requests in same batch must use same adapter
Paths: Use absolute or relative paths for images/videos
Format: Follows OpenAI chat completion API structure