Back to work
Model · Fine-tune

mdAgent-Hermes-32B

A multimodal tool-use LoRA fine-tune of Qwen3-VL-32B, optimized for reliable function calling and document understanding — quantized to run on a single on-prem GPU node.

Base model
Qwen3-VL-32B
Method
LoRA · multimodal
Quantization
AWQ · INT4
Result
82.3% single-turn

Overview

Off-the-shelf models are rarely reliable enough at tool use for production agents — they hallucinate arguments, miss schemas and break on multimodal input. mdAgent-Hermes-32B is a targeted fine-tune that makes a 32B vision-language model dependable at calling the right tool with the right arguments.

Trained with LoRA on curated multimodal tool-use data and then quantized for efficient serving, the model keeps strong function-calling accuracy while fitting a realistic on-prem hardware budget.

Approach

  • LoRA fine-tuning of Qwen3-VL-32B on multimodal tool-use traces.
  • Quantization (AWQ, INT4) for low-memory, high-throughput inference.
  • Rigorous evaluation with evalscope across function-calling and reasoning suites.
  • vLLM serving for production deployment on an A100 cluster.

Evaluation

82.3%Single-turn tool-use success
BFCL-v3Function-calling benchmark
HumanEvalCode reasoning
GSM8KMath reasoning

Stack

PyTorchLoRAQwen3-VL-32BAWQINT4vLLMevalscopeA100
All work Next: mdGPT Gateway